Friday, June 11, 2010

What To Make With Rasperry Bacardi

predictive models with R

Industrialization is the stage ultimate data mining. In the predictive framework, the goal is to classify an individual based on his description. It relies on the ability to save, distribute and operate the classifier developed during the learning phase in an operational environment. We talk about deployment.

In this tutorial, we present a deployment strategy for R. It rests on the ability to save templates in binary files via the package filehash . Admittedly, we still need the R software in the industrialization phase (for the classification of new individuals), but several aspects in favor of this strategy: R freely accessible and usable in any context whatsoever, it works equally well on Windows, Linux and MacOS (http://www.r-project.org/), we can control it in batch mode ie d . any program can call to R in hand under him to execute a task, and retrieve results.

We will write three separate programs to differentiate the stages. The first models built from training data and stores it in a binary file. The second load models and used to classify individuals of a second unlabeled data set. The predictions are saved in a CSV file. The third load predictions and the true class membership stored in a third file, it builds the confusion matrices and calculates the error rate. Data mining methods are used: decision trees (rpart ) logistic regression (glm ) linear discriminant analysis (lda ) and discriminant analysis on factors of the PCA ( princomp + lda ). With the latter case, we show that the strategy remains operational even when the prediction requires a sequence of complex operations.

Keywords : R software, deployment, industrialization, rpart, lda, pca, glm, decision trees, discriminant analysis, logistic regression, principal components analysis, discriminant analysis on factors
Link: fr_Tanagra_Deploying_Predictive_Models_with_R.pdf
Data : Pima-model-deployment.zip
References:
R package, " Filehash: Simple key-value database "
Kdnuggets, "Data mining deployment Poll "

0 comments:

Post a Comment