June 2010

Monday, June 28, 2010

Wwe Design Your Own Wrestler

Filtering predictors

The selection of variables is a crucial feature of supervised learning. It seeks to isolate the subset of predictors that permet d'expliquer efficacement les valeurs de la variable cible.

Trois approches sont généralement citées dans la littérature. Les méthodes " embedded " intègrent directement la sélection dans le processus d'apprentissage. Les méthodes " wrapper " optimisent explicitement un critère de précision, le plus souvent le taux d'erreur . Elles ne s'appuient en rien sur les caractéristiques de l'algorithme d'apprentissage qui est utilisé comme une boîte noire.

Enfin, troisième et dernière approche que nous étudierons dans ce didacticiel, les méthodes " filter " agissent en amont, avant la mise en implementation of the learning technique, and no direct connection with it. It is therefore assumed that an independent process based on an ad hoc criterion would identify relevant predictors regardless of the learning algorithm implemented downstream. The gamble is bold, even risky. And yet, some experiments show that the approach is viable even when the learning method used at the same time an integrated (embedded) selection of variables (decision trees with C4.5 for example).

We are interested in filtering methods (filter) based on the following principle: the subset of predictors selected should be composed of variables strongly associated with the target variable (relevance) but weakly related to each other (no redundancy) . Two ideas are to highlight in this pattern: (1) how to measure the association between variables, knowing that we restrict ourselves to the case of discrete predictors, (2) how to translate the redundancy in a subset of variables.

In this tutorial, we describe several methods of filtering based on a measure of correlation for discrete variables. We will apply a set of data that will be specially prepared for mettre en évidence leur comportement. Nous évaluerons alors leurs performances en construisant le modèle bayesien naïf à partir des sous-ensembles de variables sélectionnées. Nous mènerons l'expérimentation à l'aide du logiciel Tanagra ; par la suite, nous passerons en revue les méthodes filtres implémentées dans plusieurs logiciels libres de data mining ( Weka 3.6.0 , Orange 2.0b , RapidMiner 4.6.0 , R 2.9.2 - package FSelector ).

Mots clés : méthodes de filtrage, filter approach, correlation based measure, modèle bayesien naïf, modèle d'indépendance conditional
Components: FEATURE RANKING, CFS FILTERING, Miss FILTERING, FCBF FILTERING, MODTREE FILTERING, NAIVE BAYES, BOOTSTRAP
Link: fr_Tanagra_Filter_Method_Discrete_Predictors.pdf
Data: vote_filter_approach.zip
References:
R. Rakotomalala, Lallich S., " Construction of decision trees by optimization ", Journal of Knowledge Extraction and Learning, Vol. 16, No. 6 / 2002, pp.685-703, 2002.
Tutorial Tanagra, " STEPDISC - discriminant analysis"; " Strategy wrapper for selection variables ";" Wrapper for selection of variables (continued) "

Tuesday, June 15, 2010

Does The Complete Toxin Cleanser Work?

Discrete Data Mining under R - Package Deployment rattle

Tanagra's father is also a fan of R. It may seem strange and / or contradictory. But really, I'm mostly a big fan of data mining. And the software is an essential component. I spend so much time to dissect, evaluate their behavior in response to data, and analyze their source code where possible, in short, to study them in all seams. This work fascinates me altogether. I have always done. With the Internet, I can share the fruit of my reflections with others.

In this tutorial, we present the package to rattle R specializes in Data Mining. It does not include new methods of learning, but rather to add a graphical user interface (GUI in English, "graphical user interface") to R. Thus, a physician, unaware of the programming language R, will nevertheless drive its analysis by simply clicking on menus or buttons, just like the way "Explorer" software Weka. Nothing too revolutionary, then, but oh so important for novice users who want to go to basics: process their data using R without having to invest in learning the tedious programming.

To describe the operation of Rattle, we use the frame of the white paper published by its author in the Journal of R (see reference). We will achieve the following sequence of operations: load the file, split it into learning samples and testing, define the role of variables (target vs. Predictors) make some descriptive statistics and graphs to understand the data, build models predictors on the training sample, the gauge on the test sample through the usual tools of assessment (confusion matrix, a few curves).

Tags Key : R software, rpart, random forest, glm, decision trees, logistic regression, random forests, random forests
Link: fr_Tanagra_Rattle_Package_for_R.pdf
Data : heart_for_rattle.txt
References :
Togaware, "Rattle "
CRAN, "Rattle Package - Graphical user interface for data mining in R "
GJ Williams, " Rattle: A GUI for Data Mining R", R in The Journal Vol. 1 / 2, pages 45-55, december 2009.

Friday, June 11, 2010

What To Make With Rasperry Bacardi

predictive models with R

Industrialization is the stage ultimate data mining. In the predictive framework, the goal is to classify an individual based on his description. It relies on the ability to save, distribute and operate the classifier developed during the learning phase in an operational environment. We talk about deployment.

In this tutorial, we present a deployment strategy for R. It rests on the ability to save templates in binary files via the package filehash . Admittedly, we still need the R software in the industrialization phase (for the classification of new individuals), but several aspects in favor of this strategy: R freely accessible and usable in any context whatsoever, it works equally well on Windows, Linux and MacOS (http://www.r-project.org/), we can control it in batch mode ie d . any program can call to R in hand under him to execute a task, and retrieve results.

We will write three separate programs to differentiate the stages. The first models built from training data and stores it in a binary file. The second load models and used to classify individuals of a second unlabeled data set. The predictions are saved in a CSV file. The third load predictions and the true class membership stored in a third file, it builds the confusion matrices and calculates the error rate. Data mining methods are used: decision trees (rpart ) logistic regression (glm ) linear discriminant analysis (lda ) and discriminant analysis on factors of the PCA ( princomp + lda ). With the latter case, we show that the strategy remains operational even when the prediction requires a sequence of complex operations.

Keywords : R software, deployment, industrialization, rpart, lda, pca, glm, decision trees, discriminant analysis, logistic regression, principal components analysis, discriminant analysis on factors
Link: fr_Tanagra_Deploying_Predictive_Models_with_R.pdf
Data : Pima-model-deployment.zip
References:
R package, " Filehash: Simple key-value database "
Kdnuggets, "Data mining deployment Poll "

Wednesday, June 2, 2010

Great Dane Swollen Nipples

treatment of very large files with R

The treatment of large files is a recurring problem of data mining. In this tutorial, we will investigate a solution implemented in R as a bookseller. The package "filehash" allows you to copy (the "dumper" altogether) all kinds of items on the disc, but also data models. It uses a standard format database. It has a huge advantage, it is possible to use standard statistical functions or from other packages without having to make any adjustment. Instead manipulating data frame in memory, they work on the data frame stored on the disk, seamlessly. It's pretty amazing, I must admit. Processing capabilities are greatly improved and At the same time, the degradation of the computing time is not prohibitive.

Nevertheless, we find that the R functions not specifically designed for the apprehension of large data sets, even when we increase our demands, the calculations are not possible when resources are not fully utilized. It's a bit generic approaches the limit. Modification of learning algorithms is often necessary to exploit the particularities of context. It should even go further. To get results really convincing, it would both adapt the learning algorithms and accordingly organize data on disk. A solution that would suit any type of analysis is difficult, even illusory.

To evaluate the solution provided by the package "filehash" We study the computational time and memory usage, with or without swapping to disk during the calculation of descriptive statistics, the induction of a tree decision with rpart package of the same name, and modeling using discriminant analysis with function lda of the MASS library.

We will achieve the same operations in SIPINA. Indeed, it also offers a solution to swap the apprehension of very large databases. We can compare the performance of strategies implemented.

Keywords: high-volume, very large files, large databases, decision tree, discriminant analysis, SIPINA, C4.5, rpart, lda
Link: fr_Tanagra_Dealing_Very_Large_Dataset_With_R.pdf
Data : wave2M.txt.zip
References:
R package, " Filehash: Simple key-value database "
Tutorial Tanagra, " processing large volumes - Comparison of software "
Tutorial Tanagra, " Sipina - Treatment of very large files "
Yu-Sung Su's Blog, " Dealing with large dataset in R "