Wednesday, June 2, 2010

Great Dane Swollen Nipples

treatment of very large files with R

The treatment of large files is a recurring problem of data mining. In this tutorial, we will investigate a solution implemented in R as a bookseller. The package "filehash" allows you to copy (the "dumper" altogether) all kinds of items on the disc, but also data models. It uses a standard format database. It has a huge advantage, it is possible to use standard statistical functions or from other packages without having to make any adjustment. Instead manipulating data frame in memory, they work on the data frame stored on the disk, seamlessly. It's pretty amazing, I must admit. Processing capabilities are greatly improved and At the same time, the degradation of the computing time is not prohibitive.

Nevertheless, we find that the R functions not specifically designed for the apprehension of large data sets, even when we increase our demands, the calculations are not possible when resources are not fully utilized. It's a bit generic approaches the limit. Modification of learning algorithms is often necessary to exploit the particularities of context. It should even go further. To get results really convincing, it would both adapt the learning algorithms and accordingly organize data on disk. A solution that would suit any type of analysis is difficult, even illusory.

To evaluate the solution provided by the package "filehash" We study the computational time and memory usage, with or without swapping to disk during the calculation of descriptive statistics, the induction of a tree decision with rpart package of the same name, and modeling using discriminant analysis with function lda of the MASS library.

We will achieve the same operations in SIPINA. Indeed, it also offers a solution to swap the apprehension of very large databases. We can compare the performance of strategies implemented.

Keywords: high-volume, very large files, large databases, decision tree, discriminant analysis, SIPINA, C4.5, rpart, lda
Link: fr_Tanagra_Dealing_Very_Large_Dataset_With_R.pdf
Data : wave2M.txt.zip
References:
R package, " Filehash: Simple key-value database "
Tutorial Tanagra, " processing large volumes - Comparison of software "
Tutorial Tanagra, " Sipina - Treatment of very large files "
Yu-Sung Su's Blog, " Dealing with large dataset in R "

0 comments:

Post a Comment