Saturday, August 27, 2011

Data Mining with R - The Rattle Package

R (http://www.r-project.org/) is one of the most exciting free data mining software projects of these last years. Its popularity is absolutely justified (see Kdnuggets Polls - Data Mining/ Analytic Tools Used - 2011). Among the reasons which explain this success, we distinguish two very interesting characteristics: (1) we can extend almost indefinitely the features of the tool with the packages; (2) we have a programming language which allows to perform easily sequences of complex operations.

But this second property can be also a drawback. Indeed, some users do not want to learn a new programming language before being able to realize projects. For this reason, tools which allow to define the sequence of commands with diagrams (such as Tanagra, Knime, RapidMiner, etc.) still remain a valuable alternative with the data miners.

In this tutorial, we present the "Rattle" package which allows to the data miners to use R without needing to know the associated programming language. All the operations are performed with simple clicks, such as for any software driven by menus. But, in addition, all the commands are stored. We can save them in a file. Then, in a new working session, we can easily repeat all the operations. Thus, we find one of the important properties which miss to the tools driven by menus.

To describe the use of the rattle package, we perform an analysis similar to the one suggested by the rattle's author in its presentation paper (G.J. Williams, " Rattle : A Data Mining GUI for R", in The R Journal, volume 1 / 2, pages 45-55, December 2009). We perform the following steps: loading the data file; partitioning the instances into learning and test samples; specifying the types of the variables (target or input); computing some descriptive statistics; learning the predictive models from the learning sample; assessing the models on the test sample (confusion matrix, error rate, some curves).

Keywords: R software, R project, rpart, random forest, glm, decision tree, classification tree, logistic regression
Tutorial: en_Tanagra_Rattle_Package_for_R.pdf
Dataset: heart_for_rattle.txt
References:
Togaware, "Rattle"
CRAN, "Package rattle - Graphical user interface for data mining in R"
G.J. Williams, "Rattle: A Data Mining GUI for R", in The R Journal, Vol. 1/2, pages 45--55, december 2009.