Title: Using RapidMiner for Research: Experimental Evaluation of Learners
Summary
Chapter 24 features a complex data mining research use case, the performance evaluation and comparison of several classification learning algorithms including Naive Bayes, k-NN, Decision Trees, Random Forests, and Support Vector Machines (SVM) across many different datasets. Nested process control structures for loops over datasets, loops over different learning algorithms, and cross validation allow an automated validation and the selection of the best model for each application dataset. Statistical tests like t-test and ANOVA test (ANalysis Of VAriance) determine whether performance differences between different learning techniques are statistically significant or whether they may be simply due to chance. Using a custom-built Groovy script within RapidMiner, meta-attributes about the datasets are extracted, which can then be used for meta-learning, i.e., for learning to predict the performance of each learner from a given set of learners for a given new dataset, which then allows the selection of the learner with the best expected accuracy for the given dataset. The performance of fast learners called landmarkers on a given new dataset and the metadata extracted from the dataset can be used for meta-learning to predict the performance of another learner on this dataset. The RapidMiner Extension for Pattern Recognition Engineering (PaREn) and its Automatic System Construction Wizard perform this kind of meta-learning for automated learner selection and a parameter optimization for a given dataset.
Table of Contents
24.1 Introduction
24.2 Research of Learning Algorithms
24.2.1 Sources of Variation and Control
24.2.2 Example of an Experimental Setup
24.3 Experimental Evaluation in RapidMiner
24.3.1 Setting Up the Evaluation Scheme
24.3.2 Looping Through a Collection of Datasets
24.3.3 Looping Through a Collection of Learning Algorithms
24.3.4 Logging and Visualizing the Results
24.3.5 Statistical Analysis of the Results
24.3.6 Exception Handling and Parallelization
24.3.7 Setup for Meta-Learning
24.4 Conclusions
24.4 Bibliography
Data & Processes: Click here to download