Title: Visualising Clustering Validity Measures
Summary
Chapter 11 provides an introduction to clustering, to the k-Means clustering algorithm, to several cluster validity measures, and to their visualizations. Clustering algorithms group cases into groups of similar cases. While for classification, a training set with examples with predefined categories is necessary for training a classifier to automatically classify new cases into one of the predefined categories, clustering algorithms need no labeled training examples with predefined categories, but automatically group unlabeled examples into clusters of similar cases. While the predictive accuracy of classification algorithms can be easily measured by comparing known category labels of known examples to the categories predicted by the algorithm, there are no labels known in advance in the case of clustering. Hence it is more difficult to achieve an objective evaluation of a clustering result. Visualizing cluster validity measures can help humans to evaluate the quality of a set of clusters. This chapter uses k-Means clustering on a medical dataset to find groups of similar E-Coli bacteria with regards to where protein localization occurs in them and explains how to judge the quality of the clusters found using visualized cluster validity metrics. Cluster validity measures implemented in the open source statistics package R are seamlessly integrated and used within RapidMiner processes, thanks to the R extension for RapidMiner.
Table of Contents
11.1 Overview
11.2 Clustering
11.2.1 A Brief Explanation of k-Means.
11.3 Cluster Validity Measures
11.3.1 Internal Validity Measures
11.3.2 External Validity Measures
11.3.3 Relative Validity Measures
11.4 The Data
11.4.1 Artificial Data
11.4.2 E-coli Data
11.5 Setup
11.5.1 Download and Install R Extension
11.5.2 Processes and Data
11.6 The Process in Detail
11.6.1 Import Data (A)
11.6.2 Generate Clusters (B)
11.6.3 Generate Ground Truth Validity Measures (C)
11.6.4 Generate External Validity Measures (D)
11.6.5 Generate Internal Validity Measures (E)
11.6.6 Output Results (F)
11.7 Running the Process and Displaying Results
11.8 Results and Interpretation
11.8.1 Artificial Data
11.8.2 E-coli Data
11.9 Conclusion
11.9 Bibliography
Dataset & Processes: Click here to download