Chapter 13

Title: Detecting Text Message Spam

Summary

Chapters 13 to 15 are about text mining applications. Chapter 13 gives an introduction to text mining, i.e., the application of data mining techniques like classification to text documents like e-mail messages, mobile phone text messages (SMS = Short Message Service) or web pages collected from the World-Wide Web. In order to detect text message spam, preprocessing steps using the RapidMiner text processing extension transform the unstructured texts into document vectors of equal length, which make the data applicable to standard classification techniques like Na_ve Bayes, which is then trained to automatically separate legitimate mobile phone text messages from spam messages.

Table of Contents

13.1 Overview
13.2 Applying This Technique in Other Domains
13.3 Installing the Text Processing Extension
13.4 Getting the Data
13.5 Loading the Text
13.5.1 Data Import Wizard Step 1
13.5.2 Data Import Wizard Step 2
13.5.3 Data Import Wizard Step 3
13.5.4 Data Import Wizard Step 4
13.5.5 Step 5
13.6 Examining the Text
13.6.1 Tokenizing the Document
13.6.2 Creating the Word List and Word Vector
13.6.3 Examining the Word Vector
13.7 Processing the Text for Classification
13.7.1 Text Processing Concepts
13.8 The Naïve Bayes Algorithm
13.8.1 How It Works
13.9 Classifying the Data as Spam or Ham
13.10 Validating the Model
13.11 Applying the Model to New Data
13.11.1 Running the Model on New Data
13.12 Improvements
13.13 Summary

Dataset: Please download the dataset from here: http://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection