Chapter 14

Title: Robust Language Identification with RapidMiner: A Text Mining Use Case

Summary

The second text mining use case uses classification to automatically identify the language of a text based on its characters, character sequences, and/or words. Chapter 14 discusses character encodings of different European, Arabic, and Asian languages. The chapter describes different text representations by characters, by tokens like words, and by character sequences of a certain length also called n-grams. The transformation of document texts into document vectors also involves the weighting of the attributes by term frequency and document frequency-based metrics like TF/IDF, which is also described here. The classification techniques Naive Bayes and Support Vector Machines (SVM) are then trained and evaluated on four different multi-lingual text corpora including for example dictionary texts from Wikipedia and book texts from the Gutenberg project. Finally, the chapter shows how to make the RapidMiner language detection available as web service for the automated language identification of web pages via RapidAnalytics web services.

Table of Contents

14.1 Introduction
14.2 The Problem of Language Identification
14.3 Text Representation
14.3.1 Encoding
14.3.2 Token-based Representation
14.3.3 Character-Based Representation
14.3.4 Bag-of-Words Representation
14.4 Classification Models
14.5 Implementation in RapidMiner
14.5.1 Datasets
14.5.2 Importing Data
14.5.3 Frequent Words Model
14.5.4 Character n-Grams Model
14.5.5 Similarity-based Approach
14.6 Application
14.6.1 RapidAnalytics
14.6.2 Web Page Language Identification
14.7 Summary
14.7 Acknowledgment
14.7 Glossary
14.7 Bibliography

Dataset: Please download the dataset from here: http://corpora.informatik.uni-leipzig.de/download.html

Processes: Click here to download