Chapter 23

Title: Anomaly Detection


Chapter 23 gives an overview of a large range of anomaly detection methods and introduces the RapidMiner Anomaly Detection Extension. Anomaly detection is the process of finding patterns in a given dataset which deviate from the characteristics of the majority. These outstanding patterns are also known as anomalies, outliers, intrusions, exceptions, misuses, or fraud. Anomaly detection identifies single records in datasets which significantly deviate from the normal data. Application domains among others include network security, intrusion detection, computer virus detection, fraud detection, misuse detection, complex system supervision, and finding suspicious records in medical data. Anomaly detection for fraud detection is used to detect fraudulent credit card transactions caused by stolen credit cards, fraud in Internet payments, and suspicious transactions in financial accounting data. In the medical domain, anomaly detection is also used, for example, for detecting tumors in medical images or monitoring patient data (electrocardiogram) to get early warnings in case of life-threatening situations. Furthermore, a variety of other specific applications exists such as anomaly detection in surveillance camera data, fault detection in complex systems or detecting forgeries in the document forensics. Despite the differences of the various application domains, the basic principle remains the same. Multivariate normal data needs to be modeled and the few deviations need to be detected, preferably with a score indicating their \outlierness”, i.e., a score indicating their extent of being an outlier. In case of a univariate data, such an outlier factor could for example be the number of standard deviations by which an outlier differs from the mean of this variable.

The overview of anomaly detection method provided in this chapter distinguishes three different types of anomalies, namely (1) point anomalies, which are single data records deviating from others, (2) contextual anomalies, which occur with respect to their context only, for example, with respect to time, and (3) collective anomalies, where a bunch of data points causes the anomaly. Most anomaly detection algorithms detect point anomalies only, which leads to the requirement of transforming contextual and collective anomalies to point anomaly problems using an appropriate pre-processing and thus generating processable data views. Furthermore, anomaly detection algorithms can be categorized with respect to their operation mode, namely (1) supervised algorithms with training and test data as used in traditional machine learning, (2) semi-supervised algorithms with the need of anomaly-free training data for one-class learning, and (3) unsupervised approaches without the requirement of any labeled data. Anomaly detection is, in most cases, associated with an unsupervised setup, which is also the focus of this chapter. In this context, all available unsupervised algorithms from the RapidMiner anomaly detection extension are described and the most well-known algorithm, the Local Outlier Factor (LOF) is explained in detail in order to get a deeper understanding of the approaches themselves. The unsupervised anomaly detection algorithms covered in this chapter include Grubbs’ outlier test and noise removal procedure, k-NN Global Anomaly Score, Local Outlier Factor (LOF), Connectivity-Based Outlier Factor (COF), Inuenced Outlierness (INFLO), Local Outlier Probability (LoOP), Local Correlation Integral (LOCI) and aLOCI, Cluster-Based Local Outlier Factor (CBLOF), and Local Density Cluster-Based Outlier Factor (LDCOF). The semi-supervised anomaly detection algorithms covered in this chapter include a one-class Support Vector Machine (SVM) and a two-step approach with clustering and distance computations for detecting anomalies.

Besides a simple example consisting of a two-dimensional mixture of Gaussians, which is ideal for first experiments, two real-world datasets are analyzed. For the unsupervised anomaly detection the player statistics of the NBA, i.e., a dataset with the NBA regular season basketball player statistics from 1946 to 2009, are analyzed for outstanding players, including all necessary pre-processing. The UCI NASA shuttle dataset is used for illustrating how semi-supervised anomaly detection can be performed in RapidMiner to find suspicious states during a NASA shuttle mission. In this context, a Groovy script is implemented for a simple semi-supervised cluster-distance-based anomaly detection approach, showing how to easily extend RapidMiner by your own operators or scripts.

Table of Contents

23.1 Introduction
23.2 Categorizing an Anomaly Detection Problem
23.2.1 Type of Anomaly Detection Problem (Pre-processing)
23.2.2 Local versus Global Problems
23.2.3 Availability of Labels
23.3 A Simple Artificial Unsupervised Anomaly Detection Example
23.4 Unsupervised Anomaly Detection Algorithms
23.4.1 k-NN Global Anomaly Score
23.4.2 Local Outlier Factor (LOF)
23.4.3 Connectivity-Based Outlier Factor (COF)
23.4.4 Influenced Outlierness (INFLO)
23.4.5 Local Outlier Probability (LoOP)
23.4.6 Local Correlation Integral (LOCI) and aLOCI
23.4.7 Cluster-Based Local Outlier Factor (CBLOF)
23.4.8 Local Density Cluster-Based Outlier Factor (LDCOF)
23.5 An Advanced Unsupervised Anomaly Detection Example
23.6 Semi-supervised anomaly detection
23.6.1 Using a One-Class Support Vector Machine (SVM)
23.6.2 Clustering and distance computations for detecting anomalies
23.7 Summary
23.7 Glossary
23.7 Bibliography

Data & Processes: Click here to download