Outlier Discovery/Anomaly Detection
Data Mining: Concepts and
* Techniques 1
Anomaly/Outlier Detection
■ What are anomalies/outliers?
■ The set of data points that are considerably different than
the remainder of the data
■ Variants of Anomaly/Outlier Detection Problems
■ Given a database D, find all the data points x ∈ D with
anomaly scores greater than some threshold t
■ Given a database D, find all the data points x ∈ D having
the top-n largest anomaly scores f(x)
■ Given a database D, containing mostly normal (but
unlabeled) data points, and a test point x, compute the
anomaly score of x with respect to D
* Data Mining: Concepts and Techniques 2
Applications
■ Credit card fraud detection
■ telecommunication fraud detection
■ network intrusion detection
■ fault detection
■ many more
* Data Mining: Concepts and Techniques 3
Anomaly Detection
■ Challenges
■ How many outliers are there in the data?
■ Method is unsupervised
■ Validation can be quite challenging (just like for
clustering)
■ Finding needle in a haystack
■ Working assumption:
■ There are considerably more “normal”
observations than “abnormal” observations
(outliers/anomalies) in the data
* Data Mining: Concepts and Techniques 4
Anomaly Detection Schemes
■ General Steps
■ Build a profile of the “normal” behavior
■ Profile can be patterns or summary statistics for the overall
population
■ Use the “normal” profile to detect anomalies
■ Anomalies are observations whose characteristics
differ significantly from the normal profile
■ Types of anomaly detection
schemes
■ Graphical & Statistical-based
■ Distance-based
■ Model-based
* Data Mining: Concepts and Techniques 5
Graphical Approaches
■ Boxplot (1-D), Scatter plot (2-D), Spin plot (3-D)
■ Limitations
■ Time consuming
■ Subjective
* Data Mining: Concepts and Techniques 6
Convex Hull Method
■ Extreme points are assumed to be outliers
■ Use convex hull method to detect extreme values
■ What if the outlier occurs in the middle of the
data?
* Data Mining: Concepts and Techniques 7
Statistical Approaches
■ Assume a parametric model describing the distribution of the data (e.g.,
normal distribution)
■ Apply a statistical test that depends on
■ Data distribution
■ Parameter of distribution (e.g., mean, variance)
■ Number of expected outliers (confidence limit)
* Data Mining: Concepts and Techniques 8
Grubbs’ Test
■ Detect outliers in univariate data
■ Assume data comes from normal distribution
■ Detects one outlier at a time, remove the outlier, and
repeat
■ H0: There is no outlier in data
■ HA: There is at least one outlier
■ Grubbs’ test statistic:
■ Reject H0 if:
* Data Mining: Concepts and Techniques 9
Statistical-based – Likelihood
Approach
■ Assume the data set D contains samples from a mixture of two
probability distributions:
■ M (majority distribution)
■ A (anomalous distribution)
■ General Approach:
■ Initially, assume all the data points belong to M
■ Let Lt(D) be the log likelihood of D at time t
■ For each point xt that belongs to M, move it to A
■ Let L (D) be the new log likelihood.
t+1
■ Compute the difference, Δ = L (D) – L (D)
t t+1
■ If Δ > c (some threshold), then x is declared as an anomaly
t
and moved permanently from M to A
* Data Mining: Concepts and Techniques 10
Statistical-based – Likelihood
Approach
■ Data distribution, D = (1 – λ) M + λ A
■ M is a probability distribution estimated from data
■ Can be based on any modeling method
■ A is initially assumed to be uniform distribution
■ Likelihood at time t:
* Data Mining: Concepts and Techniques 11
Limitations of Statistical Approaches
■ Most of the tests are for a single attribute
■ In many cases, data distribution may not be
known
■ For multi-dimensional data, it may be difficult to
estimate the true distribution
* Data Mining: Concepts and Techniques 12
Distance-based Approaches
■ Data is represented as a vector of features
■ Three major approaches
■ Nearest-neighbor based
■ Density based
■ Clustering based
* Data Mining: Concepts and Techniques 13
Nearest-Neighbor Based Approach
■ Approach:
■ Compute the distance between every pair of data points
■ There are various ways to define outliers:
■ Data points for which there are fewer than p neighboring
points within a distance D
■ The top n data points whose distance to the kth nearest
neighbor is greatest
■ The top n data points whose average distance to the k nearest
neighbors is greatest
* Data Mining: Concepts and Techniques 14
Density-based: LOF approach
■ For each point, compute the density of its local
neighborhood
■ Compute local outlier factor (LOF) of a sample p as the
average of the ratios of the density of sample p and the
density of its nearest neighbors
■ Outliers are points with largest LOF value
In the NN approach, p2 is
not considered as outlier,
while LOF approach find
both p1 and p2 as outliers
p2
× p1
×
* Data Mining: Concepts and Techniques 15
LOF
The local outlier factor LOF, is defined as follows:
where Nk(p) is the set of k-nearest neighbors to p
and
* Data Mining: Concepts and Techniques 16
Clustering-Based
■ Basic idea:
■ Cluster the data into groups of
different density
■ Choose points in small cluster
as candidate outliers
■ Compute the distance between
candidate points and
non-candidate clusters.
■ If candidate points are far
from all other non-candidate
points, they are outliers
* Data Mining: Concepts and Techniques 17
Outliers in Lower Dimensional Projections
■ In high-dimensional space, data is sparse and
notion of proximity becomes meaningless
■ Every point is an almost equally good outlier
from the perspective of proximity-based
definitions
■ Lower-dimensional projection methods
■ A point is an outlier if in some lower
dimensional projection, it is present in a local
region of abnormally low density
* Data Mining: Concepts and Techniques 18
Outliers in Lower Dimensional Projection
■ Divide each attribute into φ equal-depth intervals
■ Each interval contains a fraction f = 1/φ of the records
■ Consider a d-dimensional cube created by picking grid ranges
from d different dimensions
■ If attributes are independent, we expect region to contain
a fraction fk of the records
■ If there are N points, we can measure sparsity of a cube
D as:
■ Negative sparsity indicates cube contains smaller number
of points than expected
■ To detect the sparse cells, you have to consider all cells….
exponential to d. Heuristics can be used to find them…
* Data Mining: Concepts and Techniques 19
Example
■ N=100, φ = 5, f = 1/5 = 0.2, N × f2 = 4
* Data Mining: Concepts and Techniques 20