Anomaly Detection
Anomaly/Outlier Detection
• What are anomalies/outliers?
• The set of data points that are considerably different than the remainder of the data
• Variants of Anomaly/Outlier Detection Problems
• Given a database D, find all the data points x Î D with anomaly scores greater than some
threshold t
• Given a database D, find all the data points x Î D having the top-n largest anomaly scores
f(x)
• Given a database D, containing mostly normal (but unlabeled) data points, and a test point x,
compute the anomaly score of x with respect to D
• Applications:
• Credit card fraud detection, telecommunication fraud detection, network intrusion
detection, fault detection
Anomaly Detection
• Challenges
• How many outliers are there in the data?
• Method is unsupervised
• Validation can be quite challenging (just like for clustering)
• Finding needle in a haystack
• Working assumption:
• There are considerably more “normal” observations than “abnormal”
observations (outliers/anomalies) in the data
•
Anomaly
General Steps
Detection Schemes
• Build a profile of the “normal” behavior
• Profile can be patterns or summary statistics for the overall population
• Use the “normal” profile to detect anomalies
• Anomalies are observations whose characteristics
differ significantly from the normal profile
• Types of anomaly detection
schemes
• Graphical & Statistical-based
• Distance-based
• Model-based
Graphical Approaches
• Boxplot (1-D), Scatter plot (2-D), Spin plot (3-D)
• Limitations
• Time consuming
• Subjective
Convex Hull Method
• Extreme points are assumed to be outliers
• Use convex hull method to detect extreme values
• What if the outlier occurs in the middle of the data?
Statistical Approaches
• Assume a parametric model describing the distribution of the data
(e.g., normal distribution)
• Apply a statistical test that depends on
• Data distribution
• Parameter of distribution (e.g., mean, variance)
• Number of expected outliers (confidence limit)
Statistical-based – Likelihood Approach
• Assume the data set D contains samples from a mixture of two
probability distributions:
• M (majority distribution)
• A (anomalous distribution)
• General Approach:
• Initially, assume all the data points belong to M
• Let Lt(D) be the log likelihood of D at time t
• For each point xt that belongs to M, move it to A
• Let Lt+1 (D) be the new log likelihood.
• Compute the difference, D = Lt(D) – Lt+1 (D)
• If D > c (some threshold), then xt is declared as an anomaly and moved permanently
from M to A
Statistical-based – Likelihood Approach
• Data distribution, D = (1 – l) M + l A
• M is a probability distribution estimated from data
• Can be based on any modeling method (naïve Bayes, maximum entropy, etc)
• A is initially assumed to be uniform distribution
• Likelihood at time t:
N æ öæ | At | ö
Lt ( D ) = Õ PD ( xi ) = ç
ç (1 - l ) |M t |
Õ PM t
( xi ) ÷ç l Õ PA ( xi ) ÷
÷ ç t ÷
i =1 è xi ÎM t øè xi Î At ø
LLt ( D ) = M t log(1 - l ) + å log PM t ( xi ) + At log l + å log PAt ( xi )
xi ÎM t xi ÎAt
Limitations of Statistical Approaches
• Most of the tests are for a single attribute
• In many cases, data distribution may not be known
• For high dimensional data, it may be difficult to estimate the true
distribution
Distance-based Approaches
• Data is represented as a vector of features
• Three major approaches
• Nearest-neighbor based
• Density based
• Clustering based
Nearest-Neighbor Based Approach
• Approach:
• Compute the distance between every pair of data points
• There are various ways to define outliers:
• Data points for which there are fewer than p neighboring points within a distance D
• The top n data points whose distance to the kth nearest neighbor is greatest
• The top n data points whose average distance to the k nearest neighbors is greatest
Outliers in Lower Dimensional Projection
• Divide each attribute into f equal-depth intervals
• Each interval contains a fraction f = 1/f of the records
• Consider a k-dimensional cube created by picking grid ranges from k
different dimensions
• If attributes are independent, we expect region to contain a fraction fk of the
records
• If there are N points, we can measure sparsity of a cube D as:
• Negative sparsity indicates cube contains smaller number of points than
expected
Example
• N=100, f = 5, f = 1/5 = 0.2, N ´ f2 = 4
•
Density-based: LOF approach
For each point, compute the density of its local neighborhood
• Compute local outlier factor (LOF) of a sample p as the average of the ratios of the density
of sample p and the density of its nearest neighbors
• Outliers are points with largest LOF value
In the NN approach, p2 is not
considered as outlier, while LOF
approach find both p1 and p2 as
outliers
p2
´
p1
´
Clustering-Based
• Basic idea:
• Cluster the data into groups of different
density
• Choose points in small cluster as
candidate outliers
• Compute the distance between
candidate points and non-candidate
clusters.
• If candidate points are far from all other
non-candidate points, they are outliers
Base Rate Fallacy
• Bayes theorem:
• More generally:
Base Rate Fallacy (Axelsson, 1999)
Base Rate Fallacy
• Even though the test is 99% certain, your chance of having the
disease is 1/100, because the population of healthy people is much
larger than sick people
Base Rate Fallacy in Intrusion Detection
• I: intrusive behavior,
¬I: non-intrusive behavior
A: alarm
¬A: no alarm
• Detection rate (true positive rate): P(A|I)
• False alarm rate: P(A|¬I)
• Goal is to maximize both
• Bayesian detection rate, P(I|A)
• P(¬I|¬A)
Detection Rate vs False Alarm Rate
• Suppose:
• Then:
• False alarm rate becomes more dominant if P(I) is very low
Detection Rate vs False Alarm Rate
• Axelsson: We need a very low false alarm rate to achieve a reasonable Bayesian
detection rate