Outlier Discovery/Anomaly Detection
Data Mining: Concepts and
July 12, 2019          Techniques           1
                    Anomaly/Outlier Detection
     What are anomalies/outliers?
        The set of data points that are considerably different than
         the remainder of the data
     Variants of Anomaly/Outlier Detection Problems
        Given a database D, find all the data points x  D with
         anomaly scores greater than some threshold t
        Given a database D, find all the data points x  D having
         the top-n largest anomaly scores f(x)
        Given a database D, containing mostly normal (but
         unlabeled) data points, and a test point x, compute the
         anomaly score of x with respect to D
    July 12, 2019           Data Mining: Concepts and Techniques       2
                        Applications
          Credit card fraud detection
          telecommunication fraud detection
          network intrusion detection
          fault detection
          many more
July 12, 2019           Data Mining: Concepts and Techniques   3
                         Anomaly Detection
    Challenges
       How many outliers are there in the data?
       Method is unsupervised
                 Validation can be quite challenging (just like for
                clustering)
           Finding needle in a haystack
    Working assumption:
       There are considerably more “normal”
        observations than “abnormal” observations
        (outliers/anomalies) in the data
July 12, 2019                   Data Mining: Concepts and Techniques   4
                    Anomaly Detection Schemes
     General Steps
        Build a profile of the “normal” behavior
                   Profile can be patterns or summary statistics for the overall
                    population
           Use the “normal” profile to detect anomalies
                   Anomalies are observations whose characteristics
                    differ significantly from the normal profile
     Types of anomaly detection
      schemes
        Graphical & Statistical-based
        Distance-based
        Model-based
July 12, 2019                        Data Mining: Concepts and Techniques           5
                Graphical Approaches
     Boxplot (1-D), Scatter plot (2-D), Spin plot (3-D)
     Limitations
        Time consuming
        Subjective
July 12, 2019          Data Mining: Concepts and Techniques   6
                 Convex Hull Method
     Extreme points are assumed to be outliers
     Use convex hull method to detect extreme values
     What if the outlier occurs in the middle of the
      data?
July 12, 2019          Data Mining: Concepts and Techniques   7
                    Statistical Approaches
     Assume a parametric model describing the distribution of the data (e.g.,
      normal distribution)
     Apply a statistical test that depends on
        Data distribution
        Parameter of distribution (e.g., mean, variance)
        Number of expected outliers (confidence limit)
July 12, 2019                   Data Mining: Concepts and Techniques             8
                       Grubbs’ Test
    Detect outliers in univariate data
    Assume data comes from normal distribution
    Detects one outlier at a time, remove the outlier, and
     repeat
       H0: There is no outlier in data
       HA: There is at least one outlier
    Grubbs’ test statistic:      max X  X
                            G
                                               s
    Reject H0 if:
                        ( N  1)                   t (2 / N , N 2 )
                     G
                            N              N  2  t (2 / N , N 2 )
July 12, 2019            Data Mining: Concepts and Techniques           9
                Statistical-based – Likelihood
                           Approach
    Assume the data set D contains samples from a mixture of two
     probability distributions:
        M (majority distribution)
        A (anomalous distribution)
    General Approach:
        Initially, assume all the data points belong to M
        Let Lt(D) be the log likelihood of D at time t
        For each point xt that belongs to M, move it to A
           Let Lt+1 (D) be the new log likelihood.
           Compute the difference,  = Lt(D) – Lt+1 (D)
           If  > c (some threshold), then xt is declared as an anomaly
          and moved permanently from M to A
July 12, 2019               Data Mining: Concepts and Techniques           10
                Statistical-based – Likelihood
                           Approach
    Data distribution, D = (1 – ) M +  A
    M is a probability distribution estimated from data
       Can be based on any modeling method
       A is initially assumed to be uniform distribution
    Likelihood at time t:
                 N                                      |At |            
     Lt ( D )   PD ( xi )   (1   )  PM t ( xi )    PAt ( xi ) 
                                         |M t |
                i 1                           xi M t   xiAt            
     LLt ( D )  M t log( 1   )   log PM t ( xi )  At log    log PAt ( xi )
                                    xi M t                           xi At
July 12, 2019                  Data Mining: Concepts and Techniques                   11
      Limitations of Statistical Approaches
    Most of the tests are for a single attribute
    In many cases, data distribution may not be
     known
    For multi-dimensional data, it may be difficult to
     estimate the true distribution
July 12, 2019          Data Mining: Concepts and Techniques   12
                Distance-based Approaches
     Data is represented as a vector of features
     Three major approaches
        Nearest-neighbor based
        Density based
        Clustering based
July 12, 2019          Data Mining: Concepts and Techniques   13
           Nearest-Neighbor Based Approach
     Approach:
        Compute the distance between every pair of data points
          There are various ways to define outliers:
             Data points for which there are fewer than p neighboring
              points within a distance D
                   The top n data points whose distance to the kth nearest
                    neighbor is greatest
                   The top n data points whose average distance to the k nearest
                    neighbors is greatest
July 12, 2019                       Data Mining: Concepts and Techniques            14
                     Density-based: LOF approach
   For each point, compute the density of its local
    neighborhood
   Compute local outlier factor (LOF) of a sample p as the
    average of the ratios of the density of sample p and the
    density of its nearest neighbors
   Outliers are points with largest LOF value
                                                        In the NN approach, p2 is
                                                        not considered as outlier,
                                                        while LOF approach find
                                                        both p1 and p2 as outliers
                p2
                          p1
                       
July 12, 2019                   Data Mining: Concepts and Techniques                 15
                                                    LOF
      The local outlier factor LOF, is defined as follows:
                                                lrd k (o)
                                    oNk ( p ) lrd ( p)
                        LOFk ( p)                 k
                                       | N k ( p) |
 where Nk(p) is the set of k-nearest neighbors to p
                                          | N k ( p) |
 and lrd k ( p) 
                            oN k   ( p)
                                          reach  dist ( p, o)
 reach  dist k ( p)  max{ k  dist (o), dist ( p, o)}
July 12, 2019                              Data Mining: Concepts and Techniques   16
                       Clustering-Based
     Basic idea:
        Cluster the data into groups of
         different density
        Choose points in small cluster
         as candidate outliers
        Compute the distance between
         candidate points and non-
         candidate clusters.
            If candidate points are far
              from all other non-candidate
              points, they are outliers
July 12, 2019                 Data Mining: Concepts and Techniques   17
        Outliers in Lower Dimensional Projection
     Divide each attribute into  equal-depth intervals
         Each interval contains a fraction f = 1/ of the records
     Consider a d-dimensional cube created by picking grid ranges
      from d different dimensions
         If attributes are independent, we expect region to contain
          a fraction fk of the records
         If there are N points, we can measure sparsity of a cube
          D as:
            Negative sparsity indicates cube contains smaller number
             of points than expected
            To detect the sparse cells, you have to consider all cells….
             exponential to d. Heuristics can be used to find them…
    July 12, 2019              Data Mining: Concepts and Techniques         19
                       Example
    N=100,  = 5, f = 1/5 = 0.2, N  f2 = 4
July 12, 2019        Data Mining: Concepts and Techniques   20