Outlier Analysis
1
Outlier Analysis
   Outlier – data objects that are grossly different from or
    inconsistent with the remaining set of data
   Causes
       Measurement / Execution errors
       Inherent data variability
   Outliers – maybe valuable patterns
       Fraud detection
       Customized marketing
       Medical Analysis
                                                                2
    Outlier Mining
   Given n data points and k – expected number of
    outliers find the top k dissimilar objects
       Define inconsistent data
           Residuals in Regression
           Difficulties – Multi-dimensional data, non-numeric data
       Mine the outliers
           Visualization based methods
               Not applicable to cyclic plots, high dimensional data and categorical data
   Approaches
       Statistical Approach
       Distance-based approach
       Density based outlier approach
       Deviation based approach
                                                                                             3
Statistical Distribution-based Outlier
detection
   Assumes data follows a probability distribution and uses
    discordancy test
   Discordancy testing
       Working hypothesis – H: oi ∈ F i=1,2,..n
       Test verifies whether an object oi is significantly different from F
       Significance probability SP(vi) = Prob(T>vi)
       IF SP is small oi is discordant and working hypothesis is rejected
        and alternate hypothesis that oi comes from another distribution
        model G is adopted
                                                                               4
Statistical Distribution-based Outlier
detection
   Alternative distributions
       Inherent alternative distribution
           Alternative hypothesis: All objects arise from another distribution G
       Mixture alternative distribution
           Discordant values are not outliers but contaminants from G H’: oi ∈ (1-
            λ) F + λG i=1,2,..n
       Slippage alternative distribution
           Some Objects are independent observations from a modified version
            of F (different parameters)
                                                                                    5
Statistical Distribution-based Outlier
detection
   Procedures for detecting Outliers
       Block procedures
           All are outliers or all are consistent
       Consecutive Procedures
           Inside-out procedure: Least likely object is tested first
           If it is an outlier – more extreme values are also considered as outliers
   Disadvantages of Statistical Approach
       Tests are for single attributes
       Data distribution may not be known
                                                                                   6
Distance based Outlier Detection
   Distance-based outlier
       A DB(p, D)-outlier is an object O in a dataset T such that at least
        a fraction p of the objects in T lies at a distance greater than D
        from O
       Object does not have enough neighbours
       Avoids excessive computation of Statistical models
       If an object is an outlier according to a discordancy test then o is
        DB(p, D) outlier for some p and D
                                                                               7
    Distance based Outlier Detection
   Index based Algorithm
       Uses multi-dimensional indexing structures such as k-d trees and R-trees
       M – maximum number of objects within dmin neighborhood
       Once M+1 neighbours are found o is not an outlier
       O(n2k) apart from index construction
   Nested loop algorithm
       Avoids index construction
       Tries to minimize I/Os
       Divides memory buffer space into two halves and data set into several logical
        blocks
                                                                                   8
Distance based Outlier Detection
   Cell based Algorithm
       Complexity : O(ck +n) c- depends on number of cells ; k – dimensionality
       Data space is partitioned into cells: dmin / 2√k
       Two layers surround each cell
           First layer – One cell thick
           Second layer -  2√k-1  cells thick
       Algorithm processes cells instead of objects
       Maintains three counts: cell_count, cell_+_1_layer_count,
        cell_+_2_layers_count
       An object in a cell is an outlier if cell_+_1_layer_count <= M, if not, no
        objects in the cell are outliers
       If cell_+_2_layers_count, <= M then all objects in cell – Outliers
           If > M some may be outliers
           Object by object processing has to be done
                                                                                     9
Density based Outlier detection
   Previous methods assume data are uniformly
    distributed
   Data may have different density distributions
       Difficulty in choosing dmin
                                                    10
Density based Outlier detection
   Local Outlier – if its outlying relative to its local
    neighbourhood particularly wrt the density of the
    neighborhood
       O2 is a local outlier wrt C2; o1 is also an outlier; none of the objects
        in C1 are treated as outliers
   Considers degree to which an object is an outlier
       Local Outlier factor – degree depends on how isolated the object is
        wrt its surroundings
                                                                             11
Density based Outlier detection
   The k-distance of an object p is the maximal distance that p gets
    from its k-nearest neighbors d(p, o)
       there are at least k objects in D that are as close as or closer to p than o;
        for k o’ d(p, o’) <= d(p, o)
       there are at most k-1 objects that are closer to p than o; for k-1 o” d(p,
        o”) < d(p, o)
   k-distance neighborhood
       contains every object whose distance is not greater than the MinPts (k)-
        distance of p
   The reachability distance of an object p with respect to object o, is
    defined as reach_distMinPts(p, o) = max { MinPts-distance(o), d(p, o) }
                                                                                     12
OPTICS
   Complexity : O(n log n)
                              13
Density based Outlier detection
   Local reachability density of p is the inverse of the
    average reachability density based on the MinPts-
    nearest neighbors of p.
   Local outlier factor (LOF) of p captures the degree to
    which we call p an outlier.
       It is the average of the ratio of the local reachability density of p
        and those of p’s MinPts-nearest neighbors.
       LOF is higher for outliers
                                                                                14
Deviation based Outlier detection
   Identifies outliers by examining the main characteristics
    of objects in a group
   Objects that “deviate” from this description are
    considered outliers
   Sequential exception technique
       Simulates the way in which humans can distinguish unusual
        objects from among a series of supposedly like objects
                                                                    15
    Deviation based Outlier detection
   Sequential exception technique
       Given a data set D a sequence of subsets {D1, D2, ..Dm} is built
        such that Dj-1 ⊆ Dj; Dissimilarities are assessed between
        subsets in the sequence
       Exception Set – Smallest subset of objects whose removal
        results in greatest reduction of dissimilarity
       Dissimilarity function – 1/n ∑i=1 n (xi-x’)2
       Smoothing factor: Assesses how much the dissimilarity can be
        reduced by removing the subset from the original set of objects
       Can be repeated to avoid the influence of order
                                                                           16
Deviation based Outlier detection
   OLAP Data Cube technique
       Uses data cubes to identify regions of anomalies
       A cell value in a cube is an exception if it differs
        significantly from an expected value
       Visualization effects guide user
           May drill down
                                                               17