0% found this document useful (0 votes)

39 views17 pages

Data Outlier Detection Techniques

data mining

Uploaded by

prathap badam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

39 views17 pages

Data Outlier Detection Techniques

data mining

Uploaded by

prathap badam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

Outlier Analysis

1
Outlier Analysis
 Outlier – data objects that are grossly different from or
inconsistent with the remaining set of data
 Causes
 Measurement / Execution errors
 Inherent data variability
 Outliers – maybe valuable patterns
 Fraud detection
 Customized marketing
 Medical Analysis

2
Outlier Mining
 Given n data points and k – expected number of
outliers find the top k dissimilar objects
 Define inconsistent data
 Residuals in Regression
 Difficulties – Multi-dimensional data, non-numeric data
 Mine the outliers
 Visualization based methods
 Not applicable to cyclic plots, high dimensional data and categorical data
 Approaches
 Statistical Approach
 Distance-based approach
 Density based outlier approach
 Deviation based approach

3
Statistical Distribution-based Outlier
detection
 Assumes data follows a probability distribution and uses
discordancy test
 Discordancy testing
 Working hypothesis – H: oi ∈ F i=1,2,..n
 Test verifies whether an object oi is significantly different from F
 Significance probability SP(vi) = Prob(T>vi)

 IF SP is small oi is discordant and working hypothesis is rejected

and alternate hypothesis that oi comes from another distribution

model G is adopted
4
Statistical Distribution-based Outlier
detection
 Alternative distributions
 Inherent alternative distribution
 Alternative hypothesis: All objects arise from another distribution G

 Mixture alternative distribution

 Discordant values are not outliers but contaminants from G H’: oi ∈ (1-
λ) F + λG i=1,2,..n

 Slippage alternative distribution

 Some Objects are independent observations from a modified version
of F (different parameters)

5
Statistical Distribution-based Outlier
detection
 Procedures for detecting Outliers
 Block procedures
 All are outliers or all are consistent

 Consecutive Procedures
 Inside-out procedure: Least likely object is tested first
 If it is an outlier – more extreme values are also considered as outliers

 Disadvantages of Statistical Approach

 Tests are for single attributes
 Data distribution may not be known
6
Distance based Outlier Detection

 Distance-based outlier
 A DB(p, D)-outlier is an object O in a dataset T such that at least
a fraction p of the objects in T lies at a distance greater than D
from O
 Object does not have enough neighbours
 Avoids excessive computation of Statistical models
 If an object is an outlier according to a discordancy test then o is
DB(p, D) outlier for some p and D

7
Distance based Outlier Detection
 Index based Algorithm
 Uses multi-dimensional indexing structures such as k-d trees and R-trees
 M – maximum number of objects within dmin neighborhood
 Once M+1 neighbours are found o is not an outlier
 O(n2k) apart from index construction

 Nested loop algorithm

 Avoids index construction
 Tries to minimize I/Os
 Divides memory buffer space into two halves and data set into several logical
blocks

8
Distance based Outlier Detection
 Cell based Algorithm
 Complexity : O(ck +n) c- depends on number of cells ; k – dimensionality
 Data space is partitioned into cells: dmin / 2√k
 Two layers surround each cell
 First layer – One cell thick
 Second layer -  2√k-1  cells thick
 Algorithm processes cells instead of objects
 Maintains three counts: cell_count, cell_+_1_layer_count,
cell_+_2_layers_count
 An object in a cell is an outlier if cell_+_1_layer_count <= M, if not, no
objects in the cell are outliers
 If cell_+_2_layers_count, <= M then all objects in cell – Outliers
 If > M some may be outliers
 Object by object processing has to be done

9
Density based Outlier detection

 Previous methods assume data are uniformly

distributed
 Data may have different density distributions
 Difficulty in choosing dmin

10
Density based Outlier detection
 Local Outlier – if its outlying relative to its local
neighbourhood particularly wrt the density of the
neighborhood
 O2 is a local outlier wrt C2; o1 is also an outlier; none of the objects
in C1 are treated as outliers

 Considers degree to which an object is an outlier

 Local Outlier factor – degree depends on how isolated the object is
wrt its surroundings

11
Density based Outlier detection
 The k-distance of an object p is the maximal distance that p gets
from its k-nearest neighbors d(p, o)
 there are at least k objects in D that are as close as or closer to p than o;
for k o’ d(p, o’) <= d(p, o)
 there are at most k-1 objects that are closer to p than o; for k-1 o” d(p,
o”) < d(p, o)

 k-distance neighborhood
 contains every object whose distance is not greater than the MinPts (k)-
distance of p

 The reachability distance of an object p with respect to object o, is

defined as reach_distMinPts(p, o) = max { MinPts-distance(o), d(p, o) }
12
OPTICS

 Complexity : O(n log n)

13
Density based Outlier detection
 Local reachability density of p is the inverse of the
average reachability density based on the MinPts-
nearest neighbors of p.
 Local outlier factor (LOF) of p captures the degree to
which we call p an outlier.
 It is the average of the ratio of the local reachability density of p
and those of p’s MinPts-nearest neighbors.
 LOF is higher for outliers

14
Deviation based Outlier detection
 Identifies outliers by examining the main characteristics
of objects in a group
 Objects that “deviate” from this description are
considered outliers
 Sequential exception technique
 Simulates the way in which humans can distinguish unusual
objects from among a series of supposedly like objects

15
Deviation based Outlier detection
 Sequential exception technique
 Given a data set D a sequence of subsets {D1, D2, ..Dm} is built
such that Dj-1 ⊆ Dj; Dissimilarities are assessed between
subsets in the sequence
 Exception Set – Smallest subset of objects whose removal
results in greatest reduction of dissimilarity

 Dissimilarity function – 1/n ∑i=1 n (xi-x’)2

 Smoothing factor: Assesses how much the dissimilarity can be

reduced by removing the subset from the original set of objects
 Can be repeated to avoid the influence of order

16
Deviation based Outlier detection

 OLAP Data Cube technique

 Uses data cubes to identify regions of anomalies
 A cell value in a cube is an exception if it differs
significantly from an expected value
 Visualization effects guide user
 May drill down

Data Mining-Outlier Analysis
No ratings yet
Data Mining-Outlier Analysis
6 pages
ADII11 Metode Deteksi Outlier
No ratings yet
ADII11 Metode Deteksi Outlier
50 pages
12 Outlier
No ratings yet
12 Outlier
16 pages
Outlier Detection Techniques
No ratings yet
Outlier Detection Techniques
12 pages
Unit-5 Outlier Analysis
No ratings yet
Unit-5 Outlier Analysis
32 pages
Outlier Detection Techniques
100% (2)
Outlier Detection Techniques
56 pages
Outlier Detection Techniques
No ratings yet
Outlier Detection Techniques
55 pages
Lecture 12
No ratings yet
Lecture 12
54 pages
Outlier Detection Techniques
100% (1)
Outlier Detection Techniques
13 pages
12outlier 1
No ratings yet
12outlier 1
45 pages
Data Mining:: Concepts and Techniques
No ratings yet
Data Mining:: Concepts and Techniques
44 pages
Lecture23 2
No ratings yet
Lecture23 2
10 pages
Feature Engineering
No ratings yet
Feature Engineering
66 pages
12 Outlier
No ratings yet
12 Outlier
18 pages
Lec3. Outlier Analysis
No ratings yet
Lec3. Outlier Analysis
54 pages
07 Outlier Detection
No ratings yet
07 Outlier Detection
54 pages
Outlier Detection in Data Mining
No ratings yet
Outlier Detection in Data Mining
72 pages
Outlier Detection & Analysis 03
No ratings yet
Outlier Detection & Analysis 03
32 pages
Anomaly Detection Techniques
No ratings yet
Anomaly Detection Techniques
14 pages
Missing and Outlier
No ratings yet
Missing and Outlier
20 pages
Unit 5
No ratings yet
Unit 5
47 pages
Outlierfin
No ratings yet
Outlierfin
19 pages
Be A 65 Ads Exp 7
No ratings yet
Be A 65 Ads Exp 7
7 pages
20 Cs 112
No ratings yet
20 Cs 112
11 pages
Outlier Detection Methods Guide
No ratings yet
Outlier Detection Methods Guide
2 pages
Outlier Detection
No ratings yet
Outlier Detection
45 pages
Anomaly Detection Overview
No ratings yet
Anomaly Detection Overview
36 pages
Bhowate, 2014, Outlier Detection Method For Data Set Based On Clustering and EDA Technique
No ratings yet
Bhowate, 2014, Outlier Detection Method For Data Set Based On Clustering and EDA Technique
3 pages
Lecture 8 Data Prepration Techniques
No ratings yet
Lecture 8 Data Prepration Techniques
4 pages
What Are Outliers138
No ratings yet
What Are Outliers138
15 pages
What Are Outliers137
No ratings yet
What Are Outliers137
15 pages
What Are Outliers139
No ratings yet
What Are Outliers139
15 pages
Unit 5
No ratings yet
Unit 5
70 pages
Data Mining Slide Contents
No ratings yet
Data Mining Slide Contents
22 pages
741 Outlier Detection
No ratings yet
741 Outlier Detection
55 pages
Distance-Based Outlier Detection: Consolidation and Renewed Bearing
No ratings yet
Distance-Based Outlier Detection: Consolidation and Renewed Bearing
12 pages
17 dm2 Anomaly Detection 2022 23
No ratings yet
17 dm2 Anomaly Detection 2022 23
113 pages
Outlier Detection Methods Guide
No ratings yet
Outlier Detection Methods Guide
15 pages
5 Anomaly Detection Annotated Section 100 300
No ratings yet
5 Anomaly Detection Annotated Section 100 300
48 pages
Unit 5 - Lecture 1 - Outlier Detection
No ratings yet
Unit 5 - Lecture 1 - Outlier Detection
30 pages
Data Minning Unit 4-1
No ratings yet
Data Minning Unit 4-1
10 pages
Unit 4-2
No ratings yet
Unit 4-2
7 pages
Krishnendu PCB-IT602B
No ratings yet
Krishnendu PCB-IT602B
11 pages
参考文献3
No ratings yet
参考文献3
9 pages
Outlier Detection
No ratings yet
Outlier Detection
30 pages
A Survey On Outlier Detection Methods
No ratings yet
A Survey On Outlier Detection Methods
4 pages
Feature Engineering
No ratings yet
Feature Engineering
63 pages
Unit 4
No ratings yet
Unit 4
17 pages
Topic 3 Data Quality
No ratings yet
Topic 3 Data Quality
4 pages
Unit 5 - Lecture 2 - Statistical - Methods - Mining - Techniques
No ratings yet
Unit 5 - Lecture 2 - Statistical - Methods - Mining - Techniques
41 pages
Anomoly Detection - Ensemble - Classifiers
No ratings yet
Anomoly Detection - Ensemble - Classifiers
68 pages
Outliers
No ratings yet
Outliers
3 pages
Unit 5-2
No ratings yet
Unit 5-2
41 pages
Outlier Analysis in Data Mining
No ratings yet
Outlier Analysis in Data Mining
5 pages
Outlier Detection Techniques
No ratings yet
Outlier Detection Techniques
6 pages
Outlier Detection
No ratings yet
Outlier Detection
10 pages
Outlier Detection Techniques
No ratings yet
Outlier Detection Techniques
28 pages
What Are Outliers231
No ratings yet
What Are Outliers231
15 pages
Oops Unit 1 Important Questions
No ratings yet
Oops Unit 1 Important Questions
2 pages
University Updates & Papers Links
No ratings yet
University Updates & Papers Links
35 pages
Flutter Record 2025
No ratings yet
Flutter Record 2025
50 pages
Flat Unit 1 Qa
No ratings yet
Flat Unit 1 Qa
25 pages
Flat Unit 4 Qa
No ratings yet
Flat Unit 4 Qa
37 pages
University Exam Resources
No ratings yet
University Exam Resources
33 pages
Flat Unit 1
No ratings yet
Flat Unit 1
80 pages
Daa Unit4
No ratings yet
Daa Unit4
38 pages
Hierarchicalclustering
No ratings yet
Hierarchicalclustering
20 pages
Flat Unit 2 Qa
No ratings yet
Flat Unit 2 Qa
30 pages
Classification Algorithm
No ratings yet
Classification Algorithm
78 pages
Unit 1 1.: Discuss The Challenges of The Distributed Systems With Their Examples?
No ratings yet
Unit 1 1.: Discuss The Challenges of The Distributed Systems With Their Examples?
18 pages
DWDM Lecture Notes
No ratings yet
DWDM Lecture Notes
139 pages
AOAC Official Method 2014 - 01 Salmonella in Selected Foods
No ratings yet
AOAC Official Method 2014 - 01 Salmonella in Selected Foods
5 pages
Difference Between Time Series and Cross Sectional Data
No ratings yet
Difference Between Time Series and Cross Sectional Data
3 pages
Planning & Scheduling in Construction
No ratings yet
Planning & Scheduling in Construction
82 pages
Assessing The Viability of Rain-Powered Mini Hydro-Electric Generator Prototype As Sustainable and Renewable Energy Source For Household Use
No ratings yet
Assessing The Viability of Rain-Powered Mini Hydro-Electric Generator Prototype As Sustainable and Renewable Energy Source For Household Use
13 pages
Stability Studies Q1
No ratings yet
Stability Studies Q1
78 pages
6 - Event Studies
No ratings yet
6 - Event Studies
31 pages
Power 1st Edition John Scott Instant Download
No ratings yet
Power 1st Edition John Scott Instant Download
78 pages
Identification and Implementation of Quality Indicators For Improvement, Monitoring and Evaluation of Laboratory Quality
No ratings yet
Identification and Implementation of Quality Indicators For Improvement, Monitoring and Evaluation of Laboratory Quality
56 pages
M.C.a. (Sem - II) Probability and Statistics
100% (1)
M.C.a. (Sem - II) Probability and Statistics
272 pages
Module 4 Parametric Vs Non Parametric Test
No ratings yet
Module 4 Parametric Vs Non Parametric Test
7 pages
Statistics - Chapter 1
No ratings yet
Statistics - Chapter 1
5 pages
The Radial Basis Function Network: March 5, 2006
No ratings yet
The Radial Basis Function Network: March 5, 2006
26 pages
Discrete Random Variables Guide
No ratings yet
Discrete Random Variables Guide
6 pages
Conover and Iman 1981 Rank Transformations As A Bridge Between Parametric and Nonparametric Statistics
No ratings yet
Conover and Iman 1981 Rank Transformations As A Bridge Between Parametric and Nonparametric Statistics
7 pages
ANOVA Guide for MATLAB Users
No ratings yet
ANOVA Guide for MATLAB Users
6 pages
Applications of Biomedical Engineering in Dentistry Tayebi PDF Download
100% (9)
Applications of Biomedical Engineering in Dentistry Tayebi PDF Download
66 pages
Religion and Politics Under Capitalism A Humanistic Approach To The Terminology 1st Edition Stefan Arvidsson Online Reading
No ratings yet
Religion and Politics Under Capitalism A Humanistic Approach To The Terminology 1st Edition Stefan Arvidsson Online Reading
91 pages
Mobility Geo
No ratings yet
Mobility Geo
104 pages
Medical Image Analysis - Unit 14 - Week 11
No ratings yet
Medical Image Analysis - Unit 14 - Week 11
4 pages
JKTech Brochure - Statistics Training (July2021) FINALv2
No ratings yet
JKTech Brochure - Statistics Training (July2021) FINALv2
2 pages
Overfitting and Underfitting
No ratings yet
Overfitting and Underfitting
8 pages
Stat Prob Q3 Week3
No ratings yet
Stat Prob Q3 Week3
10 pages
Machine Learning for Tech Enthusiasts
No ratings yet
Machine Learning for Tech Enthusiasts
12 pages
Fault Detection and Diagnosis
No ratings yet
Fault Detection and Diagnosis
20 pages
GHFTHGHTFGH
No ratings yet
GHFTHGHTFGH
9 pages
20CT1153
No ratings yet
20CT1153
2 pages
M.Sc. Course in Applied Mathematics With Oceanology and Computer Programming Semester-IV Paper-MTM402 Unit-1
No ratings yet
M.Sc. Course in Applied Mathematics With Oceanology and Computer Programming Semester-IV Paper-MTM402 Unit-1
15 pages
Probability Problem Set
No ratings yet
Probability Problem Set
2 pages
Minitab Multiple Regression Guide
No ratings yet
Minitab Multiple Regression Guide
6 pages