0% found this document useful (0 votes)

308 views19 pages

Outlier Detection

The document discusses outlier detection and anomaly detection in data mining. It defines outliers as data points that are considerably different from the majority of data points. It describes different types of anomaly detection problems and applications. It also outlines several approaches to anomaly detection, including graphical, statistical, distance-based, and clustering-based methods. Each approach has its own advantages and limitations for detecting outliers in datasets.

Uploaded by

Savitha Vasanthan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

308 views19 pages

Outlier Detection

Uploaded by

Savitha Vasanthan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 19

Outlier Discovery/Anomaly Detection

Data Mining: Concepts and

July 12, 2019 Techniques 1
Anomaly/Outlier Detection
 What are anomalies/outliers?
 The set of data points that are considerably different than

the remainder of the data

 Variants of Anomaly/Outlier Detection Problems
 Given a database D, find all the data points x  D with

anomaly scores greater than some threshold t

 Given a database D, find all the data points x  D having

the top-n largest anomaly scores f(x)

 Given a database D, containing mostly normal (but

unlabeled) data points, and a test point x, compute the

anomaly score of x with respect to D

July 12, 2019 Data Mining: Concepts and Techniques 2

Applications

 Credit card fraud detection

 telecommunication fraud detection
 network intrusion detection
 fault detection
 many more

July 12, 2019 Data Mining: Concepts and Techniques 3

Anomaly Detection

 Challenges
 How many outliers are there in the data?

 Method is unsupervised

 Validation can be quite challenging (just like for

clustering)
 Finding needle in a haystack

 Working assumption:
 There are considerably more “normal”

observations than “abnormal” observations

(outliers/anomalies) in the data
July 12, 2019 Data Mining: Concepts and Techniques 4
Anomaly Detection Schemes
 General Steps
 Build a profile of the “normal” behavior

 Profile can be patterns or summary statistics for the overall

population
 Use the “normal” profile to detect anomalies
 Anomalies are observations whose characteristics
differ significantly from the normal profile

 Types of anomaly detection

schemes
 Graphical & Statistical-based

 Distance-based

 Model-based

July 12, 2019 Data Mining: Concepts and Techniques 5

Graphical Approaches

 Boxplot (1-D), Scatter plot (2-D), Spin plot (3-D)

 Limitations
 Time consuming

 Subjective

July 12, 2019 Data Mining: Concepts and Techniques 6

Convex Hull Method

 Extreme points are assumed to be outliers

 Use convex hull method to detect extreme values

 What if the outlier occurs in the middle of the

data?
July 12, 2019 Data Mining: Concepts and Techniques 7
Statistical Approaches
 Assume a parametric model describing the distribution of the data (e.g.,
normal distribution)

 Apply a statistical test that depends on

 Data distribution
 Parameter of distribution (e.g., mean, variance)
 Number of expected outliers (confidence limit)

July 12, 2019 Data Mining: Concepts and Techniques 8

Grubbs’ Test
 Detect outliers in univariate data
 Assume data comes from normal distribution
 Detects one outlier at a time, remove the outlier, and
repeat
 H0: There is no outlier in data

 HA: There is at least one outlier

 Grubbs’ test statistic: max X  X

G
s
 Reject H0 if:
( N  1) t (2 / N , N 2 )
G
N N  2  t (2 / N , N 2 )
July 12, 2019 Data Mining: Concepts and Techniques 9
Statistical-based – Likelihood
Approach
 Assume the data set D contains samples from a mixture of two
probability distributions:
 M (majority distribution)
 A (anomalous distribution)
 General Approach:
 Initially, assume all the data points belong to M
 Let Lt(D) be the log likelihood of D at time t
 For each point xt that belongs to M, move it to A
 Let Lt+1 (D) be the new log likelihood.

 Compute the difference,  = Lt(D) – Lt+1 (D)

 If  > c (some threshold), then xt is declared as an anomaly

and moved permanently from M to A

July 12, 2019 Data Mining: Concepts and Techniques 10

Statistical-based – Likelihood
Approach
 Data distribution, D = (1 – ) M +  A
 M is a probability distribution estimated from data
 Can be based on any modeling method

 A is initially assumed to be uniform distribution

 Likelihood at time t:
N   |At | 
Lt ( D )   PD ( xi )   (1   )  PM t ( xi )    PAt ( xi ) 
|M t |

i 1  xi M t  xiAt 
LLt ( D )  M t log( 1   )   log PM t ( xi )  At log    log PAt ( xi )
xi M t xi At

July 12, 2019 Data Mining: Concepts and Techniques 11

Limitations of Statistical Approaches

 Most of the tests are for a single attribute

 In many cases, data distribution may not be

known

 For multi-dimensional data, it may be difficult to

estimate the true distribution

July 12, 2019 Data Mining: Concepts and Techniques 12

Distance-based Approaches

 Data is represented as a vector of features

 Three major approaches

 Nearest-neighbor based

 Density based

 Clustering based

July 12, 2019 Data Mining: Concepts and Techniques 13

Nearest-Neighbor Based Approach
 Approach:
 Compute the distance between every pair of data points

 There are various ways to define outliers:

 Data points for which there are fewer than p neighboring

points within a distance D

 The top n data points whose distance to the kth nearest

neighbor is greatest

 The top n data points whose average distance to the k nearest

neighbors is greatest

July 12, 2019 Data Mining: Concepts and Techniques 14

Density-based: LOF approach
 For each point, compute the density of its local
neighborhood
 Compute local outlier factor (LOF) of a sample p as the
average of the ratios of the density of sample p and the
density of its nearest neighbors
 Outliers are points with largest LOF value

In the NN approach, p2 is
not considered as outlier,
while LOF approach find
both p1 and p2 as outliers
p2
 p1


July 12, 2019 Data Mining: Concepts and Techniques 15

LOF

The local outlier factor LOF, is defined as follows:

lrd k (o)
oNk ( p ) lrd ( p)
LOFk ( p)  k
| N k ( p) |
where Nk(p) is the set of k-nearest neighbors to p
| N k ( p) |
and lrd k ( p) 
 oN k ( p)
reach  dist ( p, o)

reach  dist k ( p)  max{ k  dist (o), dist ( p, o)}

July 12, 2019 Data Mining: Concepts and Techniques 16

Clustering-Based
 Basic idea:
 Cluster the data into groups of

different density
 Choose points in small cluster

as candidate outliers
 Compute the distance between

candidate points and non-

candidate clusters.
 If candidate points are far

from all other non-candidate

points, they are outliers

July 12, 2019 Data Mining: Concepts and Techniques 17

Outliers in Lower Dimensional Projection

 Divide each attribute into  equal-depth intervals

 Each interval contains a fraction f = 1/ of the records

 Consider a d-dimensional cube created by picking grid ranges

from d different dimensions
 If attributes are independent, we expect region to contain
a fraction fk of the records
 If there are N points, we can measure sparsity of a cube
D as:

 Negative sparsity indicates cube contains smaller number

of points than expected
 To detect the sparse cells, you have to consider all cells….
exponential to d. Heuristics can be used to find them…
July 12, 2019 Data Mining: Concepts and Techniques 19
Example

 N=100,  = 5, f = 1/5 = 0.2, N  f2 = 4

July 12, 2019 Data Mining: Concepts and Techniques 20

OUTLIERS
100% (1)
OUTLIERS
5 pages
Book-Sher Muhammad Chaudary - 89-133 PDF
100% (1)
Book-Sher Muhammad Chaudary - 89-133 PDF
45 pages
Karl Pearson's Measure of Skewness
No ratings yet
Karl Pearson's Measure of Skewness
27 pages
Estimation and Hypothesis
100% (2)
Estimation and Hypothesis
32 pages
Statistics
No ratings yet
Statistics
41 pages
Probability and Statistics - Practical
No ratings yet
Probability and Statistics - Practical
126 pages
Lecture 9 Moments
No ratings yet
Lecture 9 Moments
29 pages
Tutoring Session 2023 - Statistics For Business
No ratings yet
Tutoring Session 2023 - Statistics For Business
65 pages
3.practice Assignment 3.1 - Not Graded
No ratings yet
3.practice Assignment 3.1 - Not Graded
16 pages
Gaussian Noise Detection & Estimation
No ratings yet
Gaussian Noise Detection & Estimation
55 pages
Stat Reviewer Notes
No ratings yet
Stat Reviewer Notes
14 pages
Lecture Note 3 - Introduction To Vector and Matrix Differentiation
No ratings yet
Lecture Note 3 - Introduction To Vector and Matrix Differentiation
6 pages
QM Statistic Notes
No ratings yet
QM Statistic Notes
24 pages
Bayes' Law and Probability Concepts
No ratings yet
Bayes' Law and Probability Concepts
7 pages
Formula Sheet
No ratings yet
Formula Sheet
4 pages
Types of Data (Qualitative and Quantitative)
No ratings yet
Types of Data (Qualitative and Quantitative)
89 pages
Stat 138 Course Syllabus
No ratings yet
Stat 138 Course Syllabus
4 pages
Unit I Lesson 1 Interpolation & Extrapolation: Context
No ratings yet
Unit I Lesson 1 Interpolation & Extrapolation: Context
9 pages
Business Statistics For Decision Making
No ratings yet
Business Statistics For Decision Making
6 pages
Measures of Dispersion Lecture 5
No ratings yet
Measures of Dispersion Lecture 5
35 pages
Linear Regression: Major: All Engineering Majors Authors: Autar Kaw, Luke Snyder
100% (1)
Linear Regression: Major: All Engineering Majors Authors: Autar Kaw, Luke Snyder
25 pages
Statistics: Central Tendency & Variation
No ratings yet
Statistics: Central Tendency & Variation
64 pages
Session Topic 1 - Intro - To.statistics - AY2024 2025
No ratings yet
Session Topic 1 - Intro - To.statistics - AY2024 2025
20 pages
STATS 325 Stochastic Processes Notes
No ratings yet
STATS 325 Stochastic Processes Notes
195 pages
Data Organization & Visualization
No ratings yet
Data Organization & Visualization
113 pages
Prof. U.J.Dixit
No ratings yet
Prof. U.J.Dixit
11 pages
STAT3006 Lecture Notes 2021 Aug8 2021
No ratings yet
STAT3006 Lecture Notes 2021 Aug8 2021
110 pages
Conformal Mapping
No ratings yet
Conformal Mapping
13 pages
Displaying Descriptive Statistics: Chapter 2 Map
No ratings yet
Displaying Descriptive Statistics: Chapter 2 Map
58 pages
Rohatgi Expl
No ratings yet
Rohatgi Expl
192 pages
Ma40092 Problem Sheet 3 - Solutions
No ratings yet
Ma40092 Problem Sheet 3 - Solutions
4 pages
Formula Sheet, Final Exam, April 2013: 1. Control Charts
No ratings yet
Formula Sheet, Final Exam, April 2013: 1. Control Charts
1 page
CMSC 56 Course Outline
No ratings yet
CMSC 56 Course Outline
17 pages
Chapter 9. Test of Hypotheses For A Single Sample
No ratings yet
Chapter 9. Test of Hypotheses For A Single Sample
98 pages
Example of Two Group Discriminant Analysis
No ratings yet
Example of Two Group Discriminant Analysis
7 pages
Statistics Problems for Students
No ratings yet
Statistics Problems for Students
14 pages
Quadratic Forms and Characteristic Roots Prof. NasserF1
No ratings yet
Quadratic Forms and Characteristic Roots Prof. NasserF1
65 pages
Agra University Journal Scie
No ratings yet
Agra University Journal Scie
69 pages
Statistics 131 Worksheet 10: Let X, · · ·, X ∼ U (0, θ), θ > 0. Find unbiased estimators of θ
No ratings yet
Statistics 131 Worksheet 10: Let X, · · ·, X ∼ U (0, θ), θ > 0. Find unbiased estimators of θ
2 pages
Review Mid Term Exam 2 Answer Keys
No ratings yet
Review Mid Term Exam 2 Answer Keys
11 pages
Optimization & Stochastic Theory
No ratings yet
Optimization & Stochastic Theory
29 pages
Exam With Model Answers
No ratings yet
Exam With Model Answers
4 pages
STAT 480b Answer Key To Problem Set No. 4
No ratings yet
STAT 480b Answer Key To Problem Set No. 4
3 pages
Solutions To Chapter 1 An Introduction To Data Mining: Discovering Knowledge in Data 2 Edition
No ratings yet
Solutions To Chapter 1 An Introduction To Data Mining: Discovering Knowledge in Data 2 Edition
15 pages
3 - Principles of Data Reduction
No ratings yet
3 - Principles of Data Reduction
14 pages
Class 12 Chapter 13 Maths Important Formulas
No ratings yet
Class 12 Chapter 13 Maths Important Formulas
2 pages
Probability & Statistics
No ratings yet
Probability & Statistics
351 pages
ccs346 Eda Unit 1 Notes
No ratings yet
ccs346 Eda Unit 1 Notes
20 pages
Powerpoint Workshop Introduction To Deep Learning - Statistics and Data Analysis
No ratings yet
Powerpoint Workshop Introduction To Deep Learning - Statistics and Data Analysis
26 pages
1.1 Basic Time Series Decomposition PDF
No ratings yet
1.1 Basic Time Series Decomposition PDF
38 pages
Second Year B.C.A. (Sem. I LL) Exam Ination 301: Statistical M Ethods
No ratings yet
Second Year B.C.A. (Sem. I LL) Exam Ination 301: Statistical M Ethods
4 pages
One Sample Z
No ratings yet
One Sample Z
4 pages
Advanced Statistical Inference Notes
0% (1)
Advanced Statistical Inference Notes
22 pages
Statistics For Business and Economics: Sampling and Sampling Distributions
No ratings yet
Statistics For Business and Economics: Sampling and Sampling Distributions
50 pages
Chapter12 Sampling Successive Occasions
No ratings yet
Chapter12 Sampling Successive Occasions
11 pages
4-Data Preprocessing (Cleaning) and Exploration
No ratings yet
4-Data Preprocessing (Cleaning) and Exploration
54 pages
Gradient Descent
No ratings yet
Gradient Descent
18 pages
MV - Principal Components Using SAS
No ratings yet
MV - Principal Components Using SAS
69 pages
Outlier Detection
No ratings yet
Outlier Detection
15 pages
Outlier Detection
No ratings yet
Outlier Detection
20 pages
472 653 1 SM
No ratings yet
472 653 1 SM
20 pages
PaySafe Al Intelligent Fraud Detection For UPI Transactions Using Machine Learning
No ratings yet
PaySafe Al Intelligent Fraud Detection For UPI Transactions Using Machine Learning
7 pages
TinyML-Based AnomalyDetection LitReview
No ratings yet
TinyML-Based AnomalyDetection LitReview
20 pages
Cyber Threat Detection Basics
No ratings yet
Cyber Threat Detection Basics
49 pages
A Review Paper On Outlier Detection Using Two-Phase SVM Classifiers With Cross Training Approach For Multi - Disease Diagnosis
No ratings yet
A Review Paper On Outlier Detection Using Two-Phase SVM Classifiers With Cross Training Approach For Multi - Disease Diagnosis
5 pages
AI-Powered Trust and Safety For Marketplace Platforms
No ratings yet
AI-Powered Trust and Safety For Marketplace Platforms
7 pages
Advanced Threat Detection and Response S
100% (1)
Advanced Threat Detection and Response S
28 pages
Machine Learning - Wikipedia
No ratings yet
Machine Learning - Wikipedia
36 pages
AI 900 - All Questions FINAL
No ratings yet
AI 900 - All Questions FINAL
108 pages
Big Data Management and Analysis For Cyber Physical Systems: Loon Ching Tang Hongzhi Wang Editors
No ratings yet
Big Data Management and Analysis For Cyber Physical Systems: Loon Ching Tang Hongzhi Wang Editors
208 pages
Ishan Report Final
No ratings yet
Ishan Report Final
60 pages
Emmanuel Seminar
No ratings yet
Emmanuel Seminar
9 pages
10.1515 - Jisys 2023 0220
No ratings yet
10.1515 - Jisys 2023 0220
20 pages
SAP Intelligent Asset Management Overview
No ratings yet
SAP Intelligent Asset Management Overview
20 pages
Module 1-AI
No ratings yet
Module 1-AI
79 pages
Ai-Augmented Security Models For Software Development: Leveraging Machine Learning For Threat Detection and Mitigation
No ratings yet
Ai-Augmented Security Models For Software Development: Leveraging Machine Learning For Threat Detection and Mitigation
11 pages
KL 038.4.2 3 en Labs v2.0.1
No ratings yet
KL 038.4.2 3 en Labs v2.0.1
67 pages
Skill DEVElopment
No ratings yet
Skill DEVElopment
30 pages
Casb Admin Guide
No ratings yet
Casb Admin Guide
322 pages
SAD-Practical Assignment - 2023-2024
No ratings yet
SAD-Practical Assignment - 2023-2024
9 pages
Data Minning Unit 4-1
No ratings yet
Data Minning Unit 4-1
10 pages
IIT Dhanbad Resume OnCampus
No ratings yet
IIT Dhanbad Resume OnCampus
2 pages
Chapter 10 - Machine Learning Applications
No ratings yet
Chapter 10 - Machine Learning Applications
27 pages
RT-IoT2022 - UCI Machine Learning Repository
No ratings yet
RT-IoT2022 - UCI Machine Learning Repository
4 pages
AI-102 Exam Dumps & Azure AI Solutions
No ratings yet
AI-102 Exam Dumps & Azure AI Solutions
32 pages
Fraud Detection System Report
No ratings yet
Fraud Detection System Report
28 pages
IJNRD2404218
No ratings yet
IJNRD2404218
5 pages
Chapter - 2 - Arranging - and - Collecting - Data Class9
100% (1)
Chapter - 2 - Arranging - and - Collecting - Data Class9
10 pages
Ai-102 3
No ratings yet
Ai-102 3
45 pages
Big Data Analytics
No ratings yet
Big Data Analytics
19 pages

Outlier Detection

Uploaded by

Outlier Detection

Uploaded by

Outlier Discovery/Anomaly Detection

Data Mining: Concepts and

the remainder of the data

anomaly scores greater than some threshold t

the top-n largest anomaly scores f(x)

unlabeled) data points, and a test point x, compute the

July 12, 2019 Data Mining: Concepts and Techniques 2

 Credit card fraud detection

July 12, 2019 Data Mining: Concepts and Techniques 3

 Validation can be quite challenging (just like for

observations than “abnormal” observations

 Profile can be patterns or summary statistics for the overall

 Types of anomaly detection

July 12, 2019 Data Mining: Concepts and Techniques 5

 Boxplot (1-D), Scatter plot (2-D), Spin plot (3-D)

July 12, 2019 Data Mining: Concepts and Techniques 6

 Extreme points are assumed to be outliers

 What if the outlier occurs in the middle of the

 Apply a statistical test that depends on

July 12, 2019 Data Mining: Concepts and Techniques 8

 HA: There is at least one outlier

 Grubbs’ test statistic: max X  X

 Compute the difference,  = Lt(D) – Lt+1 (D)

 If  > c (some threshold), then xt is declared as an anomaly

and moved permanently from M to A

July 12, 2019 Data Mining: Concepts and Techniques 10

 A is initially assumed to be uniform distribution

July 12, 2019 Data Mining: Concepts and Techniques 11

 Most of the tests are for a single attribute

 In many cases, data distribution may not be

 For multi-dimensional data, it may be difficult to

July 12, 2019 Data Mining: Concepts and Techniques 12

 Data is represented as a vector of features

 Three major approaches

July 12, 2019 Data Mining: Concepts and Techniques 13

 There are various ways to define outliers:

points within a distance D

 The top n data points whose distance to the kth nearest

 The top n data points whose average distance to the k nearest

July 12, 2019 Data Mining: Concepts and Techniques 14

July 12, 2019 Data Mining: Concepts and Techniques 15

The local outlier factor LOF, is defined as follows:

reach  dist k ( p)  max{ k  dist (o), dist ( p, o)}

July 12, 2019 Data Mining: Concepts and Techniques 16

candidate points and non-

from all other non-candidate

July 12, 2019 Data Mining: Concepts and Techniques 17

 Divide each attribute into  equal-depth intervals

 Consider a d-dimensional cube created by picking grid ranges

 Negative sparsity indicates cube contains smaller number

 N=100,  = 5, f = 1/5 = 0.2, N  f2 = 4

July 12, 2019 Data Mining: Concepts and Techniques 20

You might also like