0% found this document useful (0 votes)

53 views20 pages

Outlier Detection

Anomaly detection involves identifying data points that significantly differ from the majority of the dataset, with various methods including statistical, graphical, and distance-based approaches. Applications include fraud detection and network intrusion detection, but challenges exist due to the unsupervised nature of the methods and the difficulty in validating results. The process typically involves building a profile of normal behavior and detecting anomalies based on deviations from this profile.

Uploaded by

laptopuser1802

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

53 views20 pages

Outlier Detection

Uploaded by

laptopuser1802

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

Outlier Discovery/Anomaly Detection

Data Mining: Concepts and

* Techniques 1
Anomaly/Outlier Detection
■ What are anomalies/outliers?
■ The set of data points that are considerably different than

the remainder of the data

■ Variants of Anomaly/Outlier Detection Problems
■ Given a database D, find all the data points x ∈ D with

anomaly scores greater than some threshold t

■ Given a database D, find all the data points x ∈ D having

the top-n largest anomaly scores f(x)

■ Given a database D, containing mostly normal (but

unlabeled) data points, and a test point x, compute the

anomaly score of x with respect to D

* Data Mining: Concepts and Techniques 2

Applications

■ Credit card fraud detection

■ telecommunication fraud detection
■ network intrusion detection
■ fault detection
■ many more

* Data Mining: Concepts and Techniques 3

Anomaly Detection
■ Challenges
■ How many outliers are there in the data?

■ Method is unsupervised

■ Validation can be quite challenging (just like for

clustering)
■ Finding needle in a haystack

■ Working assumption:
■ There are considerably more “normal”

observations than “abnormal” observations

(outliers/anomalies) in the data
* Data Mining: Concepts and Techniques 4
Anomaly Detection Schemes
■ General Steps
■ Build a profile of the “normal” behavior

■ Profile can be patterns or summary statistics for the overall

population
■ Use the “normal” profile to detect anomalies
■ Anomalies are observations whose characteristics
differ significantly from the normal profile

■ Types of anomaly detection

schemes
■ Graphical & Statistical-based

■ Distance-based

■ Model-based

* Data Mining: Concepts and Techniques 5

Graphical Approaches
■ Boxplot (1-D), Scatter plot (2-D), Spin plot (3-D)

■ Limitations
■ Time consuming

■ Subjective

* Data Mining: Concepts and Techniques 6

Convex Hull Method
■ Extreme points are assumed to be outliers
■ Use convex hull method to detect extreme values

■ What if the outlier occurs in the middle of the

data?
* Data Mining: Concepts and Techniques 7
Statistical Approaches
■ Assume a parametric model describing the distribution of the data (e.g.,
normal distribution)

■ Apply a statistical test that depends on

■ Data distribution
■ Parameter of distribution (e.g., mean, variance)
■ Number of expected outliers (confidence limit)

* Data Mining: Concepts and Techniques 8

Grubbs’ Test
■ Detect outliers in univariate data
■ Assume data comes from normal distribution
■ Detects one outlier at a time, remove the outlier, and
repeat
■ H0: There is no outlier in data
■ HA: There is at least one outlier
■ Grubbs’ test statistic:

■ Reject H0 if:

* Data Mining: Concepts and Techniques 9

Statistical-based – Likelihood
Approach
■ Assume the data set D contains samples from a mixture of two
probability distributions:
■ M (majority distribution)
■ A (anomalous distribution)
■ General Approach:
■ Initially, assume all the data points belong to M
■ Let Lt(D) be the log likelihood of D at time t
■ For each point xt that belongs to M, move it to A
■ Let L (D) be the new log likelihood.
t+1
■ Compute the difference, Δ = L (D) – L (D)
t t+1
■ If Δ > c (some threshold), then x is declared as an anomaly
t
and moved permanently from M to A

* Data Mining: Concepts and Techniques 10

Statistical-based – Likelihood
Approach
■ Data distribution, D = (1 – λ) M + λ A
■ M is a probability distribution estimated from data
■ Can be based on any modeling method

■ A is initially assumed to be uniform distribution

■ Likelihood at time t:

* Data Mining: Concepts and Techniques 11

Limitations of Statistical Approaches
■ Most of the tests are for a single attribute

■ In many cases, data distribution may not be

known

■ For multi-dimensional data, it may be difficult to

estimate the true distribution

* Data Mining: Concepts and Techniques 12

Distance-based Approaches
■ Data is represented as a vector of features

■ Three major approaches

■ Nearest-neighbor based

■ Density based

■ Clustering based

* Data Mining: Concepts and Techniques 13

Nearest-Neighbor Based Approach
■ Approach:
■ Compute the distance between every pair of data points

■ There are various ways to define outliers:

■ Data points for which there are fewer than p neighboring

points within a distance D

■ The top n data points whose distance to the kth nearest

neighbor is greatest

■ The top n data points whose average distance to the k nearest

neighbors is greatest

* Data Mining: Concepts and Techniques 14

Density-based: LOF approach
■ For each point, compute the density of its local
neighborhood
■ Compute local outlier factor (LOF) of a sample p as the
average of the ratios of the density of sample p and the
density of its nearest neighbors
■ Outliers are points with largest LOF value

In the NN approach, p2 is
not considered as outlier,
while LOF approach find
both p1 and p2 as outliers
p2
× p1
×

* Data Mining: Concepts and Techniques 15

LOF

The local outlier factor LOF, is defined as follows:

where Nk(p) is the set of k-nearest neighbors to p

and

* Data Mining: Concepts and Techniques 16

Clustering-Based
■ Basic idea:
■ Cluster the data into groups of
different density
■ Choose points in small cluster
as candidate outliers
■ Compute the distance between
candidate points and
non-candidate clusters.
■ If candidate points are far

from all other non-candidate

points, they are outliers

* Data Mining: Concepts and Techniques 17

Outliers in Lower Dimensional Projections

■ In high-dimensional space, data is sparse and

notion of proximity becomes meaningless
■ Every point is an almost equally good outlier

from the perspective of proximity-based

definitions

■ Lower-dimensional projection methods

■ A point is an outlier if in some lower

dimensional projection, it is present in a local

region of abnormally low density

* Data Mining: Concepts and Techniques 18

Outliers in Lower Dimensional Projection

■ Divide each attribute into φ equal-depth intervals

■ Each interval contains a fraction f = 1/φ of the records
■ Consider a d-dimensional cube created by picking grid ranges
from d different dimensions
■ If attributes are independent, we expect region to contain
a fraction fk of the records
■ If there are N points, we can measure sparsity of a cube
D as:

■ Negative sparsity indicates cube contains smaller number

of points than expected
■ To detect the sparse cells, you have to consider all cells….
exponential to d. Heuristics can be used to find them…
* Data Mining: Concepts and Techniques 19
Example
■ N=100, φ = 5, f = 1/5 = 0.2, N × f2 = 4

* Data Mining: Concepts and Techniques 20

Outlier Detection
No ratings yet
Outlier Detection
15 pages
Outlier Detection
No ratings yet
Outlier Detection
19 pages
Lecture Notes For Chapter 10 Introduction To Data Mining: by Tan, Steinbach, Kumar
No ratings yet
Lecture Notes For Chapter 10 Introduction To Data Mining: by Tan, Steinbach, Kumar
24 pages
Chap10 Anomaly Detection
No ratings yet
Chap10 Anomaly Detection
24 pages
Anomaly Detection Techniques
No ratings yet
Anomaly Detection Techniques
14 pages
BITS-WASE-DATA MINING-Session-07-2015 PDF
No ratings yet
BITS-WASE-DATA MINING-Session-07-2015 PDF
25 pages
Outlier Analysis in Data Mining
No ratings yet
Outlier Analysis in Data Mining
5 pages
8 Clustering
No ratings yet
8 Clustering
89 pages
Lecture 8 Data Prepration Techniques
No ratings yet
Lecture 8 Data Prepration Techniques
4 pages
Data Minning Unit 4-1
No ratings yet
Data Minning Unit 4-1
10 pages
12 Outlier
No ratings yet
12 Outlier
16 pages
Data Mining Intro, Functionalities, Issues
No ratings yet
Data Mining Intro, Functionalities, Issues
30 pages
Concepts and Techniques: - Chapter 7
No ratings yet
Concepts and Techniques: - Chapter 7
127 pages
Lecture 4 - Data Pre-Processing: Fall 2010 Dr. Tariq MAHMOOD Nuces (Fast) - Khi
No ratings yet
Lecture 4 - Data Pre-Processing: Fall 2010 Dr. Tariq MAHMOOD Nuces (Fast) - Khi
24 pages
Chapter2 Data Preprocssing
No ratings yet
Chapter2 Data Preprocssing
70 pages
A Survey On Outlier Detection Methods
No ratings yet
A Survey On Outlier Detection Methods
4 pages
Lecture 1428550844
No ratings yet
Lecture 1428550844
87 pages
Chap9 Anomaly Detection
No ratings yet
Chap9 Anomaly Detection
46 pages
Krishnendu PCB-IT602B
No ratings yet
Krishnendu PCB-IT602B
11 pages
3 Prep
No ratings yet
3 Prep
50 pages
Clustering Full 1
No ratings yet
Clustering Full 1
98 pages
Unit 2 - Data Preprocessing
No ratings yet
Unit 2 - Data Preprocessing
42 pages
8 CLST
No ratings yet
8 CLST
98 pages
Cluster Analysis
No ratings yet
Cluster Analysis
136 pages
Data Pre Processing
No ratings yet
Data Pre Processing
35 pages
Data Mining Slide Contents
No ratings yet
Data Mining Slide Contents
22 pages
Data Preprocessing - DWM
No ratings yet
Data Preprocessing - DWM
42 pages
Swetha Unit 1 Part 2 Data Preprocessing
No ratings yet
Swetha Unit 1 Part 2 Data Preprocessing
74 pages
Data Mining Anomaly Detection
No ratings yet
Data Mining Anomaly Detection
33 pages
Unit 4
No ratings yet
Unit 4
17 pages
8 CLST
No ratings yet
8 CLST
100 pages
Data Mining:: - Chapter 2
No ratings yet
Data Mining:: - Chapter 2
75 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
99 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
78 pages
Data Mining
No ratings yet
Data Mining
29 pages
Data Mining Techniques
No ratings yet
Data Mining Techniques
41 pages
Cluster Analysis: Concepts and Techniques - Chapter 7
100% (1)
Cluster Analysis: Concepts and Techniques - Chapter 7
60 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
101 pages
Data Preprocessing Essentials
No ratings yet
Data Preprocessing Essentials
42 pages
2.2 Data Summarization
No ratings yet
2.2 Data Summarization
60 pages
Data Mining Presentation PDF
No ratings yet
Data Mining Presentation PDF
206 pages
Cluster Analysis Techniques Guide
No ratings yet
Cluster Analysis Techniques Guide
51 pages
Unit 3
No ratings yet
Unit 3
34 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
51 pages
Data Outlier Detection Techniques
No ratings yet
Data Outlier Detection Techniques
17 pages
Chapter 7. Cluster Analysis
No ratings yet
Chapter 7. Cluster Analysis
120 pages
Unit 5-2
No ratings yet
Unit 5-2
41 pages
Lecture 12
No ratings yet
Lecture 12
54 pages
Lecture 3.2.1 3.2.2
No ratings yet
Lecture 3.2.1 3.2.2
28 pages
Data Mining
No ratings yet
Data Mining
35 pages
Data Mining
No ratings yet
Data Mining
44 pages
3 Prep
No ratings yet
3 Prep
53 pages
Anomoly Detection - Ensemble - Classifiers
No ratings yet
Anomoly Detection - Ensemble - Classifiers
68 pages
What Is Data Mining?
No ratings yet
What Is Data Mining?
17 pages
ADII11 Metode Deteksi Outlier
No ratings yet
ADII11 Metode Deteksi Outlier
50 pages
What Is Cluster Analysis?
No ratings yet
What Is Cluster Analysis?
56 pages
Knowledge Management UNIT-3 Notes
No ratings yet
Knowledge Management UNIT-3 Notes
17 pages
Feature Engineering
No ratings yet
Feature Engineering
66 pages
Concepts and Techniques
100% (2)
Concepts and Techniques
118 pages
Digital Ethics - FINAL - 160616
No ratings yet
Digital Ethics - FINAL - 160616
36 pages
4 Elsa Habte
No ratings yet
4 Elsa Habte
81 pages
Course+Slides+ +Data+Warehouse+ +the+Ultimate+Guide
No ratings yet
Course+Slides+ +Data+Warehouse+ +the+Ultimate+Guide
393 pages
Supply Chain Cost Control Using Activity-Based Management (Supply Chain Integration) (Sameer Kumar, Mathew Zander)
100% (4)
Supply Chain Cost Control Using Activity-Based Management (Supply Chain Integration) (Sameer Kumar, Mathew Zander)
241 pages
University of Southern Indiana IRB-protocol-Form-A-example-faculty
No ratings yet
University of Southern Indiana IRB-protocol-Form-A-example-faculty
9 pages
BCSL606 Machine Learning Lab
No ratings yet
BCSL606 Machine Learning Lab
33 pages
Climate Data Reconstruction Project
No ratings yet
Climate Data Reconstruction Project
22 pages
Research Methods for Educators
No ratings yet
Research Methods for Educators
14 pages
Bài tập nhóm chương 5,6 - HUY -
No ratings yet
Bài tập nhóm chương 5,6 - HUY -
3 pages
19.T2862-Business Statistics With R
No ratings yet
19.T2862-Business Statistics With R
2 pages
Statistical Test Selection Guide
No ratings yet
Statistical Test Selection Guide
3 pages
Linear Regression Analysis in Excel
No ratings yet
Linear Regression Analysis in Excel
17 pages
Standard Normal Curve Table
67% (3)
Standard Normal Curve Table
3 pages
Ford Porject Charter
No ratings yet
Ford Porject Charter
12 pages
Chapter 13 Part 1
No ratings yet
Chapter 13 Part 1
49 pages
Unit 6 Assignment Testing Hypothesis
No ratings yet
Unit 6 Assignment Testing Hypothesis
3 pages
Understanding Research Bias
No ratings yet
Understanding Research Bias
5 pages
Course Plan - Introduction To Research Methods
No ratings yet
Course Plan - Introduction To Research Methods
5 pages
Aakanksha Aundhkar Professional Summary
No ratings yet
Aakanksha Aundhkar Professional Summary
6 pages
Qualitative Research Methodology
100% (2)
Qualitative Research Methodology
19 pages
Machine Learning-Based Predictive Analytics and Big Data in The Automotive Sector
No ratings yet
Machine Learning-Based Predictive Analytics and Big Data in The Automotive Sector
6 pages
Basics of Statistics
No ratings yet
Basics of Statistics
3 pages
Applied Econometrics Assignment
No ratings yet
Applied Econometrics Assignment
5 pages
Model Summary: A. Predictors: (Constant), Shelfspace
No ratings yet
Model Summary: A. Predictors: (Constant), Shelfspace
3 pages
Conjoint Analysis for Paint Preferences
No ratings yet
Conjoint Analysis for Paint Preferences
2 pages
Data Warehouse and Power BI
No ratings yet
Data Warehouse and Power BI
6 pages
Week 10 Tutorial Questions Chapter 6
No ratings yet
Week 10 Tutorial Questions Chapter 6
4 pages
Paired Samples T Test by Hand
No ratings yet
Paired Samples T Test by Hand
5 pages
MSW Course Content New Syllabus M G University
67% (3)
MSW Course Content New Syllabus M G University
34 pages
Analyzing Needs Analysis in ESP: A (Re) Modeling: Vali Mohammadi, Nasser Mousavi
No ratings yet
Analyzing Needs Analysis in ESP: A (Re) Modeling: Vali Mohammadi, Nasser Mousavi
7 pages