0% found this document useful (0 votes)

19 views10 pages

Lecture23 2

Adding more details helps others find the information they need in your upload. Boost your views by writing a clear, detailed title and description.

Uploaded by

kingsaunawdir

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views10 pages

Lecture23 2

Adding more details helps others find the information they need in your upload. Boost your views by writing a clear, detailed title and description.

Uploaded by

kingsaunawdir

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

Lecture 24: Anomaly Detection

What are anomalies/outliers?

– The set of data points that are considerably different
than the remainder of the data

Applications:
– Credit card fraud detection, telecommunication fraud
detection, network intrusion detection, fault detection

2
Ozone Depletion History
In 1985 three researchers (Farman,
Gardinar and Shanklin) were
puzzled by data gathered by the
British Antarctic Survey showing that
ozone levels for Antarctica had
dropped 10% below normal levels

Why did the Nimbus 7 satellite,

which had instruments aboard for
recording ozone levels, not record
similarly low ozone concentrations?

The ozone concentrations recorded

by the satellite were so low they
were being treated as outliers by a
computer program and discarded!

Variants of Anomaly/Outlier Detection Problems

– Given a database D, find all the data points x ∈ D with
anomaly scores greater than some threshold t

– Given a database D, find all the data points x ∈ D

having the top-n largest anomaly scores f(x)

– Given a database D, containing mostly normal (but

unlabeled) data points, and a test point x, compute the
anomaly score of x with respect to D

4
Challenges
– How many outliers are there in the data?
– Method is unsupervised
Validation can be quite challenging (just like for clustering)
– Finding needle in a haystack

Working assumption:
– There are considerably more “normal” observations
than “abnormal” observations (outliers/anomalies) in
the data

General Steps
– Build a profile of the “normal” behavior
Profile can be patterns or summary statistics for the overall population
– Use the “normal” profile to detect anomalies
Anomalies are observations whose characteristics
differ significantly from the normal profile

Types of anomaly detection

schemes
– Graphical & Statistical-based
– Distance-based

6
Boxplot (1-D), Scatter plot (2-D), Spin plot (3-D)

Limitations
– Time consuming
– Subjective

!" #

Extreme points are assumed to be outliers

Use convex hull method to detect extreme values

What if the outlier occurs in the middle of the

data?
8
Assume a parametric model describing the
distribution of the data (e.g., normal distribution)

Apply a statistical test that depends on

– Data distribution
– Parameter of distribution (e.g., mean, variance)
– Number of expected outliers (confidence limit)

$$ %&

Detect outliers in univariate data

Assume data comes from normal distribution
Detects one outlier at a time, remove the outlier,
and repeat
– H0: There is no outlier in data
– HA: There is at least one outlier
Grubbs’ test statistic: max X − X
G=
s
Reject H0 if: t (2α / N , N −2 )
( N − 1)
G>
N N − 2 + t (2α / N , N − 2 )
10
'$ #( )* #

Assume the data set D contains samples from a

mixture of two probability distributions:
– M (majority distribution)
– A (anomalous distribution)
General Approach:
– Initially, assume all the data points belong to M
– Let Lt(D) be the log likelihood of D at time t
– For each point xt that belongs to M, move it to A
Let Lt+1 (D) be the new log likelihood.
Compute the difference, ∆ = Lt(D) – Lt+1 (D)
If ∆ > c (some threshold), then xt is declared as an anomaly
and moved permanently from M to A
11

'$ #( )* #

Data distribution, D = (1 – λ) M + λ A
M is a probability distribution estimated from data
– Can be based on any modeling method (naïve Bayes,
maximum entropy, etc)
A is initially assumed to be uniform distribution
Likelihood at time t:
N
Lt ( D ) = ∏ PD ( xi ) = (1 − λ )|M t | ∏ PM t ( xi ) λ| At | ∏ PAt ( xi )
i =1 xi ∈M t xi ∈At

LLt ( D ) = M t log(1 − λ ) + log PM t ( xi ) + At log λ + log PAt ( xi )

xi ∈M t xi ∈At

12
)

Most of the tests are for a single attribute

In many cases, data distribution may not be

known

For high dimensional data, it may be difficult to

estimate the true distribution

'$ #

Data is represented as a vector of features

Three major approaches

– Nearest-neighbor based
– Density based
– Clustering based

14
+ '+ $ , #

Approach:
– Compute the distance between every pair of data
points

– There are various ways to define outliers:

Data points for which there are fewer than p neighboring
points within a distance D

The top n data points whose distance to the kth nearest

neighbor is greatest

The top n data points whose average distance to the k

nearest neighbors is greatest

)- . /

In high-dimensional space, data is sparse and

notion of proximity becomes meaningless
– Every point is an almost equally good outlier from the
perspective of proximity-based definitions

Lower-dimensional projection methods

– A point is an outlier if in some lower dimensional
projection, it is present in a local region of abnormally
low density

16
)- . /

Divide each attribute into φ equal-depth intervals

– Each interval contains a fraction f = 1/φ of the records
Consider a k-dimensional cube created by
picking grid ranges from k different dimensions
– If attributes are independent, we expect region to
contain a fraction fk of the records
– If there are N points, we can measure sparsity of a
cube D as:

– Negative sparsity indicates cube contains smaller

number of points than expected

N=100, φ = 5, f = 1/5 = 0.2, N × f2 = 4

18
'$ #)0

For each point, compute the density of its local neighborhood

Compute local outlier factor (LOF) of a sample p as the
average of the ratios of the density of sample p and the
density of its nearest neighbors
Outliers are points with largest LOF value

! "
# $
p2
× p1
×

', #

Basic idea:
– Cluster the data into
groups of different density
– Choose points in small
cluster as candidate
outliers
– Compute the distance
between candidate points
and non-candidate
clusters.
If candidate points are far
from all other non-candidate
points, they are outliers

Anomaly Detection Techniques
No ratings yet
Anomaly Detection Techniques
14 pages
5 Anomaly Detection Annotated Section 100 300
No ratings yet
5 Anomaly Detection Annotated Section 100 300
48 pages
17 dm2 Anomaly Detection 2022 23
No ratings yet
17 dm2 Anomaly Detection 2022 23
113 pages
Data Mining Slide Contents
No ratings yet
Data Mining Slide Contents
22 pages
Feature Engineering
No ratings yet
Feature Engineering
66 pages
ADII11 Metode Deteksi Outlier
No ratings yet
ADII11 Metode Deteksi Outlier
50 pages
Data Mining-Outlier Analysis
No ratings yet
Data Mining-Outlier Analysis
6 pages
Anomaly Detection and Outlier Analysis
No ratings yet
Anomaly Detection and Outlier Analysis
25 pages
Outlier Detection in Data Mining
No ratings yet
Outlier Detection in Data Mining
72 pages
Distance-Based Outlier Detection: Consolidation and Renewed Bearing
No ratings yet
Distance-Based Outlier Detection: Consolidation and Renewed Bearing
12 pages
Data Outlier Detection Techniques
No ratings yet
Data Outlier Detection Techniques
17 pages
Anomaly-Fraud-Detection
No ratings yet
Anomaly-Fraud-Detection
50 pages
ISAT 600 Progress Report 3
No ratings yet
ISAT 600 Progress Report 3
4 pages
Anomaly Detection Overview
No ratings yet
Anomaly Detection Overview
36 pages
741 Outlier Detection
No ratings yet
741 Outlier Detection
55 pages
Outlier Detection Techniques
No ratings yet
Outlier Detection Techniques
28 pages
Anomaly Detection Unit 5
No ratings yet
Anomaly Detection Unit 5
9 pages
Anomaly Detection Class
No ratings yet
Anomaly Detection Class
24 pages
Feature Engineering
No ratings yet
Feature Engineering
63 pages
Outlier Detection in The Framework of Dimensionality Reduction
No ratings yet
Outlier Detection in The Framework of Dimensionality Reduction
20 pages
12 Outlier
No ratings yet
12 Outlier
16 pages
Outlier Detection Techniques
No ratings yet
Outlier Detection Techniques
12 pages
Unit 4
No ratings yet
Unit 4
17 pages
Distance Based Outlier Detection
No ratings yet
Distance Based Outlier Detection
40 pages
Lec3. Outlier Analysis
No ratings yet
Lec3. Outlier Analysis
54 pages
Module 11 (C)
No ratings yet
Module 11 (C)
4 pages
Be A 65 Ads Exp 7
No ratings yet
Be A 65 Ads Exp 7
7 pages
Anomaly Detection
No ratings yet
Anomaly Detection
22 pages
Data Mining Anomaly Detection
No ratings yet
Data Mining Anomaly Detection
33 pages
BITS-WASE-DATA MINING-Session-07-2015 PDF
No ratings yet
BITS-WASE-DATA MINING-Session-07-2015 PDF
25 pages
Outlier Detection Techniques
100% (2)
Outlier Detection Techniques
56 pages
Lecture 12
No ratings yet
Lecture 12
54 pages
Unit 5 - Lecture 1 - Outlier Detection
No ratings yet
Unit 5 - Lecture 1 - Outlier Detection
30 pages
Data Mining:: Concepts and Techniques
No ratings yet
Data Mining:: Concepts and Techniques
44 pages
07 Outlier Detection
No ratings yet
07 Outlier Detection
54 pages
Anomaly Detection and Curve Fitting
No ratings yet
Anomaly Detection and Curve Fitting
72 pages
Chap10 Anomaly Detection
No ratings yet
Chap10 Anomaly Detection
24 pages
12outlier 1
No ratings yet
12outlier 1
45 pages
Anomaly Detection
No ratings yet
Anomaly Detection
10 pages
Lecture Notes For Chapter 10 Introduction To Data Mining: by Tan, Steinbach, Kumar
No ratings yet
Lecture Notes For Chapter 10 Introduction To Data Mining: by Tan, Steinbach, Kumar
24 pages
Anomoly Detection - Ensemble - Classifiers
No ratings yet
Anomoly Detection - Ensemble - Classifiers
68 pages
Outlier Detection Techniques
100% (1)
Outlier Detection Techniques
13 pages
20 Cs 112
No ratings yet
20 Cs 112
11 pages
Eda U2
No ratings yet
Eda U2
141 pages
Outlier Detection Methods Guide
No ratings yet
Outlier Detection Methods Guide
2 pages
Outlier Detection Techniques
No ratings yet
Outlier Detection Techniques
55 pages
Data Minning Unit 4-1
No ratings yet
Data Minning Unit 4-1
10 pages
6anomaly Fraud Detection
No ratings yet
6anomaly Fraud Detection
5 pages
Reverse Accessible in Local Outlier Factor Density Based Recognition
No ratings yet
Reverse Accessible in Local Outlier Factor Density Based Recognition
10 pages
Make 05 00042 v3
No ratings yet
Make 05 00042 v3
21 pages
Subspace Histograms For Outlier Detection in Linear Time: Saket Sathe Charu C. Aggarwal
No ratings yet
Subspace Histograms For Outlier Detection in Linear Time: Saket Sathe Charu C. Aggarwal
25 pages
Unit 5
No ratings yet
Unit 5
47 pages
Unit-5 Outlier Analysis
No ratings yet
Unit-5 Outlier Analysis
32 pages
Missing and Outlier
No ratings yet
Missing and Outlier
20 pages
Ads Exp 7
No ratings yet
Ads Exp 7
10 pages
12 Outlier
No ratings yet
12 Outlier
18 pages
Outlier Detection
No ratings yet
Outlier Detection
30 pages
PP 03 Soln
No ratings yet
PP 03 Soln
13 pages
POLYMAX - A REVOLUTION IN OPERATIONAL MODAL ANALYSIS (C)
No ratings yet
POLYMAX - A REVOLUTION IN OPERATIONAL MODAL ANALYSIS (C)
13 pages
Testsfor Structural Breaksin Time Series Analysis AReviewof Recent Development
No ratings yet
Testsfor Structural Breaksin Time Series Analysis AReviewof Recent Development
15 pages
ECS7020P ClassificationExercises II
No ratings yet
ECS7020P ClassificationExercises II
3 pages
Latent Variable Models and Factor Analysis A Unified Approach 3rd Edition David J. Bartholomew Instant Download
No ratings yet
Latent Variable Models and Factor Analysis A Unified Approach 3rd Edition David J. Bartholomew Instant Download
56 pages
Crowder - Classical Competing Risks
No ratings yet
Crowder - Classical Competing Risks
201 pages
Regime Switches in Interest Rates: Andrew Ang Geert Bekaert First Version: 25 March 1998 This Version: 16 July 2001
No ratings yet
Regime Switches in Interest Rates: Andrew Ang Geert Bekaert First Version: 25 March 1998 This Version: 16 July 2001
43 pages
Practical Weibull Analysis Guide
No ratings yet
Practical Weibull Analysis Guide
103 pages
The New Weibull Handbook 5th Edition Robert B. Abernethy PDF Version
100% (5)
The New Weibull Handbook 5th Edition Robert B. Abernethy PDF Version
138 pages
Seminar Literature Review - Deepfake Detection - Rizkiaji Putro
No ratings yet
Seminar Literature Review - Deepfake Detection - Rizkiaji Putro
22 pages
Modeling Arousal Potential of Epistemic Emotions Using Bayesian Information Gain: Inquiry Cycle Driven by Free Energy Fluctuations
No ratings yet
Modeling Arousal Potential of Epistemic Emotions Using Bayesian Information Gain: Inquiry Cycle Driven by Free Energy Fluctuations
27 pages
Ebsr Newborn Hearing Screening
No ratings yet
Ebsr Newborn Hearing Screening
34 pages
Thomas 2001 A Methodology For Linking Customer Acquisition To Customer Retention
No ratings yet
Thomas 2001 A Methodology For Linking Customer Acquisition To Customer Retention
7 pages
Info Theory for Systems Scientists
No ratings yet
Info Theory for Systems Scientists
18 pages
基于深度学习的跨模态检索综述尹奇跃
No ratings yet
基于深度学习的跨模态检索综述尹奇跃
21 pages
Reconsideration of The Derivation of Most Probable Numbers, Their Standard Deviations, Confidence Bounds and Rarity Values
No ratings yet
Reconsideration of The Derivation of Most Probable Numbers, Their Standard Deviations, Confidence Bounds and Rarity Values
8 pages
Soley-Bori 2013 Dealingwithmissingdata Keyassumptionsandmethodsforappliedanalysis
No ratings yet
Soley-Bori 2013 Dealingwithmissingdata Keyassumptionsandmethodsforappliedanalysis
21 pages
Bayesian Statistics: Thomas Bayes
No ratings yet
Bayesian Statistics: Thomas Bayes
22 pages
Bayesian Statistics For Dummies
No ratings yet
Bayesian Statistics For Dummies
5 pages
Patient Sickness Prediction System
No ratings yet
Patient Sickness Prediction System
8 pages
Ps 3
No ratings yet
Ps 3
15 pages
STAT 4101/5101 Lab2: Bingxin Zhao
No ratings yet
STAT 4101/5101 Lab2: Bingxin Zhao
6 pages
A Stream Algebra For Computer Vision Pipelines
No ratings yet
A Stream Algebra For Computer Vision Pipelines
8 pages
SDM Perawat dan Beban Kerja
No ratings yet
SDM Perawat dan Beban Kerja
17 pages
2009 Food Values
No ratings yet
2009 Food Values
13 pages
(Ebook) Contemporary Statistical Models For The Plant and Soil Sciences by Oliver Schabenberger Francis J. Pierce ISBN 9781584881117, 1584881119 Online Version
No ratings yet
(Ebook) Contemporary Statistical Models For The Plant and Soil Sciences by Oliver Schabenberger Francis J. Pierce ISBN 9781584881117, 1584881119 Online Version
118 pages
Reliability Assessment For Thickness Inspection of Pipe Wall Using Probability of Detection
No ratings yet
Reliability Assessment For Thickness Inspection of Pipe Wall Using Probability of Detection
10 pages
Latent Class Analysis: January, 2020 Boriana Pratt Office of Population Research (OPR)
No ratings yet
Latent Class Analysis: January, 2020 Boriana Pratt Office of Population Research (OPR)
32 pages
Which Framework Is Suitable For Online 3D Multi-Object Tracking For Autonomous Driving With Automotive 4D Imaging Radar?
No ratings yet
Which Framework Is Suitable For Online 3D Multi-Object Tracking For Autonomous Driving With Automotive 4D Imaging Radar?
8 pages
Chapman-Kolmogorov Equations 28 Likelihood Intervals Are 48511
No ratings yet
Chapman-Kolmogorov Equations 28 Likelihood Intervals Are 48511
9 pages