OFFICIAL (CLOSED) \ NON-SENSITIVE
What is
Anomaly Detection
Mr Hew Ka Kian
hew_ka_kian@rp.edu.sg
OFFICIAL (CLOSED) \ NON-SENSITIVE
What is Anomaly Detection
• Anomaly detection (also outlier detection) is the identification of rare items,
events or observations which raise suspicions by differing significantly from the
majority of the data.
• The anomalous items may
translate to some kind of problem such
as bank fraud, network breach, a structural
defect, medical problems or errors in a text.
• Anomalies are also referred to as outliers,
novelties, noise, deviations and exceptions.
• Anomaly detection is applied on unlabeled
data is known as unsupervised anomaly
detection although supervised anomaly
detection is possible with labeled data
Source: https://en.wikipedia.org/wiki/Anomaly_detection
OFFICIAL (CLOSED) \ NON-SENSITIVE
Anomaly Types
• Anomaly is a broad concept, which may refer to many different types
of events in time series.
• A spike of value, a shift of volatility etc. could all be anomalous or
normal, depending on the specific context.
Source:
OFFICIAL (CLOSED) \ NON-SENSITIVE
Time series data anomaly detection
• Successful anomaly detection hinges on an ability to accurately analyze time series data in real
time.
• Time series data is composed of a sequence of values over time. That means each point is
typically a pair of two items — a timestamp for when the metric was measured, and the value of
that metric.
• Time series data anomaly detection can be used
for valuable metrics such as: Seismic
Virus
infection reading
cases
Power
Transaction
generator
volume
output
Login Mobile app
attempts installs
Source:
OFFICIAL (CLOSED) \ NON-SENSITIVE
Univariate vs. Multivariate
Univariate Multivariate
• Looking at one variable • Need to consider multiple
• If we want to look at factors and the relationship
anomalous weather patterns, between them
univariate anomaly detection • If we want to look at
will measure a single indicator, anomalous weather patterns,
such as temperature. We can multivariate analysis will
then ask questions like “is this consider a host of factors, like
temperature strange for this precipitation, humidity and air
region?” pressure.
OFFICIAL (CLOSED) \ NON-SENSITIVE
Anomaly Detection Toolkit
Anomaly Detection Toolkit (ADTK) is a Python package for unsupervised time
series anomaly detection.
This package offers a set of functions that makes the training of the dataset and
the detection of anomaly easier to code
It also provides some functions to process and visualize time series and
anomaly events.
ADTK is open sourced and its code and many examples of how to use the
package is at https://github.com/arundo/adtk
OFFICIAL (CLOSED) \ NON-SENSITIVE
ADTK Anomaly Types
• ADTK can detect a point anomaly where there is a data point whose value is
significantly different from others.
• An outlier point in a time series time is one that exceed the normal range of this
series.
• To detect outliers, the normal range of time series values (baseline) is what a
detector needs to learn.
OFFICIAL (CLOSED) \ NON-SENSITIVE
ADTK Anomaly Types
• Spike and Level Shift: In some situations, whether a time point is
normal depends on if its value is aligned with its near past.
• An abrupt increase or decrease of value is called a spike if the change
is temporary.
Source:
OFFICIAL (CLOSED) \ NON-SENSITIVE
ADTK Anomaly Types
• An abrupt increase or decrease of value is called a spike if the change
is temporary.
• However we should use ADTK to detect a level shift if the change is
permanent.
Source: https://cloud.google.com/ai-platform/docs/ml-solutions-overview
OFFICIAL (CLOSED) \ NON-SENSITIVE
Workflow Source data
Data can come from the places You can also get your hands on
you have access to like the credit privileged data through Prepare data
card transaction if you work in the commercial arrangement like
bank paying for the data
Select the
algorithm
There are also plenty of open (public) dataset that is shared with the public free: Train the
https://kaggle.com is an online community for machine learning enthusiasts and it model
has many open dataset
https://data.gov.sg was first launched in 2011 as the government's one-stop portal
to its publicly-available datasets from 70 public agencies. To date, more than 100 Test the model
apps have been created using the government’s open data.
Use the model
for Inference
OFFICIAL (CLOSED) \ NON-SENSITIVE
Workflow Source data
Filter the data Transform
Prepare data
• The rows of interest like • Combine multiple source Select the
those above 65 years old • Extract feature by applying algorithm
• The columns of interest, mathematical functions
basically just the datetime like moving average of 20
and the feature columns data points Train the
model
Test the model
Use the model
for Inference
OFFICIAL (CLOSED) \ NON-SENSITIVE
Workflow Source data
What type of anomaly?
Prepare data
Global (point) Contextual anomalies Collective anomalies
Select the
algorithm
ADTK volatility shift
ADTK threshold detector
to detect values outside
ADTK quantile detector to
detect values outside
detector detects shift of
volatility by comparing 2
ADTK Seasonal detector
detects departure from a
Train the
certain threshold values certain percentile windows of values repeating pattern model
Test the model
Use the model
for Inference
OFFICIAL (CLOSED) \ NON-SENSITIVE
Workflow Source data
Train and Test
Prepare data
To train the model using ADTK, call the fit(df)
function Select the
algorithm
To test or use the model to detect anomaly, call Train the
the detect(df) function model
Convenient function that train followed by Test the model
detect, call the fit_detect(df) function
Use the model
for Inference
OFFICIAL (CLOSED) \ NON-SENSITIVE
Exercise D
Pandas is a fast and powerful Python library for data
manipulation.
Import the library
• import pandas as pd
Read the content of the comma separated values (CSV) into a
Pandas Series object
• s = pd.read_csv('dataset.csv', index_col="pr_date",
parse_dates=True, infer_datetime_format=True)
• index_col is the column that holds the datetime
• infer_datetime_format=True tries to guess the date format
OFFICIAL (CLOSED) \ NON-SENSITIVE
Exercise D
Use the ADTK validate_series(df) to check for error in the Series
Import the library
• from adtk.data import validate_series
Validate the Series
• s = validate_series(s)
OFFICIAL (CLOSED) \ NON-SENSITIVE
Exercise D Detecting Simple Threshold
Use ThresholdAD to detect outlier (point anomaly) that exceeds the
baseline threshold
Import the library
• from adtk.detector import ThresholdAD
Create the ThresholdAD object with the high and low threshold values. Values above the
high or below the low threshold are flagged as anomaly
• threshold_ad = ThresholdAD(high=30, low=15)
Detect anomalies
• anomalies = threshold_ad.detect(s)
OFFICIAL (CLOSED) \ NON-SENSITIVE
Exercise D
Use the plot() function to visualize the graph
Import the library
• from adtk.visualization import plot
Plot the graph with the DataFrame and anomalies. Can specify the anomaly
marker colour and the tag as marker (dot on the graph)
• plot(s, anomaly=anomalies, anomaly_color='red’,
anomaly_tag="marker");
OFFICIAL (CLOSED) \ NON-SENSITIVE
Exercise D Percentile
• Percentile: the value below which a percentage of data falls.
Source: https://www.mathsisfun.com/data/percentiles.html
OFFICIAL (CLOSED) \ NON-SENSITIVE
Exercise D Percentile
Use QuantileAD to detect outlier (point anomaly) that exceeds the
certain percentile of the series
Import the library
• from adtk.detector import QuantileAD
Create the object with the high and low percentile with values outside of
this boundary considered anomalous
• quantile_ad = QuantileAD(high=0.99, low=0.01)
Need to be trained on the range before detection. fit_detect() does this
in one step
• anomalies = quantile_ad.fit_detect(df)
OFFICIAL (CLOSED) \ NON-SENSITIVE
Exercise D
Use VolatilityShiftAD() to detect a change in the volatility in the
time series
Import the library
• from adtk.detector import VolatilityShiftAD
VolatilityShiftAD() compares the volatity between 2 windows next to each other. We
have to create the object specifying how many time points do the windows contain
• volatility_shift_ad = VolatilityShiftAD(window=30)
OFFICIAL (CLOSED) \ NON-SENSITIVE
Exercise E
• Read tsla.us.txt csv file and print the content.
s = pd.read_csv('data/tsla.us.txt’,
index_col="Date", parse_dates=True,
infer_datetime_format=True)
print(s)
• What are the columns
Open, High, Low, Close, volume, OpenInt
OFFICIAL (CLOSED) \ NON-SENSITIVE
Exercise E
• Write the code to drop the other columns except the Date and Volume columns.
s = s.drop(['High','Low','Open','Close','OpenInt'],axis=1)
• Write the code to detect the volatility shift with window of 60 values
s = validate_series(s)
volatility_shift_ad = VolatilityShiftAD(window=60)
anomalies = volatility_shift_ad.fit_detect(s)
• Plot the graph and do you have something like the graph shown in the worksheet that
detects the starts of increase volatility?
plot(s, anomaly=anomalies, anomaly_color='red’);
• What did you find that was the likely reason for the anomaly? Bare in mind that anomaly
does not have to be bad, simply something that deviates from the standard or norm.
• Tesla announced it is getting profitable in 2013.
OFFICIAL (CLOSED) \ NON-SENSITIVE
Exercise F
• Open the file weekly-infectious-disease-bulletin-cases.csv using Excel.
• What are the columns
epi_week: week of the year
disease: disease
no._of_cases: no of cases
OFFICIAL (CLOSED) \ NON-SENSITIVE
Exercise F
• How do we filter for the rows with ‘Dengue Fever’?
infectious = infectious[infectious['disease'] == 'Dengue Fever’]
• Set the epi_week as the index column.
infectious = infectious.set_index('epi_week')
• Drop the disease column.
infectious = infectious.drop('disease',axis=1)
OFFICIAL (CLOSED) \ NON-SENSITIVE
Student Activity Source data
In exercise G, we are going to do the whole ML workflow Prepare data
1. Source for the data at Data.gov.sg
Select the
2. Prepare the data algorithm
• Change the datetime format
• Choose a disease to examine and filter out the irrelevant columns Train the
model
3. Select the algorithm – Shift in volatility
4. Train and use the model for inference Test the
model
Use the model
for Inference
Source:
OFFICIAL (CLOSED) \ NON-SENSITIVE
Exercise G
• Acute Upper Respiratory Tract infections
infectious = pd.read_csv(
'data/average-daily-polyclinic-attendances-for-selected-diseases.csv’)
infectious['epi_week'] = pd.to_datetime(
infectious['epi_week'] + '-1 00:00:00', format='%Y-W%W-%w %H:%M:%S’)
infectious = infectious[
infectious['disease'] == 'Acute Upper Respiratory Tract infections’]
infectious = infectious.set_index('epi_week’)
infectious = infectious.drop('disease',axis=1)
volatility_shift_ad = VolatilityShiftAD(window=52)
anomalies = volatility_shift_ad.fit_detect(infectious)
plot(infectious, anomaly=anomalies, anomaly_color='red');
OFFICIAL (CLOSED) \ NON-SENSITIVE
Problem Solution
Is there is any abrupt change in trend for visitors to Singapore?
Source:
OFFICIAL (CLOSED) \ NON-SENSITIVE
Problem Solution
• Source data
• Get the visitor-international-arrivals-to-singapore-by-region-monthly.csv from Data.gov.sg
• The datetime column is well defined so no need to modify
s = pd.read_csv('data/visitor-international-arrivals-to-singapore-by-region-
monthly.csv', index_col="month",
parse_dates=True)
• Filter for a region like Africa
s = s[s['region']=='Africa']
• Drop the regions column
s = s.drop(['region'],axis=1)
• Train and inference
s = validate_series(s)
volatility_shift_ad = VolatilityShiftAD(window=12)
anomalies = volatility_shift_ad.fit_detect(s)
• Plot
plot(s, anomaly=anomalies, anomaly_color='red')
OFFICIAL (CLOSED) \ NON-SENSITIVE
Exercise H
• Any anomaly?
Yes, around 2002-2003 (depending on country)