0% found this document useful (0 votes)
413 views110 pages

WSDM21 Tutorial DLAD Slides

The document discusses deep learning methods for anomaly detection and compares them to traditional shallow methods. Deep learning methods have advantages over shallow methods by integrating feature learning and anomaly scoring, capturing intricate feature relationships through representation learning, and allowing for end-to-end training rather than relying on hand-crafted algorithms and features. However, deep learning methods also face challenges including low anomaly detection accuracy, handling high-dimensional and contextual data, and being sample efficient with limited labeled anomaly data. The document outlines these challenges and provides an overview of different deep learning approaches for anomaly detection.

Uploaded by

TrungNguyễn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
413 views110 pages

WSDM21 Tutorial DLAD Slides

The document discusses deep learning methods for anomaly detection and compares them to traditional shallow methods. Deep learning methods have advantages over shallow methods by integrating feature learning and anomaly scoring, capturing intricate feature relationships through representation learning, and allowing for end-to-end training rather than relying on hand-crafted algorithms and features. However, deep learning methods also face challenges including low anomaly detection accuracy, handling high-dimensional and contextual data, and being sample efficient with limited labeled anomaly data. The document outlines these challenges and provides an overview of different deep learning approaches for anomaly detection.

Uploaded by

TrungNguyễn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 110

Deep Learning for Anomaly Detection:

Challenges, Methods, and Opportunities


Guansong Pang1, Longbing Cao2, Charu Aggarwal3
1 Australian Institute for Machine Learning, The University of Adelaide, Australia
2 The Data Science Lab, University of Technology Sydney, Australia
3 IBM T. J. Watson Research Center, United States

March 8, 2021
Tutorial outline
Overview of challenges and methods

Part 1
• Introduction to anomaly detection
30 min • Problems and challenges Charu Aggarwal
• Deep vs. shallow methods
• Overview of deep anomaly detection approaches
5 min • Q&A
Methods

Part 2
80 min • The modeling perspective
10 min • Break Guansong Pang
15 min • The supervision information perspective
10 min • Implementation and evaluation
5 min • Q&A
Conclusions and future opportunities

Part 3
15 min • Summary of the methods Longbing Cao
• Six possible directions for future research
10 min • Q&A

2
Part 1: Overview of
Challenges and Methods
• Problem definition and applications
• Challenges
• Deep vs. shallow methods
• Overview of deep anomaly detection approaches
• Taxonomy of methods

3
What are Anomalies?

• Anomalies (a.k.a., outliers, novelties): Points that are significantly different


from most of the data
✓ Rare
✓ Irregular

Source: Wikipedia

4
Anomaly detection: Problem Variations

Binary Output versus scoring


• Binary output generates a yes/no tag
• Preferable and more general: Scoring output generates a real-valued score or rank

Multiple ways to define what makes an anomaly different


Three common types of anomalies:

Normal Outlier
Point anomalies Conditional anomalies Group anomalies
Image source: Gupta et al. CIKM Tutorial 2013
5
Real-World Application Domains
Cybersecurity: Social Network and Web Security: Video Surveillance:
attacks, malware, malicious false/malicious accounts, criminal activities, road
apps/URLs, biometric spoofing false/hate/toxic information accidents, violence, etc.

fighting road accident

shooting shoplifting
Finance:
Healthcare: Industrial Inspection:
credit card/insurance frauds, market
lesions, tumours, events in Defects, micro-cracks
manipulation, money laundering, etc.
IoT/ICU monitoring, etc.

Image source: UCF-Crime data, MVTec AD data, etc.


6
Scientific Application Domains
Drug Discovery: Rover-Based Space Exploration:
rare active substances unknown textures

Astronomy:
Anomalous events

High-Energy Physics: Material Science:


Higgs boson particles exceptional molecule graphs
Application-Specific Complexities
Four key complexities
Heterogeneity
• Different anomalies may exhibit completely different
expressions, e.g., accidents, robbery vs. explosion events
Application-specific methodologies
Robbery
• Different methodologies required by different
application-specific definitions, e.g., credit card frauds
(point anomalies) vs malicious accounts in social media
(group anomalies) Accidents
Unknown Nature (unsupervised setting)
• Anomalies remain unknown until they actually occur
Coverage
• Difficult to collect data covering all classes of anomalies

Explosion
Source: Wikipedia, UCF-Crime
8
Key Challenges

Challenge #1: Low Anomaly Detection Accuracy


• Rareness and heterogeneity of anomalies in a dataset
• Many returned anomalies are noise or uninteresting instances

Challenge #2: Contextual and High-Dimensional Data


• Anomalies are visible only in context of implicit relations in temporal, spatial and graph data
• Increased dimensionality also makes anomaly detection difficult

Challenge #3: Sample-Efficient Learning


• Building generalized detection models with a limited amount of labeled anomaly data

9
Key Challenges

Challenge #4: Noise-Resilient Anomaly Detection


• Data may contain normal and anomalous instances with no labels (anomaly contamination)
• Data may contain weak supervision information:
Coarse anomaly labels such as leveraging video-level labels to detect anomalous frames

Challenge #5: Complex Anomalies


• Conditional/group anomalies
• Multi-modal anomalies

Challenge #6: Anomaly Explanation


• Obtaining cues about why a specific instance is detected anomalies by specific methods
• Balancing interpretability and detection accuracy

10
Traditional (Shallow) Methods and Disadvantages

Statistical/probabilistic-based approaches
• Statistical test-based, depth-based, deviation-based
Proximity-based approach
• Distance-based, density-based, clustering-based
Shallow ML Models
• Construct an unsupervised (one-class) analog of a supervised ML model such as the SVM
• Use unsupervised dimensionality reduction methods, PCA, kernel PCA
Others
• Information-theoretic, subspace methods

Weaknesses
• Weak capability of capturing intricate relationships
• Lots of hand-crafting of algorithms and features [ad hoc]
• Ad hoc nature makes it difficult to incorporate supervision seamlessly

11
Advantages of Deep Learning

Integrates feature learning and anomaly scoring


• Generates newly learned feature space → A uninformative and primitive feature
representations [e.g., image pixels]
• End-to-end learning → Can simultaneously learn features and relevant anomaly
scores [no hand-crafting of features]
• Strong feature learning → Captures intricate relations [e.g., mid-level image features
• Diverse neural architectures → Tailor to complex domains [e.g., RNN for time-series]
• Unified detection and localization of anomalies → Better anomaly explanation
guaranteed by integration of detection and localization
• Anomaly-informed models with improved accuracy → Naturally integrates with
labeled data (easy to navigate spectrum of supervised and unsupervised models)

12
Deep vs Shallow [Traditional]: Example

Deep Method - Autoencoder Shallow Method – iForest

13
Deep vs. Shallow [Representation]

Deep methods Shallow methods


Feature space Expressive new space Primitive space

14
Deep vs. Shallow: [Algorithm Type]

Deep methods Shallow methods


Feature space Expressive new space Primitive space
Anomaly detection algo. Defined by NN structure Heuristic or ad hoc

15
Deep vs. Shallow [Feature Relations]

Deep methods Shallow methods


Feature space Expressive new space Primitive space
Anomaly detection algo. Defined by NN structure Heuristic or ad hoc
Feature relations captured Intricate Simple

16
Deep vs. Shallow [Feature Learning
Methods for Diverse Data Types]
Deep methods Shallow methods
Feature space Expressive new space Primitive space
Anomaly detection algo. Defined by NN structure Heuristic or ad hoc
Feature relations captured Intricate Simple
Extracting features in Varying on architectures and Hand-crafted feature
diverse types of data loss functions [e.g., RNN, extractors/off-the-shelf
CNN] methods
MLP, CNN, RNN, GNN, etc. vs. random projection, PCA, subgraph patterns, optical flow, etc.

17
Deep vs. Shallow Methods [Localization]

Deep methods Shallow methods


Feature space Expressive new space Primitive space
Anomaly detection algo. Defined by NN structure Heuristic or ad hoc
Feature relations captured Intricate Simple
Extracting features in Varying on architectures and Hand-crafted feature
diverse types of data loss functions [e.g., RNN, extractors/off-the-shelf
CNN] methods
Unified anomaly detection Yes No
and localization

18
Deep vs. Shallow Methods [Localization]

Anomaly scores → backpropagating to obtain activation maps Model-independent outlying aspect mining

Pang, Guansong, et al. "Self-trained deep ordinal regression for End-to-End video Angiulli, Fabrizio, et al. "Outlying property detection with numerical
anomaly detection.“ In: CVPR. 2020. attributes." Data mining and knowledge discovery 31.1 (2017): 134-163.

19
Three Principal Categories
End-to-end optimization of
Anomaly detection-specific pipeline with score learning
feature learning

Simplest approaches Most methods belong to this category, Often more effective than the
e.g., autoencoder-, GANs-, one-class models other two approaches
Pang, Guansong, et al. Deep learning for anomaly detection: A review. ACM Computing Survey 54, 2,
Article 38 (March 2021), 38 pages. https://doi.org/10.1145/3439950. arXiv preprint. 20
More Detailed Taxonomy

Three high-level
categories of
methods and 11
fine-grained
subcategories of
methods

Pang, Guansong, et al. Deep learning for anomaly detection: A review. ACM Computing Survey 54, 2,
Article 38 (March 2021), 38 pages. https://doi.org/10.1145/3439950. arXiv preprint. 21
Categorization Based on Supervision

Unsupervised approach
• Working on anomaly-contaminated unlabeled data; no manually labeled training data
• Limited work done

Semi-supervised approach
• Assuming the availability of a set of manually labeled normal training data
• Most of current deep methods belong to this approach

Weakly-supervised approach
• Assuming we have some labels for anomaly classes, yet the class labels are partial (i.e., they do not
span the entire set of anomaly class), inexact (i.e., coarse-grained labels), or inaccurate (i.e., some
given labels can be incorrect)
• Limited work done

22
Supervision: Application Settings

Unsupervised: Too costly to collect normal data or anomalies


• Health monitoring, unusual observation discovery in natural science research, etc.

Semi-Supervised: Lots of normal data obtained easily


• Video surveillance, visual defect detection.

Weakly Supervised: A few manually labeled anomalies easy to


obtain
• Fraud detection, intrusion detection, disease detection

23
Part 2: Methods
• The modeling perspective
• Deep learning for feature extraction
• Learning feature representations of normality
• End-to-end anomaly score learning
• Break
• The supervision information perspective
• Unsupervised approach
• Weakly-supervised approach
• Semi-supervised approach
• Implementation and Evaluation

24
Part 2: Methods
• The modeling perspective
• Deep learning for feature extraction ←
• Learning feature representations of normality
• End-to-end anomaly score learning
• Break
• The supervision information perspective
• Unsupervised approach
• Weakly-supervised approach
• Semi-supervised approach
• Implementation and Evaluation

25
Main approach I: Deep learning for
feature extraction
Leveraging existing deep models to extract low- Working purely as
dimensional features for downstream anomaly feature extraction
measures
• The feature extraction and the anomaly scoring are fully disjointed
• Assumption: the extracted features preserve the discriminative
information that helps separate anomalies from normal instances
General framework
1. Given dataset 𝓧 = 𝒙1 , 𝒙2 , ⋯ , 𝒙𝑁 𝑤𝑖𝑡ℎ 𝒙𝒊 ∈ ℝ𝐷 , the approach is
formulated as 𝒛 = 𝜙(𝒙; Θ)
where 𝜙: 𝒳 → 𝒵 is a deep-neural-network-based feature mapping, with 𝒵 ∈ ℝ𝐾 (𝐾 ≪ 𝐷)
2. An anomaly measure, i.e., 𝒇 that has no connection to 𝝓, is then
applied onto the new space to calculate anomaly scores
Two directions: pre-trained models vs directly training deep feature extractors on the target data

26
Direction I: Using pre-trained models

General framework Pre-trained deep


models
1. First use a pre-trained network 𝝓 such as
AlexNet, VGG, and ResNet to extract low-
dimensional features 𝒛

2. Then apply an anomaly measure 𝒇 to


calculate the anomaly scores

Applications
This approach is commonly used in image or
video anomaly detection

27
Example: VGG + Unmasking

Intuition
• Abnormal video frames are more distinguishable than
normal video frames when compared with adjacent VGG model
frames
The model
1. Two sets of video frames {t-w, …, t} vs. {t+1,…,t+w}
2. The VGG model is used to extract features from these
video frames
3. Iteratively train a binary classifier to distinguish these
two video sets while removing at each step the most
discriminant features (anomalous to unmasking)
4. Mean training classification accuracy as anomaly score
Tudor Ionescu, Radu, et al. "Unmasking the abnormal events in video.“ In: ICCV. 2017.
28
Direction II: Training deep feature
extraction models
Learning deep
General framework feature extractors
1. First training a network 𝝓 such as autoencoders to using training data
extract low-dimensional features 𝒛
2. Then apply an anomaly measure 𝒇 to calculate the
anomaly scores
Applications
Autoencoder is commonly used to instantiate the neural
network mapping 𝝓
• Autoencoder (𝝓) + one-class SVM (𝒇) [Xu et al., BMVC 2015]
• Autoencoder (𝝓) + clustering (𝒇) [Yu et al., KDD 2018]
• Autoencoder (𝝓) + unsupervised classification (𝒇) [Tudor Ionescu et al.,
CVPR 2019]

Tudor Ionescu, Radu, et al. "Object-centric auto-encoders and dummy anomalies for abnormal event detection in video.“ In: CVPR. 2019.
Xu, Dan, et al. "Learning deep representations of appearance and motion for anomalous event detection.“ In: BMVC (2015). 29
Yu, Wenchao, et al. "Netwalk: A flexible deep embedding approach for anomaly detection in dynamic networks.“ In: KDD. 2018.
Section summary

Pros Cons
• Many state-of-the art (pre-trained) deep • Fully disjointing feature extraction and
models and off-the-shelf anomaly detectors anomaly scoring can lead to significant loss
are readily available of anomaly-discriminative information
• More powerful dimensionality reduction • Pre-trained deep models are typically
than popular linear methods limited to specific types of data
• Easy-to-implement • Inherent limitation of existing anomaly
measures

30
Part 2: Methods
• The modeling perspective
• Deep learning for feature extraction ←
• Learning feature representations of normality
• End-to-end anomaly score learning
• Break
• The supervision information perspective
• Unsupervised approach
• Weakly-supervised approach
• Semi-supervised approach
• Implementation and Evaluation

31
Part 2: Methods
• The modeling perspective
• Deep learning for feature extraction √
• Learning feature representations of normality ←
• End-to-end anomaly score learning
• Break
• The supervision information perspective
• Unsupervised approach
• Weakly-supervised approach
• Semi-supervised approach
• Implementation and Evaluation

32
Main approach II – Learning feature
representations of normality
Integrating feature learning with anomaly scoring in some
ways, rather than fully decoupling them as in approach I
• Adapting popular deep approaches for normality feature learning

(𝜓 is a surrogate feature learning function, ℓ is a loss function)

e.g., autoencoder methods


✓ 𝝓 – encoder, 𝝍 – decoder, 𝒇 – a reconstruction error-based anomaly score

33
Main approach II – Learning feature
representations of normality
Integrating feature learning with anomaly scoring in some
ways, rather than fully decoupling them as in approach I
• Adapting popular deep approaches for normality feature learning

(𝜓 is a surrogate feature learning function, ℓ is a loss function)


• Anomaly measure-dependent feature leaning

𝑓 is a traditional anomaly
measure, e.g., one-class measure,
the nearest neighbor distance

34
Taxonomy: the modeling perspective

Three high-level
categories of
methods and 11
fine-grained
subcategories of
methods

Pang, Guansong, et al. Deep learning for anomaly detection: A review. ACM Computing Survey 54, 2,
Article 38 (March 2021), 38 pages. https://doi.org/10.1145/3439950. arXiv preprint. 35
Autoencoders

To learn some low-dimensional feature representation space on


which the given data instances can be well reconstructed
• Assumption: Normal instances can be better reconstructed from compressed feature space than
anomalies

General Framework
1. Bottleneck architecture + reconstruction loss
2. The larger reconstruction errors the more abnormal

Image source: Towards Data Science


36
Autoencoders – Replicator neural
network
Seminal work on using autoencoders for
anomaly detection
• Anomaly score: Tanh Tanh Tanh

• Evaluation on KDDCup99 and Wisconsin breast cancer data

Hawkins, Simon, et al. "Outlier detection using replicator neural networks.“ In: DaWaK. 2002.
37
Autoencoders – ensemble method

A set of autoencoders with randomly


connected dense layers
• Increasing the model diversity
• Empirical results on UCI datasets

Chen, Jinghui, et al. "Outlier detection with autoencoder ensembles.“ In: SDM. 2017.
38
Robust deep autoencoders

Intuition Robust PCA + autoencoders


• Input X can be split into two parts:
L captures normal behaviors, S captures sparse outliers

Robust PCA

where L is a low-rank matrix and S is a sparse matrix.


RPCA aims at solving Relaxation

where 𝜌(𝐿) is the rank of L.


Autoencoder

Zhou, Chong, et al. "Anomaly detection with robust deep autoencoders." In: KDD. 2017.
39
Taxonomy: the modeling perspective

Three high-level
categories of
methods and 11
fine-grained
subcategories of
methods

Pang, Guansong, et al. Deep learning for anomaly detection: A review. ACM Computing Survey 54, 2,
Article 38 (March 2021), 38 pages. https://doi.org/10.1145/3439950. arXiv preprint. 40
Generative Adversarial Networks (GANs)

To adversarially learn a latent space that captures the


normality underlying the given data
• Assumption: Normal data instances can be better generated than anomalies from the latent
feature space of the generative network in GANs

General framework
1. Train a GAN-based model
2. Calculate anomaly scores by looking into the difference between an input instance and its
counterpart generated from the latent space of the generator

41
Example – AnoGAN

Intuition
• Given an instance x, there is generally an instance z in the latent feature space of the generative
network so that the corresponding generated instance G(z) and x are as similar as possible
The model
1. Train a GAN model using the standard GAN objective function

2. Explicitly searching for the instance z in the latent space of normality

3. Anomaly score where h is the last feature layer of the discriminator

Schlegl, Thomas, et al. "Unsupervised anomaly detection with generative adversarial networks to guide marker discovery." In: IPMI. Springer, Cham, 2017.
42
Example – EBGAN

Intuition
• To add an extra network that learns the mapping from data instances onto the latent space, i.e.,
an inverse of the generator, to avoid the costly search of the latent instance z

The model
1. Train a bi-directional GAN

2. Anomaly scoring

Zenati, Houssam, et al. "Efficient gan-based anomaly detection." arXiv preprint arXiv:1802.06222 (2018).
Image source: Donahue, Jeff, et al. "Adversarial feature learning." In: ICLR, 2017. 43
Taxonomy: the modeling perspective

Three high-level
categories of
methods and 11
fine-grained
subcategories of
methods

Pang, Guansong, et al. Deep learning for anomaly detection: A review. ACM Computing Survey 54, 2,
Article 38 (March 2021), 38 pages. https://doi.org/10.1145/3439950. arXiv preprint. 44
Predictability modeling

Learn representations by using temporally adjacent instances


as the context to predict the current/future instances
• Assumption: Normal instances are temporally more predictable than anomalies

General framework
1. Train a current/future instance prediction network
2. Calculate the difference between the predicted instance
and the actual instance as anomaly score.

45
Example – Future frame prediction
Appearance (spatial) constraints:
Intensity and gradient
Intuition The model
• Leverage the difference between a predicted
future frame and its ground truth to detect
an abnormal event in video data

Motion (temporal) constraint in optical flow

Anomaly scoring:

Liu, Wen, et al. "Future frame prediction for anomaly detection–a new baseline." In: CVPR. 2018.
Ye, Muchao, et al. "Anopcn: Video anomaly detection via deep predictive coding network." In: ACM Multimedia. 2019. 46
Taxonomy: the modeling perspective

Three high-level
categories of
methods and 11
fine-grained
subcategories of
methods

Pang, Guansong, et al. Deep learning for anomaly detection: A review. ACM Computing Survey 54, 2,
Article 38 (March 2021), 38 pages. https://doi.org/10.1145/3439950. arXiv preprint. 47
Self-supervised classification

Learn representations of normality by self-supervised


classification with different data augmentation operations
• Assumption: Normal instances are more consistent to self-supervised classifiers than anomalies

General framework
1. Apply different augmentation operations to the data
2. Learn a multi-class classification model using instances
augmented with the same operation as one class
3. Calculate the inconsistency of the instance to the model
as anomaly score

Image source: Wang, Siqi, et al. "Effective End-to-end Unsupervised Outlier Detection via Inlier Priority of Discriminative Network.“ In: NeurIPS. 2019.
48
Example – Image geometric transformations

Intuition
• Train a multi-class model to discriminate between dozens of geometric transformations
applied on all the given images

The model
• Self labeling with compositions of horizontal flipping, translations, and rotations, resulting in 72
distinct transformations
• Training a 72-class deep classification model with a standard cross-entropy loss function
• Using softmax statistics to calculate normality score

where 𝑇𝑖 is one type of composite geometric transformation


Golan, Izhak, et al. "Deep anomaly detection using geometric transformations.“ In: NeurIPS. 2018.
49
Section summary

Pros Cons
• Can leverage existing deep • Some of the methods are limited to specific
autoencoder/GAN/predictability type of data
modeling/self-supervised classification • Methods like GAN/predictability modeling
models for anomaly detection are computationally costly at the training
• The learned representations are generally stage
more effective than the methods as in • Most methods are sensitive to anomaly
approach I contamination; cannot work in
unsupervised settings

50
Taxonomy: the modeling perspective

Three high-level
categories of
methods and 11
fine-grained
subcategories of
methods

Pang, Guansong, et al. Deep learning for anomaly detection: A review. ACM Computing Survey 54, 2,
Article 38 (March 2021), 38 pages. https://doi.org/10.1145/3439950. arXiv preprint. 51
Distance-based measure

Learning representations tailored for distance-based measures


• Assumption: Anomalies are distributed far from their closest neighbors while normal
instances are located in dense neighborhoods
Distance-based measure

The general framework


1. Devise a feature mapping function 𝜙 that maps original data onto
a new representation space
2. Optimize the feature representations such that anomalies have larger
distance to some reference instances than normal instances
3. Anomaly scoring using the desired distance measure in the new space

52
Deep random distance-based method - REPEN

Intuition
• Learning representations tailored for the random distance-based measure
What is random distance-based measure?
𝑠𝐱 = min 𝐱 − 𝐱 ′
′ 2
𝐱 ∈𝒮
where 𝒮 is a small random data subset
And why it is used?
• Provably and empirically effective
• But less effective on high-dimensional
data
How to learn tailored representations?
• Anomaly query network
*Sugiyama, M., & Borgwardt, K. Rapid distance-based outlier detection via sampling. In: NeurIPS, 26, 467-475. 2013.
*Pang, Guansong, et al. "LeSiNN: Detecting anomalies by identifying least similar nearest neighbours." In: ICDMW, 2015.
53
REPEN – The model
Goal: the representations of
pseudo normal instances
1. Use off-the-shelf detectors to obtain pseudo labels
𝑓Θ 𝒙′ have smaller random
nearest neighbour distances
𝒜 and 𝒩are anomaly and normal candidate sets, respectively than that of pseudo
2. Optimize an anomaly query network by minimizing
anomalies 𝑓Θ 𝒙

where 𝑓 returns the nearest neighbor distance of x in a random data subset S in the learned
representation space:

3. During inference, the same 𝑓 function is used to calculate the random nearest neighbor distance
as the anomaly score

Pang, Guansong, et al. "Learning representations of ultrahigh-dimensional data for random distance-based outlier detection.“ In: KDD. 2018.
54
REPEN – Effectiveness in real-world data

• Significantly better AUC performance


• Up to two orders of magnitude faster in online detection

IMP: Relative improvement of REPEN over ORG; SU: Speed-up of REPEN over ORG
Pang, Guansong, et al. "Learning representations of ultrahigh-dimensional data for random distance-based outlier detection.“ In: KDD. 2018.
55
Taxonomy: the modeling perspective

Three high-level
categories of
methods and 11
fine-grained
subcategories of
methods

Pang, Guansong, et al. Deep learning for anomaly detection: A review. ACM Computing Survey 54, 2,
Article 38 (March 2021), 38 pages. https://doi.org/10.1145/3439950. arXiv preprint. 56
One-class classification measure

Learning representations tailored for one-class classification


• Assumption: All normal instances come from a single (abstract) class and can be summarized by a
compact model, to which anomalies do not conform
One-class classification
measure
The general framework
1. Devise a feature mapping function 𝜙 that maps original data onto
a new representation space
2. Optimize the feature representations using one-class classification loss
3. Anomaly scoring using the one-class classification model in the new space

57
Example – Deep support vector data
description (Deep SVDD)
Intuition
• To learn feature representations tailored for SVDD-based anomaly detection

The model

Soft boundary
with radius r

Hard boundary

Ruff, Lukas, et al. "Deep one-class classification.“ In: ICML. 2018.


58
Example – Deep SVDD

MNIST

CIFAR-10

Ruff, Lukas, et al. "Deep one-class classification.“ In: ICML. 2018.


59
Taxonomy: the modeling perspective

Three high-level
categories of
methods and 11
fine-grained
subcategories of
methods

Pang, Guansong, et al. Deep learning for anomaly detection: A review. ACM Computing Survey 54, 2,
Article 38 (March 2021), 38 pages. https://doi.org/10.1145/3439950. arXiv preprint. 60
Cluster-based measure

Learning representations so that anomalies are clearly deviated from


the clusters in the newly learned representation space
• Assumption: Normal instances have stronger adherence to clusters than anomalies

Cluster-based measure
The general framework
1. Devise a feature mapping function 𝜙 that maps original data onto
a new representation space
2. Optimize the feature representations using clustering-based loss
3. Anomaly scoring using a cluster-based anomaly measure in the new space

61
Example – Deep autoencoding gaussian
mixture model (DAGMM)
Intuition Objective function
RE Energy Diagonal
• Learn low-dimensional representations for GMM
The model
• An autoencoder compression network
Compressed features
Reconstruction error features
Concatenated features
• A cluster membership estimation network

mixture probability mean covariance sample energy


Zong, Bo, et al. "Deep autoencoding gaussian mixture model for unsupervised anomaly detection.“ In: ICLR. 2018.
62
Section summary

Pros Cons
• Strong foundation from traditional anomaly • The performance of anomaly detection is
measures (distance/one-class heavily dependent on the specific anomaly
classification/cluster-based measures) in the measures - inherent limitations of the
literature measures
• Working on low-dimensional feature • The clustering process may be biased by
representations that are specifically contaminated anomalies in the training data,
optimized for the anomaly measures, which in turn leads to less effective
resulting in more effective detection representations

63
Part 2: Methods
• The modeling perspective
• Deep learning for feature extraction √
• Learning feature representations of normality ←
• End-to-end anomaly score learning
• Break
• The supervision information perspective
• Unsupervised approach
• Weakly-supervised approach
• Semi-supervised approach
• Implementation and Evaluation

64
Part 2: Methods
• The modeling perspective
• Deep learning for feature extraction √
• Learning feature representations of normality √
• End-to-end anomaly score learning ←
• Break
• The supervision information perspective
• Unsupervised approach
• Weakly-supervised approach
• Semi-supervised approach
• Implementation and Evaluation

65
Main approach III – End-to-end anomaly
score learning
Directly learn anomaly scores in an end-to-end fashion
• Has a neural network that directly learns scalar anomaly scores
• (surrogate) Loss functions for anomaly ranking/classification
• Generally requiring supervision of (synthetic or real) anomaly data
• Not dependent on existing anomaly measures

• Formally, the general formulation is as follows

• where 𝜏: 𝒳 → ℝ is an end-to-end anomaly scoring network

66
Taxonomy: the modeling perspective

Three high-level
categories of
methods and 11
fine-grained
subcategories of
methods

Pang, Guansong, et al. Deep learning for anomaly detection: A review. ACM Computing Survey 54, 2,
Article 38 (March 2021), 38 pages. https://doi.org/10.1145/3439950. arXiv preprint. 67
Ranking models

Learn a ranking model that is associated with the absolute/relative


ordering relation of the instance abnormality
Assumption: There exists an observable ordinal variable that captures some data abnormality

The general framework


1. Define the (synthetic) ordinal variable
2. Use the variable to define a surrogate loss functions for anomaly ranking and train the
detection model
3. Given a test instance, the model directly gives its anomaly score

68
Ranking models – Deep ordinal regression (SDOR)

Intuition
• Use self-training to iteratively learn the anomaly scores via deep ordinal regression
The model
1. Use initial anomaly scores to produce
pseudo normal and abnormal sets 𝒜 and 𝒩
2. Create the ordinal class labels:
(𝑐1 > 𝑐2)
3. Let , then learn the model with

where ℓ is a MAE loss function


4. Use 𝜏 to update the initial scores and repeat Steps 1-3 until a fixed number of (4-5 ) iterations
Pang, Guansong, et al. "Self-trained deep ordinal regression for End-to-End video anomaly detection.“ In: CVPR. 2020.
69
Ranking models – Deep ordinal regression (SDOR)

Human-in-the-loop detection Anomaly explanation

Pang, Guansong, et al. "Self-trained deep ordinal regression for End-to-End video anomaly detection.“ In: CVPR. 2020.
70
Ranking models – Pairwise relation prediction (PReNet)

Intuition
• Learn the anomaly scores by predicting the
relation of any instance pairs from a few
labeled anomalies and unlabeled instances
The model
• Let 𝒜 be the small labeled anomaly set and 𝒰
be the large unlabeled data set
• Create an ordinal class label based on three
pairwise relations: a-a, a-u, u-u

Pang, Guansong, et al. "Deep weakly-supervised anomaly detection." arXiv


preprint:1910.13601 (2019). 71
Taxonomy: the modeling perspective

Three high-level
categories of
methods and 11
fine-grained
subcategories of
methods

Pang, Guansong, et al. Deep learning for anomaly detection: A review. ACM Computing Survey 54, 2,
Article 38 (March 2021), 38 pages. https://doi.org/10.1145/3439950. arXiv preprint. 72
Prior-driven models

Impose a prior over the anomaly scores to drive the anomaly


score learning
• Assumption: The imposed prior captures the underlying (ab)normality of the dataset

The general framework


1. Impose a prior over the weight parameters of a neural network-based anomaly scoring
measure, or over the expected anomaly scores
2. Optimize the anomaly ranking/classification with the prior
3. Given a test instance, the model directly gives its anomaly score

73
Deviation networks (DevNet) unlabeled data
anomaly

Intuition

Data point index


• We aim at learning an end-to-end anomaly scoring function 𝜏 so
that anomaly scores of a few labeled anomalies are larger than
that of unlabeled data points
The model
1. Assuming anomaly scores follow a Gaussian distribution 𝝁
Anomaly scores
𝒩(𝜇, 𝜎 2 )
2. Guarantee – 1) the anomaly scores of unlabeled data points 𝐿(𝜏 𝐱; Θ , 𝜇ℛ , 𝜎ℛ )
distribute around 𝜇; and 2) the anomaly scores of anomalies
have at least 𝛼 ∗ 𝜎 deviation from 𝜇, by minimizing: 𝜏
Deviation loss

where 𝑦𝐱 = 1 𝑖𝑓𝐱 𝑖𝑠 𝑎 𝑙𝑎𝑏𝑒𝑙𝑒𝑑 𝑎𝑛𝑜𝑚𝑎𝑙𝑦 𝑎𝑛𝑑𝑦𝐱 = 0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒


3. Anomaly scoring using 𝑠𝐱 = 𝜏 𝐱; Θ
Pang, Guansong, et al. "Deep anomaly detection with deviation networks." In: KDD, pp. 353-362. 2019.
74
Effectiveness in
real-world multi-
dimensional data
sets

REPEN, Pang et al. KDD18


DSVDD, Ruff et al. ICML18
FSNet, Snell et al. NeurIPS17
iForest, Liu et al. ICDM08

75
Deviation loss on other types of data

Image data Graph data


• For screening covid-19 on chest x-ray images • For abnormal node detection

Zhang, Jianpeng, et al. "Viral Pneumonia Screening on Chest X-rays Using Confidence- Ding, Kaize, et al. "Few-shot Network Anomaly Detection via Cross-network Meta-learning.“
Aware Anomaly Detection." IEEE Transactions on Medical Imaging (2020). In: The Web Conference (2021).

76
Taxonomy: the modeling perspective

Three high-level
categories of
methods and 11
fine-grained
subcategories of
methods

Pang, Guansong, et al. Deep learning for anomaly detection: A review. ACM Computing Survey 54, 2,
Article 38 (March 2021), 38 pages. https://doi.org/10.1145/3439950. arXiv preprint. 77
Softmax likelihood models

Learn anomaly scores by maximizing the likelihood of events in


the training data
• Assumption: Anomalies and normal instances are respectively low- and high-probability events

The general framework


1. The probability of an event is modeled using a softmax function

2. The parameters are then learned by a maximum likelihood function

3. Given a test instance, the model directly gives its anomaly score by the event probability

78
Softmax likelihood models – APE

Intuition
• Leverage pairwise feature interactions to estimate the event
likelihood using Noise-Contrastive Estimation (NCE)
Events with categorical features
The model
NCE ‘noise’ sample: Univariate
extrapolation of x
Synthetic binary classification

Scoring (density) function 𝑄 𝐱 ′ is the noise probability


and estimated based on 𝑝(𝐱; Θ)
Chen, Ting, et al. "Entity embedding-based anomaly detection for heterogeneous categorical events.“ In: IJCAI. 2016.
Fan, Shaohua, et al. "Abnormal event detection via heterogeneous information network embedding.“ In: CIKM. 2018. 79
Taxonomy: the modeling perspective

Three high-level
categories of
methods and 11
fine-grained
subcategories of
methods

Pang, Guansong, et al. Deep learning for anomaly detection: A review. ACM Computing Survey 54, 2,
Article 38 (March 2021), 38 pages. https://doi.org/10.1145/3439950. arXiv preprint. 80
End-to-end one-class classification

Train a one-class classifier that learns to discriminate whether


a given instance is normal or not in an end-to-end fashion
• Assumptions: (i) Data instances that are approximated to anomalies can be effectively
synthesized. (ii) All normal instances can be summarized by a discriminative one-class model

The general framework


• Generate artificial outliers
• Train a GAN to discriminate whether a given instance is normal or an artificial outlier

Specifically
designed Generic GAN
GAN

Image source: Ngo, Phuc Cuong, et al. "Fence GAN: Towards better anomaly detection." In: ICTAI. 2019.
81
End-to-end one-class classification - OCAN

Intuition
• Use the generator of a `bad’ GANs to generate complementary samples, instead of matching
the original data distribution, which are then used to learn a one-class discriminator to
discriminate normal instances from generated complementary instance

The model
• The generator in complementary GAN
Complementary distribution

• The discriminator
Discriminator in
a regular GAN
Zheng, Panpan, et al. "One-class adversarial nets for fraud detection.“ In: AAAI. 2019.
82
Section summary

Pros Cons
• The anomaly scoring/ranking/classification • At least some form of labeled/synthetic
is optimized in an end-to-end fashion, anomalies are required in these methods, which
normally more effective than the other two may not be applicable to applications where
approaches such labeled anomalies are not available
• Does not depend on any existing anomaly • Since the models are exclusively fitted to detect
measures the few labeled anomalies, they may not be able
to generalize to unseen anomalies that exhibit
different abnormal features to the labeled
anomalies

83
Part 2: Methods
• The modeling perspective
• Deep learning for feature extraction √
• Learning feature representations of normality √
• End-to-end anomaly score learning √
• Break ← 10 min
• The supervision information perspective
• Unsupervised approach
• Weakly-supervised approach
• Semi-supervised approach
• Implementation and Evaluation

84
Part 2: Methods
• The modeling perspective
• Deep learning for feature extraction √
• Learning feature representations of normality √
• End-to-end anomaly score learning √
• Break √
• The supervision information perspective
• Unsupervised approach ←
• Weakly-supervised approach
• Semi-supervised approach
• Implementation and Evaluation

85
Unsupervised approach
Training on anomaly-contaminated unlabeled data
Outlier-aware autoencoders One-class models, soft boundary
• Deep SVDD (ICML18)

• Robust deep autoencoders (RDA-KDD17)

Pseudo labeling Augmented deep clustering


• Deep distance-based method (REPEN-KDD18) • DAGMM (ICLR18)

• Deep ordinal regression (DOR-CVPR20)

86
Part 2: Methods
• The modeling perspective
• Deep learning for feature extraction √
• Learning feature representations of normality √
• End-to-end anomaly score learning √
• Break √
• The supervision information perspective
• Unsupervised approach ←
• Weakly-supervised approach
• Semi-supervised approach
• Implementation and Evaluation

87
Part 2: Methods
• The modeling perspective
• Deep learning for feature extraction √
• Learning feature representations of normality √
• End-to-end anomaly score learning √
• Break √
• The supervision information perspective
• Unsupervised approach √
• Weakly-supervised approach ←
• Semi-supervised approach
• Implementation and Evaluation

88
Weakly-supervised approach 1/2
A limited number of partially labeled anomalies and large unlabeled data
Contrastive feature learning Reinforcement learning*
• Deep distance-based method (REPEN-KDD18)

Prior-driven method
• Deviation network (DevNet-KDD19)

A limited number of partially labeled anomalies +


Surrogate learning large unlabeled data + some labeled normal data
+Deep SAD built upon Deep SVDD
• Pairwise relation prediction (PReNet-arXiv19)

*Pang, Guansong, et al. "Deep Reinforcement Learning for Unknown Anomaly Detection." arXiv preprint:2009.06847 (2020).
+Ruff, Lukas, et al. "Deep semi-supervised anomaly detection." arXiv preprint arXiv:1906.02694 (2019).
89
Weakly-supervised approach 2/2
Inexact anomaly labels (coarse-grained labels)
Multiple instance learning
• Problem setting: Given a large set of videos
with video-level labels of anomaly and normal
classes, we aim to learn detection models to
identify abnormal video frames

Positive bags Negative bags

Tian, Yu, et al. "Weakly-supervised Video Anomaly Detection with Contrastive Learning of Long and Short-range Temporal Features." arXiv preprint:2101.10030 (2021).
Sultani, Waqas, et al. "Real-world anomaly detection in surveillance videos.“ In: CVPR. 2018.
90
Part 2: Methods
• The modeling perspective
• Deep learning for feature extraction √
• Learning feature representations of normality √
• End-to-end anomaly score learning √
• Break √
• The supervision information perspective
• Unsupervised approach √
• Weakly-supervised approach ←
• Semi-supervised approach
• Implementation and Evaluation

91
Part 2: Methods
• The modeling perspective
• Deep learning for feature extraction √
• Learning feature representations of normality √
• End-to-end anomaly score learning √
• Break √
• The supervision information perspective
• Unsupervised approach √
• Weakly-supervised approach √
• Semi-supervised approach ←
• Implementation and Evaluation

92
Semi-supervised approach
Training on a large labeled normal
dataset

All methods in the category

Many methods in the category

Pang, Guansong, et al. Deep learning for anomaly detection: A review. ACM Computing Survey 54, 2,
Article 38 (March 2021), 38 pages. https://doi.org/10.1145/3439950. arXiv preprint. 93
Part 2: Methods
• The modeling perspective
• Deep learning for feature extraction √
• Learning feature representations of normality √
• End-to-end anomaly score learning √
• Break √
• The supervision information perspective
• Unsupervised approach √
• Weakly-supervised approach √
• Semi-supervised approach ←
• Implementation and Evaluation

94
Part 2: Methods
• The modeling perspective
• Deep learning for feature extraction √
• Learning feature representations of normality √
• End-to-end anomaly score learning √
• Break √
• The supervision information perspective
• Unsupervised approach √
• Weakly-supervised approach √
• Semi-supervised approach √
• Implementation and Evaluation ←

95
Implementation of representative algorithms

Pang, Guansong, et al. Deep learning for anomaly detection: A review. ACM Computing Survey 54, 2,
Article 38 (March 2021), 38 pages. https://doi.org/10.1145/3439950. arXiv preprint. 96
Source codes of representative algorithms

Pang, Guansong, et al. Deep learning for anomaly detection: A review. ACM Computing Survey 54, 2,
Article 38 (March 2021), 38 pages. https://doi.org/10.1145/3439950. arXiv preprint. 97
Publicly available datasets with real anomalies

Collection of continuously updated preprocessed datasets is made available at


https://github.com/GuansongPang/anomaly-detection-datasets
Pang, Guansong, et al. Deep learning for anomaly detection: A review. ACM Computing Survey 54, 2,
Article 38 (March 2021), 38 pages. https://doi.org/10.1145/3439950. arXiv preprint. 98
Part 3: Conclusions and
future opportunities
• Summary of the methods
• Six possible directions for future research

99
Part 3: Conclusions and
future opportunities
• Summary of the methods ←
• Six possible directions for future research

100
Summary of the methods

Challenges tackled (Methods against Challenges)


AutoEncoder Distance-based Measures Ranking Models Prior-driven Models

ID Challenge FE AE GAN PD SC DM OCM CM RM PM SLM EOC


CH1 Low detection recall √ √ √ √ √ √ √ √ √ √ √ √
CH2 Handling complex data √ √ √ √ √ √ √ √ √ √ √ √
CH3 Data-efficient learning √ √ √ √
CH4 Noise-resilient √ √ √ √ √ √
CH5 Complex anomalies √ √ √
CH6 Anomaly explanation √

101
Part 3: Conclusions and
future opportunities
• Summary of the methods √
• Six possible directions for future research

102
Part 3: Conclusions and
future opportunities
• Summary of the methods √
• Six possible directions for future research ←

103
Direction #1 – Exploring anomaly-
supervisory signals
Unsupervised
• Data reconstruction, generator-discriminator, pseudo class labels, etc.
Self-supervised
• Self-supervised classification, future prediction, etc.
Anomaly measure-driven
• Presuming some distribution of normal/anomalous data, e.g., one-class, cluster, distance, etc.

Are there other more effective sources of supervisory signals?


Domain-driven anomaly detection?
• Application-specific knowledge of anomaly
• Expert rules, etc.

104
Direction #2 – Deep weakly-supervised
anomaly detection
Few-shot anomaly detection or data-efficient anomaly detection
• Leveraging a few anomaly examples to perform anomaly-informed detection
• Data efficiency?
• Overfitting?

Unknown anomaly detection


• To generalize from the limited labeled anomalies to novel classes of anomaly

Learning detection models with coarse-grained anomaly labels


• How to effectively leverage such label information

105
Direction #3 – Large-scale normality
learning
Large-scale unsupervised/self-supervised representation
learning specifically designed for anomaly detection
• Any anomaly contamination in the large-scale data?

• Knowledge transferable across different domains?

• How about different types of datasets?

• How about different types of anomalies?

106
Direction #4 – Deep detection of
complex anomalies
Deep models for conditional/group anomalies
• Capturing complex temporal/spatial dependence
• Learning representations of a set of unordered data points

Multimodal anomaly detection


• Excellent capability in learning feature representations from different types of raw data
• Flexible feature representation fusion

107
Direction #5 – Interpretable and
actionable deep anomaly detection
Interpretable deep anomaly detection
• Deep models with inherent capability (via activation/attention maps) to provide straightforward
anomaly explanation

Actionable deep anomaly detection


• Quantifying the impact of detected anomalies and mitigation actions

108
Direction #6 – Novel applications and
settings
Safety in
Out-of-distribution (OOD) detection autonomous
• Accurate classification while being able to detect any data systems
instances that are drawn far away from the given training
distribution

Curiosity learning
• Curiosity-driven exploration: Encouraging reinforcement
learning agents to explore novel states

Non-i.i.d. anomaly detection


Montezuma’s
Detection of adversarial examples Revenge
Anti-spoofing in biometric systems
Anomaly detection in scientific data
109
Thank you!

Q&A

110

You might also like