SMOTE
Dr.Trilok Nath Pandey
Imbalanced Datasets
Dr.Trilok Nath Pandey
You trained a model to predict cancer from image data using a
state of the art Hierarchical siamese CNN with dynamic kernel
activations…
Your model has an accuracy of 99.9%
An imbalanced classification problem is an example of a classification problem
where the distribution of examples across the known classes is biased or
Dr.Trilok Nath Pandey
skewed. 3
By looking at the confusion matrix you • After plotting your class
realize that the model does not distribution you see that you
detect any of the positive examples. have thousands of negative
examples but just a couple
of positives.
Dr.Trilok Nath Pandey
negatives positives
4
Classifiers try to reduce the overall error so they can be biased
towards the majority class.
# Negatives = 998
# Positives = 2
By always predicting a negative class the accuracy will be 99.8%
Your dataset is imbalanced!!!
Now What???
Dr.Trilok Nath Pandey
5
The Class Imbalance Problem
• Data sets are said to be balanced if there are, approximately, as
many positive examples of the concept as there are negative
ones.
• There exist many domains that have unbalanced data sets.
• Examples:
a) Helicopter Gearbox Fault Monitoring
b) Discrimination between Earthquakes and Nuclear Explosions
c) Document Filtering
d) Detection of Oil Spills
e) Detection of cancerous cells
f) Detection of Fraudulent Telephone Calls
g) Detection of hotspots in ASIC/FPGA Placements
h) Detection of unrouteable designs in VLSI
i) Fraud and default prediction
j) Mail Spam Detection
Dr.Trilok Nath Pandey
6
The Class Imbalance Problem
• The problem with class imbalances is that standard
learners are often biased towards the majority class.
• That is because these classifiers attempt to reduce
global quantities such as the error rate, not taking
the data distribution into consideration.
• As a result, examples from the overwhelming class
are well-classified whereas examples from the
minority class tend to be misclassified.
Dr.Trilok Nath Pandey
7
The Class Imbalance Problem
For classification problems, we often use accuracy as
the evaluation metric.
It is easy to calculate and intuitive:
• Accuracy = # of correct predictions / # of total predictions
But, it is misleading for highly imbalanced datasets!!.
For example, in credit card fraud detection, we can set
a model to always classify new transactions as legit.
The accuracy could be high at 99.0% if 99.0% in the
dataset is all legit.
But, don’t forget that our goal is to detect fraud, so such
a model is useless.
Dr.Trilok Nath Pandey
8
Dr.Trilok Nath Pandey
Dr.Trilok Nath Pandey
Dr.Trilok Nath Pandey
Dr.Trilok Nath Pandey
Dr.Trilok Nath Pandey
Dr.Trilok Nath Pandey
Dr.Trilok Nath Pandey
Dr.Trilok Nath Pandey
Dr.Trilok Nath Pandey
Solutions
Dr.Trilok Nath Pandey
Solutions to Imbalanced Learning
Data
Level Sampling methods
Algorithmic
Level Cost-sensitive methods
Kernel and Active Learning methods
Dr.Trilok Nath Pandey
19
Several Common Approaches
At the data Level: Re-Sampling
Oversampling (Random or Directed)
o Add more examples to minority class
Undersampling (Random or Directed)
o Remove samples from majority class
At the Algorithmic Level:
Adjusting the Costs or weights of classes
Adjusting the decision threshold / probabilistic
estimate at the tree leaf
Most of the machine learning models provide a parameter called class weights
Dr.Trilok Nath Pandey
20
Sampling Methods
Create balance through sampling
If data is Create
Modify data
balanced
Imbalanced… distribution
dataset
• A widely adopted technique for dealing with highly unbalanced
datasets is called resampling.
1. Removing samples from the majority class (under-sampling).
2. Adding more examples to the minority class (over-sampling).
3. Or perform both simultaneously:
• Under-Sample the majority &
• Over-Sample the minority
Dr.Trilok Nath Pandey
21
Sampling Methods
Create balance though sampling
Oversampling may just randomly replicate records within the dataset!!
Can cause loss of information. Can cause overfitting!
Advantages and disadvantages of Under-sampling and Oversampling?
Dr.Trilok Nath Pandey
22
SMOTE
Dr.Trilok Nath Pandey
SMOTE: Resampling Approach
SMOTE stands for:
Synthetic Minority Oversampling Technique
It is a technique designed by Hall et. al in 2002.
SMOTE is an oversampling method that synthesizes new
plausible examples in the minority class.
SMOTE not only increases the size of the training set,
it also increases the variety!!
SMOTE currently yields the best results as far as re-
sampling and modifying the probabilistic estimate
techniques go (Chawla, 2003).
Dr.Trilok Nath Pandey
25
Dr.Trilok Nath Pandey
Dr.Trilok Nath Pandey
Dr.Trilok Nath Pandey
Dr.Trilok Nath Pandey
SMOTE’s Informed Oversampling
Procedure
For each Minority Sample
I. Find its k-nearest minority neighbors
II. Randomly select j of these neighbors
III.Randomly generate synthetic samples along the
lines joining the minority sample and its j selected
neighbors
(j depends on the amount of oversampling desired)
For instance, if it sees two examples (of the same class)
near each other, it creates a third artificial one, in the
middle of the original two.
Dr.Trilok Nath Pandey
30
SMOTE
Synthetic Minority Oversampling Technique (SMOTE)
• Find its k-nearest minority neighbors • Randomly generate synthetic samples
• Randomly select j of these neighbors along the lines joining the minority
Dr.Trilok Nath Pandey
sample and its j selected neighbors
31
Example
Dr.Trilok Nath Pandey
SMOTE’s Shortcomings
• Overgeneralization
a) SMOTE’s procedure may blindly generalizes the minority
area without regard to the majority class.
b) It may oversample noisy samples
c) It may oversample uninformative samples
• Lack of Flexibility
a) The number of synthetic samples generated by SMOTE is
fixed in advance, thus not allowing for any flexibility in the
re-balancing rate.
b) It would be nice to increase the minority class just to
the right value (i.e., not excessive) to avoid the side
affects of unbalanced datasets
Dr.Trilok Nath Pandey
34