0% found this document useful (0 votes)
43 views32 pages

Smote TNP

The document discusses the challenges of imbalanced datasets in classification problems, particularly in predicting cancer from image data. It introduces SMOTE (Synthetic Minority Oversampling Technique) as a resampling method to address class imbalance by generating synthetic examples for the minority class. While SMOTE improves dataset balance and model performance, it also has shortcomings such as overgeneralization and lack of flexibility in the number of synthetic samples generated.

Uploaded by

ddiya.2610
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views32 pages

Smote TNP

The document discusses the challenges of imbalanced datasets in classification problems, particularly in predicting cancer from image data. It introduces SMOTE (Synthetic Minority Oversampling Technique) as a resampling method to address class imbalance by generating synthetic examples for the minority class. While SMOTE improves dataset balance and model performance, it also has shortcomings such as overgeneralization and lack of flexibility in the number of synthetic samples generated.

Uploaded by

ddiya.2610
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

SMOTE

Dr.Trilok Nath Pandey


Imbalanced Datasets

Dr.Trilok Nath Pandey


You trained a model to predict cancer from image data using a
state of the art Hierarchical siamese CNN with dynamic kernel
activations…

Your model has an accuracy of 99.9%

An imbalanced classification problem is an example of a classification problem


where the distribution of examples across the known classes is biased or
Dr.Trilok Nath Pandey
skewed. 3
By looking at the confusion matrix you • After plotting your class
realize that the model does not distribution you see that you
detect any of the positive examples. have thousands of negative
examples but just a couple
of positives.

Dr.Trilok Nath Pandey


negatives positives
4
Classifiers try to reduce the overall error so they can be biased
towards the majority class.

# Negatives = 998
# Positives = 2

By always predicting a negative class the accuracy will be 99.8%

Your dataset is imbalanced!!!

Now What???

Dr.Trilok Nath Pandey


5
The Class Imbalance Problem

• Data sets are said to be balanced if there are, approximately, as


many positive examples of the concept as there are negative
ones.
• There exist many domains that have unbalanced data sets.
• Examples:
a) Helicopter Gearbox Fault Monitoring
b) Discrimination between Earthquakes and Nuclear Explosions
c) Document Filtering
d) Detection of Oil Spills
e) Detection of cancerous cells
f) Detection of Fraudulent Telephone Calls
g) Detection of hotspots in ASIC/FPGA Placements
h) Detection of unrouteable designs in VLSI
i) Fraud and default prediction
j) Mail Spam Detection
Dr.Trilok Nath Pandey
6
The Class Imbalance Problem

• The problem with class imbalances is that standard


learners are often biased towards the majority class.

• That is because these classifiers attempt to reduce


global quantities such as the error rate, not taking
the data distribution into consideration.

• As a result, examples from the overwhelming class


are well-classified whereas examples from the
minority class tend to be misclassified.

Dr.Trilok Nath Pandey


7
The Class Imbalance Problem
 For classification problems, we often use accuracy as
the evaluation metric.
 It is easy to calculate and intuitive:
• Accuracy = # of correct predictions / # of total predictions

 But, it is misleading for highly imbalanced datasets!!.

 For example, in credit card fraud detection, we can set


a model to always classify new transactions as legit.

 The accuracy could be high at 99.0% if 99.0% in the


dataset is all legit.

 But, don’t forget that our goal is to detect fraud, so such


a model is useless.
Dr.Trilok Nath Pandey
8
Dr.Trilok Nath Pandey
Dr.Trilok Nath Pandey
Dr.Trilok Nath Pandey
Dr.Trilok Nath Pandey
Dr.Trilok Nath Pandey
Dr.Trilok Nath Pandey
Dr.Trilok Nath Pandey
Dr.Trilok Nath Pandey
Dr.Trilok Nath Pandey
Solutions

Dr.Trilok Nath Pandey


Solutions to Imbalanced Learning

Data
Level Sampling methods

Algorithmic
Level Cost-sensitive methods

Kernel and Active Learning methods

Dr.Trilok Nath Pandey


19
Several Common Approaches
 At the data Level: Re-Sampling
 Oversampling (Random or Directed)
o Add more examples to minority class
 Undersampling (Random or Directed)
o Remove samples from majority class

 At the Algorithmic Level:


 Adjusting the Costs or weights of classes
 Adjusting the decision threshold / probabilistic
estimate at the tree leaf

Most of the machine learning models provide a parameter called class weights
Dr.Trilok Nath Pandey
20
Sampling Methods
Create balance through sampling

If data is Create
Modify data
balanced
Imbalanced… distribution
dataset

• A widely adopted technique for dealing with highly unbalanced


datasets is called resampling.
1. Removing samples from the majority class (under-sampling).
2. Adding more examples to the minority class (over-sampling).
3. Or perform both simultaneously:
• Under-Sample the majority &
• Over-Sample the minority
Dr.Trilok Nath Pandey
21
Sampling Methods
Create balance though sampling
Oversampling may just randomly replicate records within the dataset!!
Can cause loss of information. Can cause overfitting!

Advantages and disadvantages of Under-sampling and Oversampling?


Dr.Trilok Nath Pandey
22
SMOTE

Dr.Trilok Nath Pandey


SMOTE: Resampling Approach
SMOTE stands for:
Synthetic Minority Oversampling Technique
It is a technique designed by Hall et. al in 2002.
SMOTE is an oversampling method that synthesizes new
plausible examples in the minority class.

SMOTE not only increases the size of the training set,


it also increases the variety!!

SMOTE currently yields the best results as far as re-


sampling and modifying the probabilistic estimate
techniques go (Chawla, 2003).
Dr.Trilok Nath Pandey
25
Dr.Trilok Nath Pandey
Dr.Trilok Nath Pandey
Dr.Trilok Nath Pandey
Dr.Trilok Nath Pandey
SMOTE’s Informed Oversampling
Procedure
For each Minority Sample
I. Find its k-nearest minority neighbors
II. Randomly select j of these neighbors
III.Randomly generate synthetic samples along the
lines joining the minority sample and its j selected
neighbors
(j depends on the amount of oversampling desired)

For instance, if it sees two examples (of the same class)


near each other, it creates a third artificial one, in the
middle of the original two.
Dr.Trilok Nath Pandey
30
SMOTE
Synthetic Minority Oversampling Technique (SMOTE)

• Find its k-nearest minority neighbors • Randomly generate synthetic samples


• Randomly select j of these neighbors along the lines joining the minority
Dr.Trilok Nath Pandey
sample and its j selected neighbors
31
Example

Dr.Trilok Nath Pandey


SMOTE’s Shortcomings
• Overgeneralization
a) SMOTE’s procedure may blindly generalizes the minority
area without regard to the majority class.
b) It may oversample noisy samples
c) It may oversample uninformative samples

• Lack of Flexibility
a) The number of synthetic samples generated by SMOTE is
fixed in advance, thus not allowing for any flexibility in the
re-balancing rate.
b) It would be nice to increase the minority class just to
the right value (i.e., not excessive) to avoid the side
affects of unbalanced datasets
Dr.Trilok Nath Pandey
34

You might also like