0% found this document useful (0 votes)

43 views32 pages

Smote TNP

The document discusses the challenges of imbalanced datasets in classification problems, particularly in predicting cancer from image data. It introduces SMOTE (Synthetic Minority Oversampling Technique) as a resampling method to address class imbalance by generating synthetic examples for the minority class. While SMOTE improves dataset balance and model performance, it also has shortcomings such as overgeneralization and lack of flexibility in the number of synthetic samples generated.

Uploaded by

ddiya.2610

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

43 views32 pages

Smote TNP

Uploaded by

ddiya.2610

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 32

SMOTE

Dr.Trilok Nath Pandey

Imbalanced Datasets

Dr.Trilok Nath Pandey

You trained a model to predict cancer from image data using a
state of the art Hierarchical siamese CNN with dynamic kernel
activations…

Your model has an accuracy of 99.9%

An imbalanced classification problem is an example of a classification problem

where the distribution of examples across the known classes is biased or
Dr.Trilok Nath Pandey
skewed. 3
By looking at the confusion matrix you • After plotting your class
realize that the model does not distribution you see that you
detect any of the positive examples. have thousands of negative
examples but just a couple
of positives.

Dr.Trilok Nath Pandey

negatives positives
4
Classifiers try to reduce the overall error so they can be biased
towards the majority class.

# Negatives = 998
# Positives = 2

By always predicting a negative class the accuracy will be 99.8%

Your dataset is imbalanced!!!

Now What???

Dr.Trilok Nath Pandey

5
The Class Imbalance Problem

• Data sets are said to be balanced if there are, approximately, as

many positive examples of the concept as there are negative
ones.
• There exist many domains that have unbalanced data sets.
• Examples:
a) Helicopter Gearbox Fault Monitoring
b) Discrimination between Earthquakes and Nuclear Explosions
c) Document Filtering
d) Detection of Oil Spills
e) Detection of cancerous cells
f) Detection of Fraudulent Telephone Calls
g) Detection of hotspots in ASIC/FPGA Placements
h) Detection of unrouteable designs in VLSI
i) Fraud and default prediction
j) Mail Spam Detection
Dr.Trilok Nath Pandey
6
The Class Imbalance Problem

• The problem with class imbalances is that standard

learners are often biased towards the majority class.

• That is because these classifiers attempt to reduce

global quantities such as the error rate, not taking
the data distribution into consideration.

• As a result, examples from the overwhelming class

are well-classified whereas examples from the
minority class tend to be misclassified.

Dr.Trilok Nath Pandey

7
The Class Imbalance Problem
 For classification problems, we often use accuracy as
the evaluation metric.
 It is easy to calculate and intuitive:
• Accuracy = # of correct predictions / # of total predictions

 But, it is misleading for highly imbalanced datasets!!.

 For example, in credit card fraud detection, we can set

a model to always classify new transactions as legit.

 The accuracy could be high at 99.0% if 99.0% in the

dataset is all legit.

 But, don’t forget that our goal is to detect fraud, so such

a model is useless.
Dr.Trilok Nath Pandey
8
Dr.Trilok Nath Pandey
Dr.Trilok Nath Pandey
Dr.Trilok Nath Pandey
Dr.Trilok Nath Pandey
Dr.Trilok Nath Pandey
Dr.Trilok Nath Pandey
Dr.Trilok Nath Pandey
Dr.Trilok Nath Pandey
Dr.Trilok Nath Pandey
Solutions

Dr.Trilok Nath Pandey

Solutions to Imbalanced Learning

Data
Level Sampling methods

Algorithmic
Level Cost-sensitive methods

Kernel and Active Learning methods

Dr.Trilok Nath Pandey

19
Several Common Approaches
 At the data Level: Re-Sampling
 Oversampling (Random or Directed)
o Add more examples to minority class
 Undersampling (Random or Directed)
o Remove samples from majority class

 At the Algorithmic Level:

 Adjusting the Costs or weights of classes
 Adjusting the decision threshold / probabilistic
estimate at the tree leaf

Most of the machine learning models provide a parameter called class weights
Dr.Trilok Nath Pandey
20
Sampling Methods
Create balance through sampling

If data is Create
Modify data
balanced
Imbalanced… distribution
dataset

• A widely adopted technique for dealing with highly unbalanced

datasets is called resampling.
1. Removing samples from the majority class (under-sampling).
2. Adding more examples to the minority class (over-sampling).
3. Or perform both simultaneously:
• Under-Sample the majority &
• Over-Sample the minority
Dr.Trilok Nath Pandey
21
Sampling Methods
Create balance though sampling
Oversampling may just randomly replicate records within the dataset!!
Can cause loss of information. Can cause overfitting!

Advantages and disadvantages of Under-sampling and Oversampling?

Dr.Trilok Nath Pandey
22
SMOTE

Dr.Trilok Nath Pandey

SMOTE: Resampling Approach
SMOTE stands for:
Synthetic Minority Oversampling Technique
It is a technique designed by Hall et. al in 2002.
SMOTE is an oversampling method that synthesizes new
plausible examples in the minority class.

SMOTE not only increases the size of the training set,

it also increases the variety!!

SMOTE currently yields the best results as far as re-

sampling and modifying the probabilistic estimate
techniques go (Chawla, 2003).
Dr.Trilok Nath Pandey
25
Dr.Trilok Nath Pandey
Dr.Trilok Nath Pandey
Dr.Trilok Nath Pandey
Dr.Trilok Nath Pandey
SMOTE’s Informed Oversampling
Procedure
For each Minority Sample
I. Find its k-nearest minority neighbors
II. Randomly select j of these neighbors
III.Randomly generate synthetic samples along the
lines joining the minority sample and its j selected
neighbors
(j depends on the amount of oversampling desired)

For instance, if it sees two examples (of the same class)

near each other, it creates a third artificial one, in the
middle of the original two.
Dr.Trilok Nath Pandey
30
SMOTE
Synthetic Minority Oversampling Technique (SMOTE)

• Find its k-nearest minority neighbors • Randomly generate synthetic samples

• Randomly select j of these neighbors along the lines joining the minority
Dr.Trilok Nath Pandey
sample and its j selected neighbors
31
Example

Dr.Trilok Nath Pandey

SMOTE’s Shortcomings
• Overgeneralization
a) SMOTE’s procedure may blindly generalizes the minority
area without regard to the majority class.
b) It may oversample noisy samples
c) It may oversample uninformative samples

• Lack of Flexibility
a) The number of synthetic samples generated by SMOTE is
fixed in advance, thus not allowing for any flexibility in the
re-balancing rate.
b) It would be nice to increase the minority class just to
the right value (i.e., not excessive) to avoid the side
affects of unbalanced datasets
Dr.Trilok Nath Pandey
34

MK-SMOTE and M-SMOTE: Enhanced Techniques For Handling Class Imbalance Problem
No ratings yet
MK-SMOTE and M-SMOTE: Enhanced Techniques For Handling Class Imbalance Problem
19 pages
MEE22154 Task2
No ratings yet
MEE22154 Task2
4 pages
Lec - 15 Imbalance Dataset
No ratings yet
Lec - 15 Imbalance Dataset
20 pages
A Comprehensive Analysis of Synthetic Minority Oversampling Technique (SMOTE) For Handling Class Imbalance
No ratings yet
A Comprehensive Analysis of Synthetic Minority Oversampling Technique (SMOTE) For Handling Class Imbalance
33 pages
Imbalanced Classes in Big Data
No ratings yet
Imbalanced Classes in Big Data
20 pages
JPSP - 2022 - 383
No ratings yet
JPSP - 2022 - 383
12 pages
AI Imbalance: SMOTE's 15-Year Impact
No ratings yet
AI Imbalance: SMOTE's 15-Year Impact
43 pages
133 - Sampling Approaches For Imbalanced Data Classificatin Problem in Machine Learning
No ratings yet
133 - Sampling Approaches For Imbalanced Data Classificatin Problem in Machine Learning
14 pages
Improving Imbalanced Learning Through A Heuristic Oversampling Method Based On K-Means and SMOTE
No ratings yet
Improving Imbalanced Learning Through A Heuristic Oversampling Method Based On K-Means and SMOTE
20 pages
Admin, 1277
No ratings yet
Admin, 1277
21 pages
Random and Synthetic Over Sampling Approach To Resolve Data 2zu79c47m6
No ratings yet
Random and Synthetic Over Sampling Approach To Resolve Data 2zu79c47m6
9 pages
11-A-SMOTE A New Preprocessing Approach For Highly Im
No ratings yet
11-A-SMOTE A New Preprocessing Approach For Highly Im
11 pages
A Comparative Study of SMOTE Borderline-SMOTE and ADASYN Oversampling Techniques Using Different Classifiers
No ratings yet
A Comparative Study of SMOTE Borderline-SMOTE and ADASYN Oversampling Techniques Using Different Classifiers
9 pages
Expert Systems With Applications: Georgios Douzas, Fernando Bacao
No ratings yet
Expert Systems With Applications: Georgios Douzas, Fernando Bacao
8 pages
An Overview of Classification Algorithms For Imbalanced Datasets
No ratings yet
An Overview of Classification Algorithms For Imbalanced Datasets
7 pages
5 Techniques To Handle Imbalanced Data For A Classification Problem
No ratings yet
5 Techniques To Handle Imbalanced Data For A Classification Problem
7 pages
Imbalanced Classes in ML: 10 Techniques
No ratings yet
Imbalanced Classes in ML: 10 Techniques
10 pages
How To Handle Imbalanced Datasets - by Subha - Medium
No ratings yet
How To Handle Imbalanced Datasets - by Subha - Medium
18 pages
Eng2 12298 PDF
No ratings yet
Eng2 12298 PDF
24 pages
Author Final Version
No ratings yet
Author Final Version
11 pages
15 dm2 Imbalanced Learning 2022 23
No ratings yet
15 dm2 Imbalanced Learning 2022 23
35 pages
Handling Data Imbalance in Machine Learning
No ratings yet
Handling Data Imbalance in Machine Learning
51 pages
Sampling
No ratings yet
Sampling
9 pages
An Empirical Comparison and Evaluation of Minority Oversampling
No ratings yet
An Empirical Comparison and Evaluation of Minority Oversampling
13 pages
A Novel Resampling Technique For Imbalanced Classification in Software Defect Prediction by A Re-Sampling Method With Filtering
No ratings yet
A Novel Resampling Technique For Imbalanced Classification in Software Defect Prediction by A Re-Sampling Method With Filtering
10 pages
Analysis of Imbalanced Classification Algorithms A Perspective View
No ratings yet
Analysis of Imbalanced Classification Algorithms A Perspective View
5 pages
Imbalanced Learn Python
No ratings yet
Imbalanced Learn Python
5 pages
SMOTE For Imbalanced Classification With Python - GeeksforGeeks
No ratings yet
SMOTE For Imbalanced Classification With Python - GeeksforGeeks
18 pages
Addressing Imbalance Problem in The Class - A Survey
No ratings yet
Addressing Imbalance Problem in The Class - A Survey
5 pages
Handling Imbalanced Data in ML
No ratings yet
Handling Imbalanced Data in ML
6 pages
Mod 7 Smote ML
No ratings yet
Mod 7 Smote ML
40 pages
Imbalanced Data: How To Handle Imbalanced Classification Problems
No ratings yet
Imbalanced Data: How To Handle Imbalanced Classification Problems
17 pages
IR-Lab Project of Yanjun Qi (Fall 2004) : A Brief Literature Review of Class Imbalanced Problem
No ratings yet
IR-Lab Project of Yanjun Qi (Fall 2004) : A Brief Literature Review of Class Imbalanced Problem
5 pages
Topic 2
No ratings yet
Topic 2
47 pages
Class Imbalance Problem in Data Mining: Review
No ratings yet
Class Imbalance Problem in Data Mining: Review
5 pages
Lesson 3
No ratings yet
Lesson 3
8 pages
SMOTE for Class Imbalance Handling
No ratings yet
SMOTE for Class Imbalance Handling
12 pages
Ads Module 4 Smote 2023
No ratings yet
Ads Module 4 Smote 2023
71 pages
Be A 65 Ads Exp 6
No ratings yet
Be A 65 Ads Exp 6
11 pages
Evaluation and Enhancement of Standard Classifier
No ratings yet
Evaluation and Enhancement of Standard Classifier
31 pages
1 s2.0 S0950705119302898 Main
No ratings yet
1 s2.0 S0950705119302898 Main
17 pages
Handling Class Imbalance - Will Your Approach Differ Depending On The Level of Skewness in TH
No ratings yet
Handling Class Imbalance - Will Your Approach Differ Depending On The Level of Skewness in TH
12 pages
A Systematic Review On Imbalanced Data Challenges in Machine Learning: Applications and Solutions
100% (1)
A Systematic Review On Imbalanced Data Challenges in Machine Learning: Applications and Solutions
36 pages
INT354 - Unit 1
No ratings yet
INT354 - Unit 1
72 pages
Data Imbalance Problem
No ratings yet
Data Imbalance Problem
14 pages
Random-SMOTE for Imbalanced Data
No ratings yet
Random-SMOTE for Imbalanced Data
4 pages
Kumar 2021 IOP Conf. Ser. Mater. Sci. Eng. 1099 012077
No ratings yet
Kumar 2021 IOP Conf. Ser. Mater. Sci. Eng. 1099 012077
9 pages
Imbalanced Data
No ratings yet
Imbalanced Data
54 pages
SMOTE For Imbalanced Classification With Python
No ratings yet
SMOTE For Imbalanced Classification With Python
75 pages
Metabalance: High-Performance Neural Networks For Class-Imbalanced Data
No ratings yet
Metabalance: High-Performance Neural Networks For Class-Imbalanced Data
13 pages
A Survey On Oversampling Techniques For Imbalanced Learning
No ratings yet
A Survey On Oversampling Techniques For Imbalanced Learning
6 pages
Foundations of Data Imbalance and Solutions For A Data Democracy
No ratings yet
Foundations of Data Imbalance and Solutions For A Data Democracy
20 pages
Batista 2004
No ratings yet
Batista 2004
10 pages
AReviewon Oversampling Techniquesfor Solvingthe Data Imbalance Problemin Classification
No ratings yet
AReviewon Oversampling Techniquesfor Solvingthe Data Imbalance Problemin Classification
11 pages
Giaonx,+1155 3735 1 CE
No ratings yet
Giaonx,+1155 3735 1 CE
13 pages
Clustering Based Undersampling For Handling Class Imbalance in C4.5 Classification Algorithm
No ratings yet
Clustering Based Undersampling For Handling Class Imbalance in C4.5 Classification Algorithm
7 pages
Complete ML Concepts
No ratings yet
Complete ML Concepts
30 pages
AI - ML Intern Assignment - Quick Chatbot Prototype
No ratings yet
AI - ML Intern Assignment - Quick Chatbot Prototype
1 page
Dbscan TNP
No ratings yet
Dbscan TNP
19 pages
TNP Lecture 2 G1G2
No ratings yet
TNP Lecture 2 G1G2
58 pages
Module 3 Urban Community Issues Part 2-Merged
No ratings yet
Module 3 Urban Community Issues Part 2-Merged
16 pages
Machine Learning Blockchain
100% (1)
Machine Learning Blockchain
15 pages
Claude Prompt Optimizer
No ratings yet
Claude Prompt Optimizer
2 pages
03.0 PP Ix Xii Preface
No ratings yet
03.0 PP Ix Xii Preface
4 pages
Computer Networks Course Outline
No ratings yet
Computer Networks Course Outline
6 pages
Adityaraj Belhe VU
No ratings yet
Adityaraj Belhe VU
2 pages
Artificial Intelligence: B.E. (Computer Technology) Semester Seventh (C.B.S.)
No ratings yet
Artificial Intelligence: B.E. (Computer Technology) Semester Seventh (C.B.S.)
2 pages
EBU News Report 2025 Leading Newsrooms AI
No ratings yet
EBU News Report 2025 Leading Newsrooms AI
75 pages
Minus Zero Nature Inspired AI - 2023
No ratings yet
Minus Zero Nature Inspired AI - 2023
8 pages
Real-Time Leopard Detection System
No ratings yet
Real-Time Leopard Detection System
7 pages
How Geopolitical Tensions Will Influence NVIDIA's
No ratings yet
How Geopolitical Tensions Will Influence NVIDIA's
4 pages
What Is Linear Discriminant Analysis
No ratings yet
What Is Linear Discriminant Analysis
3 pages
Vishal Engti - Mern-Python - Stack
No ratings yet
Vishal Engti - Mern-Python - Stack
1 page
GenAI Boosts Customer Support
No ratings yet
GenAI Boosts Customer Support
11 pages
Sensors: Machine Learning in Agriculture: A Review
No ratings yet
Sensors: Machine Learning in Agriculture: A Review
29 pages
Technology Vs Learning-Advanced Worksheet
No ratings yet
Technology Vs Learning-Advanced Worksheet
8 pages
Infomercial vs. Commercial Ads Study
No ratings yet
Infomercial vs. Commercial Ads Study
23 pages
Data Center Transformation With Google: Authors
100% (2)
Data Center Transformation With Google: Authors
33 pages
Frontend Developer Intern, Anjali
No ratings yet
Frontend Developer Intern, Anjali
75 pages
Localization in Translation
No ratings yet
Localization in Translation
51 pages
IBM - Moving Ahead With Intelligent Automation
No ratings yet
IBM - Moving Ahead With Intelligent Automation
17 pages
Deep Learning Course Overview
No ratings yet
Deep Learning Course Overview
56 pages
Artificial Intelligence in Governemnt-Taking Stock and Moving Forward
No ratings yet
Artificial Intelligence in Governemnt-Taking Stock and Moving Forward
18 pages
Deshpande Et Al. 2024 - NLP Driven - Chatbot For Career - Counseling
No ratings yet
Deshpande Et Al. 2024 - NLP Driven - Chatbot For Career - Counseling
7 pages
E-Tech Prelim 1st Sem Reviewer
No ratings yet
E-Tech Prelim 1st Sem Reviewer
4 pages
AI Org Responsibilities Core Sec Responsibilities 2024
No ratings yet
AI Org Responsibilities Core Sec Responsibilities 2024
50 pages
Crackens Rebel Field Guide WEG40046 PDF
100% (3)
Crackens Rebel Field Guide WEG40046 PDF
82 pages
2026
No ratings yet
2026
14 pages
Fourth Sem Old Question
No ratings yet
Fourth Sem Old Question
13 pages
Chapter 1 - Information Technology Literacy (1.4)
No ratings yet
Chapter 1 - Information Technology Literacy (1.4)
22 pages
Bulletin 2017 1 171130 Hyperlink - Mon PDF
100% (1)
Bulletin 2017 1 171130 Hyperlink - Mon PDF
88 pages