0% found this document useful (0 votes)
8 views14 pages

Data Imbalance Problem

The document discusses the data imbalance problem in machine learning, particularly in contexts like fraud detection and healthcare, and questions the adequacy of accuracy as a quality measure for models. It introduces the ROC curve and AUC score as effective tools for evaluating model performance, emphasizing the importance of distinguishing between classes. Additionally, it outlines methods to address data imbalance, including class weighting, oversampling techniques like SMOTE, and various undersampling strategies.

Uploaded by

dhruv tiwari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views14 pages

Data Imbalance Problem

The document discusses the data imbalance problem in machine learning, particularly in contexts like fraud detection and healthcare, and questions the adequacy of accuracy as a quality measure for models. It introduces the ROC curve and AUC score as effective tools for evaluating model performance, emphasizing the importance of distinguishing between classes. Additionally, it outlines methods to address data imbalance, including class weighting, oversampling techniques like SMOTE, and various undersampling strategies.

Uploaded by

dhruv tiwari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Data Imbalance problem

Is accuracy
correct way to
measure Quality
of model?
• Fraud Detection
Why this happen? • Anomaly Detection
• Healthcare
Confusion Matrix
How to
measure
quality of
model?
ROC Curve and ROC AUC Score
• Receiver Operating Characteristics(ROC) curves are VERY help with understanding the
balance between true-positive rate and false positive rates. Calculated using 3 lists

• thresholds = all unique prediction probabilities in descending order


• FPR = the false positive rate (FP / (FP + TN)) for each threshold
• TPR = the true positive rate (TP / (TP + FN)) for each threshold

• It tells how much model is capable of distinguishing between classes.

• Higher the AUC, better the model is at predicting 0s as 0s and 1s as 1s.


ROC Curve and ROC AUC Score
Methods to Overcome Data Imbalance
Problem
• Class weight
• Oversampling
• Random oversampling
• Synthetic Minority Over-sampling Technique (SMOTE)
• ADASYN
• Undersampling
• Random undersampling
• Near miss
• Tomeks links
Class weight
• Provide a weight for each class which places more emphasis on the
minority classes

wj=n_samples / (n_classes * n_samplesj)


Here,
•wj is the weight for each class(j signifies the class)
•n_samples is the total number of samples or rows in the dataset
•n_classes is the total number of unique classes in the target
•n_samplesj is the total number of rows of the respective class
Oversampling
• Oversampling the minority classes to increase the number of minority
observations until we've reached a balanced dataset

• Random Oversampling
• Randomly sample the minority classes and simply duplicate the sampled
observations
Synthetic Minority Over-sampling Technique (SMOTE)

• It generates new observations by


interpolating between
observations in the original
dataset
• For a given observation xi, a new
(synthetic) observation is
generated by interpolating
between one of the k-nearest
neighbors, xzi.
Under Sampling
• Throwing away data to make it easier to learn characteristics about
the minority classes

• Random under sampling


• simply sample the majority class at random until reaching a similar number of
observations as the minority classes
Near miss -1
NearMiss-1 select samples from the
majority class for which the average
distance of the N closest samples of
a minority class is smallest.
Near miss -2

Select samples from the majority class for


which the average distance of the N
farthest samples of a minority class is
smallest.

You might also like