Data Imbalance problem
Is accuracy
correct way to
measure Quality
of model?
• Fraud Detection
Why this happen? • Anomaly Detection
• Healthcare
Confusion Matrix
How to
measure
quality of
model?
ROC Curve and ROC AUC Score
• Receiver Operating Characteristics(ROC) curves are VERY help with understanding the
balance between true-positive rate and false positive rates. Calculated using 3 lists
• thresholds = all unique prediction probabilities in descending order
• FPR = the false positive rate (FP / (FP + TN)) for each threshold
• TPR = the true positive rate (TP / (TP + FN)) for each threshold
• It tells how much model is capable of distinguishing between classes.
• Higher the AUC, better the model is at predicting 0s as 0s and 1s as 1s.
ROC Curve and ROC AUC Score
Methods to Overcome Data Imbalance
Problem
• Class weight
• Oversampling
• Random oversampling
• Synthetic Minority Over-sampling Technique (SMOTE)
• ADASYN
• Undersampling
• Random undersampling
• Near miss
• Tomeks links
Class weight
• Provide a weight for each class which places more emphasis on the
minority classes
wj=n_samples / (n_classes * n_samplesj)
Here,
•wj is the weight for each class(j signifies the class)
•n_samples is the total number of samples or rows in the dataset
•n_classes is the total number of unique classes in the target
•n_samplesj is the total number of rows of the respective class
Oversampling
• Oversampling the minority classes to increase the number of minority
observations until we've reached a balanced dataset
• Random Oversampling
• Randomly sample the minority classes and simply duplicate the sampled
observations
Synthetic Minority Over-sampling Technique (SMOTE)
• It generates new observations by
interpolating between
observations in the original
dataset
• For a given observation xi, a new
(synthetic) observation is
generated by interpolating
between one of the k-nearest
neighbors, xzi.
Under Sampling
• Throwing away data to make it easier to learn characteristics about
the minority classes
• Random under sampling
• simply sample the majority class at random until reaching a similar number of
observations as the minority classes
Near miss -1
NearMiss-1 select samples from the
majority class for which the average
distance of the N closest samples of
a minority class is smallest.
Near miss -2
Select samples from the majority class for
which the average distance of the N
farthest samples of a minority class is
smallest.