How to handle an imbalanced dataset?
A dataset is called imbalanced when the number of observations in one class is far greater than the number of observations in other classes. Class imbalance is a common problem in classification problems.
Most machine learning algorithms work best when the number of samples in each class is about equal. This is because most algorithms are designed to maximize accuracy and reduce errors. Accuracy is not the best metric to use when evaluating imbalanced datasets as it can be very misleading. In this case, metrics like confusion matrix, precision, recall, and f1 score can be used to judge the model’s performance.
To better understand this topic, let us consider the credit card fraud detection dataset from Kaggle.
Overview:
1. Dataset description
2. Baseline model
3. Undersampling technique
4. Oversampling technique
5. Class_weight hyperparameter
6. SMOTE technique
7. SMOTETomek technique
1. Dataset description
The dataset contains transactions made by credit cards in September 2013 by European cardholders. This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions. The dataset has 284807 rows and 31 features.
2. Baseline model
We will now try to train a random-forest classifier model on this data.
#import model library
from sklearn.ensemble import RandomForestClassifier#import library to get model accuracy
from sklearn.metrics import accuracy_score#model training
rf = RandomForestClassifier()
rf.fit(X_train, y_train)#make prediction
y_pred=rf.predict(X_valid)#check model accuracy by comparing it with validation data
print(‘Model accuracy score: ‘,accuracy_score(y_valid,y_pred))
We are getting a very high accuracy of 99% because the model is mostly making predictions based on the majority class (i.e 0- not fraud).
We will now use other metrics to get a better understanding of our model’s performance.
from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_valid,y_pred))
from sklearn.metrics import classification_report
print(classification_report(y_valid,y_pred))
3. Undersampling technique
In this resampling technique, a number of observations from the majority class are removed to match it in comparison to the minority classes. A major disadvantage of this technique is that value insights/data can be lost while discarding data from the majority class.
We will use the NearMiss library from imblearn to undersample our majority class (i.e. not fraud)
#import library used for counting the number of observations
from collections import Counter#import library used for undersampling
from imblearn.under_sampling import NearMiss
ns = NearMiss()#obtain undersampled data
X_train_ns,y_train_ns = ns.fit_resample(X_train,y_train)print("The number of classes before fit{}".format(Counter(y_train)))print("The number of classes after fit{}".format(Counter(y_train_ns)))
Now we will try to train a random forest model on the undersampled data and make predictions.
#model training
rf = RandomForestClassifier()
rf.fit(X_train_ns, y_train_ns)#make prediction
y_pred=rf.predict(X_valid)print('Model accuracy score: ',accuracy_score(y_valid,y_pred_ns))
print('Confusion matrix: ')
print(confusion_matrix(y_valid,y_pred_ns))
print(classification_report(y_valid,y_pred_ns))
The model performed poorly because there was not enough data to train the model.
4. Oversampling technique
In this resampling technique, a number of observations from the minority class are duplicated to match it in comparison to the majority classes.
We will use the RandomOverSampler library from imblearn to oversample our minority class (i.e. fraud)
#import library used for counting the number of observations
from collections import Counter#import library used for oversampling
from imblearn.over_sampling import RandomOverSampler
os = RandomOverSampler()#obain oversampled data
X_train_os,y_train_os = os.fit_resample(X_train,y_train)print(“The number of classes before fit{}”.format(Counter(y_train)))print(“The number of classes after fit{}”.format(Counter(y_train_os)))
Now we will try to train a random forest model on the oversampled data and make predictions.
#model training
rf_os = RandomForestClassifier()
rf_os.fit(X_train_os, y_train_os)#make prediction
y_pred_os = rf_os.predict(X_valid)print('Model accuracy score: ',accuracy_score(y_valid,y_pred_os))
print('Confusion matrix: ')
print(confusion_matrix(y_valid,y_pred_os))
print(classification_report(y_valid,y_pred_os))
The model with oversampled data performs slightly better than the baseline model.
5. Assign class weights
There is a hyper-parameter named ‘class_weight’ that can be used in algorithms supporting it (like random forest) where weights/importance is assigned to the classes.
Suppose class A has 100 observations and class B has 10 observations. In this case, if we assign a weight of 1 to class A and a weight of 10 to class B then the model will treat 1 sample of class B as equal to 10 samples of class A.
#assign a weight of 1 to class 0 and 100 to class 1
class_weight=dict({0:1,1:100})#train a model
classifier=RandomForestClassifier(class_weight=class_weight)
classifier.fit(X_train,y_train)#make predictions
y_pred=classifier.predict(X_valid)print(confusion_matrix(y_valid,y_pred))
print(accuracy_score(y_valid,y_pred))
print(classification_report(y_valid,y_pred))
6. SMOTE technique
SMOTE is an abbreviation for Synthetic Minority Oversampling Technique.
SMOTE works by randomly picking a point from the minority class and computing the k-nearest neighbors for this point. The synthetic points are added between the chosen point and its neighbors.
# import library
from imblearn.over_sampling import SMOTEsmote = SMOTE()# fit predictor and target variable
X_train_smote, y_train_smote = smote.fit_resample(X_train,y_train)print(‘Original dataset shape’, Counter(y_train))
print(‘Resample dataset shape’, Counter(y_train_smote))
#model training
rf_smote = RandomForestClassifier()
rf_smote.fit(X_train_smote, y_train_smote)#make prediction
y_pred_smote = rf_smote.predict(X_valid)print(confusion_matrix(y_valid,y_pred_smote))
print(accuracy_score(y_valid,y_pred_smote))
print(classification_report(y_valid,y_pred_smote))
7. SMOTETomek technique
Tomek links are pairs of very close instances but of opposite classes. Removing the instances of the majority class of each pair increases the space between the two classes, facilitating the classification process.
SMOTETomek is a hybrid technique of SMOTE + TOMEK. In this technique first, the minority class is oversampled using SMOTE and then the TOMEK links are removed from both majority and minority classes to prevent overfitting.
#import library
from imblearn.combine import SMOTETomeksmo=SMOTETomek()X_train_smo,y_train_smo = smo.fit_resample(X_train,y_train)print("The number of classes before fit{}".format(Counter(y_train)))print("The number of classes after fit
{}".format(Counter(y_train_smo)))
#model training
rf_smo = RandomForestClassifier()
rf_smo.fit(X_train_smo, y_train_smo)#make prediction
y_pred_smo = rf_smo.predict(X_valid)print(confusion_matrix(y_valid,y_pred_smo))
print(accuracy_score(y_valid,y_pred_smo))
print(classification_report(y_valid,y_pred_smo))
Please find the jupyter notebook code from the below link