0% found this document useful (0 votes)
17 views38 pages

Banking Project Final

Uploaded by

aurorajashri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views38 pages

Banking Project Final

Uploaded by

aurorajashri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

Banking Project

(Capstone Project – Final Report)


DSBA

By:
E. AuroRajashri

-1-
List of Content
1. Introduction 6
1.1. Defining problem statement
1.2. Need of the study/project

2. Exploratory data analysis 7


2.1 Univariate analysis (distribution and spread for every continuous attribute,
distribution of data in categories for categorical ones)
2.2 Bivariate analysis (relationship between different variables, correlations)

3. Data Cleaning and Pre-processing 14


3.1 Removal of unwanted variables (if applicable)
3.2 Missing Value treatment (if applicable)
3.3 Outlier treatment (if required)
3.4 Variable transformation (if applicable)

4. Model building 17
4.1 Build various models
4.2 Test your predictive model against the test set using various performance
metrics
4.3 Interpretation of the model(s)
4.4 Ensemble modelling, wherever applicable
4.5 Interpretation of the most optimum model and its implication on the business

5. Model validation 36
5.1 Various model Validation measures

6. Final interpretation / recommendation 37

-2-
List of Tables
2.2 Descriptive Statistics

2.3 Data Info

3.3.1 Removed User id variable from the data frame

3.3.2 Removed Name in email variable from the data frame

3.4.1 Percentage of missing value per column

3.4.2 Post dropping off columns with 25% threshold

3.4.3 Post Imputation- Missing values

3.6.1 One-hot encoding

4.2.1 Post scaling treatment

4.2.2 Inertia of various n_clusters

4.2.4 Final dataset post clustering

List of Figures
3.1.1 Histogram of age

3.1.2 Histogram of Time_hours

3.1.3 Number of Defaulters and Non-defaulters

3.1.4 Top 10 Merchant Categories

3.1.5 Top 10 Merchant groups

3.1.6 Histogram of all numerical variables

3.2.1 Average Account Amount Added (12-24m) by Default status

3.2.2 Distribution of Max paid invoice(0-12m) by Default status

3.2.3 Violin plot: Age distribution by default status

3.2.4 Heat Map -Correlation

3.5.1 Outliers using box plot

3.5.2 Post Outliers Treatment

4.2.3 Elbow graph

5.1 Train and test data

5.2.1 Accuracy score – Random Forest

5.2.2 Confusion Matrix – Random Forest

5.2.3 Classification report – Random Forest

5.2.4 ROC Curve – Random Forest

5.2.5 Accuracy score – DTC

-3-
5.2.6 Confusion Matrix – DTC

5.2.7 Classification report – DTC

5.2.8 ROC curve – DTC

5.2.9 Accuracy score – NBC

5.2.10 Confusion Matrix – NBC

5.2.11 Classification report – NBC

5.2.12 ROC Curve – NBC

5.2.13 Accuracy score – SVM

5.2.14 Confusion matrix – SVM

5.2.15 Classification report – SVM

5.2.16 ROC Curve– SVM

6.1.1 Accuracy score – Bagging

6.1.2 Confusion matrix – Bagging

6.1.3 Classification report – Bagging

6.1.4 ROC Curve – Bagging

6.1.5 Accuracy score – Ada boosting

6.1.6 Confusion matrix – Ada boosting

6.1.7 Classification report – Ada boosting

6.1.8 ROC Curve – Ada boosting

6.1.9 Accuracy score – gradient boosting

6.1.10 confusion matrix – gradient boosting

6.1.11 classification report – gradient boosting

6.1.12 ROC Curve – gradient boosting

6.1.13 Performance metrics of models

6.2.1 Accuracy score – Randomized search cv using RFC

6.2.2 Confusion matrix – Randomized search cv using RFC

6.2.3 Classification report – Randomized search cv using RFC

6.2.4 ROC Curve – Randomized search cv using RFC

6.2.5 Accuracy score – Randomized search cv using DTC

6.2.6 Confusion matrix – Randomized search cv using DTC

6.2.7 Classification report – Randomized search cv using DTC

6.2.8 ROC Curve – Randomized search cv using DTC

6.2.9 Accuracy score – Randomized search cv using NB

6.2.10 Confusion matrix – Randomized search cv using NB

6.2.11 Classification report– Randomized search cv using NB

6.2.12 ROC Curve – Randomized search cv using NB

-4-
6.2.13 Accuracy score – Grid search cv using DTC

6.2.14 Confusion matrix – Grid search cv using DTC

6.2.15 Confusion matrix – Grid search cv using DTC

6.2.16 ROC curve – Grid search cv using DTC

6.2.17 Accuracy score – Grid search cv using NB

6.2.18 Confusion matrix– Grid search cv using NB

6.2.19 Classification report – Grid search cv using NB

6.2.20 ROC Curve – Grid search cv using NB

6.2.21 Performance metrics of all models

6.3.1 Top 10 feature importances

-5-
1. Introduction
1.1 Defining problem statement
Problem Statement: This business problem is a supervised learning
example for a credit card company. The objective is to predict the probability
of default (whether the customer will pay the credit card bill or not) based on
the variables provided. There are multiple variables on the credit card
account, purchase and delinquency information which can be used in the
modelling.
PD modelling problems are meant for understanding the riskiness of the
customers and how much credit is at stake in case the customer defaults.
This is an extremely critical part in any organization that lends money [both
secured and unsecured loans].
• The objective of this project is to develop a predictive model that
estimates the probability of default for credit card customers. This
involves using the provided dataset, which contains various variables
related to credit card accounts, purchases, and delinquency
information, to understand the riskiness of customers.
• By accurately predicting the likelihood of default, the credit card
company can better assess the credit risk associated with each
customer and make informed decisions regarding credit limits,
interest rates, and other lending terms. This is crucial for minimizing
potential losses and managing the overall credit risk portfolio of the
organization.

1.2 Need of the study/project


The need for this study or project arises from the critical role that predicting the
probability of default (PD) plays in the financial industry, particularly for credit
card companies and other lending institutions. Here are some key reasons why
this study is essential:
1. Risk Management: Understanding the riskiness of customers is crucial for
managing the overall risk portfolio of a lending institution. By predicting the
likelihood of default, companies can make informed decisions about whom to
lend to and under what terms.
2. Credit Allocation: Accurate PD models help in determining the appropriate
amount of credit to extend to each customer. This ensures that credit is
allocated efficiently, maximizing returns while minimizing risk.
3. Loss Mitigation: By identifying high-risk customers, companies can take
proactive measures to mitigate potential losses. This might include adjusting
credit limits, changing interest rates, or implementing stricter repayment
terms.

-6-
4. Regulatory Compliance: Financial institutions are often required to maintain
certain levels of capital reserves based on the riskiness of their loan portfolios.
Accurate PD models help in meeting these regulatory requirements by
providing a clear picture of potential defaults.
Overall, this study is essential for enhancing the financial stability and
operational efficiency of lending institutions, ultimately contributing to their
long-term success.
Recent example: Yes Bank, once one of India's fastest-growing private sector
banks, faced a severe crisis in 2020 due to its inability to manage credit risk
effectively.
Recent example: In 2019, SBI implemented an AI-powered credit scoring system
to assess loan applications and predict the probability of default. This system has
helped SBI better manage its non-performing assets (NPAs) by more accurately
predicting which borrowers are likely to default

2. Exploratory data analysis


2.1 Univariate analysis

3.1.1 Histogram of age

This histogram shows the distribution of age in the dataset. We can observe that:
• The age distribution is right-skewed, with most customers falling in the range
of 25-45 years old.
• The peak of the distribution is around 30-35 years old.
• There are fewer customers in the older age ranges (above 60).

-7-
3.1.2 Histogram of Time_hours

• The data shows a higher frequency of occurrences around the 15 to 21-hour


mark. This suggests that most of the recorded times fall within this range.
• The distribution appears to be right-skewed, with more values concentrated
in the later hours (past 10 hours) and fewer in the earlier hours.

3.1.3 Number of Defaulters and Non-defaulters

The dataset is highly imbalanced in terms of default status:


• 88688 customers did not default (98.57%)

• 1,288 customers defaulted (1.43%)

3.1.4 Top 10 Merchant category

-8-
Key insights:
• Concentration of Transactions: The Direct selling establishments category
dominates with the highest count, nearly 40,000, far exceeding the other
categories. This indicates a large number of transactions or significant activity
in this category.
• Moderate Activity: Categories like Books & Magazines and Youthful Shoes &
Clothing have moderate counts (around 10,000–15,000), showing significant
but not overwhelming activity compared to the leader.
• Low Activity Categories: Categories like Dietary Supplements, Prints &
Photos, and Diversified electronics have much lower counts (under 10,000).
These are niche categories with fewer transactions.
• Category Variety: The top 10 categories represent a broad range of industries,
including electronics, apparel, outdoor gear, books, and general merchandise.
This indicates diverse customer interests.

3.1.5 Top 10 Merchant groups

The bar chart you shared shows the top 10 merchant groups and the count of
transactions or occurrences associated with each group. Here's a breakdown of the
insights:
• Entertainment is by far the dominant category, with significantly more counts
(around 50,000) than the other categories. This suggests that consumers
engage with or spend more in this group.
• Clothing & Shoes follows as the second-highest group, though it's much lower
than Entertainment.

-9-
• The groups with the lowest counts are Jewelry & Accessories, Home & Garden,
Intangible Products, and Automotive Products.
• The distribution shows that spending or transaction volume is concentrated
heavily in Entertainment, with other categories having relatively smaller but
still notable volumes.

3.1.6 Histogram of all numerical variables

• Many variables, such as acct_worst_status_0_24m,


acct_worst_status_1_24m, and num_active_rev_tl, show high frequencies at
zero or low values with a steep decline as the values increase. This suggests
that most data points fall in the lower range, with fewer high values.

- 10 -
• Variables like sum_capital_paid_account_0_12m and num_active_tl also
show extreme right-skewness, where the majority of data points are
concentrated at lower values. Many variables, like num_tl_90g_dpd_24m,
num_actv_bc_tl, and max_bal_bc, have a significant concentration of values
near zero, indicating that for these variables, the majority of the data points
reflect minimal activity or involvement (e.g., low number of transactions or
minimal balance).
• In many histograms (e.g., recovery_label,
sum_capital_paid_account_0_12m), there are long tails indicating the
presence of outliers or extreme values. This implies that there are a few cases
where the values are much higher than the rest of the data.

2.2 Bivariate analysis

3.2.1 Average Account Amount Added (12-24m) by Default status

This barplot compares the average account amount added in the last 12-24 months
for customers who defaulted (1) versus those who didn't (0). We can see that:
• Customers who defaulted (1) tend to have a higher average account amount
added compared to those who didn't default (0).
• This could suggest that customers who add larger amounts to their accounts
might be at a higher risk of default, possibly due to overextending their
financial capabilities.

3.2.2 Distribution of Max paid invoice(0-12m) by Default status

This strip plot shows the distribution of the maximum paid invoice in the last 12
months for defaulted and non-defaulted customers. Observations:

- 11 -
• The distribution for non-defaulted customers (0) appears to be more
concentrated in the lower range, with some high-value outliers.

• Non-defaulted accounts (status 0) show a wider and higher distribution of


max paid invoices, while defaulted accounts (status 1) have smaller invoice
amounts. This pattern could be used for risk assessment or to better
understand customer payment behaviour

3.2.3 Violin plot: Age distribution by default status

This violin plot displays the age distribution for defaulted and non-defaulted
customers.
• The age distributions are fairly similar for both groups.
• Both distributions are slightly right-skewed, with most customers between
25-45 years old.
• There's a slight indication that defaulted customers might be younger on
average, but the difference doesn't appear to be substantial.

3.2.4 Heat Map -Correlation

- 12 -
Key Insights:
1. Highly Correlated Features:
• Features with a correlation coefficient close to 1 or -1 have a very strong
linear relationship, either positively or negatively correlated.
• For example, if max_paid_inv_0_12m and num_active_inv_0_12m show
high positive correlation, it implies that as the number of active invoices
increases, the maximum paid invoice also tends to increase.
• Similarly, features like acct_worst_status_12_24m might be strongly
correlated with acct_worst_status_6_12m, indicating a consistency in
worst account status over different periods.
2. Clusters of Features:
• Features that are highly correlated with each other may form "clusters." For
instance, all account status variables or payment-related features might be
grouped together, showing that they are related aspects of customer
behavior.
• Clustering often reveals related features that can be treated similarly in
model building or analysis, as they provide overlapping information.
3. Negative Correlations:
• Strong negative correlations (close to -1) indicate an inverse relationship.
For example, if default_status has a negative correlation with
max_paid_inv_0_12m, it means that customers with higher max paid
invoices are less likely to default.
• Similarly, a negative correlation between
acct_incoming_debt_vs_paid_0_24m and acct_days_in_rem_12_24m
might show that the more days a person remains in arrears, the less they
manage to reduce their outstanding debt.
4. Redundancy:
• Features that are almost perfectly correlated (near 1) may represent
redundant information. For example, if acct_worst_status_6_12m and
acct_worst_status_3_6m are highly correlated, it may be redundant to
include both in certain analyses. One of these features can potentially be
dropped in a model without losing valuable information.
5. Outliers in Correlation:
• If there are features that stand out with unexpectedly high or low
correlations compared to others, they may warrant deeper investigation.
These outliers could represent key insights into behavior or relationships
between variables that are not immediately obvious.

- 13 -
3. Data Cleaning and Pre-processing
3.1 Removal of unwanted variables
• Removed userid variable

3.3.1 Removed Userid from the data frame

• Removed name in email variable

3.3.2 Removed Name in email variable from the data frame

3.2 Missing Value treatment

• There are 615512 missing values. The percentage of missing value in each
variable calculated and the result is below:

3.4.1 Percentage of missing value per column

• Dropping off the columns which has missing value greater than 25% and below
are the missing values in remaining columns

- 14 -
3.4.2 Post dropping off columns with 25% threshold

• By importing SimpleImputer, missing values are imputed by median. Below is


the result post imputation.

3.4.3Post Imputation- Missing values

3.3 Outlier treatment


• For outlier treatment, we are separated the data into object and non-object to
visualize the outliers

- 15 -
3.5.1 Outliers using box plot

• Post outlier treatment,

3.5.2 Post Outliers Treatment

3.4 Variable transformation


• One-hot encoding done for Merchant Category and Merchant group (categorical
columns)
• Post that, shape of dataset is shown below:

- 16 -
3.6.1 One-hot encoding

4. Model building and interpretation


4.1 Build various models
Post doing EDA, building models would be the next step.
• Dataset splitted into training and testing dataset before building models as
shown below:

1.1.1 Train and test data

Random forest classifier:


• Imported RandomForestClassifier library from sklearn ensemble
• It was fitted to the training data set.
Decision Tree classifier:
• Imported DecisionTreeClassifier library from sklearn.tree
• It was fitted to the training data set.
Naïve Bayes classifier:
• Imported GuassianNB library from sklearn.naive_bayes
• It was fitted to the training data set.
Support Vector Machine:

- 17 -
• Imported SVC library from sklearn.svm
• It was fitted to the training data set.

4.2 Test your predictive model against the test set using
various appropriate performance metrics
• Imported few libraries like confusion_matrix, precision_score,
recall_score, ConfusionMatrixDisplay, Classification report,
accuracy score.
Random forest classifier:
• It makes predictions on a test set, and calculates the accuracy score. The
accuracy achieved is 98.43%.

1.2.1 Accuracy score – Random Forest

• A confusion matrix is plotted using seaborn's heatmap function. This matrix


visualizes the performance of the classification model:
1. The top-left cell (25312) represents true negatives (correctly predicted Class 0)
2. The bottom-right cell (22) represents true positives (correctly predicted Class 1)
3. The top-right (39) and bottom-left (372) cells represent false positives and false
negatives respectively
4. The high number of correct predictions in the diagonal cells and low numbers in
the off-diagonal cells indicate that the model performs very well, which is
consistent with the high accuracy score.

1.2.2 Confusion Matrix – Random Forest

• Few points observed from classification report:


1. The model performs very well in identifying non-defaulters (high precision,
recall, and F1-score for class 0)

- 18 -
2. However, it struggles with identifying defaulters (low precision, very low
recall, and low F1-score for class 1)
3. The high overall accuracy (98.42%) is misleading due to the class imbalance
4. The large difference between macro and weighted averages further highlights
the impact of class imbalance

1.2.3 Classification report – Random Forest

• Key points from ROC curve:


1. The AUC is 0.80, which suggests that the classifier has good
performance
2. The curve is above the diagonal line indicating that the classifier is better
than random guessing.
3. Overall, the Random Forest classifier is performing well, with a good
balance between sensitivity and specificity. An AUC of 0.80 suggests that
the model is effective at distinguishing between the two classes.

1.2.4 ROC Curve – Random Forest

Decision Tree classifier:


• It makes predictions on a test set, and calculates the accuracy score. The
accuracy achieved is 97.21%.

- 19 -
1.2.5 Accuracy score – DTC

• A confusion matrix is plotted using seaborn's heatmap function. This


confusion matrix suggests that while the Decision Tree Classifier performs
well for the majority class, it may need improvement in correctly identifying
the minority class

1.2.6 Confusion Matrix – DTC

• Few points observed from classification report:


1. The model performs very well in identifying non-defaulters (Class 0) with
high precision, recall, and F1-score (all above 0.98).
2. However, it struggles significantly with identifying defaulters (Class 1),
with low precision, recall, and F1-score.
3. The overall accuracy is high (0.972111), but this is misleading due to class
imbalance. There are far more non-defaulters than defaulters in the
dataset
4. The macro average, which gives equal weight to both classes, shows much
lower overall performance (around 0.55 for all metrics) due to the poor
performance on the minority class.
5. while this classifier is very good at identifying non-defaulters, it performs
poorly in detecting defaulters, which is likely the more important class in
many real-world scenarios.

1.2.7 Classification report – DTC

- 20 -
• Key points from ROC curve:
1. The ROC curve is close to the diagonal line, which represents random
performance. This further confirms that the classifier's performance is
not strong.
2. AUC of 0.57 suggests that the classifier has slightly better performance
than random guessing but is not very effective.
3. Overall, the Decision Tree Classifier in this case has limited
discriminative ability, as indicated by the low AUC score and the
shape of the ROC curve. Improvements might be needed, such as
tuning the model parameters or using a different classification
algorithm.

1.2.8 ROC curve – DTC

Naïve Bayes classifier:


• It makes predictions on a test set, and calculates the accuracy score. The
accuracy achieved is 95.98%.

1.2.9 Accuracy score – NBC

• While the Naive Bayes Classifier does reasonably well at identifying Class 0
(with high true negatives), it performs poorly in identifying Class 1
(Defaulters), as seen by the low number of true positives and high false
negatives

- 21 -
1.2.10 Confusion Matrix – NBC

• Based on classification report, The Naive Bayes classifier is heavily biased


towards predicting "non-defaulters", which leads to a very low precision,
recall, and F1-score for "Defaulters".

1.2.11 Classification report – NBC

• Key points of ROC curve:


1. The curve is above the random line: This confirms that the classifier is
better than random guessing.
2. Moderate AUC (0.80): The classifier performs well overall but still has
room for improvement, especially when considering that the classification
report showed poor results for the minority class (Defaulters).
3. A score of 0.80 means that there's a 80% chance that the classifier will
distinguish between a randomly chosen "Defaulter" and "Non-Defaulter"
correctly.

- 22 -
1.2.12 ROC Curve – NBC

Support Vector Machine:


• It makes predictions on a test set, and calculates the accuracy score. The
accuracy achieved is 98.47%.

1.2.13 Accuracy score – SVM

• Key points from confusion matrix:


1. The SVM classifier has predicted all instances as "Non-Defaulters" (Class
0). This is why there are no predictions for Class 1 (Defaulters).
2. The confusion matrix indicates that the classifier is highly biased towards
the majority class (Class 0), and it is not able to identify any instances of
the minority class (Class 1). This is often the result of severe class
imbalance, where the classifier is dominated by the large number of "non-
defaulters" and ignores the small number of "Defaulters."
3. Since all the actual "Defaulters" are misclassified as "non-defaulters," the
model has 0 recall for Class 1, which means it’s not useful for identifying
defaulters at all.

1.2.14 Confusion matrix – SVM

• From classification report, the model performs very well in predicting non-
defaulters but completely fails to detect Defaulters. This could be due to
class imbalance.

- 23 -
1.2.15 Classification report – SVM

• An AUC of 0.5 indicates that the model performs no better than random
guessing, meaning it has no discriminative power to distinguish between
the classes.

1.2.16 OC Curve– SVM

4.3 Interpretation of the model(s)

• The Random Forest has a good accuracy (98.43%) and a relatively high
AUC (0.80), which indicates it performs well in distinguishing classes.
However, its precision (0.40) and recall (0.08) for the minority class (likely
Defaulters) are quite low, showing that it struggles with class imbalance.
• The Decision Tree model has a lower AUC (0.58), and precision, recall, and
F1-scores are also quite low. It struggles more compared to Random Forest
in separating the classes, and overall performance indicates that it might
need tuning.
• Naive Bayes has a lower accuracy (95.98%), and while its precision is low
(0.08), it has a relatively higher recall (0.16). The AUC score is similar to
Random Forest (0.80), but its low precision indicates that it struggles with
false positives.
• The SVM model has a very high precision (1.00) but a recall of 0, meaning
it does not detect any Defaulters at all. This results in an F1-score of 0 and a
low AUC (0.50), indicating it performs no better than random guessing.
• For models like Decision Tree, using boosting techniques (e.g., Gradient
Boosting, XGBoost) could improve performance by focusing on the
misclassified instances.
• So, ensemble and model tuning is needed for the more effective models.

- 24 -
4.4 Ensemble modelling:
Bagging Classifier using Decision Tree:
• Imported BaggingClassifier from sklearn.ensemble
• It makes predictions on a test set, and calculates the accuracy score. The
accuracy achieved is 98.41%.

2.1.1 Accuracy score – Bagging

• Key points on confusion matrix:


1. Class 0 (non-defaulters) is being predicted quite accurately, with
25,308 correct predictions and only 43 false positives. This
suggests that the Bagging Classifier performs well on the majority class.
2. Class 1 (Defaulters) is where the model struggles. Out of 394 true
instances of Defaulters (from the earlier report), it correctly identified
only 27. The remaining 367 Defaulters were misclassified as non-
defaulters, leading to a high false negative rate.

2.1.2 Confusion matrix – Bagging

• Key points on classification report:


1. The classifier is doing well on the majority class (Non-Defaulters),
but it performs poorly on the minority class (Defaulters). This can
be seen in the low precision, recall, and F1-score for Defaulters.
2. The overall accuracy (98.4%) is high because of the class imbalance.
The model is heavily skewed toward predicting non-defaulters
correctly but is failing to capture the Defaulters, which is crucial in
many real-world applications

- 25 -
3. The low recall (8.6%) for Defaulters means the model is missing
most of the Defaulters. This can be dangerous in scenarios where
detecting Defaulters is important.

2.1.3 Classification report – Bagging

• Key points on ROC curve:


1. The AUC score is 0.80, which indicates a good model. A perfect model
would have an AUC of 1, while a random model would have an AUC of
0.5. AUC = 0.80 means that 80% of the time, the model will correctly
distinguish between a Defaulter and a Non-Defaulter.
2. Good Performance: An AUC of 0.80 is a strong indicator that the
Bagging Classifier has a good balance between correctly identifying
Defaulters while minimizing the number of false positives.
3. Although the AUC score is 0.80, which indicates a good model, it’s
essential to balance the trade-off between recall and precision,
especially in contexts where false positives or false negatives can have
significant costs.

2.1.4 ROC Curve – Bagging

Ada Boosting Classifier using Decision Tree:


• It makes predictions on a test set, and calculates the accuracy score. The
accuracy achieved is 98.45%.

- 26 -
2.1.5 Accuracy score – Ada boosting

• Based on confusion matrix, The model performs poorly at identifying Class


1 (Defaulters), with only 5 true positives and 389 false negatives. This
means the model frequently misclassifies Class 1 as Class 0.

2.1.6 Confusion matrix – Ada boosting

• From classification report, the model is very good at identifying Non-


Defaulters (Class 0) but performs poorly for Defaulters (Class 1).

2.1.7 Classification report – Ada boosting

• The classifier does a good job overall, with a relatively high AUC score.
• Although the classifier performs well in general, it may still fail to correctly
identify the minority class (Class 1) as shown by its low recall and F1-score
for that class.

- 27 -
2.1.8 ROC Curve – Ada boosting

Gradient Boosting classifier:


• Gradient Boosting primarily uses decision trees as the base model, and
through an iterative process of reducing prediction errors, it builds a
strong overall model from these weaker individual trees.
• With this, it has fairly good accuracy score of 98.45%

2.1.9 Accuracy score – gradient boosting

• Confusion matrix suggests the model is skewed towards predicting Class 0


more often and may not perform well on the minority Class 1.

2.1.10 confusion matrix – gradient boosting

• Since accuracy is misleading with imbalanced data, using metrics like F1-
score, precision-recall curve, or ROC-AUC may provide better insight into
model performance.

2.1.11 classification report – gradient boosting

- 28 -
2.1.12 ROC Curve – gradient boosting

• Post applying ensemble, below is the result of all the models and its
performance.

2.1.13 Performance metrics of models

Other model tuning measures


• Few hyperparameter tuning like grid search, random search was
performed on the models.
Randomised Search CV using Random Forest Classifier:
• Performed hyperparameter tuning for a Random Forest Classifier using
RandomizedSearchCV from the sklearn library.
• the best parameters for the Random Forest model are displayed, along
with a best accuracy of 0.99, meaning the model performed very well
during cross-validation.

• The score () method is used to evaluate the model (which was trained
earlier using RandomizedSearchCV) on the test set X_test and y_test.
• It returns the accuracy of the model on the test set, which is stored in
the variable accuracy.

- 29 -
2.2.1 Accuracy score – Randomized search cv using RFC

• The model is highly accurate for Class 0 but has difficulty


distinguishing Class 1, possibly due to class imbalance (many more
instances of class 0 compared to class 1). This type of issue is common
when one class dominates the dataset.

2.2.2 Confusion matrix – Randomized search cv using RFC

2.2.3 Classification report – Randomized search cv using RFC

2.2.4 ROC Curve – Randomized search cv using RFC

Randomised Search CV using Decision Tree Classifier


• Performed hyperparameter tuning for a Decision tree Classifier using
RandomizedSearchCV from the sklearn library.

- 30 -
• the best parameters for the Decision tree model are displayed, along
with a best accuracy of 0.99, meaning the model performed very well
during cross-validation

• It returns the accuracy of the model on the test set, which is stored in
the variable accuracy, which is 0.98

2.2.5 Accuracy score – Randomized search cv using DTC

• The confusion matrix further reinforces the issue identified in the


classification report. The model is highly biased towards the majority
class (Non-Defaulters) and completely ignores the minority class
(Defaulters).

2.2.6 Confusion matrix – Randomized search cv using DTC

2.2.7 Classification report – Randomized search cv using DTC

- 31 -
2.2.8 ROC Curve – Randomized search cv using DTC

Randomised Search CV for Naive Bayes (Bernoulli)

2.2.9 Accuracy score – Randomized search cv using NB

2.2.10 Confusion matrix – Randomized search cv using NB

2.2.11 Classification report– Randomized search cv using NB

- 32 -
2.2.12 ROC Curve – Randomized search cv using NB

Grid search in decision tree classifier


• The model now identifies some Defaulters (37), but the number of false
negatives (357) is still significant.

2.2.13 Accuracy score – Grid search cv using DTC

2.2.14 Confusion matrix – Grid search cv using DTC

2.2.15 Confusion matrix – Grid search cv using DTC

- 33 -
2.2.16 ROC curve – Grid search cv using DTC

Grid search in Bernoulli NB classifier

2.2.17 Accuracy score – Grid search cv using NB

2.2.18 Confusion matrix– Grid search cv using NB

2.2.19 Classification report – Grid search cv using NB

- 34 -
2.2.20 ROC Curve – Grid search cv using NB

2.2.21 Performance metrics of all models

4.5 Interpretation of the most optimum model and its


implication on the business
• RandomisedSearchCV for RandomForestClassifier has the
highest accuracy (98.50%) and a good balance of precision (0.77) and
AUC score (0.88), making it a strong candidate for predicting default
probability.
• The reason to chose this model is as follows:
• Highest Accuracy: The model achieves the highest accuracy of
98.50% among all the models presented. This means it correctly
predicts the outcome (default or non-default) for 98.50% of the
cases in the dataset. High accuracy is crucial in banking risk
assessment to minimize errors in predicting defaults.
• Strong Precision: With a precision of 0.77, this model has the
highest precision among all models (tied with
RandomisedSearchCV for Decision Tree Classifier). Precision
measures the proportion of true positive predictions (correctly
predicted defaults) out of all positive predictions. A high precision
means that when the model predicts a default, it's more likely to be
correct, reducing false alarms.
• High AUC Score: The Area Under the Curve (AUC) score of 0.88 is
one of the highest among all models. AUC represents the model's
ability to distinguish between classes (default and non-default). A
score of 0.88 indicates that the model has a strong ability to
separate the two classes, which is crucial for a binary classification
problem like predicting loan defaults.

- 35 -
• Balanced Performance: This model provides a good balance
between different metrics. While some models might excel in one
area but perform poorly in others, this model maintains high
scores across accuracy, precision, and AUC.
Advantages of Random Forest: The base algorithm (Random Forest)
is known for its robustness and ability to handle complex relationships
in data. It's an ensemble method that combines multiple decision
trees, which helps in reducing overfitting and improving
generalization.
• Hyperparameter Optimization: The use of RandomisedSearchCV
indicates that the model's hyperparameters have been optimized.
This process helps in finding the best configuration of the Random
Forest algorithm for this specific dataset, potentially improving its
performance over a standard Random Forest.
• Feature Importance Visualization:

2.3.1 Top 10 feature importances

• Financial behaviour: The majority of the important features seem


to focus on financial behaviour, particularly in terms of payments,
investments, and capital added within certain time frames (12 and
24 months).
• Age and time-related metrics: Age and duration within the system
("time_hours") are also influential, likely capturing aspects of
experience, reliability, or maturity.

5.Model Validation

To choose the model, important performance metrics need to be identified for the
respective problem
• Precision is critical to avoid incorrectly labeling non-defaulting customers as
defaults, which can lead to reputational damage and loss of trust.
• Recall is important to capture as many actual defaults as possible, but the
focus is on ensuring that defaults are accurately identified without mistakenly
flagging non-defaults, precision takes precedence.
• For imbalanced datasets, the AUC-ROC is the most important metric. This is
because it evaluates the model's ability to distinguish between classes across
all possible thresholds, providing a comprehensive view of performance
regardless of class distribution.

- 36 -
So, Randomized Search CV for Random Forest classifier is chosen as final
model as it has balance between the performance metrics especially Precision
and AUC-ROC

6.Final interpretation/ recommendations


• Explore Age-based Strategies: The slight tendency of younger customers to
have a higher default risk could be further investigated, and the credit card
company may consider implementing age-specific strategies to manage this
risk.
• Optimize Merchant Category Strategies: The insights on the concentration of
spending in certain merchant categories can guide the credit card company's
marketing and product strategies, allowing them to better cater to the
preferences and needs of their customers.
• Relationship between Merchant Categories and Default Risk: The analysis of
the top merchant categories could be extended to investigate if certain types of
purchases are more strongly associated with default risk. This could provide
valuable insights for the credit card company to identify high-risk spending
patterns. For example: merchant category with the highest default rate is
Youthful Shoes & Clothing, indicating a potential issue with that category.
• Clustering Insights: The optimal number of customer clusters was identified
as five using K-means clustering, suggesting distinct customer segments based
on financial behavior. This can help in targeted marketing and risk assessment
strategies.
• Customer Risk Segmentation: Customers who defaulted tend to have
higher average account amounts added in the last 12-24 months, which may
indicate that higher financial activity could be associated with default risk.
Non-defaulting customers generally have higher maximum paid invoices
compared to defaulters, implying better financial behaviour in terms of
invoice payments
• Leverage Feature Importance Insights: The top feature importances
such as financial behavior, age, and time-related metrics, can provide valuable
insights for the business. These insights can be used to enhance customer risk
profiling, develop targeted intervention strategies, and inform product design
or credit policies.
• Explore Explainable AI Techniques: Given the critical nature of loan
default prediction in the banking industry, it is essential to provide
interpretable and explainable model decisions. Consider incorporating XAI
techniques, such as SHAP values or feature importance analysis, to enhance
the transparency and trustworthiness of the model's predictions.

- 37 -
• External Data Integration: The credit card company could consider
integrating external data sources, such as macroeconomic indicators, industry
trends, or customer financial information from other sources, to enrich the
analysis and gain a more comprehensive understanding of the factors
influencing default risk.
By implementing these recommendations, the banking organization can leverage
the model to improve its loan default prediction capabilities, make more
informed decisions, and ultimately enhance its risk management practices.

- 38 -

You might also like