0% found this document useful (0 votes)

17 views38 pages

Banking Project Final

Uploaded by

aurorajashri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views38 pages

Banking Project Final

Uploaded by

aurorajashri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 38

Banking Project

(Capstone Project – Final Report)

DSBA

By:
E. AuroRajashri

-1-
List of Content
1. Introduction 6
1.1. Defining problem statement
1.2. Need of the study/project

2. Exploratory data analysis 7

2.1 Univariate analysis (distribution and spread for every continuous attribute,
distribution of data in categories for categorical ones)
2.2 Bivariate analysis (relationship between different variables, correlations)

3. Data Cleaning and Pre-processing 14

3.1 Removal of unwanted variables (if applicable)
3.2 Missing Value treatment (if applicable)
3.3 Outlier treatment (if required)
3.4 Variable transformation (if applicable)

4. Model building 17
4.1 Build various models
4.2 Test your predictive model against the test set using various performance
metrics
4.3 Interpretation of the model(s)
4.4 Ensemble modelling, wherever applicable
4.5 Interpretation of the most optimum model and its implication on the business

5. Model validation 36
5.1 Various model Validation measures

6. Final interpretation / recommendation 37

-2-
List of Tables
2.2 Descriptive Statistics

2.3 Data Info

3.3.1 Removed User id variable from the data frame

3.3.2 Removed Name in email variable from the data frame

3.4.1 Percentage of missing value per column

3.4.2 Post dropping off columns with 25% threshold

3.4.3 Post Imputation- Missing values

3.6.1 One-hot encoding

4.2.1 Post scaling treatment

4.2.2 Inertia of various n_clusters

4.2.4 Final dataset post clustering

List of Figures
3.1.1 Histogram of age

3.1.2 Histogram of Time_hours

3.1.3 Number of Defaulters and Non-defaulters

3.1.4 Top 10 Merchant Categories

3.1.5 Top 10 Merchant groups

3.1.6 Histogram of all numerical variables

3.2.1 Average Account Amount Added (12-24m) by Default status

3.2.2 Distribution of Max paid invoice(0-12m) by Default status

3.2.3 Violin plot: Age distribution by default status

3.2.4 Heat Map -Correlation

3.5.1 Outliers using box plot

3.5.2 Post Outliers Treatment

4.2.3 Elbow graph

5.1 Train and test data

5.2.1 Accuracy score – Random Forest

5.2.2 Confusion Matrix – Random Forest

5.2.3 Classification report – Random Forest

5.2.4 ROC Curve – Random Forest

5.2.5 Accuracy score – DTC

-3-
5.2.6 Confusion Matrix – DTC

5.2.7 Classification report – DTC

5.2.8 ROC curve – DTC

5.2.9 Accuracy score – NBC

5.2.10 Confusion Matrix – NBC

5.2.11 Classification report – NBC

5.2.12 ROC Curve – NBC

5.2.13 Accuracy score – SVM

5.2.14 Confusion matrix – SVM

5.2.15 Classification report – SVM

5.2.16 ROC Curve– SVM

6.1.1 Accuracy score – Bagging

6.1.2 Confusion matrix – Bagging

6.1.3 Classification report – Bagging

6.1.4 ROC Curve – Bagging

6.1.5 Accuracy score – Ada boosting

6.1.6 Confusion matrix – Ada boosting

6.1.7 Classification report – Ada boosting

6.1.8 ROC Curve – Ada boosting

6.1.9 Accuracy score – gradient boosting

6.1.10 confusion matrix – gradient boosting

6.1.11 classification report – gradient boosting

6.1.12 ROC Curve – gradient boosting

6.1.13 Performance metrics of models

6.2.1 Accuracy score – Randomized search cv using RFC

6.2.2 Confusion matrix – Randomized search cv using RFC

6.2.3 Classification report – Randomized search cv using RFC

6.2.4 ROC Curve – Randomized search cv using RFC

6.2.5 Accuracy score – Randomized search cv using DTC

6.2.6 Confusion matrix – Randomized search cv using DTC

6.2.7 Classification report – Randomized search cv using DTC

6.2.8 ROC Curve – Randomized search cv using DTC

6.2.9 Accuracy score – Randomized search cv using NB

6.2.10 Confusion matrix – Randomized search cv using NB

6.2.11 Classification report– Randomized search cv using NB

6.2.12 ROC Curve – Randomized search cv using NB

-4-
6.2.13 Accuracy score – Grid search cv using DTC

6.2.14 Confusion matrix – Grid search cv using DTC

6.2.15 Confusion matrix – Grid search cv using DTC

6.2.16 ROC curve – Grid search cv using DTC

6.2.17 Accuracy score – Grid search cv using NB

6.2.18 Confusion matrix– Grid search cv using NB

6.2.19 Classification report – Grid search cv using NB

6.2.20 ROC Curve – Grid search cv using NB

6.2.21 Performance metrics of all models

6.3.1 Top 10 feature importances

-5-
1. Introduction
1.1 Defining problem statement
Problem Statement: This business problem is a supervised learning
example for a credit card company. The objective is to predict the probability
of default (whether the customer will pay the credit card bill or not) based on
the variables provided. There are multiple variables on the credit card
account, purchase and delinquency information which can be used in the
modelling.
PD modelling problems are meant for understanding the riskiness of the
customers and how much credit is at stake in case the customer defaults.
This is an extremely critical part in any organization that lends money [both
secured and unsecured loans].
• The objective of this project is to develop a predictive model that
estimates the probability of default for credit card customers. This
involves using the provided dataset, which contains various variables
related to credit card accounts, purchases, and delinquency
information, to understand the riskiness of customers.
• By accurately predicting the likelihood of default, the credit card
company can better assess the credit risk associated with each
customer and make informed decisions regarding credit limits,
interest rates, and other lending terms. This is crucial for minimizing
potential losses and managing the overall credit risk portfolio of the
organization.

1.2 Need of the study/project

The need for this study or project arises from the critical role that predicting the
probability of default (PD) plays in the financial industry, particularly for credit
card companies and other lending institutions. Here are some key reasons why
this study is essential:
1. Risk Management: Understanding the riskiness of customers is crucial for
managing the overall risk portfolio of a lending institution. By predicting the
likelihood of default, companies can make informed decisions about whom to
lend to and under what terms.
2. Credit Allocation: Accurate PD models help in determining the appropriate
amount of credit to extend to each customer. This ensures that credit is
allocated efficiently, maximizing returns while minimizing risk.
3. Loss Mitigation: By identifying high-risk customers, companies can take
proactive measures to mitigate potential losses. This might include adjusting
credit limits, changing interest rates, or implementing stricter repayment
terms.

-6-
4. Regulatory Compliance: Financial institutions are often required to maintain
certain levels of capital reserves based on the riskiness of their loan portfolios.
Accurate PD models help in meeting these regulatory requirements by
providing a clear picture of potential defaults.
Overall, this study is essential for enhancing the financial stability and
operational efficiency of lending institutions, ultimately contributing to their
long-term success.
Recent example: Yes Bank, once one of India's fastest-growing private sector
banks, faced a severe crisis in 2020 due to its inability to manage credit risk
effectively.
Recent example: In 2019, SBI implemented an AI-powered credit scoring system
to assess loan applications and predict the probability of default. This system has
helped SBI better manage its non-performing assets (NPAs) by more accurately
predicting which borrowers are likely to default

2. Exploratory data analysis

2.1 Univariate analysis

3.1.1 Histogram of age

This histogram shows the distribution of age in the dataset. We can observe that:
• The age distribution is right-skewed, with most customers falling in the range
of 25-45 years old.
• The peak of the distribution is around 30-35 years old.
• There are fewer customers in the older age ranges (above 60).

-7-
3.1.2 Histogram of Time_hours

• The data shows a higher frequency of occurrences around the 15 to 21-hour

mark. This suggests that most of the recorded times fall within this range.
• The distribution appears to be right-skewed, with more values concentrated
in the later hours (past 10 hours) and fewer in the earlier hours.

3.1.3 Number of Defaulters and Non-defaulters

The dataset is highly imbalanced in terms of default status:

• 88688 customers did not default (98.57%)

• 1,288 customers defaulted (1.43%)

3.1.4 Top 10 Merchant category

-8-
Key insights:
• Concentration of Transactions: The Direct selling establishments category
dominates with the highest count, nearly 40,000, far exceeding the other
categories. This indicates a large number of transactions or significant activity
in this category.
• Moderate Activity: Categories like Books & Magazines and Youthful Shoes &
Clothing have moderate counts (around 10,000–15,000), showing significant
but not overwhelming activity compared to the leader.
• Low Activity Categories: Categories like Dietary Supplements, Prints &
Photos, and Diversified electronics have much lower counts (under 10,000).
These are niche categories with fewer transactions.
• Category Variety: The top 10 categories represent a broad range of industries,
including electronics, apparel, outdoor gear, books, and general merchandise.
This indicates diverse customer interests.

3.1.5 Top 10 Merchant groups

The bar chart you shared shows the top 10 merchant groups and the count of
transactions or occurrences associated with each group. Here's a breakdown of the
insights:
• Entertainment is by far the dominant category, with significantly more counts
(around 50,000) than the other categories. This suggests that consumers
engage with or spend more in this group.
• Clothing & Shoes follows as the second-highest group, though it's much lower
than Entertainment.

-9-
• The groups with the lowest counts are Jewelry & Accessories, Home & Garden,
Intangible Products, and Automotive Products.
• The distribution shows that spending or transaction volume is concentrated
heavily in Entertainment, with other categories having relatively smaller but
still notable volumes.

3.1.6 Histogram of all numerical variables

• Many variables, such as acct_worst_status_0_24m,

acct_worst_status_1_24m, and num_active_rev_tl, show high frequencies at
zero or low values with a steep decline as the values increase. This suggests
that most data points fall in the lower range, with fewer high values.

- 10 -
• Variables like sum_capital_paid_account_0_12m and num_active_tl also
show extreme right-skewness, where the majority of data points are
concentrated at lower values. Many variables, like num_tl_90g_dpd_24m,
num_actv_bc_tl, and max_bal_bc, have a significant concentration of values
near zero, indicating that for these variables, the majority of the data points
reflect minimal activity or involvement (e.g., low number of transactions or
minimal balance).
• In many histograms (e.g., recovery_label,
sum_capital_paid_account_0_12m), there are long tails indicating the
presence of outliers or extreme values. This implies that there are a few cases
where the values are much higher than the rest of the data.

2.2 Bivariate analysis

3.2.1 Average Account Amount Added (12-24m) by Default status

This barplot compares the average account amount added in the last 12-24 months
for customers who defaulted (1) versus those who didn't (0). We can see that:
• Customers who defaulted (1) tend to have a higher average account amount
added compared to those who didn't default (0).
• This could suggest that customers who add larger amounts to their accounts
might be at a higher risk of default, possibly due to overextending their
financial capabilities.

3.2.2 Distribution of Max paid invoice(0-12m) by Default status

This strip plot shows the distribution of the maximum paid invoice in the last 12
months for defaulted and non-defaulted customers. Observations:

- 11 -
• The distribution for non-defaulted customers (0) appears to be more
concentrated in the lower range, with some high-value outliers.

• Non-defaulted accounts (status 0) show a wider and higher distribution of

max paid invoices, while defaulted accounts (status 1) have smaller invoice
amounts. This pattern could be used for risk assessment or to better
understand customer payment behaviour

3.2.3 Violin plot: Age distribution by default status

This violin plot displays the age distribution for defaulted and non-defaulted
customers.
• The age distributions are fairly similar for both groups.
• Both distributions are slightly right-skewed, with most customers between
25-45 years old.
• There's a slight indication that defaulted customers might be younger on
average, but the difference doesn't appear to be substantial.

3.2.4 Heat Map -Correlation

- 12 -
Key Insights:
1. Highly Correlated Features:
• Features with a correlation coefficient close to 1 or -1 have a very strong
linear relationship, either positively or negatively correlated.
• For example, if max_paid_inv_0_12m and num_active_inv_0_12m show
high positive correlation, it implies that as the number of active invoices
increases, the maximum paid invoice also tends to increase.
• Similarly, features like acct_worst_status_12_24m might be strongly
correlated with acct_worst_status_6_12m, indicating a consistency in
worst account status over different periods.
2. Clusters of Features:
• Features that are highly correlated with each other may form "clusters." For
instance, all account status variables or payment-related features might be
grouped together, showing that they are related aspects of customer
behavior.
• Clustering often reveals related features that can be treated similarly in
model building or analysis, as they provide overlapping information.
3. Negative Correlations:
• Strong negative correlations (close to -1) indicate an inverse relationship.
For example, if default_status has a negative correlation with
max_paid_inv_0_12m, it means that customers with higher max paid
invoices are less likely to default.
• Similarly, a negative correlation between
acct_incoming_debt_vs_paid_0_24m and acct_days_in_rem_12_24m
might show that the more days a person remains in arrears, the less they
manage to reduce their outstanding debt.
4. Redundancy:
• Features that are almost perfectly correlated (near 1) may represent
redundant information. For example, if acct_worst_status_6_12m and
acct_worst_status_3_6m are highly correlated, it may be redundant to
include both in certain analyses. One of these features can potentially be
dropped in a model without losing valuable information.
5. Outliers in Correlation:
• If there are features that stand out with unexpectedly high or low
correlations compared to others, they may warrant deeper investigation.
These outliers could represent key insights into behavior or relationships
between variables that are not immediately obvious.

- 13 -
3. Data Cleaning and Pre-processing
3.1 Removal of unwanted variables
• Removed userid variable

3.3.1 Removed Userid from the data frame

• Removed name in email variable

3.3.2 Removed Name in email variable from the data frame

3.2 Missing Value treatment

• There are 615512 missing values. The percentage of missing value in each
variable calculated and the result is below:

3.4.1 Percentage of missing value per column

• Dropping off the columns which has missing value greater than 25% and below
are the missing values in remaining columns

- 14 -
3.4.2 Post dropping off columns with 25% threshold

• By importing SimpleImputer, missing values are imputed by median. Below is

the result post imputation.

3.4.3Post Imputation- Missing values

3.3 Outlier treatment

• For outlier treatment, we are separated the data into object and non-object to
visualize the outliers

- 15 -
3.5.1 Outliers using box plot

• Post outlier treatment,

3.5.2 Post Outliers Treatment

3.4 Variable transformation

• One-hot encoding done for Merchant Category and Merchant group (categorical
columns)
• Post that, shape of dataset is shown below:

- 16 -
3.6.1 One-hot encoding

4. Model building and interpretation

4.1 Build various models
Post doing EDA, building models would be the next step.
• Dataset splitted into training and testing dataset before building models as
shown below:

1.1.1 Train and test data

Random forest classifier:

• Imported RandomForestClassifier library from sklearn ensemble
• It was fitted to the training data set.
Decision Tree classifier:
• Imported DecisionTreeClassifier library from sklearn.tree
• It was fitted to the training data set.
Naïve Bayes classifier:
• Imported GuassianNB library from sklearn.naive_bayes
• It was fitted to the training data set.
Support Vector Machine:

- 17 -
• Imported SVC library from sklearn.svm
• It was fitted to the training data set.

4.2 Test your predictive model against the test set using
various appropriate performance metrics
• Imported few libraries like confusion_matrix, precision_score,
recall_score, ConfusionMatrixDisplay, Classification report,
accuracy score.
Random forest classifier:
• It makes predictions on a test set, and calculates the accuracy score. The
accuracy achieved is 98.43%.

1.2.1 Accuracy score – Random Forest

• A confusion matrix is plotted using seaborn's heatmap function. This matrix

visualizes the performance of the classification model:
1. The top-left cell (25312) represents true negatives (correctly predicted Class 0)
2. The bottom-right cell (22) represents true positives (correctly predicted Class 1)
3. The top-right (39) and bottom-left (372) cells represent false positives and false
negatives respectively
4. The high number of correct predictions in the diagonal cells and low numbers in
the off-diagonal cells indicate that the model performs very well, which is
consistent with the high accuracy score.

1.2.2 Confusion Matrix – Random Forest

• Few points observed from classification report:

1. The model performs very well in identifying non-defaulters (high precision,
recall, and F1-score for class 0)

- 18 -
2. However, it struggles with identifying defaulters (low precision, very low
recall, and low F1-score for class 1)
3. The high overall accuracy (98.42%) is misleading due to the class imbalance
4. The large difference between macro and weighted averages further highlights
the impact of class imbalance

1.2.3 Classification report – Random Forest

• Key points from ROC curve:

1. The AUC is 0.80, which suggests that the classifier has good
performance
2. The curve is above the diagonal line indicating that the classifier is better
than random guessing.
3. Overall, the Random Forest classifier is performing well, with a good
balance between sensitivity and specificity. An AUC of 0.80 suggests that
the model is effective at distinguishing between the two classes.

1.2.4 ROC Curve – Random Forest

Decision Tree classifier:

• It makes predictions on a test set, and calculates the accuracy score. The
accuracy achieved is 97.21%.

- 19 -
1.2.5 Accuracy score – DTC

• A confusion matrix is plotted using seaborn's heatmap function. This

confusion matrix suggests that while the Decision Tree Classifier performs
well for the majority class, it may need improvement in correctly identifying
the minority class

1.2.6 Confusion Matrix – DTC

• Few points observed from classification report:

1. The model performs very well in identifying non-defaulters (Class 0) with
high precision, recall, and F1-score (all above 0.98).
2. However, it struggles significantly with identifying defaulters (Class 1),
with low precision, recall, and F1-score.
3. The overall accuracy is high (0.972111), but this is misleading due to class
imbalance. There are far more non-defaulters than defaulters in the
dataset
4. The macro average, which gives equal weight to both classes, shows much
lower overall performance (around 0.55 for all metrics) due to the poor
performance on the minority class.
5. while this classifier is very good at identifying non-defaulters, it performs
poorly in detecting defaulters, which is likely the more important class in
many real-world scenarios.

1.2.7 Classification report – DTC

- 20 -
• Key points from ROC curve:
1. The ROC curve is close to the diagonal line, which represents random
performance. This further confirms that the classifier's performance is
not strong.
2. AUC of 0.57 suggests that the classifier has slightly better performance
than random guessing but is not very effective.
3. Overall, the Decision Tree Classifier in this case has limited
discriminative ability, as indicated by the low AUC score and the
shape of the ROC curve. Improvements might be needed, such as
tuning the model parameters or using a different classification
algorithm.

1.2.8 ROC curve – DTC

Naïve Bayes classifier:

• It makes predictions on a test set, and calculates the accuracy score. The
accuracy achieved is 95.98%.

1.2.9 Accuracy score – NBC

• While the Naive Bayes Classifier does reasonably well at identifying Class 0
(with high true negatives), it performs poorly in identifying Class 1
(Defaulters), as seen by the low number of true positives and high false
negatives

- 21 -
1.2.10 Confusion Matrix – NBC

• Based on classification report, The Naive Bayes classifier is heavily biased

towards predicting "non-defaulters", which leads to a very low precision,
recall, and F1-score for "Defaulters".

1.2.11 Classification report – NBC

• Key points of ROC curve:

1. The curve is above the random line: This confirms that the classifier is
better than random guessing.
2. Moderate AUC (0.80): The classifier performs well overall but still has
room for improvement, especially when considering that the classification
report showed poor results for the minority class (Defaulters).
3. A score of 0.80 means that there's a 80% chance that the classifier will
distinguish between a randomly chosen "Defaulter" and "Non-Defaulter"
correctly.

- 22 -
1.2.12 ROC Curve – NBC

Support Vector Machine:

• It makes predictions on a test set, and calculates the accuracy score. The
accuracy achieved is 98.47%.

1.2.13 Accuracy score – SVM

• Key points from confusion matrix:

1. The SVM classifier has predicted all instances as "Non-Defaulters" (Class
0). This is why there are no predictions for Class 1 (Defaulters).
2. The confusion matrix indicates that the classifier is highly biased towards
the majority class (Class 0), and it is not able to identify any instances of
the minority class (Class 1). This is often the result of severe class
imbalance, where the classifier is dominated by the large number of "non-
defaulters" and ignores the small number of "Defaulters."
3. Since all the actual "Defaulters" are misclassified as "non-defaulters," the
model has 0 recall for Class 1, which means it’s not useful for identifying
defaulters at all.

1.2.14 Confusion matrix – SVM

• From classification report, the model performs very well in predicting non-
defaulters but completely fails to detect Defaulters. This could be due to
class imbalance.

- 23 -
1.2.15 Classification report – SVM

• An AUC of 0.5 indicates that the model performs no better than random
guessing, meaning it has no discriminative power to distinguish between
the classes.

1.2.16 OC Curve– SVM

4.3 Interpretation of the model(s)

• The Random Forest has a good accuracy (98.43%) and a relatively high
AUC (0.80), which indicates it performs well in distinguishing classes.
However, its precision (0.40) and recall (0.08) for the minority class (likely
Defaulters) are quite low, showing that it struggles with class imbalance.
• The Decision Tree model has a lower AUC (0.58), and precision, recall, and
F1-scores are also quite low. It struggles more compared to Random Forest
in separating the classes, and overall performance indicates that it might
need tuning.
• Naive Bayes has a lower accuracy (95.98%), and while its precision is low
(0.08), it has a relatively higher recall (0.16). The AUC score is similar to
Random Forest (0.80), but its low precision indicates that it struggles with
false positives.
• The SVM model has a very high precision (1.00) but a recall of 0, meaning
it does not detect any Defaulters at all. This results in an F1-score of 0 and a
low AUC (0.50), indicating it performs no better than random guessing.
• For models like Decision Tree, using boosting techniques (e.g., Gradient
Boosting, XGBoost) could improve performance by focusing on the
misclassified instances.
• So, ensemble and model tuning is needed for the more effective models.

- 24 -
4.4 Ensemble modelling:
Bagging Classifier using Decision Tree:
• Imported BaggingClassifier from sklearn.ensemble
• It makes predictions on a test set, and calculates the accuracy score. The
accuracy achieved is 98.41%.

2.1.1 Accuracy score – Bagging

• Key points on confusion matrix:

1. Class 0 (non-defaulters) is being predicted quite accurately, with
25,308 correct predictions and only 43 false positives. This
suggests that the Bagging Classifier performs well on the majority class.
2. Class 1 (Defaulters) is where the model struggles. Out of 394 true
instances of Defaulters (from the earlier report), it correctly identified
only 27. The remaining 367 Defaulters were misclassified as non-
defaulters, leading to a high false negative rate.

2.1.2 Confusion matrix – Bagging

• Key points on classification report:

1. The classifier is doing well on the majority class (Non-Defaulters),
but it performs poorly on the minority class (Defaulters). This can
be seen in the low precision, recall, and F1-score for Defaulters.
2. The overall accuracy (98.4%) is high because of the class imbalance.
The model is heavily skewed toward predicting non-defaulters
correctly but is failing to capture the Defaulters, which is crucial in
many real-world applications

- 25 -
3. The low recall (8.6%) for Defaulters means the model is missing
most of the Defaulters. This can be dangerous in scenarios where
detecting Defaulters is important.

2.1.3 Classification report – Bagging

• Key points on ROC curve:

1. The AUC score is 0.80, which indicates a good model. A perfect model
would have an AUC of 1, while a random model would have an AUC of
0.5. AUC = 0.80 means that 80% of the time, the model will correctly
distinguish between a Defaulter and a Non-Defaulter.
2. Good Performance: An AUC of 0.80 is a strong indicator that the
Bagging Classifier has a good balance between correctly identifying
Defaulters while minimizing the number of false positives.
3. Although the AUC score is 0.80, which indicates a good model, it’s
essential to balance the trade-off between recall and precision,
especially in contexts where false positives or false negatives can have
significant costs.

2.1.4 ROC Curve – Bagging

Ada Boosting Classifier using Decision Tree:

• It makes predictions on a test set, and calculates the accuracy score. The
accuracy achieved is 98.45%.

- 26 -
2.1.5 Accuracy score – Ada boosting

• Based on confusion matrix, The model performs poorly at identifying Class

1 (Defaulters), with only 5 true positives and 389 false negatives. This
means the model frequently misclassifies Class 1 as Class 0.

2.1.6 Confusion matrix – Ada boosting

• From classification report, the model is very good at identifying Non-

Defaulters (Class 0) but performs poorly for Defaulters (Class 1).

2.1.7 Classification report – Ada boosting

• The classifier does a good job overall, with a relatively high AUC score.
• Although the classifier performs well in general, it may still fail to correctly
identify the minority class (Class 1) as shown by its low recall and F1-score
for that class.

- 27 -
2.1.8 ROC Curve – Ada boosting

Gradient Boosting classifier:

• Gradient Boosting primarily uses decision trees as the base model, and
through an iterative process of reducing prediction errors, it builds a
strong overall model from these weaker individual trees.
• With this, it has fairly good accuracy score of 98.45%

2.1.9 Accuracy score – gradient boosting

• Confusion matrix suggests the model is skewed towards predicting Class 0

more often and may not perform well on the minority Class 1.

2.1.10 confusion matrix – gradient boosting

• Since accuracy is misleading with imbalanced data, using metrics like F1-
score, precision-recall curve, or ROC-AUC may provide better insight into
model performance.

2.1.11 classification report – gradient boosting

- 28 -
2.1.12 ROC Curve – gradient boosting

• Post applying ensemble, below is the result of all the models and its
performance.

2.1.13 Performance metrics of models

Other model tuning measures

• Few hyperparameter tuning like grid search, random search was
performed on the models.
Randomised Search CV using Random Forest Classifier:
• Performed hyperparameter tuning for a Random Forest Classifier using
RandomizedSearchCV from the sklearn library.
• the best parameters for the Random Forest model are displayed, along
with a best accuracy of 0.99, meaning the model performed very well
during cross-validation.

• The score () method is used to evaluate the model (which was trained
earlier using RandomizedSearchCV) on the test set X_test and y_test.
• It returns the accuracy of the model on the test set, which is stored in
the variable accuracy.

- 29 -
2.2.1 Accuracy score – Randomized search cv using RFC

• The model is highly accurate for Class 0 but has difficulty

distinguishing Class 1, possibly due to class imbalance (many more
instances of class 0 compared to class 1). This type of issue is common
when one class dominates the dataset.

2.2.2 Confusion matrix – Randomized search cv using RFC

2.2.3 Classification report – Randomized search cv using RFC

2.2.4 ROC Curve – Randomized search cv using RFC

Randomised Search CV using Decision Tree Classifier

• Performed hyperparameter tuning for a Decision tree Classifier using
RandomizedSearchCV from the sklearn library.

- 30 -
• the best parameters for the Decision tree model are displayed, along
with a best accuracy of 0.99, meaning the model performed very well
during cross-validation

• It returns the accuracy of the model on the test set, which is stored in
the variable accuracy, which is 0.98

2.2.5 Accuracy score – Randomized search cv using DTC

• The confusion matrix further reinforces the issue identified in the

classification report. The model is highly biased towards the majority
class (Non-Defaulters) and completely ignores the minority class
(Defaulters).

2.2.6 Confusion matrix – Randomized search cv using DTC

2.2.7 Classification report – Randomized search cv using DTC

- 31 -
2.2.8 ROC Curve – Randomized search cv using DTC

Randomised Search CV for Naive Bayes (Bernoulli)

2.2.9 Accuracy score – Randomized search cv using NB

2.2.10 Confusion matrix – Randomized search cv using NB

2.2.11 Classification report– Randomized search cv using NB

- 32 -
2.2.12 ROC Curve – Randomized search cv using NB

Grid search in decision tree classifier

• The model now identifies some Defaulters (37), but the number of false
negatives (357) is still significant.

2.2.13 Accuracy score – Grid search cv using DTC

2.2.14 Confusion matrix – Grid search cv using DTC

2.2.15 Confusion matrix – Grid search cv using DTC

- 33 -
2.2.16 ROC curve – Grid search cv using DTC

Grid search in Bernoulli NB classifier

2.2.17 Accuracy score – Grid search cv using NB

2.2.18 Confusion matrix– Grid search cv using NB

2.2.19 Classification report – Grid search cv using NB

- 34 -
2.2.20 ROC Curve – Grid search cv using NB

2.2.21 Performance metrics of all models

4.5 Interpretation of the most optimum model and its

implication on the business
• RandomisedSearchCV for RandomForestClassifier has the
highest accuracy (98.50%) and a good balance of precision (0.77) and
AUC score (0.88), making it a strong candidate for predicting default
probability.
• The reason to chose this model is as follows:
• Highest Accuracy: The model achieves the highest accuracy of
98.50% among all the models presented. This means it correctly
predicts the outcome (default or non-default) for 98.50% of the
cases in the dataset. High accuracy is crucial in banking risk
assessment to minimize errors in predicting defaults.
• Strong Precision: With a precision of 0.77, this model has the
highest precision among all models (tied with
RandomisedSearchCV for Decision Tree Classifier). Precision
measures the proportion of true positive predictions (correctly
predicted defaults) out of all positive predictions. A high precision
means that when the model predicts a default, it's more likely to be
correct, reducing false alarms.
• High AUC Score: The Area Under the Curve (AUC) score of 0.88 is
one of the highest among all models. AUC represents the model's
ability to distinguish between classes (default and non-default). A
score of 0.88 indicates that the model has a strong ability to
separate the two classes, which is crucial for a binary classification
problem like predicting loan defaults.

- 35 -
• Balanced Performance: This model provides a good balance
between different metrics. While some models might excel in one
area but perform poorly in others, this model maintains high
scores across accuracy, precision, and AUC.
Advantages of Random Forest: The base algorithm (Random Forest)
is known for its robustness and ability to handle complex relationships
in data. It's an ensemble method that combines multiple decision
trees, which helps in reducing overfitting and improving
generalization.
• Hyperparameter Optimization: The use of RandomisedSearchCV
indicates that the model's hyperparameters have been optimized.
This process helps in finding the best configuration of the Random
Forest algorithm for this specific dataset, potentially improving its
performance over a standard Random Forest.
• Feature Importance Visualization:

2.3.1 Top 10 feature importances

• Financial behaviour: The majority of the important features seem

to focus on financial behaviour, particularly in terms of payments,
investments, and capital added within certain time frames (12 and
24 months).
• Age and time-related metrics: Age and duration within the system
("time_hours") are also influential, likely capturing aspects of
experience, reliability, or maturity.

5.Model Validation

To choose the model, important performance metrics need to be identified for the
respective problem
• Precision is critical to avoid incorrectly labeling non-defaulting customers as
defaults, which can lead to reputational damage and loss of trust.
• Recall is important to capture as many actual defaults as possible, but the
focus is on ensuring that defaults are accurately identified without mistakenly
flagging non-defaults, precision takes precedence.
• For imbalanced datasets, the AUC-ROC is the most important metric. This is
because it evaluates the model's ability to distinguish between classes across
all possible thresholds, providing a comprehensive view of performance
regardless of class distribution.

- 36 -
So, Randomized Search CV for Random Forest classifier is chosen as final
model as it has balance between the performance metrics especially Precision
and AUC-ROC

6.Final interpretation/ recommendations

• Explore Age-based Strategies: The slight tendency of younger customers to
have a higher default risk could be further investigated, and the credit card
company may consider implementing age-specific strategies to manage this
risk.
• Optimize Merchant Category Strategies: The insights on the concentration of
spending in certain merchant categories can guide the credit card company's
marketing and product strategies, allowing them to better cater to the
preferences and needs of their customers.
• Relationship between Merchant Categories and Default Risk: The analysis of
the top merchant categories could be extended to investigate if certain types of
purchases are more strongly associated with default risk. This could provide
valuable insights for the credit card company to identify high-risk spending
patterns. For example: merchant category with the highest default rate is
Youthful Shoes & Clothing, indicating a potential issue with that category.
• Clustering Insights: The optimal number of customer clusters was identified
as five using K-means clustering, suggesting distinct customer segments based
on financial behavior. This can help in targeted marketing and risk assessment
strategies.
• Customer Risk Segmentation: Customers who defaulted tend to have
higher average account amounts added in the last 12-24 months, which may
indicate that higher financial activity could be associated with default risk.
Non-defaulting customers generally have higher maximum paid invoices
compared to defaulters, implying better financial behaviour in terms of
invoice payments
• Leverage Feature Importance Insights: The top feature importances
such as financial behavior, age, and time-related metrics, can provide valuable
insights for the business. These insights can be used to enhance customer risk
profiling, develop targeted intervention strategies, and inform product design
or credit policies.
• Explore Explainable AI Techniques: Given the critical nature of loan
default prediction in the banking industry, it is essential to provide
interpretable and explainable model decisions. Consider incorporating XAI
techniques, such as SHAP values or feature importance analysis, to enhance
the transparency and trustworthiness of the model's predictions.

- 37 -
• External Data Integration: The credit card company could consider
integrating external data sources, such as macroeconomic indicators, industry
trends, or customer financial information from other sources, to enrich the
analysis and gain a more comprehensive understanding of the factors
influencing default risk.
By implementing these recommendations, the banking organization can leverage
the model to improve its loan default prediction capabilities, make more
informed decisions, and ultimately enhance its risk management practices.

- 38 -

Modelling-Project Notes-2
No ratings yet
Modelling-Project Notes-2
49 pages
EDA-project Notes-1
No ratings yet
EDA-project Notes-1
23 pages
Capstone Project PPT
No ratings yet
Capstone Project PPT
13 pages
Credit Default Project 23124001
No ratings yet
Credit Default Project 23124001
13 pages
Thera Bank Loan Campaign Analysis
100% (1)
Thera Bank Loan Campaign Analysis
21 pages
Thera Bank Loan Campaign Analysis
No ratings yet
Thera Bank Loan Campaign Analysis
21 pages
Machine Learning Paper BD
No ratings yet
Machine Learning Paper BD
16 pages
Group 5 Dseb64a Report
No ratings yet
Group 5 Dseb64a Report
10 pages
Credit Card Default Prediction
No ratings yet
Credit Card Default Prediction
33 pages
Capstone Project
100% (1)
Capstone Project
7 pages
Mlproj
No ratings yet
Mlproj
49 pages
Default Payment Analysis of Credit Card Clients: July 2018
No ratings yet
Default Payment Analysis of Credit Card Clients: July 2018
7 pages
Quadexp IDS Project
No ratings yet
Quadexp IDS Project
22 pages
An Kit
No ratings yet
An Kit
12 pages
DM Gopala Satish Kumar Business Report G8 DSBA
100% (2)
DM Gopala Satish Kumar Business Report G8 DSBA
26 pages
DS Report 2
No ratings yet
DS Report 2
10 pages
SCA Module 9
No ratings yet
SCA Module 9
43 pages
Coser Al. Crisan Albu (T)
No ratings yet
Coser Al. Crisan Albu (T)
17 pages
Capstone Project
No ratings yet
Capstone Project
33 pages
Finclub Summer Project 2 (2025)
No ratings yet
Finclub Summer Project 2 (2025)
7 pages
Credit Card Marketing Analytics
100% (1)
Credit Card Marketing Analytics
18 pages
Capastone - Project - Subash Karnatakapu
No ratings yet
Capastone - Project - Subash Karnatakapu
54 pages
Data Mining for Business Insights
83% (12)
Data Mining for Business Insights
34 pages
Machine Learning Techniques For Predicting Credit Approvals: Prawar Mundra 2018IMG-037
No ratings yet
Machine Learning Techniques For Predicting Credit Approvals: Prawar Mundra 2018IMG-037
16 pages
6 Applications of Predictive Analytics in Business Intelligence
No ratings yet
6 Applications of Predictive Analytics in Business Intelligence
6 pages
Vehicle Loan Default Prediction Report
No ratings yet
Vehicle Loan Default Prediction Report
23 pages
Credit Card Default Prediction: Final Project Report
No ratings yet
Credit Card Default Prediction: Final Project Report
28 pages
Final Report
No ratings yet
Final Report
69 pages
Bank Customer Segmentation
No ratings yet
Bank Customer Segmentation
14 pages
Business Report FRA-Extended Project
No ratings yet
Business Report FRA-Extended Project
22 pages
Capstone Project - Final Submission
No ratings yet
Capstone Project - Final Submission
36 pages
Churn Analysis of Bank Customers
100% (1)
Churn Analysis of Bank Customers
12 pages
SSRN 4976040
No ratings yet
SSRN 4976040
14 pages
Edafinal 1
No ratings yet
Edafinal 1
32 pages
Assignment 3 F1 - F4
No ratings yet
Assignment 3 F1 - F4
19 pages
EDA Loan Case Study PPT - Ver 1.1
80% (5)
EDA Loan Case Study PPT - Ver 1.1
22 pages
Credit Card Default Analysis
No ratings yet
Credit Card Default Analysis
26 pages
Data Mininig Project
67% (3)
Data Mininig Project
28 pages
EDA Credit Assignment Shakti - PDF
No ratings yet
EDA Credit Assignment Shakti - PDF
51 pages
Ai It HW MST Prac
No ratings yet
Ai It HW MST Prac
14 pages
Machine Learning
No ratings yet
Machine Learning
26 pages
Arpit Pal E2 17 Report Loan-Prediction-System
No ratings yet
Arpit Pal E2 17 Report Loan-Prediction-System
34 pages
Business Analytics
No ratings yet
Business Analytics
56 pages
Data Mining Project Report - Reshma
No ratings yet
Data Mining Project Report - Reshma
23 pages
Credit Score Prediction.
No ratings yet
Credit Score Prediction.
3 pages
Credit Risk Model Building Steps
No ratings yet
Credit Risk Model Building Steps
81 pages
Credit Defaulter Classifier 1659348484
No ratings yet
Credit Defaulter Classifier 1659348484
7 pages
ML-2 Guided Project Report
No ratings yet
ML-2 Guided Project Report
63 pages
Report
No ratings yet
Report
34 pages
Predictionof Customer Churnin Banking Industry
No ratings yet
Predictionof Customer Churnin Banking Industry
16 pages
Data Preparation
No ratings yet
Data Preparation
4 pages
Data Mini Proj
100% (2)
Data Mini Proj
44 pages
Predictive Analysis For Retail Banking
No ratings yet
Predictive Analysis For Retail Banking
28 pages
Group 11 Data Analytics
No ratings yet
Group 11 Data Analytics
8 pages
IDS 575 Project Report
No ratings yet
IDS 575 Project Report
9 pages
Survey on Sentiment Analysis in Reviews
No ratings yet
Survey on Sentiment Analysis in Reviews
4 pages
Efficient Learning Machines Theories Concepts and Applications For Engineers and System Designers Rahul Khanna Instant Download
100% (1)
Efficient Learning Machines Theories Concepts and Applications For Engineers and System Designers Rahul Khanna Instant Download
91 pages
Lec01 Conceptlearning
100% (1)
Lec01 Conceptlearning
49 pages
New Approach Based On Pix2Pix-YOLOv7 Mmwave Radar
No ratings yet
New Approach Based On Pix2Pix-YOLOv7 Mmwave Radar
19 pages
Machine Learning Algorithms Guide
No ratings yet
Machine Learning Algorithms Guide
10 pages
Ensemble Learning: David Sontag New York University
No ratings yet
Ensemble Learning: David Sontag New York University
17 pages
Deep Learning Revision Guide
No ratings yet
Deep Learning Revision Guide
6 pages
Upath ER (SP1) 1021934EN - 1
No ratings yet
Upath ER (SP1) 1021934EN - 1
22 pages
Ballsim Direct
No ratings yet
Ballsim Direct
58 pages
Smart Traffic Monitoring System
No ratings yet
Smart Traffic Monitoring System
27 pages
30 Days ML Projects Challenge
No ratings yet
30 Days ML Projects Challenge
288 pages
CUML1021 Machine Learning For Predictive Analytics Syllabus
No ratings yet
CUML1021 Machine Learning For Predictive Analytics Syllabus
4 pages
Firefly Algorithm for Feature Selection
No ratings yet
Firefly Algorithm for Feature Selection
4 pages
Apollon: AML Defense for IDS
No ratings yet
Apollon: AML Defense for IDS
73 pages
Thesis. Facial Recognition Security System
0% (1)
Thesis. Facial Recognition Security System
44 pages
Full ML Viva Questions Answers Q1 To Q70
No ratings yet
Full ML Viva Questions Answers Q1 To Q70
6 pages
Cover Page
No ratings yet
Cover Page
11 pages
Learning Rules
No ratings yet
Learning Rules
3 pages
CVlecture 5
No ratings yet
CVlecture 5
56 pages
The Use of Machine Learning in Digital Forensics: Review Paper
No ratings yet
The Use of Machine Learning in Digital Forensics: Review Paper
18 pages
MLT Quantum Aktu PDF
33% (3)
MLT Quantum Aktu PDF
160 pages
Midterm Sample 2 2010
No ratings yet
Midterm Sample 2 2010
8 pages
20ECE633T Machine Learning in VLSI
No ratings yet
20ECE633T Machine Learning in VLSI
81 pages
Unit Pattern
No ratings yet
Unit Pattern
6 pages
Extended Instruments in Contemporary Music
100% (1)
Extended Instruments in Contemporary Music
34 pages
Loanliness: Predicting Loan Repayment Ability by Using Machine Learning Methods
No ratings yet
Loanliness: Predicting Loan Repayment Ability by Using Machine Learning Methods
6 pages
Data Mining and Data Warehousing
No ratings yet
Data Mining and Data Warehousing
25 pages
Unit1 ML
No ratings yet
Unit1 ML
23 pages
Data Mining: Learning Objectives For Chapter 5
No ratings yet
Data Mining: Learning Objectives For Chapter 5
22 pages