Banking Project Final
Banking Project Final
By:
E. AuroRajashri
-1-
List of Content
1. Introduction 6
1.1. Defining problem statement
1.2. Need of the study/project
4. Model building 17
4.1 Build various models
4.2 Test your predictive model against the test set using various performance
metrics
4.3 Interpretation of the model(s)
4.4 Ensemble modelling, wherever applicable
4.5 Interpretation of the most optimum model and its implication on the business
5. Model validation 36
5.1 Various model Validation measures
-2-
List of Tables
2.2 Descriptive Statistics
List of Figures
3.1.1 Histogram of age
-3-
5.2.6 Confusion Matrix – DTC
-4-
6.2.13 Accuracy score – Grid search cv using DTC
-5-
1. Introduction
1.1 Defining problem statement
Problem Statement: This business problem is a supervised learning
example for a credit card company. The objective is to predict the probability
of default (whether the customer will pay the credit card bill or not) based on
the variables provided. There are multiple variables on the credit card
account, purchase and delinquency information which can be used in the
modelling.
PD modelling problems are meant for understanding the riskiness of the
customers and how much credit is at stake in case the customer defaults.
This is an extremely critical part in any organization that lends money [both
secured and unsecured loans].
• The objective of this project is to develop a predictive model that
estimates the probability of default for credit card customers. This
involves using the provided dataset, which contains various variables
related to credit card accounts, purchases, and delinquency
information, to understand the riskiness of customers.
• By accurately predicting the likelihood of default, the credit card
company can better assess the credit risk associated with each
customer and make informed decisions regarding credit limits,
interest rates, and other lending terms. This is crucial for minimizing
potential losses and managing the overall credit risk portfolio of the
organization.
-6-
4. Regulatory Compliance: Financial institutions are often required to maintain
certain levels of capital reserves based on the riskiness of their loan portfolios.
Accurate PD models help in meeting these regulatory requirements by
providing a clear picture of potential defaults.
Overall, this study is essential for enhancing the financial stability and
operational efficiency of lending institutions, ultimately contributing to their
long-term success.
Recent example: Yes Bank, once one of India's fastest-growing private sector
banks, faced a severe crisis in 2020 due to its inability to manage credit risk
effectively.
Recent example: In 2019, SBI implemented an AI-powered credit scoring system
to assess loan applications and predict the probability of default. This system has
helped SBI better manage its non-performing assets (NPAs) by more accurately
predicting which borrowers are likely to default
This histogram shows the distribution of age in the dataset. We can observe that:
• The age distribution is right-skewed, with most customers falling in the range
of 25-45 years old.
• The peak of the distribution is around 30-35 years old.
• There are fewer customers in the older age ranges (above 60).
-7-
3.1.2 Histogram of Time_hours
-8-
Key insights:
• Concentration of Transactions: The Direct selling establishments category
dominates with the highest count, nearly 40,000, far exceeding the other
categories. This indicates a large number of transactions or significant activity
in this category.
• Moderate Activity: Categories like Books & Magazines and Youthful Shoes &
Clothing have moderate counts (around 10,000–15,000), showing significant
but not overwhelming activity compared to the leader.
• Low Activity Categories: Categories like Dietary Supplements, Prints &
Photos, and Diversified electronics have much lower counts (under 10,000).
These are niche categories with fewer transactions.
• Category Variety: The top 10 categories represent a broad range of industries,
including electronics, apparel, outdoor gear, books, and general merchandise.
This indicates diverse customer interests.
The bar chart you shared shows the top 10 merchant groups and the count of
transactions or occurrences associated with each group. Here's a breakdown of the
insights:
• Entertainment is by far the dominant category, with significantly more counts
(around 50,000) than the other categories. This suggests that consumers
engage with or spend more in this group.
• Clothing & Shoes follows as the second-highest group, though it's much lower
than Entertainment.
-9-
• The groups with the lowest counts are Jewelry & Accessories, Home & Garden,
Intangible Products, and Automotive Products.
• The distribution shows that spending or transaction volume is concentrated
heavily in Entertainment, with other categories having relatively smaller but
still notable volumes.
- 10 -
• Variables like sum_capital_paid_account_0_12m and num_active_tl also
show extreme right-skewness, where the majority of data points are
concentrated at lower values. Many variables, like num_tl_90g_dpd_24m,
num_actv_bc_tl, and max_bal_bc, have a significant concentration of values
near zero, indicating that for these variables, the majority of the data points
reflect minimal activity or involvement (e.g., low number of transactions or
minimal balance).
• In many histograms (e.g., recovery_label,
sum_capital_paid_account_0_12m), there are long tails indicating the
presence of outliers or extreme values. This implies that there are a few cases
where the values are much higher than the rest of the data.
This barplot compares the average account amount added in the last 12-24 months
for customers who defaulted (1) versus those who didn't (0). We can see that:
• Customers who defaulted (1) tend to have a higher average account amount
added compared to those who didn't default (0).
• This could suggest that customers who add larger amounts to their accounts
might be at a higher risk of default, possibly due to overextending their
financial capabilities.
This strip plot shows the distribution of the maximum paid invoice in the last 12
months for defaulted and non-defaulted customers. Observations:
- 11 -
• The distribution for non-defaulted customers (0) appears to be more
concentrated in the lower range, with some high-value outliers.
This violin plot displays the age distribution for defaulted and non-defaulted
customers.
• The age distributions are fairly similar for both groups.
• Both distributions are slightly right-skewed, with most customers between
25-45 years old.
• There's a slight indication that defaulted customers might be younger on
average, but the difference doesn't appear to be substantial.
- 12 -
Key Insights:
1. Highly Correlated Features:
• Features with a correlation coefficient close to 1 or -1 have a very strong
linear relationship, either positively or negatively correlated.
• For example, if max_paid_inv_0_12m and num_active_inv_0_12m show
high positive correlation, it implies that as the number of active invoices
increases, the maximum paid invoice also tends to increase.
• Similarly, features like acct_worst_status_12_24m might be strongly
correlated with acct_worst_status_6_12m, indicating a consistency in
worst account status over different periods.
2. Clusters of Features:
• Features that are highly correlated with each other may form "clusters." For
instance, all account status variables or payment-related features might be
grouped together, showing that they are related aspects of customer
behavior.
• Clustering often reveals related features that can be treated similarly in
model building or analysis, as they provide overlapping information.
3. Negative Correlations:
• Strong negative correlations (close to -1) indicate an inverse relationship.
For example, if default_status has a negative correlation with
max_paid_inv_0_12m, it means that customers with higher max paid
invoices are less likely to default.
• Similarly, a negative correlation between
acct_incoming_debt_vs_paid_0_24m and acct_days_in_rem_12_24m
might show that the more days a person remains in arrears, the less they
manage to reduce their outstanding debt.
4. Redundancy:
• Features that are almost perfectly correlated (near 1) may represent
redundant information. For example, if acct_worst_status_6_12m and
acct_worst_status_3_6m are highly correlated, it may be redundant to
include both in certain analyses. One of these features can potentially be
dropped in a model without losing valuable information.
5. Outliers in Correlation:
• If there are features that stand out with unexpectedly high or low
correlations compared to others, they may warrant deeper investigation.
These outliers could represent key insights into behavior or relationships
between variables that are not immediately obvious.
- 13 -
3. Data Cleaning and Pre-processing
3.1 Removal of unwanted variables
• Removed userid variable
• There are 615512 missing values. The percentage of missing value in each
variable calculated and the result is below:
• Dropping off the columns which has missing value greater than 25% and below
are the missing values in remaining columns
- 14 -
3.4.2 Post dropping off columns with 25% threshold
- 15 -
3.5.1 Outliers using box plot
- 16 -
3.6.1 One-hot encoding
- 17 -
• Imported SVC library from sklearn.svm
• It was fitted to the training data set.
4.2 Test your predictive model against the test set using
various appropriate performance metrics
• Imported few libraries like confusion_matrix, precision_score,
recall_score, ConfusionMatrixDisplay, Classification report,
accuracy score.
Random forest classifier:
• It makes predictions on a test set, and calculates the accuracy score. The
accuracy achieved is 98.43%.
- 18 -
2. However, it struggles with identifying defaulters (low precision, very low
recall, and low F1-score for class 1)
3. The high overall accuracy (98.42%) is misleading due to the class imbalance
4. The large difference between macro and weighted averages further highlights
the impact of class imbalance
- 19 -
1.2.5 Accuracy score – DTC
- 20 -
• Key points from ROC curve:
1. The ROC curve is close to the diagonal line, which represents random
performance. This further confirms that the classifier's performance is
not strong.
2. AUC of 0.57 suggests that the classifier has slightly better performance
than random guessing but is not very effective.
3. Overall, the Decision Tree Classifier in this case has limited
discriminative ability, as indicated by the low AUC score and the
shape of the ROC curve. Improvements might be needed, such as
tuning the model parameters or using a different classification
algorithm.
• While the Naive Bayes Classifier does reasonably well at identifying Class 0
(with high true negatives), it performs poorly in identifying Class 1
(Defaulters), as seen by the low number of true positives and high false
negatives
- 21 -
1.2.10 Confusion Matrix – NBC
- 22 -
1.2.12 ROC Curve – NBC
• From classification report, the model performs very well in predicting non-
defaulters but completely fails to detect Defaulters. This could be due to
class imbalance.
- 23 -
1.2.15 Classification report – SVM
• An AUC of 0.5 indicates that the model performs no better than random
guessing, meaning it has no discriminative power to distinguish between
the classes.
• The Random Forest has a good accuracy (98.43%) and a relatively high
AUC (0.80), which indicates it performs well in distinguishing classes.
However, its precision (0.40) and recall (0.08) for the minority class (likely
Defaulters) are quite low, showing that it struggles with class imbalance.
• The Decision Tree model has a lower AUC (0.58), and precision, recall, and
F1-scores are also quite low. It struggles more compared to Random Forest
in separating the classes, and overall performance indicates that it might
need tuning.
• Naive Bayes has a lower accuracy (95.98%), and while its precision is low
(0.08), it has a relatively higher recall (0.16). The AUC score is similar to
Random Forest (0.80), but its low precision indicates that it struggles with
false positives.
• The SVM model has a very high precision (1.00) but a recall of 0, meaning
it does not detect any Defaulters at all. This results in an F1-score of 0 and a
low AUC (0.50), indicating it performs no better than random guessing.
• For models like Decision Tree, using boosting techniques (e.g., Gradient
Boosting, XGBoost) could improve performance by focusing on the
misclassified instances.
• So, ensemble and model tuning is needed for the more effective models.
- 24 -
4.4 Ensemble modelling:
Bagging Classifier using Decision Tree:
• Imported BaggingClassifier from sklearn.ensemble
• It makes predictions on a test set, and calculates the accuracy score. The
accuracy achieved is 98.41%.
- 25 -
3. The low recall (8.6%) for Defaulters means the model is missing
most of the Defaulters. This can be dangerous in scenarios where
detecting Defaulters is important.
- 26 -
2.1.5 Accuracy score – Ada boosting
• The classifier does a good job overall, with a relatively high AUC score.
• Although the classifier performs well in general, it may still fail to correctly
identify the minority class (Class 1) as shown by its low recall and F1-score
for that class.
- 27 -
2.1.8 ROC Curve – Ada boosting
• Since accuracy is misleading with imbalanced data, using metrics like F1-
score, precision-recall curve, or ROC-AUC may provide better insight into
model performance.
- 28 -
2.1.12 ROC Curve – gradient boosting
• Post applying ensemble, below is the result of all the models and its
performance.
• The score () method is used to evaluate the model (which was trained
earlier using RandomizedSearchCV) on the test set X_test and y_test.
• It returns the accuracy of the model on the test set, which is stored in
the variable accuracy.
- 29 -
2.2.1 Accuracy score – Randomized search cv using RFC
- 30 -
• the best parameters for the Decision tree model are displayed, along
with a best accuracy of 0.99, meaning the model performed very well
during cross-validation
• It returns the accuracy of the model on the test set, which is stored in
the variable accuracy, which is 0.98
- 31 -
2.2.8 ROC Curve – Randomized search cv using DTC
- 32 -
2.2.12 ROC Curve – Randomized search cv using NB
- 33 -
2.2.16 ROC curve – Grid search cv using DTC
- 34 -
2.2.20 ROC Curve – Grid search cv using NB
- 35 -
• Balanced Performance: This model provides a good balance
between different metrics. While some models might excel in one
area but perform poorly in others, this model maintains high
scores across accuracy, precision, and AUC.
Advantages of Random Forest: The base algorithm (Random Forest)
is known for its robustness and ability to handle complex relationships
in data. It's an ensemble method that combines multiple decision
trees, which helps in reducing overfitting and improving
generalization.
• Hyperparameter Optimization: The use of RandomisedSearchCV
indicates that the model's hyperparameters have been optimized.
This process helps in finding the best configuration of the Random
Forest algorithm for this specific dataset, potentially improving its
performance over a standard Random Forest.
• Feature Importance Visualization:
5.Model Validation
To choose the model, important performance metrics need to be identified for the
respective problem
• Precision is critical to avoid incorrectly labeling non-defaulting customers as
defaults, which can lead to reputational damage and loss of trust.
• Recall is important to capture as many actual defaults as possible, but the
focus is on ensuring that defaults are accurately identified without mistakenly
flagging non-defaults, precision takes precedence.
• For imbalanced datasets, the AUC-ROC is the most important metric. This is
because it evaluates the model's ability to distinguish between classes across
all possible thresholds, providing a comprehensive view of performance
regardless of class distribution.
- 36 -
So, Randomized Search CV for Random Forest classifier is chosen as final
model as it has balance between the performance metrics especially Precision
and AUC-ROC
- 37 -
• External Data Integration: The credit card company could consider
integrating external data sources, such as macroeconomic indicators, industry
trends, or customer financial information from other sources, to enrich the
analysis and gain a more comprehensive understanding of the factors
influencing default risk.
By implementing these recommendations, the banking organization can leverage
the model to improve its loan default prediction capabilities, make more
informed decisions, and ultimately enhance its risk management practices.
- 38 -