0% found this document useful (0 votes)
68 views21 pages

Bank Loan Case Study

The Bank Loan Case Study demonstrates the application of Exploratory Data Analysis (EDA) in banking risk analytics, focusing on data cleaning, outlier detection, and visualization using MS Excel. The analysis involved two datasets: current and previous loan applications, revealing insights on loan defaults and factors influencing loan approval. Key findings include the impact of income, age, and education on loan applications and defaults, as well as the challenges of handling large datasets with missing values and outliers.

Uploaded by

Nitika
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
68 views21 pages

Bank Loan Case Study

The Bank Loan Case Study demonstrates the application of Exploratory Data Analysis (EDA) in banking risk analytics, focusing on data cleaning, outlier detection, and visualization using MS Excel. The analysis involved two datasets: current and previous loan applications, revealing insights on loan defaults and factors influencing loan approval. Key findings include the impact of income, age, and education on loan applications and defaults, as well as the challenges of handling large datasets with missing values and outliers.

Uploaded by

Nitika
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Trainity Project-6

lOMoARcPSD|30732588

Project-6: BANK LOAN CASE STUDY

Description:
This case study attempts to demonstrate the application of EDA in a real-world
business environment. In this case study, in addition to using the techniques learned
in the EDA module, you will gain a basic grasp of risk analytics in banking and
financial services, as well as how data is utilized to reduce the risk of losing money
when lending to consumers

Approach:
For this project, approach was to analyse the dataset, clean the dataset finding
the blanks and missing values, imputing the missing values with the appropriate
method (mean, median, mode). Then I tried to find the outliers in the dataset, there
are some anomalies such as negative values which need either to be deleted or
standardized. After all these I used pivot tables and basic charts to visualise the data.
Moreover, insights were drawn based on my understandings

Tech-Stack:
• MS Excel (2023)
• Dataset provided (Bank Dataset)
Insights:
The dataset contains 3 files:
1. application_data.csv: contains all the information of the client at the time of
application. The data is about whether a client has payment difficulties
lOMoARcPSD|30732588

2. previous_application.csv: contains information about the client’s previous


loan data. It contains the data whether the previous application had been
Approved, Cancelled, Refused or Unused offer.

3. columns_descrption.csv: It is data dictionary which describes the meaning of


the variables

Both sets of data contained many undesired columns that will not be used for risk
analytics, as well as many blanks. So, I cleaned up the data.
Following the data cleaning procedure, I split columns in the dataset based on
two categories of variables. 1) Categorical variables
2) Numerical variables
Categorical variables (non-numerical variables)- person's occupation, education
status.
Numerical variables - income, credit etc.,
The following are some of the categorical and numerical variables from the
provided data set.

Categorical variables Numeric variables


Gender Age
Name contract type Days employed
Income type Amount Income
Education Amount Annuity
Housing type Amount Credit

I completed full EDA on the present application and then on the previous
application. Then, in this report, I summarized the results of both
applications and provided business insights.
lOMoARcPSD|30732588

Current application.csv

Task 2 (Find Missing Data):

Importing the dataset in excel :


lOMoARcPSD|30732588
lOMoARcPSD|30732588

Imputing the missing values using mean , median and mode


lOMoARcPSD|30732588

3.Outliers can only be identified on Numeric

Box plotted Target column vs

1) Amount credit
2) Amount Income
3) Amount Annuity
lOMoARcPSD|30732588

Data imbalance: Data imbalance occurs when data is disseminated in an


unequal manner. I plotted data imbalance using Pivot charts.

NAME CONTRACT TYPE


lOMoARcPSD|30732588

Task 5 (EDA):
Univariate Analysis:

INFERENCE
Individuals with higher incomes are less likely to apply for loans. The credit
amount of a bank loan is typically in the range of 45000 to 1045000. The majority
of loan applications have come from people between the ages of 35 and 50. Those
with 0 to 8 years of work experience are the most likely to seek for loans.
Individuals who own homes are more likely to apply for loans than others. Those
who are married have taken out more loans. More loans have been requested by
working people. Unaccompanied minors have requested for extra loans.

Amount Income
lOMoARcPSD|30732588

Amount Credit

AGE
lOMoARcPSD|30732588

Name suite type

Bivariate Analysis:
INFERENCE
Customers who live in low-rating areas will have higher defaults.
Individuals with lower incomes are more likely to default. Young people are
more likely to default, and the trend of defaulters declines with age. Ladies
are less inclined than males to have defaults. More defaults are predicted due
to maternity leave and unemployment. Customers with more than five family
members are more likely to default on their bank loan. Customers with fewer
educational qualifications are more likely to fail on a bank loan. Customers
with hardly work experience are more likely to have defaults.

Region Rating Client vs Target Amount Income vs Target


lOMoARcPSD|30732588

Age vs Target Gender vs Target

INCOME TYPE VS TARGET FAMILY MEMBER VS


TARGET

EDUCATION TYPE VS TARGET MONTHS EMPLOYED


VSTARGET
lOMoARcPSD|30732588

Task 6 (Finding top 10 correlations):


Top 10 driving factors in current application.csv

1. Income type
2. Count of Family Members
3. Children count
4. External source
5. Region rating of client

6. Age
7. Months Employed
8. Amount credit
9. Amount Goods Price
10. Amount total income
lOMoARcPSD|30732588

Earlier Application.csv

Task 2 (Data Cleaning):


Column removal: I used the COUNTBLANK function to determine the number of
blanks in a column, and if it exceeded 5%, I eliminated it. I removed a couple
columns that were of no use to the analysis
There are 1670214 rows in the dataset where as Excel has a Max limit of 1048576
rows and as per the project requirement we are supposed to use only Excel for
Analysis. Hence we’d be limited to the use of 1048576 rows

Task 3 (Finding Outliers):


lOMoARcPSD|30732588

Task 4(Data Imbalance):


Below are the columns where data is unevenly distributed
lOMoARcPSD|30732588

Task 5 (EDA):
Univariate Analysis:

Inference

Customers have largely chosen cash and consumer loans. The majority of our
clients are repeat customers.

The majority of current loan applicants are individuals who applied for loans
less than ten months ago. More loans have been requested for consumer
gadgets.
lOMoARcPSD|30732588

Bivariate Analysis:
Inference
Customers who applied for more than Rs. 350,000 will most likely be denied. The
majority of loans sought for through Credit and Cash agencies are cancelled.New
clients are overjoyed because the majority of their loans were approved. Thus far,
car loans have been denied. Loans made to MLM partner clients are likely to be
cancelled. Virtually 80% of the loans were authorised, with a steady stream of
rejections.Consumer loans have nearly no cancellations and the greatest approval
rate. Several loans for the first Selling place area group were cancelled.Clients who
apply for another loan within 10 months of their previous loan are more likely to
have it cancelled. Walk-in loans have a higher refusal rate
lOMoARcPSD|30732588

Task 6 (Finding Correlations): Top ten reasons for loan cancellation and refusal
lOMoARcPSD|30732588

1. Amount Application
2. Cash loan Purpose
3. Goods Category
4. Product Combination
5. Product type
6. Channel type
7. Months Decision
8. Contract type
9. Client type
10. Payment type
lOMoARcPSD|30732588

Task 7 (Combining two sheets): I then ran analysis on the common set of
data by joining the Target column with the previous application table. I used
MySQL to join them. I loaded the data into workbench and ran the following
query.

Query:
SELECT TARGET,
SK_ID_CURR,
NAME_CONTRACT_TYPE,
AMT_APPLICATION,
NAME_CASH_LOAN_PURPOSE,
NAME_CONTRACT_STATUS,
NAME_CLIENT_TYPE, DAYS_DECISION,
CODE_REJECT_REASON,
NAME_SELLER_INDUSTRY,
NAME_PORTFOLIO,
NAME_PRODUCT_TYPE,
CHANNEL_TYPE, SELLERPLACE_AREA,
NAME_YIELD_GROUP,
PRODUCT_COMBINATION
FROM application_data
JOIN previous_application ON SK_ID_CURR;

pivot table analysis


lOMoARcPSD|30732588

Clients who have applied for previous loans have no defaults in current loans

Excel file:
https://docs.google.com/spreadsheets/d/1HqSNv0NdM7yg0Q0uC0IL7x_Ds
Xt6JgLD/edit?usp=drive_link&ouid=115404029938861642621&rtpof=tru
e&sd=true

RESULT:

This project involved extensive use of Excel. The major challenge was working with
such huge data. This project helped me understand how to work with huge datasets.
This helped me understand how 2 datasets are merged to analyze the details. The
dataset involved a lot of missing data and outliers, handling them was a task and this
project helped me understand what to how and why of handling the outliers and Null
values. The project also helped me discover new add-ins such as data analyze.

You might also like