Trainity Project-6
lOMoARcPSD|30732588
Project-6: BANK LOAN CASE STUDY
Description:
This case study attempts to demonstrate the application of EDA in a real-world
business environment. In this case study, in addition to using the techniques learned
in the EDA module, you will gain a basic grasp of risk analytics in banking and
financial services, as well as how data is utilized to reduce the risk of losing money
when lending to consumers
Approach:
For this project, approach was to analyse the dataset, clean the dataset finding
the blanks and missing values, imputing the missing values with the appropriate
method (mean, median, mode). Then I tried to find the outliers in the dataset, there
are some anomalies such as negative values which need either to be deleted or
standardized. After all these I used pivot tables and basic charts to visualise the data.
Moreover, insights were drawn based on my understandings
Tech-Stack:
• MS Excel (2023)
• Dataset provided (Bank Dataset)
Insights:
The dataset contains 3 files:
1. application_data.csv: contains all the information of the client at the time of
application. The data is about whether a client has payment difficulties
lOMoARcPSD|30732588
2. previous_application.csv: contains information about the client’s previous
loan data. It contains the data whether the previous application had been
Approved, Cancelled, Refused or Unused offer.
3. columns_descrption.csv: It is data dictionary which describes the meaning of
the variables
Both sets of data contained many undesired columns that will not be used for risk
analytics, as well as many blanks. So, I cleaned up the data.
Following the data cleaning procedure, I split columns in the dataset based on
two categories of variables. 1) Categorical variables
2) Numerical variables
Categorical variables (non-numerical variables)- person's occupation, education
status.
Numerical variables - income, credit etc.,
The following are some of the categorical and numerical variables from the
provided data set.
Categorical variables Numeric variables
Gender Age
Name contract type Days employed
Income type Amount Income
Education Amount Annuity
Housing type Amount Credit
I completed full EDA on the present application and then on the previous
application. Then, in this report, I summarized the results of both
applications and provided business insights.
lOMoARcPSD|30732588
Current application.csv
Task 2 (Find Missing Data):
Importing the dataset in excel :
lOMoARcPSD|30732588
lOMoARcPSD|30732588
Imputing the missing values using mean , median and mode
lOMoARcPSD|30732588
3.Outliers can only be identified on Numeric
Box plotted Target column vs
1) Amount credit
2) Amount Income
3) Amount Annuity
lOMoARcPSD|30732588
Data imbalance: Data imbalance occurs when data is disseminated in an
unequal manner. I plotted data imbalance using Pivot charts.
NAME CONTRACT TYPE
lOMoARcPSD|30732588
Task 5 (EDA):
Univariate Analysis:
INFERENCE
Individuals with higher incomes are less likely to apply for loans. The credit
amount of a bank loan is typically in the range of 45000 to 1045000. The majority
of loan applications have come from people between the ages of 35 and 50. Those
with 0 to 8 years of work experience are the most likely to seek for loans.
Individuals who own homes are more likely to apply for loans than others. Those
who are married have taken out more loans. More loans have been requested by
working people. Unaccompanied minors have requested for extra loans.
Amount Income
lOMoARcPSD|30732588
Amount Credit
AGE
lOMoARcPSD|30732588
Name suite type
Bivariate Analysis:
INFERENCE
Customers who live in low-rating areas will have higher defaults.
Individuals with lower incomes are more likely to default. Young people are
more likely to default, and the trend of defaulters declines with age. Ladies
are less inclined than males to have defaults. More defaults are predicted due
to maternity leave and unemployment. Customers with more than five family
members are more likely to default on their bank loan. Customers with fewer
educational qualifications are more likely to fail on a bank loan. Customers
with hardly work experience are more likely to have defaults.
Region Rating Client vs Target Amount Income vs Target
lOMoARcPSD|30732588
Age vs Target Gender vs Target
INCOME TYPE VS TARGET FAMILY MEMBER VS
TARGET
EDUCATION TYPE VS TARGET MONTHS EMPLOYED
VSTARGET
lOMoARcPSD|30732588
Task 6 (Finding top 10 correlations):
Top 10 driving factors in current application.csv
1. Income type
2. Count of Family Members
3. Children count
4. External source
5. Region rating of client
6. Age
7. Months Employed
8. Amount credit
9. Amount Goods Price
10. Amount total income
lOMoARcPSD|30732588
Earlier Application.csv
Task 2 (Data Cleaning):
Column removal: I used the COUNTBLANK function to determine the number of
blanks in a column, and if it exceeded 5%, I eliminated it. I removed a couple
columns that were of no use to the analysis
There are 1670214 rows in the dataset where as Excel has a Max limit of 1048576
rows and as per the project requirement we are supposed to use only Excel for
Analysis. Hence we’d be limited to the use of 1048576 rows
Task 3 (Finding Outliers):
lOMoARcPSD|30732588
Task 4(Data Imbalance):
Below are the columns where data is unevenly distributed
lOMoARcPSD|30732588
Task 5 (EDA):
Univariate Analysis:
Inference
Customers have largely chosen cash and consumer loans. The majority of our
clients are repeat customers.
The majority of current loan applicants are individuals who applied for loans
less than ten months ago. More loans have been requested for consumer
gadgets.
lOMoARcPSD|30732588
Bivariate Analysis:
Inference
Customers who applied for more than Rs. 350,000 will most likely be denied. The
majority of loans sought for through Credit and Cash agencies are cancelled.New
clients are overjoyed because the majority of their loans were approved. Thus far,
car loans have been denied. Loans made to MLM partner clients are likely to be
cancelled. Virtually 80% of the loans were authorised, with a steady stream of
rejections.Consumer loans have nearly no cancellations and the greatest approval
rate. Several loans for the first Selling place area group were cancelled.Clients who
apply for another loan within 10 months of their previous loan are more likely to
have it cancelled. Walk-in loans have a higher refusal rate
lOMoARcPSD|30732588
Task 6 (Finding Correlations): Top ten reasons for loan cancellation and refusal
lOMoARcPSD|30732588
1. Amount Application
2. Cash loan Purpose
3. Goods Category
4. Product Combination
5. Product type
6. Channel type
7. Months Decision
8. Contract type
9. Client type
10. Payment type
lOMoARcPSD|30732588
Task 7 (Combining two sheets): I then ran analysis on the common set of
data by joining the Target column with the previous application table. I used
MySQL to join them. I loaded the data into workbench and ran the following
query.
Query:
SELECT TARGET,
SK_ID_CURR,
NAME_CONTRACT_TYPE,
AMT_APPLICATION,
NAME_CASH_LOAN_PURPOSE,
NAME_CONTRACT_STATUS,
NAME_CLIENT_TYPE, DAYS_DECISION,
CODE_REJECT_REASON,
NAME_SELLER_INDUSTRY,
NAME_PORTFOLIO,
NAME_PRODUCT_TYPE,
CHANNEL_TYPE, SELLERPLACE_AREA,
NAME_YIELD_GROUP,
PRODUCT_COMBINATION
FROM application_data
JOIN previous_application ON SK_ID_CURR;
pivot table analysis
lOMoARcPSD|30732588
Clients who have applied for previous loans have no defaults in current loans
Excel file:
https://docs.google.com/spreadsheets/d/1HqSNv0NdM7yg0Q0uC0IL7x_Ds
Xt6JgLD/edit?usp=drive_link&ouid=115404029938861642621&rtpof=tru
e&sd=true
RESULT:
This project involved extensive use of Excel. The major challenge was working with
such huge data. This project helped me understand how to work with huge datasets.
This helped me understand how 2 datasets are merged to analyze the details. The
dataset involved a lot of missing data and outliers, handling them was a task and this
project helped me understand what to how and why of handling the outliers and Null
values. The project also helped me discover new add-ins such as data analyze.