Loan Approval
Loan Approval
FIELD REPORT
ON
A STUDY CUSTOMER LOAN PRDICTION ANALYSIS
Submitted In Partial fulfillment of The Requirements For The Degree
OF
Master of Business Administration
Finance
Submitted By
Mr. Tushar Ramnath Sandhan MBA
(Finance)
Under the guidance of
Prof. Wagh
Submitted Through
I hereby declare that this Project Report titled “Customer Loan Prediction
Analysis” submitted by me is based on actual work carried out under the guidance
and supervision of Dr. Aarti More. Any reference to work done by any other person or
institution or any material obtained from other sources have been duly cited and
referenced. It is further to state that this work is not submitted anywhere else for any
examination.
Date:
Place:
I would like to thank the almighty for his constant grace showered on me and
his increasing gift of knowledge and strength that has relentlessly prevailed in
my life through the entire project work.
It was such an honor and privilege for me to undergo my field project at Shree
Umiya technopets.pvt.ltd . I Would have not completed my project
without their immense help and cooperation.
I acknowledge my sincere thanks to Dr. Aarti More for his guidance that made
this project materialized. Finally, I am also thankful to my parents and friends
for their encouragement and support.
Date:
Place:
Signature of the
student: (Tushar
Ramnath Sandhan )
Certificate of Organisation
EXECUTIVE SUMMARY
This study explores the implementation and impact of predictive analytics on
strategic decision-making within Shree Umiya technopets.pvt.ltd , a dynamic
technology-driven company based in Nashik. As the organization continues to
scale and diversify its service offerings, decision-makers face increasing
complexity in aligning operations with market demands. Predictive analytics
emerges as a transformative tool, offering data-driven foresight into customer
behavior, operational efficiency, and market trends. The study investigates how
predictive models can enhance strategic planning, risk mitigation, and resource
optimization. To conduct this research, both qualitative and quantitative methods
were employed. Primary data was gathered through structured interviews with key
stakeholders, including managers, analysts, and IT personnel. Secondary data was
sourced from internal company reports, academic journals, and industry white
papers. Data analysis focused on identifying areas within the organization where
predictive analytics could create value—such as customer segmentation, sales
forecasting, supply chain management, and employee productivity. The findings
revealed a strong correlation between data-driven insights and improved decision
quality, agility, and ROI. The study found that I has already a Shree Umiya
technopets.pvt.ltd opted certain analytical tools, but lacks a cohesive framework to
fully leverage predictive capabilities across departments. A major recommendation
is the establishment of a centralized data analytics team responsible for model
development, data governance, and continuous learning. Additionally, strategic
investments in AI-driven software platforms and staff training programs would
enable Shree Umiya technopets.pvt.ltd to cultivate a culture of evidence-based
decision-making. Case studies within the report demonstrate the potential for
predictive analytics to reduce operational costs, enhance customer satisfaction, and
support long-term innovation. This study concludes that predictive analytics is not
merely a technical upgrade but a strategic imperative. By integrating predictive
models into core business functions, Shree Umiya technopets.pvt.ltd Services Pvt.
Ltd. can transition from reactive to proactive management. This approach equips
the company to anticipate market shifts, personalize offerings, and maintain a
competitive edge in a rapidly evolving technological landscape.
ABSTRACT
In our banking system, banks have many products to sell but main source of income of
any banks is on its credit line. So they can earn from interest of those loans which they
credits. Loan approval is a very important process for banking organizations. Banking
Industry always needs a more accurate predictive modeling system for many issues.
Predicting credit defaulters is a difficult task for the banking industry.
A bank's profit or a loss depends to a large extent on loans i.e. whether the customers are
paying back the loan or defaulting. By predicting the loan defaulters, the bank can reduce
its Non- Performing Assets. This makes the study of this phenomenon very important.
Previous research in this era has shown that there are so many methods to study the
problem of controlling loan default. But as the right predictions are very important for the
maximization of profits, it is essential to study the nature of the different methods and
their comparison. A very important approach in predictive analytics is used to study the
problem of predicting loan defaulters: The Logistic regression model.
Logistic Regression models have been performed and the different measures of
performances are computed. The models are compared on the basis of the performance
measures such as sensitivity and specificity. The final results have shown that the model
produce different results. Model is marginally better because it includes variables (personal
attributes of customer like age, purpose, credit history, credit amount, credit duration, etc.)
other than checking account information (which shows wealth of a customer) that should
be taken into account to calculate the probability of default on loan correctly. Therefore, by
using a logistic regression approach, the right customers to be targeted for granting loan
can be easily detected by evaluating their likelihood of default on loan. The model
concludes that a bank should not only target the rich customers for granting loan but it
should assess the other attributes of a customer as well which play a very important part in
credit granting decisions and predicting the loan defaulters.
TABLE OF CONTENTS
ABSTRACT
LIST OF FIGURES
1. INTRODUCTION
1.1 INTRODUCTION TO LOANS
1.2 LOAN APPROVAL AUTOMATION PROCESS
1.3 RESEARCH AND SIGNIFICANCE
REFERENCES
APPENDIX
A. PLAGIARISM REPORT
B. JOURNAL PAPER
C. SOURCE CODE
LIST OF FIGURES
1.1Introduction To Loans
In most developing countries, the banking sector continues to play a leading role in
promoting economic growth, and this in spite of efforts to boost their financial markets,
which remained underdeveloped. Credit risk is among the most significant risks to which
banks are exposed. This allowed the proliferation of risk management methods.
Distribution of the loans is the main business part of almost every bank. The main
portion of the bank’s asset is directly from the profit earned from the loans distributed by
the banks. When credit is granted, and regardless of the expected gain, it can be exposed to
uncertainty of default. The credit institution is not always safe to recover its funds and is
exposed to counter party risk.
Banking domain is to invest their assets in safe hands. Lending money to unsuitable
loan applicants results in the credit risk. Today many banks approve loans after a long
procedure of verification, yet there is no guarantee whether the picked candidate is the right
candidate or not. Estimating the risk, which is involved in a loan application, is one of the
most significant concerns of the banks in order to survive in the highly competitive market.
Through our proposed model we can predict whether that specific customer is safe or not.
Loan Prediction is very helpful for employee of banks as well as for the applicant
also. The aim of this Paper is to provide quick, immediate and easy way
1
to choose the deserving applicants.
For banks, the big problem has become not only to decide whether to grant a loan
or not but also to predict the probability of default of a borrower in case of a credit
agreement. It is to anticipate, for a determinate period, the quality of the borrower (good or
bad borrower) and its ability to repay its debt. When credit is granted, and regardless of the
expected gain, it can be exposed to uncertainty of default. The credit institution is not
always safe to recover its funds and is exposed to counter party risk.
Credit scoring is a widely used technique that helps financial institutions evaluates
the likelihood for a credit applicant to default on the financial obligation and decide
whether to grant credit or not. the lowest credit score referred to as ‘zero’ and the highest
credit score referred to as ‘one’. Here, t he highest correlated factor on the loan status of a
customer is Credit score. This fact is inferred by the feature extraction and correlation
matrix of a customer’s data.
Credit scoring is a statistical method for estimating the probability of default of the
borrower using historical data and statistical data to reach a single indicator that can
distinguish good borrowers from bad borrowers. The score function is based on a method
of financial analysis on financial ratios presented as a single indicator can distinguish
between healthy and failing companies.We can find it by implementing Logistic
Regression Model.
The increased number of potential applicants impelled the development of
sophisticated techniques that automate the credit approval procedure and supervise the
financial health of the borrower. The large volume of loan portfolios also imply that
modest improvements in scoring accuracy may result in significant savings for financial
institutions (West, 2000). The goal of a credit scoring model is to classify credit applicants
into two classes: the “good credit” class that is liable to reimburse
2
the financial obligation and the “bad credit” class that should be denied credit due to the
high probability of defaulting on the financial obligation. The classification is contingent
on sociodemographic characteristics of the borrower (such as age, education level,
occupation and income).
In statistics, exploratory data analysis (EDA) is an approach to analyzing data sets
to summarize their main characteristics, often with visual methods. A statistical model can
be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal
modeling or hypothesis testing task.
Loan approval automation refers to the integration of advanced technologies—such as
artificial intelligence (AI), machine learning (ML), big data analytics, and robotic process
automation (RPA)—into the loan application and evaluation lifecycle. It replaces or
significantly enhances the manual processes traditionally used by financial institutions to
assess a borrower's eligibility and creditworthiness.
In a manual process, loan officers typically review a series of documents including
proof of identity, income statements, credit scores, and employment details. This process is
time-consuming, often inconsistent due to subjective judgment, and vulnerable to human
errors and bias. Moreover, as the volume of applications increases, the traditional method
becomes less scalable and more expensive to manage.
Automation, on the other hand, enables real-time data collection, instant validation,
and quick decision-making. AI models can be trained on historical loan data to predict the
risk of default and the likelihood of repayment, allowing for data-driven decisions that are
faster and more accurate. Machine learning continuously improves the quality of
predictions as more data is fed into the system.
Moreover, automation contributes to a more seamless and transparent customer
experience. Applicants can complete the process online without visiting a branch, and
receive near-instant feedback on their application status. This not only boosts customer
satisfaction but also enhances competitiveness for financial institutions in an increasingly
digital landscape.
In essence, loan approval automation modernizes the lending process by making it
faster, more reliable, scalable, and customer-centric, while helping institutions manage risk
more effectively and comply with regulatory standards.
4
probabilities and classify the use of different types of data and easily determines the most
effective variables that are used for classification. Regression is the task of forecasting the
value of a continuously changeable variable (e.g. a price, a temperature) given some input
variables. As well as the lowest credit score tends to offer the highest interest rates on the
loan. Actually, the credit score saves the lenders to trust the unreliable borrowers. The
Credit Score plays a major role in the customer's information. when the FICO Score was
created according to the money lending process was fairly subjective, and potential
borrowers were often judged by how trustworthy their character seemed.
The past machine learning models have the highest accuracy but with the least
precision score. The precision score is the number of true positives per the total number of
true positives and false positives. If the precision score is low means the large n umber of
False Positives generated by the model. Generating higher number of False Positives than
the False Negatives decreases the model exactness and also increases the risk factor.
5
geography.
Automation:
Uses data-driven algorithms and AI/ML models that follow consistent rules.
Ensures accurate data handling and objective assessment of creditworthiness.
Logs decisions for auditing and compliance, enhancing transparency.
Benefit: Improves fairness and reliability of the approval process.
6
Components of an Automated Loan Approval System
1. User Interface
Purpose: The front-end platform that borrowers interact with.
Types:
Web-based online portal
Mobile applications (Android/iOS)
Key Features:
Application form for entering personal, financial, and employment details
Document upload functionality
Real-time feedback (e.g., missing fields, eligibility hints)
Secure login/authentication (e.g., OTP, biometrics)
Progress tracking and status updates
7
Typical Documents:
Government ID, salary slips, tax returns, bank statements, utility bills
6. Decision Engine
Purpose: Finalizes loan decisions by combining ML model outputs with business logic.
Business Rules Applied Might Include:
Minimum income threshold
Maximum debt-to-income ratio
Manual review trigger conditions (e.g., missing documents, conflicting data)
8
Actions:
Approve
Reject
Escalate to manual underwriter for review
7. Notification System
Purpose: Communicates status and next steps to applicants.
Channels:
Email
SMS
Push notifications via app
Messages Include:
Application received
Additional documents required
Approved/Rejected decision
Loan disbursal confirmation
Features:
Templates with personalization (e.g., name, application ID)
Event-triggered (e.g., send SMS when status changes)
8. Integration API
Purpose: Connects the loan system with external third-party services.
Typical Integrations:
Credit Bureaus: Fetch credit scores and reports
KYC Providers: Verify identity documents, facial recognition
Income Verification: Connect with payroll APIs, bank aggregators
Fraud Detection: Integrate with fraud scoring tools
Advantages:
Seamless data exchange
Reduced manual verification
Ensures compliance with legal and regulatory requirements
2. Data Validation
Once submitted, the system automatically validates the input data for:
Completeness: All required fields are filled out.
Accuracy: Formats are correct (e.g., valid email address, ID number).
Internal Consistency: Values make sense in relation to one another (e.g., income
level vs. loan amount).
If issues are detected, the system can prompt the applicant to correct them immediately.
Benefit: Ensures clean, structured data for downstream processes.
4. Credit Assessment
The system integrates with external credit bureaus (e.g., Experian, Equifax, TransUnion) to
fetch the applicant's credit report.
Credit history, payment defaults, credit utilization, and number of inquiries are
evaluated.
Credit score is used as a base input for assessing creditworthiness.
10
Benefit: Provides a standardized view of financial reliability.
5. ML Risk Scoring
A machine learning (ML) model analyzes all collected data to calculate a risk score:
Inputs may include income, age, employment stability, existing debt, spending
behavior, etc.
Models are trained on historical data to predict the likelihood of default or
delinquency.
This risk score is a critical factor in the loan decision.
Benefit: Enables data-driven risk assessment, adapting over time to changing patterns.
6. Decision Making
A rule-based decision engine applies business policies and thresholds to:
Approve low-risk applications automatically.
Reject high-risk ones.
Escalate borderline or complex cases to human underwriters.
Rules might include:
Minimum credit score requirements
Maximum debt-to-income ratios
Loan amount caps based on income
Benefit: Ensures consistent and auditable decisions.
7. Communication
Once a decision is made:
The system automatically sends notifications to the applicant via email, SMS, or app
alerts.
If rejected, the message may include a brief explanation or next steps.
If approved, the applicant may be prompted to review and accept the terms.
Benefit: Keeps the customer informed in real-time, reducing uncertainty.
8. Disbursement
For approved applications:
The system triggers the fund disbursement workflow.
11
Funds are transferred directly to the applicant’s bank account or digital wallet.
A digital copy of the loan agreement is shared with the applicant.
Benefit: Enables same-day or instant loan funding, which improves customer satisfaction.
Faster Turnaround Time for Loan Approvals
Explanation: In a traditional loan approval process, there can be delays due to manual
reviews, document verification, and waiting for inputs from various departments.
With automation, data integration, and AI-driven systems, loan applications can be
processed much faster.
Impact:
o Efficiency Gains: Automated systems can quickly validate data, assess
eligibility, and make decisions without human intervention. For example, AI
can instantly assess a borrower's creditworthiness based on a variety of real-
time data points.
o Reduced Waiting Times: With faster decision-making, applicants get their
loan approval (or rejection) much quicker, which improves their experience.
o Competitive Advantage: A faster approval time can be a strong differentiator
for banks and lending institutions in a crowded market, attracting more
customers who need quick financial solutions.
2. Reduced Operational Costs
Explanation: Manual processes, paper handling, and a high volume of human
intervention contribute significantly to operational costs. By leveraging technology,
financial institutions can streamline operations, reduce errors, and cut back on labor
and administrative costs.
Impact:
o Automation: Tasks like data entry, document processing, and compliance
checks can be automated. This not only saves time but also reduces the need
for large administrative teams.
o Resource Allocation: By automating routine tasks, employees can focus on
higher-value activities, such as customer service or strategic decision-making.
o Cost Reduction: With fewer staff needed for manual tasks and fewer errors to
correct, operational costs are dramatically reduced.
3. Enhanced Decision Accuracy
Explanation: Automated systems, particularly those based on artificial intelligence
12
and machine learning, rely on data-driven algorithms to make decisions. These
systems can analyze large amounts of data with great accuracy, helping to make
decisions based on objective factors rather than subjective judgment.
Impact:
o Consistency: Automation ensures that every loan is evaluated based on the
same criteria, reducing human bias and subjective decision-making.
o Data-driven Insights: AI models can evaluate complex patterns in a
borrower’s financial behavior, improving the accuracy of predicting their
likelihood of repaying the loan.
o Risk Reduction: More accurate decision-making helps minimize defaults and
financial risks, improving the overall stability of the financial institution.
4. Higher Customer Satisfaction
Explanation: Customers expect a smooth, fast, and transparent process when
applying for loans. A seamless digital experience that delivers fast decisions and
clear communication significantly boosts customer satisfaction.
Impact:
o Convenience: Online loan applications and automated approvals provide
customers with a hassle-free experience, where they can track the status of
their application in real time.
o Transparency: Clear and immediate feedback on loan decisions helps to
manage customer expectations and reduces frustration.
o Personalization: AI-driven systems can offer tailored loan products or flexible
repayment options based on the customer's financial situation, leading to
higher satisfaction and loyalty.
5. Improved Compliance and Audit Readiness
Explanation: Compliance with financial regulations and audit readiness are essential
for avoiding legal issues and maintaining operational integrity. Technology solutions
like automated tracking systems and real-time reporting can ensure that all loan
transactions are in line with regulatory requirements.
Impact:
o Automated Tracking and Reporting: Automated systems can document every
step of the loan process, ensuring that a complete audit trail is available. This
includes tracking compliance with regulations, such as those related to KYC
13
(Know Your Customer), AML (Anti-Money Laundering), and other lending
standards.
o Real-time Alerts and Monitoring: AI systems can monitor transactions for
suspicious activities and flag them in real time, ensuring quicker responses to
compliance issues.
o Simplified Audits: With automated record-keeping and real-time reports,
audits become much easier and less time-consuming, as auditors can access
all required documents and compliance records at the push of a button.
17
AIM AND SCOPE
The primary aim of this proposed system i.e. “Customer Loan Prediction Analysis
as we know that now-a-days there is a rapid growth in banking sector, resulting lots of
people are applying for bank loans. Finding out the applicant to whom the loan will be
approved is a difficult process. In this paper, we proposed a model which predicts loan
approval/rejection of an applicant using machine learning techniques. This can be done by
training the model with the data of the previous records of the people applied for loan.
For this, First we will analyze the data from the Dataset download from Kaggle and
then Preprocess the data using Exploratory Data Analysis and after that build the model
using Train Dataset and use that model for Test Data to Predict the output. To do this, we
need Logistic Regression for finding the Loan Approval status of the applicants. AIM
AND SCOPE
2.1 OBJECTIVES
The proposed model “Online Crime Reporting System (OCRS)” is based on the
new technology Logistic Regression where rather than fitting a regression line, we fit an
"S" shaped logistic function, which predicts two greatest values (0 or 1). It is a significant
algorithm because it can provide probabilities and classify the use of different types of data
and easily determines the most effective variables that are used for classification.
After that we will Clean and process the Dataset and understanding the Data fields
and coming to and understanding by finding the criteria’s and variables giving positive
impact for Loan approval. Various steps are involved to do this process, after doing all the
necessary steps like building model and training it to the Test data set.
Then after implementation we will get an output.csv Dataset file containing all the
predicted approval status based on the model we build which can be used by
18
Why It Matters:
Banks and financial institutions need to automate loan approval decisions to reduce processing
time and maintain consistency, while minimizing the risk of bad loans.
How It Works:
Input: Customer demographic and financial details (e.g., income, employment, credit
history).
Output: Binary prediction – Approved or Not Approved.
Algorithm: Supervised learning classification models (e.g., Logistic Regression, Random
Forest, XGBoost).
Business Impact:
Faster processing of applications.
Objective decision-making.
Better customer experience.
19
Banks to automate the Loan approval Process in a easier and faster way in no time.
2.2 SCOPE
This model focusses more on the both accuracy and cross validation score of a
model. The higher Precision Score decreases the lender’s risk factor. By implementing this
Banks can save more time in processing the Customers or Applicants Loan application by
automating the process so that it saves more time and human work intervention.
To implement this, the model fed by the data which impacts more on the loan status
and also to decrease the more ambiguity the credit score factor consists only two unique
values {0,1} where ‘0’ represents lower credit score and ‘1’ represents higher credit score.
By doing this we can predict the Approval status for the customers who applied for the loan
in the respective Banks.
The main goal of our model is to allocate credit borrowers to two groups:”good
credit” group that is likely to repay the financial obligation or “bad credit” group with a
high risk of default on its financial obligations. Our sample consists of 195 SMEs using
credit files distributed over five sectors of activity namely “agriculture, industry, services,
tourism and trade.”
The objective of this paper is to present models for predicting the probability of
default of the counterparty based on the score method, which pits discriminant analysis
against logistic regression. Credit scoring is a method that helps the bank to rationalize its
process for credit granting decision. Its principle is to synthesize a set of financial ratios as
one indicator able to distinguish between good and bad customers.
The past machine learning models have the highest accuracy but with the least
precision score. The precision score is the number of true positives per the
20
total number of true positives and false positives. If the precision score is low means the
large n umber of False Positives generated by the model. Generating higher number of
False Positives than the False Negatives decreases the model exactness and also increases
the risk factor. The credit risk analysis is based on various information about the borrower,
which may be summarized as qualitative and quantitative data such as financial ratios,
since access on qualitative data seems be difficult, our study will be only based on
quantitative variables; hence it becomes necessary to make a wise choice of our financial
ratio.
2.1.1 ADVANTAGES
Benefits:
21
3. Operational Efficiency
What it means:
Automation of loan screening reduces the burden on loan officers and speeds up
processing time.
Benefits:
Cuts down manual effort in verifying applications.
Enables real-time or near-instant loan decisions.
Scales efficiently as loan volumes grow.
7. Competitive Advantage
What it means:
Early adoption of AI-driven lending gives a company a lead over competitors still
using traditional methods.
Benefits:
Attracts tech-savvy and creditworthy customers.
Enables agile response to market changes.
Facilitates innovation in loan products and services.
EXISTING SYSTEM
Machine Learning implementation is a very complex part in terms of Data analytics. Working
on the data which deals with prediction and making the code to predicit the future of out comes
from the customer is challenging part.All the other methods are simply making process way
complexed and not giving accurate Cross Validation score in predicting results.
3.3 MODULES
The Customer Loan Prediction Analysis Model consists of mainly Modules. They are
Reading and Cleaning Dataset, Model Building and finally Testing the Dataset. Here the
primary task is to maintain the database required for analyzing the data in the form of .csv
files. It needs a Train dataset, Test Dataset. The user will have access to both the datasets to
train and build the testing model.
Firstly, we need to implement Exploratory Data Analysis for Spotting mistakes in the
provided or collected Dataset and mapping them to underlaying structure of the data. Finding
the Most important variables and fields in the dataset given by EDA by visualizing graphs
and boxplots using python libraries NumPy and Pandas. Listing Anomalies and Outliers
comes to the next stage scenario. Let us look into each module in detail.
25
For the non-numerical values (e.g., Property Area, Credit History etc.), we can look
at frequency distribution to understand whether they make sense or not. Getting the unique
values and their frequency of variable Property Area and also Understanding Distribution of
Numerical Variables like Applicant Income and Loan Amount can be done by using Boxplot
to understand and finding the outliers of the Dataset fields. We can see that there is no
substantial different between the mean income of graduate and non-graduates. But there are a
higher number of graduates with very high incomes, which are appearing to be the outliers.
So after this we will be able to find fields which are having positive impact on Loan approval
status and coming to a conclusion by analysis.
We can see that there is no substantial different between the mean income of
graduate and non-graduates. But there are a higher number of graduates with very high
incomes, which are appearing to be the outliers. Loan Amount has missing as well as extreme
values, while Applicant Income has a few extreme values. Now it is time to Understanding
Distribution of Categorical Variables.
By making Cross table with both Credit History and Loan Amount Fields we will
see that Loans Approved in the Train Data set are more with applicants having credit history
equals to 1. Then we will write a function to find out the percentage of applicants whose
loans are approved with credit history equals to 1 and it shows more than 79% people have
got loans with Credit history of 1. Now we will move forward and understand outliers in a
better way in next module to build the model.
26
earlier in this project also comes under EDA.
In this phase, we were discussing a bit more depth than before for better
clarification on the insights we got. The insights we got from the feature analysis were the
highest point values were male, married, graduation, not self-employed, and semi- urban
areas. But there, we applied the statistics one by one and we were not sure about how these
futures affecting the target variable Loan Status when we grouped those together.
Also, we were not sure that we should clear the outliers in numerical features or not.
sklearn requires all inputs to be numeric, we should convert all our categorical variables into
numeric by encoding the categories. Before that we will fill all the missing values in the
dataset. In order to prepare and apply a model to this dataset, we’ll first have to break it into
two subsets.
The first will be the training set on which we will develop the model. The second will
be the test dataset which we will use to test the accuracy of our model. We will allocate 75%
of the items to Training and 25% items to the Test set. Once our dataset has been split, we
can establish a baseline model for predicting whether a credit application will be approved.
This baseline model will be used as a benchmark to determine how effective the
models are. First, we determine the percentage of credit card applications that were approved
in the training set. We can see from the summary output that the Debt variable has missing
values that we’ll have to fill in. We could simply use the mean of all the existing values to do
so.
Another method would be to check the relationship among the numeric values and
use a linear regression to fill them in Regression models are useful for predicting continuous
(numeric) variables. However, the target value in Approved is binary and can only be values
of 1 or 0. The applicant can either be issued a credit card or denied- they cannot receive a
partial credit card.
We could use linear regression to predict the approval decision using threshold and
anything below assigned to 0 and anything above is assigned to 1. Unfortunately, the
predicted values could be well outside of the 0 to 1 expected range. Therefore, linear or
multivariate regression will not be effective for predicting the values. Instead, logistic
regression will be more useful because it will produce probability that the target value is 1.
Probabilities are always between 0 and 1 so the output will more closely match the target
value range than linear regression.
27
Convert all non-numeric values to number and creating Generic Classification Function for
accessing performance.
Logistic Regression is a classification algorithm for supervised data and also similar
to linear regression, the difference between them is logistic regression uses Logistic function/
sigmoid function which is an S-shaped curve[5]. The sigmoid function can take any real-
valued number/integer and maps that value into a value between 0 and 1, but won’t be 0 or 1.
The chances of getting loan will be higher for Applicants having credit history equals to 1,
Applicants with higher applicant income and Co-applicant income, with Higher education
and having properties in Urban areas. So we make our model with ‘Credit History’,
’Education’ & ’Gender’.
28
MODEL DEVELOPMENT METHOLOGY
An effective model was proposed for predicting the right customers who have In our
proposed model we had used Logistic regression which is one of the popular Machine
Learning algorithms that comes under the Supervised Learning technique. It is applicable for
categorical dependent variable using a given set of independent variables. Thus, the outcome
must be a categorical or discrete value. The output can be either Yes or No, 0 or 1, true or
false, etc. but instead of giving the exact value as
0 or 1, it gives some probabilistic values which lies between 0 and 1. Logistic regression is
much similar to linear regression except that how they are used. It is used for solving
regression problems, whereas Logistic regression is used for solving the classification
problems. applied for loan. The input dataset is the bank dataset of customers who applied for
the loan. The dataset is a CSV file. The dataset can read into the python environment by using
the read_csv() method in pandas. For that, should import pandas into the present python
environment. The features of the
29
Customers dataset: Loan_ID, Gender, Married, Dependents, Education, Self Employed,
ApplicantIncome, CoapplicantIncome, LoanAmount, LoanAmount_Term, Credit_History,
Property Area and Loan_Status. Missing Values considers as noisy data in the dataset. If no
information is provided while collecting the data that particular entries in the database will
refer to as Missing values. The missing data is either represented as Nan or None in pandas.
None is referred to as missing data in code and also as singleton object of python. Mean
imputation replacing the mean value of a column in place of null values. Median
imputation replacing the median value of a column in place of null values. A value that
deviated significantly from the rest of the values in a feature is referred to as an Outlier. They
can be caused either by measurement or execution error. Analyzing the outlier data from the
data point is referred to as outlier analysis. Removing or decreasing the outliers is important
when that particular feature effecting more on the dependent variable means if the featu re is
important in getting the output. Pandas is a Python package providing fast, flexible, and
expressive data structures designed to make working with structured (tabular,
multidimensional, potentially heterogeneous) and time series data both easy and intuitive.
Based on the data given by the loan applicant, we can predict whether the loan of
particular applicant is approved or not using a User Interface. User interface contains input
variables with their corresponding fields and a field to display the output. Input variables are
Gender, Marital status, Dependents, Education, Applicant income, Loan Amount, Loan
amount term, Credit History, Property Area. The applicant need to give these values and
based on these, the model will predict whether the loan will be approved or not.
30
As per our Analysis, Credit History, Income, Education, Gender and Property Areas
of the Applicant makes more impact so we created a model using above steps with these
fields.
31
Based on the data given by the loan applicant, we can predict whether the loan of particular
applicant is approved or not using a User Interface. User interface contains input variables
with their corresponding fields and a field to display the output. Input variables are Gender,
Marital status, Dependents, Education, Applicant income, Loan Amount, Loan amount term,
Credit History, Property Area. The applicant need to give these values and based on these, the
model will predict whether the loan will be approved or not.
Variable Description
Loan_ID Unique Loan ID
Gender Male/ Female
Married Applicant married (Y/N)
Dependents Number of Dependents
Education Graduate/ Non-Graduate
Self_Employed Self_Employed(Y/N)
ApplicantIncome Applicant Income
CoapplicantIncome Coapplicant Income
LoanAmount Loan amount in thousands
Loan_Amount_Term Term of loan in months
Credit_History credit history meets guidelines
Property_Area Urban/ Semi Urban/ Rural
Loan_Status Loan approved (Y/N)
32
mathematical functions to operate on these arrays.
Matplotlib: Matplotlib is a plotting library for the Python programming language and its
numerical mathematics extension NumPy. It provides an object-oriented API for embedding
plots into applications using general-purpose GUI toolkits and has advanced Mathematical
features.
Scikit-learn: Scikit-learn is a free software machine learning library for the Python
programming language. It features various classification, regression and clustering algorithms
including support vector machines.
Logistic Regression: Regression models are useful for predicting continuous (numeric)
variables. However, the target value in Approved is binary and can only be values of 1 or 0.
The applicant can either be issued a credit card or denied- they cannot receive a partial credit
card. We could use linear regression to predict the approval decision using threshold and
anything below assigned to 0 and anything above is assigned to 1. Unfortunately, the
predicted values could be well outside of the 0 to 1 expected range. Therefore, linear or
multivariate regression will not be effective for predicting the values. Instead, logistic
regression will be more useful because it will produce probability that the target value is 1.
Probabilities are always between 0 and 1 so the output will more closely match the target
value range than linear regression.
33
FIG 4.4 Logistic Regression Model Graph
Accuracy Rate : It is the percentage of correct predictions for a given dataset Cross
35
RESULTS AND DISCUSSION
Finally, in our model by using logistic regression model we predict whether the loan is
approved or not. In order to implement this various input variables were used to get the
output. Whenever program takes the input data it gives the output in the form of binary i.e.,
either 0 or 1. If the output is 1 then ‘1’ will be displayed and it indicates that loan is approved.
If the output is 0 then ‘0’ will be displayed and it indicates that loan is not approved. Here, we
had implemented loan credibility prediction system that helps the organizations in making the
right decision to approve or reject the loan request of the customers. This will definitely help
the banking industry to open up efficient delivery channels. In this model, Logistic
Regression algorithm is used for the prediction.
Fig 5.1 Boxplot for Variable Applicant Income of Training Data set
The above Box Plot confirms the presence of a lot of outliers/extreme values. This can be
attributed to the income disparity in the society.
36
Fig 5.2 Box Plot for understanding the distributions and to observe the outliers Above
Fig 5.2 shows that Outliers are present in Male Gender than compared to female
37
Fig 5.4 Box plot for Applicant Income before and after Log Transformation The
extreme values are practically possible, i.e. some people might apply for high value loans
due to specific needs. So instead of treating them as outliers, let’s try a log transformation to
nullify their effect. Also we combine both Applicant and Co Applicant Income for better
results
38
Fig 5.6 Accuracy Rate and Cross Validation Score
In Fig 5.6 we can see the Accuracy Rate is 80.945% with Cross Validation Score of 80.946%
which is best value when compared to recent models based on other Machine Learning
Models. This better values makes us understand how accurate this model.
39
CONCLUSION AND FUTURE WORK
CONCLUSION
From a proper analysis of positive points and constraints on the component, it can be
safely concluded that the product is a highly efficient component. This application is working
properly and meeting to all Banker requirements. This component can be easily plugged in
many other systems. There have been numbers cases of computer glitches, errors in content
and most important weight of features is fixed in automated prediction system, So in the near
future the so – called software could be made more secure, reliable and dynamic weight
adjustment
Finally, in our model by using logistic regression model we predict whether the loan
is approved or not. In order to implement this various input variables were used to get the
output. Whenever program takes the input data it gives the output in the form of binary i.e.,
either 0 or 1. If the output is 1 then ‘1’ will be displayed and it indicates that loan is approved.
If the output is 0 then ‘0’ will be displayed and it indicates that loan is not approved.
In this paper, data preprocessing and transformation techniques are applied and results
are generated by implementing analytical models. The performance is analyzed using the
confusion matrix table. We can also use this model to make detail testing selections. Any
credit application that does not have the same outcome as predicted by the model is potential
audit exception. The inherent risk is that a credit card was issued to someone that should have
been denied. This account is more likely to default than a properly approved account which,
in turn, exposes the company to loss.
FUTURE WORK
Here, we had implemented loan credibility prediction system that helps the
organizations in making the right decision to approve or reject the loan request of the
customers. This will definitely help the banking industry to open up efficient delivery
channels. In this model, Logistic Regression algorithm is used for the prediction.
Incorporation of other techniques that outperform the performance of popular data mining
models has to be implemented and tested for the domain.
The inherent risk is that a credit card was issued to someone that should have been
40
denied. This account is more likely to default than a properly approved account which, in
turn, exposes the company to loss. The different machine learning models can be
implemented and the performance can be compared. We can also do this automation by using
Neural Networks, Naive Bayes and some other Machine Learning Algorithms which are in
progress.
41
REFERENCES
[1]. Sudhamathy G and Jothi Venkateswaran “Analytics Using R for Predicting Credit
Defaulters”, IEEE international conference on advances in computer applications (ICACA),
978-1-5090-3770-4, 2016.
[2]. M. Sudhakar, and C.V.K. Reddy, “Two Step Credit Risk Assessment Model For Retail
Bank Loan Applications Using Decision Tree Data Mining Technique”, International Journal
of Advanced Research in Computer Engineering & Technology (IJARCET), vol. 5, no.3, pp.
705-718, 2016.
[3]. J.H. Aboobyda, and M.A. Tarig, “Developing Prediction Model Of Loan Risk In Banks
Using Data Mining”, Machine Learning and Applications: An International Journal (MLAIJ),
vol. 3, no.1, pp. 1–9, 2016.
[4]. Z. Somayyeh, and M. Abdolkarim,“Natural Customer Ranking of Banks in Terms of
Credit Risk by Using Data Mining A Case Study: Branches of Mellat Bank of Iran”, Jurnal
UMP Social Sciences and Technology Management, vol. 3, no. 2, pp. 307– 316, 2015.
[5]. A.B. Hussain, and F.K.E. Shorouq, “Credit risk assessment model for Jordanian
commercial banks: Neuralscoring approach”, Review of Development Finance, Elsevier, vol.
4, pp. 20–28, 2014.
[6]. T. Harris, “Quantitative credit risk assessment using support vector machines: Broad
versus Narrow default definitions”, Expert Systems with Applications, vol. 40,
pp. 4404– 4413, 2013.
[7]. Dileep B. Desai, Dr. R.V.Kulkarni “A Review: Application of Data Mining Tools in
CRM for Selected Banks”, (IJCSIT) International Journal of Computer Science and
Information Technologies, Vol. 4 (2), 2013, 199 – 201.
[8].Gang Wang, Jian Ma, “Study of corporate credit risk prediction based on integrating
boosting and random subspace”, 2011.
[9]Hussain Ali Bekhet , Shorouq Fathi Kamel Eletter , “Credit risk assessment model for
Jordanian commercial banks: Neural scoring approach” ,Apr2014.
[10] M. Yaghini , T. Zhiyan , and M. Fallahi, “A Prediction Model for Recognition of Bad
Credit Customers in Saman Bank Using Neural Networks”, 2011
42
APPENDIX
A. PLAGIARISM REPORT
Customer Loan Approval Prediction using Logistic
Regression
ABSTRACT
In Banking Sector, a loan is a process of lending or borrowing a sum of money by one or more individuals,
organizations, etc. from Banks. The Person who lends that money from respective financier incurs a debt, and he
is responsible to pay back the money with the Interest decided by Bank within a certain period. Generally what
Banks look into before applying for a loan is Credit History, Credit loss and Income of Applicant. So basically,
loans play a major role regarding Income for Bank. Due to rapid urban development people who are applying for
loans got increased rapidly. As a result, finding the applicant to whom loan can be approved become a
complexed process. In this paper, we want to automate the loan eligibility process (real time) based on customer
details. Fields that required are Marital Status, Income, Education, Gender, Number of Dependents, Loan
Amount, Credit History and others. To predict the status, we will use Logistic Regression to spot the customers,
those are eligible for loan amount so that bank can reach out them for granting loans to those people who can
payback in a given time.
Variable Description
Loan_ID Unique Loan ID
Number In the above diagram, It shows what are the steps
Gender Male/ Female involved in building this model.
58
fullData = pd.concat([df,test],axis=0, sort=True) #Look at
the available missing values in the dataset
fullData.isnull().sum()
#Create a new column as Total Income fullData['TotalIncome']=fullData['ApplicantIncome']
+ fullData['CoapplicantIncome'] fullData['TotalIncome_log'] =
np.log(fullData['TotalIncome'])
#Histogram for Total Income fullData['TotalIncome_log'].hist(bins=20) #create
label encoders for categorical features
for var in cat_cols:
number = LabelEncoder()
fullData[var] = number.fit_transform(fullData[var].astype('str'))
train_modified=fullData[fullData['Type']=='Train'] test_modified=fullData[fullData['Type']=='Test']
train_modified["Loan_Status"] = number.fit_transform(train_modified["Loan_Status"].asty
from sklearn.linear_model import LogisticRegression
predictors_Logistic=['Credit_History','Education','Gender']
x_train = train_modified[list(predictors_Logistic)].values
y_train = train_modified["Loan_Status"].values x_test=test_modified[list(predictors_Logistic)].values
# Create logistic regression object model =
LogisticRegression()
# Train the model using the training sets
model.fit(x_train, y_train)
#Predict Output
predicted= model.predict(x_test) #Reverse
encoding for predicted outcome
predicted = number.inverse_transform(predicted)
#Store it to test dataset
test_modified['Loan_Status']=predicted outcome_var =
'Loan_Status'
classification_model(model, df,predictors_Logistic,outcome_var)
test_modified.to_csv("Logistic_Prediction.csv",columns=['Loan_ID','Loan_Status'])
Accuracy: 80.945% Cross-Validation Score: 80.946%
58