0% found this document useful (0 votes)
30 views61 pages

Loan Approval

This field report presents a study on customer loan prediction analysis conducted by Mr. Tushar Ramnath Sandhan for his MBA degree, focusing on the implementation of predictive analytics at Shree Umiya Technopets. The research highlights the importance of predictive models in enhancing decision-making, risk mitigation, and resource optimization in banking, particularly in predicting loan defaults using logistic regression. The study concludes that integrating predictive analytics into core business functions is essential for banks to transition from reactive to proactive management, ultimately improving customer satisfaction and operational efficiency.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views61 pages

Loan Approval

This field report presents a study on customer loan prediction analysis conducted by Mr. Tushar Ramnath Sandhan for his MBA degree, focusing on the implementation of predictive analytics at Shree Umiya Technopets. The research highlights the importance of predictive models in enhancing decision-making, risk mitigation, and resource optimization in banking, particularly in predicting loan defaults using logistic regression. The study concludes that integrating predictive analytics into core business functions is essential for banks to transition from reactive to proactive management, ultimately improving customer satisfaction and operational efficiency.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 61

A

FIELD REPORT
ON
A STUDY CUSTOMER LOAN PRDICTION ANALYSIS
Submitted In Partial fulfillment of The Requirements For The Degree
OF
Master of Business Administration
Finance
Submitted By
Mr. Tushar Ramnath Sandhan MBA
(Finance)
Under the guidance of
Prof. Wagh
Submitted Through

K.R. SAPKAL COLLEGE OF MANAGEMENT STUDIES, NASHIK


Submitted To
SAVITRIBAI FHULE UNIVERSITY OF PUNE 2024-25
STUDENT DECLARATION

I hereby declare that this Project Report titled “Customer Loan Prediction
Analysis” submitted by me is based on actual work carried out under the guidance
and supervision of Dr. Aarti More. Any reference to work done by any other person or
institution or any material obtained from other sources have been duly cited and
referenced. It is further to state that this work is not submitted anywhere else for any
examination.

Date:

Place:

Signature of the student:


(Tushar Ramnath Sandhan)
ACKNOWLEDGMENT

I would like to thank the almighty for his constant grace showered on me and
his increasing gift of knowledge and strength that has relentlessly prevailed in
my life through the entire project work.

It was such an honor and privilege for me to undergo my field project at Shree
Umiya technopets.pvt.ltd . I Would have not completed my project
without their immense help and cooperation.

My sincere gratitude goes to all who guiding me and constantly counseling me


through their undertaking.

I acknowledge my sincere thanks to Dr. Aarti More for his guidance that made
this project materialized. Finally, I am also thankful to my parents and friends
for their encouragement and support.

Date:

Place:

Signature of the
student: (Tushar
Ramnath Sandhan )
Certificate of Organisation

SHREE UMIYA TECHNOPETS


SHOP NO. 3, PLOT NO. X-73, MIDC, MALEGAON,
TAL.

SINNAR, DIST. NASHIK – 422 113.

Mobile No - Hitesh Patel 9325656111


Pankaj Patel 9822000333 .

This is to certify that Mr. Tushar Ramnath Sandhan of student of MBA of


K R SAPKAL COLLAGE OF MANAGEMENT STUDIES has successfully
completed the field work as per the guidelines of Savitribai Phule Pune
University in our organization During the work, the student was sincere,
hardworking and showed a keen interest learn. The involvement and
sustained efforts put in by the student are highly appreciable. I
recommend this Field Project for evaluation &
consideration for the award of credits to the student. We wish him all the
best in his future endeavors.

. Authorized Signature and Stamp

EXECUTIVE SUMMARY
This study explores the implementation and impact of predictive analytics on
strategic decision-making within Shree Umiya technopets.pvt.ltd , a dynamic
technology-driven company based in Nashik. As the organization continues to
scale and diversify its service offerings, decision-makers face increasing
complexity in aligning operations with market demands. Predictive analytics
emerges as a transformative tool, offering data-driven foresight into customer
behavior, operational efficiency, and market trends. The study investigates how
predictive models can enhance strategic planning, risk mitigation, and resource
optimization. To conduct this research, both qualitative and quantitative methods
were employed. Primary data was gathered through structured interviews with key
stakeholders, including managers, analysts, and IT personnel. Secondary data was
sourced from internal company reports, academic journals, and industry white
papers. Data analysis focused on identifying areas within the organization where
predictive analytics could create value—such as customer segmentation, sales
forecasting, supply chain management, and employee productivity. The findings
revealed a strong correlation between data-driven insights and improved decision
quality, agility, and ROI. The study found that I has already a Shree Umiya
technopets.pvt.ltd opted certain analytical tools, but lacks a cohesive framework to
fully leverage predictive capabilities across departments. A major recommendation
is the establishment of a centralized data analytics team responsible for model
development, data governance, and continuous learning. Additionally, strategic
investments in AI-driven software platforms and staff training programs would
enable Shree Umiya technopets.pvt.ltd to cultivate a culture of evidence-based
decision-making. Case studies within the report demonstrate the potential for
predictive analytics to reduce operational costs, enhance customer satisfaction, and
support long-term innovation. This study concludes that predictive analytics is not
merely a technical upgrade but a strategic imperative. By integrating predictive
models into core business functions, Shree Umiya technopets.pvt.ltd Services Pvt.
Ltd. can transition from reactive to proactive management. This approach equips
the company to anticipate market shifts, personalize offerings, and maintain a
competitive edge in a rapidly evolving technological landscape.
ABSTRACT

In our banking system, banks have many products to sell but main source of income of
any banks is on its credit line. So they can earn from interest of those loans which they
credits. Loan approval is a very important process for banking organizations. Banking
Industry always needs a more accurate predictive modeling system for many issues.
Predicting credit defaulters is a difficult task for the banking industry.
A bank's profit or a loss depends to a large extent on loans i.e. whether the customers are
paying back the loan or defaulting. By predicting the loan defaulters, the bank can reduce
its Non- Performing Assets. This makes the study of this phenomenon very important.
Previous research in this era has shown that there are so many methods to study the
problem of controlling loan default. But as the right predictions are very important for the
maximization of profits, it is essential to study the nature of the different methods and
their comparison. A very important approach in predictive analytics is used to study the
problem of predicting loan defaulters: The Logistic regression model.
Logistic Regression models have been performed and the different measures of
performances are computed. The models are compared on the basis of the performance
measures such as sensitivity and specificity. The final results have shown that the model
produce different results. Model is marginally better because it includes variables (personal
attributes of customer like age, purpose, credit history, credit amount, credit duration, etc.)
other than checking account information (which shows wealth of a customer) that should
be taken into account to calculate the probability of default on loan correctly. Therefore, by
using a logistic regression approach, the right customers to be targeted for granting loan
can be easily detected by evaluating their likelihood of default on loan. The model
concludes that a bank should not only target the rich customers for granting loan but it
should assess the other attributes of a customer as well which play a very important part in
credit granting decisions and predicting the loan defaulters.
TABLE OF CONTENTS

ABSTRACT
LIST OF FIGURES

CHAPTER No. TITLE

1. INTRODUCTION
1.1 INTRODUCTION TO LOANS
1.2 LOAN APPROVAL AUTOMATION PROCESS
1.3 RESEARCH AND SIGNIFICANCE

2. AIM AND SCOPE


2.1AIM OF PROJECT
2.2 OBJECTIVES
2.3 SCOPE
2.2.1 ADVANTAGES

3. SYSTEM DESIGN & METHODOLOGY


3.1 EXISTING SYSTEM
3.2 PROPOSED SYSTEM
3.3 SYSTEM CONFIGURATION
3.4 MODULES

4. MODEL DEVELOPMENT METHODOLOGY


4.1 DESCRIPTION OF DIAGRAM
4.2 ARCHITECTURAL DESIGN
4.3 IDENTIFYING THE ACTORS
4.4 SYSTEM IMPLEMENTATION
4.5 MODULAR DESIGN
6. RESULTS AND DISCUSSION

7. CONCLUSION AND FUTURE WORK

REFERENCES

APPENDIX

A. PLAGIARISM REPORT
B. JOURNAL PAPER
C. SOURCE CODE
LIST OF FIGURES

FIGURE No. FIGURE NAME

4.1 BLOCK DIAGRAM

4.2 ARCHITECTURE DIAGRAM


4.3 LOGISTIC REGRESSION GRAPH
5.1 BOX PLOT 1

5.2 BOX PLOT 2

5.3 BOX PLOT 3

5.4 BOX PLOT 4

5.5 MODEL BUILDING

5.6 ACCURACY RATE RESULT


5.7 OUTPUT FILE
INTRODUCTION

1.1Introduction To Loans

In most developing countries, the banking sector continues to play a leading role in
promoting economic growth, and this in spite of efforts to boost their financial markets,
which remained underdeveloped. Credit risk is among the most significant risks to which
banks are exposed. This allowed the proliferation of risk management methods.

Distribution of the loans is the main business part of almost every bank. The main
portion of the bank’s asset is directly from the profit earned from the loans distributed by
the banks. When credit is granted, and regardless of the expected gain, it can be exposed to
uncertainty of default. The credit institution is not always safe to recover its funds and is
exposed to counter party risk.

Banking domain is to invest their assets in safe hands. Lending money to unsuitable
loan applicants results in the credit risk. Today many banks approve loans after a long
procedure of verification, yet there is no guarantee whether the picked candidate is the right
candidate or not. Estimating the risk, which is involved in a loan application, is one of the
most significant concerns of the banks in order to survive in the highly competitive market.
Through our proposed model we can predict whether that specific customer is safe or not.

Today many banks/financial companies approve loan after a regress process of


verification and validation but still there is no surety whether the chosen applicant is the
deserving right applicant out of all applicants. The objective of this paper is to present
models for predicting the probability of default of the counterparty based on the score
method, which pits discriminant analysis against logistic regression. Credit scoring is a
method that helps the bank to rationalize its process for credit granting decision. Its
principle is to synthesize a set of financial ratios as one indicator able to distinguish
between good and bad customers.

Loan Prediction is very helpful for employee of banks as well as for the applicant
also. The aim of this Paper is to provide quick, immediate and easy way

1
to choose the deserving applicants.

This Process is exclusively for the managing authority of Bank/finance company,


whole process of prediction is done privately no stakeholders would be able to alter the
processing. Result against particular Loan Id can be send to various department of banks so
that they can take appropriate action on application..

For banks, the big problem has become not only to decide whether to grant a loan
or not but also to predict the probability of default of a borrower in case of a credit
agreement. It is to anticipate, for a determinate period, the quality of the borrower (good or
bad borrower) and its ability to repay its debt. When credit is granted, and regardless of the
expected gain, it can be exposed to uncertainty of default. The credit institution is not
always safe to recover its funds and is exposed to counter party risk.

1.1 LOAN APPROVAL AUTOMATION PROCESS

Credit scoring is a widely used technique that helps financial institutions evaluates
the likelihood for a credit applicant to default on the financial obligation and decide
whether to grant credit or not. the lowest credit score referred to as ‘zero’ and the highest
credit score referred to as ‘one’. Here, t he highest correlated factor on the loan status of a
customer is Credit score. This fact is inferred by the feature extraction and correlation
matrix of a customer’s data.
Credit scoring is a statistical method for estimating the probability of default of the
borrower using historical data and statistical data to reach a single indicator that can
distinguish good borrowers from bad borrowers. The score function is based on a method
of financial analysis on financial ratios presented as a single indicator can distinguish
between healthy and failing companies.We can find it by implementing Logistic
Regression Model.
The increased number of potential applicants impelled the development of
sophisticated techniques that automate the credit approval procedure and supervise the
financial health of the borrower. The large volume of loan portfolios also imply that
modest improvements in scoring accuracy may result in significant savings for financial
institutions (West, 2000). The goal of a credit scoring model is to classify credit applicants
into two classes: the “good credit” class that is liable to reimburse

2
the financial obligation and the “bad credit” class that should be denied credit due to the
high probability of defaulting on the financial obligation. The classification is contingent
on sociodemographic characteristics of the borrower (such as age, education level,
occupation and income).
In statistics, exploratory data analysis (EDA) is an approach to analyzing data sets
to summarize their main characteristics, often with visual methods. A statistical model can
be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal
modeling or hypothesis testing task.
Loan approval automation refers to the integration of advanced technologies—such as
artificial intelligence (AI), machine learning (ML), big data analytics, and robotic process
automation (RPA)—into the loan application and evaluation lifecycle. It replaces or
significantly enhances the manual processes traditionally used by financial institutions to
assess a borrower's eligibility and creditworthiness.
In a manual process, loan officers typically review a series of documents including
proof of identity, income statements, credit scores, and employment details. This process is
time-consuming, often inconsistent due to subjective judgment, and vulnerable to human
errors and bias. Moreover, as the volume of applications increases, the traditional method
becomes less scalable and more expensive to manage.
Automation, on the other hand, enables real-time data collection, instant validation,
and quick decision-making. AI models can be trained on historical loan data to predict the
risk of default and the likelihood of repayment, allowing for data-driven decisions that are
faster and more accurate. Machine learning continuously improves the quality of
predictions as more data is fed into the system.
Moreover, automation contributes to a more seamless and transparent customer
experience. Applicants can complete the process online without visiting a branch, and
receive near-instant feedback on their application status. This not only boosts customer
satisfaction but also enhances competitiveness for financial institutions in an increasingly
digital landscape.
In essence, loan approval automation modernizes the lending process by making it
faster, more reliable, scalable, and customer-centric, while helping institutions manage risk
more effectively and comply with regulatory standards.

1.1 RESEARCH AND SIGNIFICANCE


3
The primary aim of this proposed system i.e. “Customer Loan Prediction
Analysis” is to find whether the Customer applied for Loan is eligible to take by evaluating
all his details related to his credit score, Gender, Annual Income and Nature of Place
living.
This paper proposes a credit scoring model of consumer loans based on various
analytical models. The rest of this paper is organized as follows. In the next section,
exploratory analysis and data transformation is presented. This is followed by a description
of the data sets and a comparison of the predictive accuracy of the models.
In our proposed model we had used Logistic regression which is one of the popular
Machine Learning algorithms that comes under the Supervised Learning technique. It is
applicable for categorical dependent variable using a given set of independent variables.
Thus, the outcome must be a categorical or discrete value. The output can be either Yes or
No, 0 or 1, true or false, etc. but instead of giving the exact value as 0 or 1, it gives some
probabilistic values which lies between 0 and 1. Logistic regression is much similar to
linear regression except that how they are used. It is used for solving regression problems,
whereas Logistic regression is used for solving the classification problem.
In Logistic regression, rather than fitting a regression line, we fit an "S" shaped
logistic function, which predicts two greatest values (0 or 1). The curve from the logistic
function demonstrates the probability of something, for example, regardless of whether the
cells are destructive or not, a mouse is corpulent or not founded on its weight, and so on. It
is a significant algorithm because it can provide

4
probabilities and classify the use of different types of data and easily determines the most
effective variables that are used for classification. Regression is the task of forecasting the
value of a continuously changeable variable (e.g. a price, a temperature) given some input
variables. As well as the lowest credit score tends to offer the highest interest rates on the
loan. Actually, the credit score saves the lenders to trust the unreliable borrowers. The
Credit Score plays a major role in the customer's information. when the FICO Score was
created according to the money lending process was fairly subjective, and potential
borrowers were often judged by how trustworthy their character seemed.
The past machine learning models have the highest accuracy but with the least
precision score. The precision score is the number of true positives per the total number of
true positives and false positives. If the precision score is low means the large n umber of
False Positives generated by the model. Generating higher number of False Positives than
the False Negatives decreases the model exactness and also increases the risk factor.

Objectives of Loan Approval Automation


1. Streamline the Loan Application and Approval Process
Automation aims to make the entire loan lifecycle—from application to
disbursal—more seamless. Traditional loan processing involves multiple manual steps,
such as data collection, document verification, credit scoring, and final approval.
Automation integrates these steps into a unified digital workflow:
 Online application portals allow customers to submit necessary data and documents
quickly.
 Automated data extraction and integration with credit bureaus help in faster data
gathering.
 Workflow engines route applications to the right decision-making models or
personnel based on rules.
Benefit: Reduces administrative overhead and creates a smoother experience
for both applicants and staff.

2. Reduce Human Error and Bias


Manual loan processing is prone to human errors—data entry mistakes,
misinterpretation of financial data, or overlooking key documentation. Additionally, human
judgment can be influenced by unconscious bias related to age, gender, ethnicity, or

5
geography.
Automation:
 Uses data-driven algorithms and AI/ML models that follow consistent rules.
 Ensures accurate data handling and objective assessment of creditworthiness.
 Logs decisions for auditing and compliance, enhancing transparency.
Benefit: Improves fairness and reliability of the approval process.

3. Minimize Loan Processing Time


Traditionally, loan approvals can take days or even weeks, depending on the
complexity of the loan and the organization’s efficiency. Automation significantly shortens
this timeframe by:
 Automating document verification and scoring processes.
 Using real-time decision engines to provide instant approvals or rejections.
 Facilitating faster communication through automated updates and notifications.
Benefit: Accelerates time-to-approval, which is especially critical for personal,
small business, or emergency loans.

4. Ensure Consistency in Decision-Making


Manual decision-making can vary based on individual judgment or workload.
Automation ensures that:
 Every application is assessed using the same criteria.
 Pre-defined rules, scoring models, and risk thresholds are applied uniformly.
 Updates to policy are propagated instantly across the entire system.
Benefit: Builds trust in the institution's credibility and reduces the risk of legal
or regulatory inconsistencies.

5. Improve Customer Satisfaction and Operational Efficiency


Automation enhances the customer experience by making the process faster,
clearer, and more accessible. It also helps financial institutions by:
 Reducing the workload on loan officers and back-office staff.
 Lowering operational costs through reduced manual labor and paper handling.
 Allowing staff to focus on complex or high-value cases.
Benefit: Increases customer retention and operational profitability.

6
Components of an Automated Loan Approval System
1. User Interface
Purpose: The front-end platform that borrowers interact with.
Types:
 Web-based online portal
 Mobile applications (Android/iOS)
Key Features:
 Application form for entering personal, financial, and employment details
 Document upload functionality
 Real-time feedback (e.g., missing fields, eligibility hints)
 Secure login/authentication (e.g., OTP, biometrics)
 Progress tracking and status updates

2. Data Collection Module


Purpose: Gathers structured data from applicants.
Key Data Points Collected:
 Personal Info: Name, date of birth, address, ID numbers
 Financial Info: Income, liabilities, existing loans, bank statements
 Employment Info: Employer details, job title, work history
Functionality:
 Validates inputs (e.g., correct formats, mandatory fields)
 Pre-fills data from past applications (if logged in)
 Sends data to backend systems for processing

3. Document Verification Tool


Purpose: Verifies the authenticity and accuracy of submitted documents.
Technology Used:
 OCR (Optical Character Recognition): Reads text from images/PDFs
 AI & ML Models: Detects document types, checks for tampering or forgery
 Cross-validation: Compares data from documents with inputs (e.g., bank statements
vs. declared salary)

7
Typical Documents:
 Government ID, salary slips, tax returns, bank statements, utility bills

4. Credit Scoring Engine


Purpose: Assesses an applicant’s creditworthiness.
Data Sources:
 External credit bureaus (e.g., Experian, Equifax, TransUnion)
 Internal repayment history (if the applicant is a returning customer)
Evaluation Criteria:
 Credit score
 Number of open accounts and credit cards
 Payment history and delinquencies
 Debt-to-income ratio
Output: A numerical score or rating (e.g., A, B, C) and risk level (low, medium, high)

5. Machine Learning Model


Purpose: Predicts loan eligibility and default risk using historical data.
How it Works:
 Trained on past loan data (features: income, age, credit score, etc.)
 Outputs:
o Approval likelihood
o Default probability
o Recommended loan amount or interest rate
Benefits:
 Dynamic and adaptive (improves as more data is added)
 Uncovers hidden patterns not visible in rule-based systems

6. Decision Engine
Purpose: Finalizes loan decisions by combining ML model outputs with business logic.
Business Rules Applied Might Include:
 Minimum income threshold
 Maximum debt-to-income ratio
 Manual review trigger conditions (e.g., missing documents, conflicting data)
8
Actions:
 Approve
 Reject
 Escalate to manual underwriter for review

7. Notification System
Purpose: Communicates status and next steps to applicants.
Channels:
 Email
 SMS
 Push notifications via app
Messages Include:
 Application received
 Additional documents required
 Approved/Rejected decision
 Loan disbursal confirmation
Features:
 Templates with personalization (e.g., name, application ID)
 Event-triggered (e.g., send SMS when status changes)

8. Integration API
Purpose: Connects the loan system with external third-party services.
Typical Integrations:
 Credit Bureaus: Fetch credit scores and reports
 KYC Providers: Verify identity documents, facial recognition
 Income Verification: Connect with payroll APIs, bank aggregators
 Fraud Detection: Integrate with fraud scoring tools
Advantages:
 Seamless data exchange
 Reduced manual verification
 Ensures compliance with legal and regulatory requirements

Workflow of Loan Approval Automation


9
. Application Submission
The loan process begins when a customer submits an application through a digital platform
—typically a web portal or mobile app.
 The system collects essential personal, financial, and employment details.
 User-friendly forms and digital signature support make the process smoother.
 Real-time input validation helps prevent common errors (e.g., missing fields or
incorrect formats).
Benefit: Enhances user convenience and reduces data entry errors early in the process.

2. Data Validation
Once submitted, the system automatically validates the input data for:
 Completeness: All required fields are filled out.
 Accuracy: Formats are correct (e.g., valid email address, ID number).
 Internal Consistency: Values make sense in relation to one another (e.g., income
level vs. loan amount).
If issues are detected, the system can prompt the applicant to correct them immediately.
Benefit: Ensures clean, structured data for downstream processes.

3. Document Upload & Verification


Applicants upload supporting documents such as:
 ID proofs, income statements, employment letters, bank statements, etc.
Automation tools (often AI-powered OCR and NLP systems):
 Extract data from documents.
 Compare extracted data with information provided in the application.
 Detect fraud or forgery by checking document authenticity, logos, and formats.
Benefit: Speeds up verification and reduces the need for manual review.

4. Credit Assessment
The system integrates with external credit bureaus (e.g., Experian, Equifax, TransUnion) to
fetch the applicant's credit report.
 Credit history, payment defaults, credit utilization, and number of inquiries are
evaluated.
 Credit score is used as a base input for assessing creditworthiness.

10
Benefit: Provides a standardized view of financial reliability.

5. ML Risk Scoring
A machine learning (ML) model analyzes all collected data to calculate a risk score:
 Inputs may include income, age, employment stability, existing debt, spending
behavior, etc.
 Models are trained on historical data to predict the likelihood of default or
delinquency.
This risk score is a critical factor in the loan decision.
Benefit: Enables data-driven risk assessment, adapting over time to changing patterns.

6. Decision Making
A rule-based decision engine applies business policies and thresholds to:
 Approve low-risk applications automatically.
 Reject high-risk ones.
 Escalate borderline or complex cases to human underwriters.
Rules might include:
 Minimum credit score requirements
 Maximum debt-to-income ratios
 Loan amount caps based on income
Benefit: Ensures consistent and auditable decisions.

7. Communication
Once a decision is made:
 The system automatically sends notifications to the applicant via email, SMS, or app
alerts.
 If rejected, the message may include a brief explanation or next steps.
 If approved, the applicant may be prompted to review and accept the terms.
Benefit: Keeps the customer informed in real-time, reducing uncertainty.

8. Disbursement
For approved applications:
 The system triggers the fund disbursement workflow.
11
 Funds are transferred directly to the applicant’s bank account or digital wallet.
 A digital copy of the loan agreement is shared with the applicant.
Benefit: Enables same-day or instant loan funding, which improves customer satisfaction.
Faster Turnaround Time for Loan Approvals
 Explanation: In a traditional loan approval process, there can be delays due to manual
reviews, document verification, and waiting for inputs from various departments.
With automation, data integration, and AI-driven systems, loan applications can be
processed much faster.
 Impact:
o Efficiency Gains: Automated systems can quickly validate data, assess
eligibility, and make decisions without human intervention. For example, AI
can instantly assess a borrower's creditworthiness based on a variety of real-
time data points.
o Reduced Waiting Times: With faster decision-making, applicants get their
loan approval (or rejection) much quicker, which improves their experience.
o Competitive Advantage: A faster approval time can be a strong differentiator
for banks and lending institutions in a crowded market, attracting more
customers who need quick financial solutions.
2. Reduced Operational Costs
 Explanation: Manual processes, paper handling, and a high volume of human
intervention contribute significantly to operational costs. By leveraging technology,
financial institutions can streamline operations, reduce errors, and cut back on labor
and administrative costs.
 Impact:
o Automation: Tasks like data entry, document processing, and compliance
checks can be automated. This not only saves time but also reduces the need
for large administrative teams.
o Resource Allocation: By automating routine tasks, employees can focus on
higher-value activities, such as customer service or strategic decision-making.
o Cost Reduction: With fewer staff needed for manual tasks and fewer errors to
correct, operational costs are dramatically reduced.
3. Enhanced Decision Accuracy
 Explanation: Automated systems, particularly those based on artificial intelligence

12
and machine learning, rely on data-driven algorithms to make decisions. These
systems can analyze large amounts of data with great accuracy, helping to make
decisions based on objective factors rather than subjective judgment.
 Impact:
o Consistency: Automation ensures that every loan is evaluated based on the
same criteria, reducing human bias and subjective decision-making.
o Data-driven Insights: AI models can evaluate complex patterns in a
borrower’s financial behavior, improving the accuracy of predicting their
likelihood of repaying the loan.
o Risk Reduction: More accurate decision-making helps minimize defaults and
financial risks, improving the overall stability of the financial institution.
4. Higher Customer Satisfaction
 Explanation: Customers expect a smooth, fast, and transparent process when
applying for loans. A seamless digital experience that delivers fast decisions and
clear communication significantly boosts customer satisfaction.
 Impact:
o Convenience: Online loan applications and automated approvals provide
customers with a hassle-free experience, where they can track the status of
their application in real time.
o Transparency: Clear and immediate feedback on loan decisions helps to
manage customer expectations and reduces frustration.
o Personalization: AI-driven systems can offer tailored loan products or flexible
repayment options based on the customer's financial situation, leading to
higher satisfaction and loyalty.
5. Improved Compliance and Audit Readiness
 Explanation: Compliance with financial regulations and audit readiness are essential
for avoiding legal issues and maintaining operational integrity. Technology solutions
like automated tracking systems and real-time reporting can ensure that all loan
transactions are in line with regulatory requirements.
 Impact:
o Automated Tracking and Reporting: Automated systems can document every
step of the loan process, ensuring that a complete audit trail is available. This
includes tracking compliance with regulations, such as those related to KYC

13
(Know Your Customer), AML (Anti-Money Laundering), and other lending
standards.
o Real-time Alerts and Monitoring: AI systems can monitor transactions for
suspicious activities and flag them in real time, ensuring quicker responses to
compliance issues.
o Simplified Audits: With automated record-keeping and real-time reports,
audits become much easier and less time-consuming, as auditors can access
all required documents and compliance records at the push of a button.

Challenges in Loan Approval Automation


1. Data Privacy and Cybersecurity Threats
 Data Privacy Risks: Loan approval systems typically require sensitive personal and
financial data from applicants, including Social Security numbers, credit histories,
and bank account details. This data is valuable to cybercriminals, making it a prime
target for hacking attempts. Any breach of this data could have serious consequences
for both customers and financial institutions.
 Cybersecurity Threats: Automation systems are often interconnected with various
databases, third-party services, and cloud environments, increasing the number of
potential attack surfaces. Hackers can exploit vulnerabilities in the system to access
confidential information, steal identities, or manipulate loan data. Data encryption,
multi-factor authentication, and continuous monitoring are crucial to mitigating these
risks, but there’s no way to guarantee 100% security.
 Regulatory Compliance Issues: Financial institutions are bound by strict regulations
(such as GDPR in the EU or CCPA in California) that govern how personal data
should be collected, stored, and shared. Violating these regulations can lead to hefty
fines and damage to the institution's reputation.
2. Algorithm Bias and Fairness Issues
 Algorithm Bias: Machine learning (ML) algorithms used in loan approval can
unintentionally inherit biases present in historical data. If past lending decisions were
influenced by factors like race, gender, or socioeconomic status, the algorithm may
replicate or even amplify those biases. For example, an algorithm trained on biased
historical data might unfairly deny loans to certain demographic groups, leading to
discrimination.
 Fairness Concerns: There is also the concern that algorithms might disproportionately
14
favor certain groups based on factors like income, employment history, or credit
score. For instance, individuals from lower-income neighborhoods may be unfairly
disadvantaged, even if their ability to repay a loan is comparable to that of
individuals from wealthier areas.
 Transparency and Explainability: One of the key challenges with algorithmic bias is
the "black-box" nature of many AI models. It can be difficult for lenders to explain
why a loan application was denied, especially when the decision is made by a
complex machine learning model. This lack of transparency can result in distrust
from applicants and regulatory scrutiny.
 Mitigation: Institutions must take proactive steps to identify and mitigate bias in
algorithms. This involves using diverse and representative data sets, auditing
algorithms for fairness, and ensuring that algorithms are transparent and explainable
to both customers and regulators.
3. Integration Complexity with Legacy Systems
 Legacy Systems: Many financial institutions still rely on outdated legacy systems
that were not designed to handle modern automation technologies. Integrating new
loan approval automation systems with these older systems can be technically
challenging and expensive.
 Data Silos: Legacy systems often store data in siloed formats or databases that don’t
easily communicate with newer systems. This can make it difficult to efficiently
access and process the necessary data for automated loan approvals. Integrating
disparate systems might require significant reengineering or the development of
complex APIs to allow seamless data flow.
 Operational Risks: When integrating new systems with old ones, there’s a risk of
system failures or errors that could impact loan approval decisions. These errors can
cause delays, miscommunications, or even incorrect loan decisions, harming both
customers and the financial institution's reputation.
 Cost and Time: Legacy system integration is time-consuming and costly. It requires
significant technical resources and expertise, and there may be challenges in
achieving compatibility with modern automation tools. Some banks may opt for
complete system overhauls, which are both financially and logistically demanding.
4. Regulatory Compliance and Transparency
 Complex Regulations: The financial industry is heavily regulated, and loan approval
automation must comply with various local, national, and international laws.
15
Regulations such as the Equal Credit Opportunity Act (ECOA) in the U.S. prohibit
discrimination in lending, while others govern aspects like data privacy, consumer
protection, and anti-money laundering.
 Dynamic Regulatory Landscape: Financial regulations evolve frequently, and loan
automation systems must be able to adapt quickly. A change in law may require a
system update or a reconfiguration of algorithms to ensure compliance. Failing to
comply with regulatory changes can result in significant legal and financial penalties.
 Transparency and Auditing: Regulators require transparency in the decision-making
process, especially regarding why certain loans were approved or denied. Automated
loan approval systems must be designed to provide clear, auditable records that
explain their decisions. Without proper transparency, regulators may have difficulty
determining whether the system complies with fairness and anti-discrimination laws.
 Audit Trails and Documentation: Financial institutions must be able to produce audit
trails that demonstrate how loan decisions were made. This includes tracking which
data was used, which algorithms were applied, and how specific decisions were
reached. Building these audit trails requires careful planning and integration into
automated workflows.
5. Customer Trust in Fully Digital Systems
 Lack of Human Interaction: Many customers still prefer speaking to a human when
dealing with important financial decisions, such as loan approval. Automated systems
may feel impersonal or untrustworthy to some applicants, especially if they are not
familiar with the technology. The fear of being denied for reasons they don’t
understand can deter people from using fully digital systems.
 Transparency and Communication: When a loan application is automated, customers
may have difficulty understanding how decisions are made or why they were denied.
Clear communication and transparency are essential in building trust. Financial
institutions must ensure that customers can easily access information about their loan
status and understand the decision-making process.
 Perceived Security Risks: Customers may be hesitant to trust automated systems
because they worry about the security of their data. Without the reassurance that their
personal and financial information is being handled safely, they may resist using
fully digital systems.
 Building Trust: To foster trust, financial institutions must invest in customer
education and clearly explain how their automated systems work. Providing user-
16
friendly interfaces, offering customer support, and ensuring clear, understandable
loan approval criteria are all steps toward building confidence in digital loan approval
systems.

17
AIM AND SCOPE

2.1 AIM OF PROJECT

The primary aim of this proposed system i.e. “Customer Loan Prediction Analysis
as we know that now-a-days there is a rapid growth in banking sector, resulting lots of
people are applying for bank loans. Finding out the applicant to whom the loan will be
approved is a difficult process. In this paper, we proposed a model which predicts loan
approval/rejection of an applicant using machine learning techniques. This can be done by
training the model with the data of the previous records of the people applied for loan.

For this, First we will analyze the data from the Dataset download from Kaggle and
then Preprocess the data using Exploratory Data Analysis and after that build the model
using Train Dataset and use that model for Test Data to Predict the output. To do this, we
need Logistic Regression for finding the Loan Approval status of the applicants. AIM
AND SCOPE

2.1 OBJECTIVES

The proposed model “Online Crime Reporting System (OCRS)” is based on the
new technology Logistic Regression where rather than fitting a regression line, we fit an
"S" shaped logistic function, which predicts two greatest values (0 or 1). It is a significant
algorithm because it can provide probabilities and classify the use of different types of data
and easily determines the most effective variables that are used for classification.

After that we will Clean and process the Dataset and understanding the Data fields
and coming to and understanding by finding the criteria’s and variables giving positive
impact for Loan approval. Various steps are involved to do this process, after doing all the
necessary steps like building model and training it to the Test data set.

Then after implementation we will get an output.csv Dataset file containing all the
predicted approval status based on the model we build which can be used by

1. Predict Loan Eligibility (Loan Approval Prediction)


Goal:
To predict whether a customer's loan application should be approved or rejected based on their
profile and financial background.

18
Why It Matters:
Banks and financial institutions need to automate loan approval decisions to reduce processing
time and maintain consistency, while minimizing the risk of bad loans.
How It Works:
 Input: Customer demographic and financial details (e.g., income, employment, credit
history).
 Output: Binary prediction – Approved or Not Approved.
 Algorithm: Supervised learning classification models (e.g., Logistic Regression, Random
Forest, XGBoost).
Business Impact:
 Faster processing of applications.
 Objective decision-making.
 Better customer experience.

2. Predict Credit Risk (Loan Default Prediction)


Goal:
To assess the likelihood of a customer defaulting (failing to repay) a loan once it's approved.
Why It Matters:
Lending institutions aim to minimize loan defaults, which can significantly affect profitability
and stability.
How It Works:
 Input: Customer repayment history, current debts, income, loan terms.
 Output: Binary prediction – Default or No Default.
 Algorithm: Risk scoring models or classification techniques.
Business Impact:
 Helps in setting risk-based interest rates.
 Enables preventive actions (like requesting collateral or adjusting loan terms).
 Reduces non-performing assets (NPAs).

19
Banks to automate the Loan approval Process in a easier and faster way in no time.

2.2 SCOPE

This model focusses more on the both accuracy and cross validation score of a
model. The higher Precision Score decreases the lender’s risk factor. By implementing this
Banks can save more time in processing the Customers or Applicants Loan application by
automating the process so that it saves more time and human work intervention.

Also to improve the accuracy of the estimation of logistic regression coefficients


when we have a binary response variable and a large number of highly correlated
categorical explanatory variables. For this purpose, we first employ a categorical principal
component analysis to deal with multi collinearity problem among binary explanatory
variables, and then use uncorrelated principal components instead of original correlated.

To implement this, the model fed by the data which impacts more on the loan status
and also to decrease the more ambiguity the credit score factor consists only two unique
values {0,1} where ‘0’ represents lower credit score and ‘1’ represents higher credit score.
By doing this we can predict the Approval status for the customers who applied for the loan
in the respective Banks.

The main goal of our model is to allocate credit borrowers to two groups:”good
credit” group that is likely to repay the financial obligation or “bad credit” group with a
high risk of default on its financial obligations. Our sample consists of 195 SMEs using
credit files distributed over five sectors of activity namely “agriculture, industry, services,
tourism and trade.”

The objective of this paper is to present models for predicting the probability of
default of the counterparty based on the score method, which pits discriminant analysis
against logistic regression. Credit scoring is a method that helps the bank to rationalize its
process for credit granting decision. Its principle is to synthesize a set of financial ratios as
one indicator able to distinguish between good and bad customers.

The past machine learning models have the highest accuracy but with the least
precision score. The precision score is the number of true positives per the

20
total number of true positives and false positives. If the precision score is low means the
large n umber of False Positives generated by the model. Generating higher number of
False Positives than the False Negatives decreases the model exactness and also increases
the risk factor. The credit risk analysis is based on various information about the borrower,
which may be summarized as qualitative and quantitative data such as financial ratios,
since access on qualitative data seems be difficult, our study will be only based on
quantitative variables; hence it becomes necessary to make a wise choice of our financial
ratio.

2.1.1 ADVANTAGES

 Time saving as it is operated online.

 Automate the Loan approval process.

 This system lets the model to predict accurately.


 This model decreases the risk and efforts of a lender.

 This process is based on digitalization.

 Useful to minimize this risk without much effort on the lender.

 Reduces the Human Involvement in this process for banks

 Having more Accuracy Rate when compared to other models.


 Improved Decision-Making
What it means:
Machine learning models analyze historical data and provide objective decisions,
reducing human bias and inconsistency.

Benefits:

Faster and more accurate loan approvals.

Data-driven risk assessment instead of intuition-based decisions.


Uniform lending standards across branches and teams.

 2. Reduced Default Rates


What it means:
Predictive models help identify high-risk customers before approving loans.
Benefits:
Prevents issuing loans to likely defaulters.
Reduces Non-Performing Assets (NPAs).
Maintains healthier loan portfolios.

21
 3. Operational Efficiency
What it means:
Automation of loan screening reduces the burden on loan officers and speeds up
processing time.
Benefits:
Cuts down manual effort in verifying applications.
Enables real-time or near-instant loan decisions.
Scales efficiently as loan volumes grow.

 4. Enhanced Customer Experience


What it means:
Quick, transparent, and fair decision-making improves customer trust and
satisfaction.
Benefits:
Reduces turnaround time for approvals.
Minimizes paperwork and back-and-forth.
Builds a positive reputation for the lender.

 5. Risk-Based Pricing and Personalization


What it means:
Lenders can offer personalized loan terms based on predicted risk profiles.
Benefits:
Safer customers may get lower interest rates.
Riskier customers can be offered loans with safeguards (e.g., collateral or higher
interest).
Maximizes profitability per customer segment.

 6. Compliance and Audit Trail


What it means:
Models can keep track of decision-making logic, aiding in regulatory compliance.
Benefits:
Provides documentation for why a loan was accepted or rejected.
Meets requirements for fairness and transparency.
Avoids discrimination or biased lending practices.

 7. Competitive Advantage
What it means:
Early adoption of AI-driven lending gives a company a lead over competitors still
using traditional methods.
Benefits:
Attracts tech-savvy and creditworthy customers.
Enables agile response to market changes.
Facilitates innovation in loan products and services.

 8. Better Portfolio Management


What it means:
Ongoing analysis helps lenders monitor their entire loan book for risks and trends.
Benefits:
Proactively manage emerging credit risks.
Adjust lending strategies based on economic indicators.
Improves financial health and planning.
22
23
SYSTEM DESIGN & METHODOLOGY

EXISTING SYSTEM
Machine Learning implementation is a very complex part in terms of Data analytics. Working
on the data which deals with prediction and making the code to predicit the future of out comes
from the customer is challenging part.All the other methods are simply making process way
complexed and not giving accurate Cross Validation score in predicting results.

Disadvantages of Exiting System


• Complexity in analyzing the data.
• Prediction is challenging task working in the model.
• Coding is complex maintaining multiple methods.
• Libraries support was not that much familiar.

3.1 PROPOSED SYSTEM


Python has a is a good area for data analytical which helps us in analyzing the data
with better models in data science. The libraries in python makes the predication for loan data
and results with multiple terms considering all properties of the customer in terms of
predicting. Logistic Regression is used to build the model and used to get the output by
predicting results accurately.
The objective of this paper is to present models for predicting the probability of
default of the counterparty based on the score method, which pits discriminant analysis
against logistic regression. Credit scoring is a method that helps the bank to rationalize its
process for credit granting decision. Its principle is to synthesize a set of financial ratios as
one indicator able to distinguish between good and bad customers.

Advantages of Proposed System:


• Libraries helps to analyze the data.
• Statistical and prediction is very easy comparing to existing technologies.
• Results will be accurate compared to other methodologies.
24
3.2 SYSTEM CONFIGURATION

H/W System Configuration


Processor : Pentium 3,Pentium 4 and higher RAM : 2GB/4GB RAM.
Hard disk : 40GB and higher.

S/W System Configuration


Operating System : Windows 7 , Windows 8, (or higher versions)
Language : Python 3.5 Mozilla Firefox(or any browser)
IDLE : Jupyter Notebook.

3.3 MODULES
The Customer Loan Prediction Analysis Model consists of mainly Modules. They are
Reading and Cleaning Dataset, Model Building and finally Testing the Dataset. Here the
primary task is to maintain the database required for analyzing the data in the form of .csv
files. It needs a Train dataset, Test Dataset. The user will have access to both the datasets to
train and build the testing model.
Firstly, we need to implement Exploratory Data Analysis for Spotting mistakes in the
provided or collected Dataset and mapping them to underlaying structure of the data. Finding
the Most important variables and fields in the dataset given by EDA by visualizing graphs
and boxplots using python libraries NumPy and Pandas. Listing Anomalies and Outliers
comes to the next stage scenario. Let us look into each module in detail.

Reading and Cleaning Dataset


Now we need to import Pandas, NumPy and sklearn libraries and use them to process
the data. Reading both training and testing dataset in a data frame using Pandas. Using head ()
function, we will be able to see first 10 rows of the dataset so that we will have a clear picture
on what fields does dataset contains in it. After then we will store the length of rows and
columns in the dataset. Here comes the crucial part of this module nothing but processing the
data as required for the analysis. For that we need to understand the various features and
columns of the dataset. We will use describe () function to get the summary of all the
numerical and categorical value fields in the dataset which contains Applicant income, Co-
applicant income, Credit history and Loan amount.

25
For the non-numerical values (e.g., Property Area, Credit History etc.), we can look
at frequency distribution to understand whether they make sense or not. Getting the unique
values and their frequency of variable Property Area and also Understanding Distribution of
Numerical Variables like Applicant Income and Loan Amount can be done by using Boxplot
to understand and finding the outliers of the Dataset fields. We can see that there is no
substantial different between the mean income of graduate and non-graduates. But there are a
higher number of graduates with very high incomes, which are appearing to be the outliers.
So after this we will be able to find fields which are having positive impact on Loan approval
status and coming to a conclusion by analysis.
We can see that there is no substantial different between the mean income of
graduate and non-graduates. But there are a higher number of graduates with very high
incomes, which are appearing to be the outliers. Loan Amount has missing as well as extreme
values, while Applicant Income has a few extreme values. Now it is time to Understanding
Distribution of Categorical Variables.
By making Cross table with both Credit History and Loan Amount Fields we will
see that Loans Approved in the Train Data set are more with applicants having credit history
equals to 1. Then we will write a function to find out the percentage of applicants whose
loans are approved with credit history equals to 1 and it shows more than 79% people have
got loans with Credit history of 1. Now we will move forward and understand outliers in a
better way in next module to build the model.

Module 2 – Model Building


A value that deviated significantly from the rest of the values in a feature is
referred to as an Outlier. They can be caused either by measurement or execution error.
Analyzing the outlier data from the data point is referred to as outlier analysis. Removing or
decreasing the outliers is important when that particular feature effecting more on the
dependent variable means if the feature is important in getting the output. In this project,
we approached Box Plot methodology on the numerical features. Exploratory Data
Analysis is graphically representing how the features are in the dataset to discover the patt
erns, anomalies and to derive the insights and assumptions used to get full knowledge on
the given data to build a model. EDA is all about making sense of data in hand, before
getting them dirty with it. What we did

26
earlier in this project also comes under EDA.
In this phase, we were discussing a bit more depth than before for better
clarification on the insights we got. The insights we got from the feature analysis were the
highest point values were male, married, graduation, not self-employed, and semi- urban
areas. But there, we applied the statistics one by one and we were not sure about how these
futures affecting the target variable Loan Status when we grouped those together.
Also, we were not sure that we should clear the outliers in numerical features or not.
sklearn requires all inputs to be numeric, we should convert all our categorical variables into
numeric by encoding the categories. Before that we will fill all the missing values in the
dataset. In order to prepare and apply a model to this dataset, we’ll first have to break it into
two subsets.
The first will be the training set on which we will develop the model. The second will
be the test dataset which we will use to test the accuracy of our model. We will allocate 75%
of the items to Training and 25% items to the Test set. Once our dataset has been split, we
can establish a baseline model for predicting whether a credit application will be approved.
This baseline model will be used as a benchmark to determine how effective the
models are. First, we determine the percentage of credit card applications that were approved
in the training set. We can see from the summary output that the Debt variable has missing
values that we’ll have to fill in. We could simply use the mean of all the existing values to do
so.
Another method would be to check the relationship among the numeric values and
use a linear regression to fill them in Regression models are useful for predicting continuous
(numeric) variables. However, the target value in Approved is binary and can only be values
of 1 or 0. The applicant can either be issued a credit card or denied- they cannot receive a
partial credit card.
We could use linear regression to predict the approval decision using threshold and
anything below assigned to 0 and anything above is assigned to 1. Unfortunately, the
predicted values could be well outside of the 0 to 1 expected range. Therefore, linear or
multivariate regression will not be effective for predicting the values. Instead, logistic
regression will be more useful because it will produce probability that the target value is 1.
Probabilities are always between 0 and 1 so the output will more closely match the target
value range than linear regression.

27
Convert all non-numeric values to number and creating Generic Classification Function for
accessing performance.
Logistic Regression is a classification algorithm for supervised data and also similar
to linear regression, the difference between them is logistic regression uses Logistic function/
sigmoid function which is an S-shaped curve[5]. The sigmoid function can take any real-
valued number/integer and maps that value into a value between 0 and 1, but won’t be 0 or 1.
The chances of getting loan will be higher for Applicants having credit history equals to 1,
Applicants with higher applicant income and Co-applicant income, with Higher education
and having properties in Urban areas. So we make our model with ‘Credit History’,
’Education’ & ’Gender’.

Module 3 – Testing Data set with Model


Model building will be completed by combining both training and test data set and
converting into log transformation to nullify outliers and model is built. Now let’s use it to
test with the test data set to predict the output for the applicants. Both Test and train dataset
are combined and do reverse encoding to get outcome. Final Output will be stored in the local
storage with Output.csv and also we can see Accuracy Rate and cross validation score seems
to be higher than any other models with more than 80%. Results can be viewed from the
Output.csv file which can be used by banks for Loan approvals and Processing them to the
applicants.
So finally, we have builded a model and used it to predict the output for this Loan
Approval Process.

28
MODEL DEVELOPMENT METHOLOGY

4.1 Description of Diagram

Fig 4.1 Block Diagram

An effective model was proposed for predicting the right customers who have In our
proposed model we had used Logistic regression which is one of the popular Machine
Learning algorithms that comes under the Supervised Learning technique. It is applicable for
categorical dependent variable using a given set of independent variables. Thus, the outcome
must be a categorical or discrete value. The output can be either Yes or No, 0 or 1, true or
false, etc. but instead of giving the exact value as
0 or 1, it gives some probabilistic values which lies between 0 and 1. Logistic regression is
much similar to linear regression except that how they are used. It is used for solving
regression problems, whereas Logistic regression is used for solving the classification
problems. applied for loan. The input dataset is the bank dataset of customers who applied for
the loan. The dataset is a CSV file. The dataset can read into the python environment by using
the read_csv() method in pandas. For that, should import pandas into the present python
environment. The features of the

29
Customers dataset: Loan_ID, Gender, Married, Dependents, Education, Self Employed,
ApplicantIncome, CoapplicantIncome, LoanAmount, LoanAmount_Term, Credit_History,
Property Area and Loan_Status. Missing Values considers as noisy data in the dataset. If no
information is provided while collecting the data that particular entries in the database will
refer to as Missing values. The missing data is either represented as Nan or None in pandas.
None is referred to as missing data in code and also as singleton object of python. Mean
imputation replacing the mean value of a column in place of null values. Median
imputation replacing the median value of a column in place of null values. A value that
deviated significantly from the rest of the values in a feature is referred to as an Outlier. They
can be caused either by measurement or execution error. Analyzing the outlier data from the
data point is referred to as outlier analysis. Removing or decreasing the outliers is important
when that particular feature effecting more on the dependent variable means if the featu re is
important in getting the output. Pandas is a Python package providing fast, flexible, and
expressive data structures designed to make working with structured (tabular,
multidimensional, potentially heterogeneous) and time series data both easy and intuitive.

Steps in Logistic Regression:


For implementing the Logistic Regression using Python, we had done the following steps:
1. Data Pre-processing.
2. Fitting Logistic Regression to the Training set.
3. Predicting the test result.
4. Test accuracy of the result.
5. Visualizing the test set result.

Based on the data given by the loan applicant, we can predict whether the loan of
particular applicant is approved or not using a User Interface. User interface contains input
variables with their corresponding fields and a field to display the output. Input variables are
Gender, Marital status, Dependents, Education, Applicant income, Loan Amount, Loan
amount term, Credit History, Property Area. The applicant need to give these values and
based on these, the model will predict whether the loan will be approved or not.

30
As per our Analysis, Credit History, Income, Education, Gender and Property Areas
of the Applicant makes more impact so we created a model using above steps with these
fields.

4.1 ARCHITECTURE DIAGRAM

Fig 4.2 ARCHITECTURE DIAGRAM

4.2 IDENTIFYING THE ACTORS


The actors in the system are Loan Approval Authority , Applicant and Builded
System consisting of maintain and export data and also preprocessing the data by building the
model. The main goal of the Loan Approval Authority is to find out the predictable status of
Granting loan or not based on the details given by the applicant by using the system.

4.3 SYSTEM IMPLEMENTATION


We will get input dataset is the bank dataset of customers who applied for the loan.
The dataset is a CSV file. The dataset can read into the python environment by using the
read_csv() method in pandas. For that, should import pandas into the present python
environment. Lets us look into the variable description of each field in the Dataset.

31
Based on the data given by the loan applicant, we can predict whether the loan of particular
applicant is approved or not using a User Interface. User interface contains input variables
with their corresponding fields and a field to display the output. Input variables are Gender,
Marital status, Dependents, Education, Applicant income, Loan Amount, Loan amount term,
Credit History, Property Area. The applicant need to give these values and based on these, the
model will predict whether the loan will be approved or not.

Variable Description
Loan_ID Unique Loan ID
Gender Male/ Female
Married Applicant married (Y/N)
Dependents Number of Dependents
Education Graduate/ Non-Graduate
Self_Employed Self_Employed(Y/N)
ApplicantIncome Applicant Income
CoapplicantIncome Coapplicant Income
LoanAmount Loan amount in thousands
Loan_Amount_Term Term of loan in months
Credit_History credit history meets guidelines
Property_Area Urban/ Semi Urban/ Rural
Loan_Status Loan approved (Y/N)

Various steps are included in making this process. Packages Used


are NumPy, Matplotlib, Boxplot and sklearn.
Pandas: Pandas is mainly used for data analysis. Pandas allows importing data from various
file formats such as comma-separated values, JSON, SQL, Microsoft Excel. Pandas allows
various data manipulation operations such as merging, reshaping, selecting, as well as data
cleaning, and data wrangling features.
NumPy: NumPy is a library for the Python programming language, adding support for large,
multi-dimensional arrays and matrices, along with a large collection of high-level

32
mathematical functions to operate on these arrays.
Matplotlib: Matplotlib is a plotting library for the Python programming language and its
numerical mathematics extension NumPy. It provides an object-oriented API for embedding
plots into applications using general-purpose GUI toolkits and has advanced Mathematical
features.
Scikit-learn: Scikit-learn is a free software machine learning library for the Python
programming language. It features various classification, regression and clustering algorithms
including support vector machines.

Logistic Regression: Regression models are useful for predicting continuous (numeric)
variables. However, the target value in Approved is binary and can only be values of 1 or 0.
The applicant can either be issued a credit card or denied- they cannot receive a partial credit
card. We could use linear regression to predict the approval decision using threshold and
anything below assigned to 0 and anything above is assigned to 1. Unfortunately, the
predicted values could be well outside of the 0 to 1 expected range. Therefore, linear or
multivariate regression will not be effective for predicting the values. Instead, logistic
regression will be more useful because it will produce probability that the target value is 1.
Probabilities are always between 0 and 1 so the output will more closely match the target
value range than linear regression.

The sigmoid function is a numerical function used to outline predicted values to


probabilities. It maps any real value to another must be in between 0 and 1 which means it
can exceed the limit, then it forms a curve like the “S” form. The S-structure curve is also
known as the sigmoid function or the logistic function. logistic regression, we utilize the
concept of threshold value probability of either 0 or 1 and the value below the threshold
values tends to 0.

33
FIG 4.4 Logistic Regression Model Graph

4.1 MODULAR DESIGN


To automate this Loan approval process, we need to follow steps below mentioned.

4.1.1 Collecting Data


Step 1: Download Test and Train Data set from Kaggle Website.
Step 2: Understanding the fields of the Data set by using Data description. Step 3:
Importing NumPy and Pandas libraries.
Step 4: Reading the both Data sets .
Step 5: using head () func to see the data set is actually loaded or not.

4.1.2 Understanding the various features (columns) of the dataset.


Step 1: Using describe() func to get summary of the training Data set.
Step 2: Get the unique values and their frequency of variable Property Area.
Step 3: Understanding Distribution of Numerical Variables like Applicant Income and Loan
Amount by drawing Box Plot .
Step 4: Boxplot is drawn to understand distribution and find Outliers.. Step
5: We will find outliers from Boxplot in Education and Gender.

4.1.2Understanding Distribution of Categorical Variables


Step 1: Finding Cross table for Credit History and Loan status in train data set. Step 2:
Write Function to output percentage row wise in a cross table.
Step 3: We will see Loan approval rate for customers having Credit History (1) is 79%. Step 4:
Nullify Outliers of Loan Amount and Applicant Income
34
Step 5: This can be done by using Log Transformation by combining both Applicant and Co
applicant Income.

4.1.3 Data Preparation For Model Building and Generic Classification


Step 1: sklearn requires all inputs to be numeric, we should convert all our categorical variables into
numeric.
Step 2: We will fill all the missing values in the dataset. Step 3:
Import models from scikit learn module.
Step 4: Generic function for making a classification model and accessing performance. Step 5:
Perform k-fold cross-validation with 5 folds.
Step 6: Training the algorithm using the predictors and target.
Step 7: Fit the model again so that it can be referred outside the function.

4.1.4 Model Building and Testing


Step 1: We have chosen Logistic Regression for Model Building. Step 2:
The chances of getting a loan will be higher for

 Applicants having a credit history (we observed this in exploration.)


 Applicants with higher applicant and co-applicant incomes.
 Applicants with higher education level.
 Properties in urban areas with high growth perspectives.

So let’s make our model with ‘Credit_History’, ’Education’ & ’Gender’.


Step 3: Create logistic regression object.
Step 4: Train the model using the training sets. Step 5:
Reverse encoding for predicted outcome.
Step 6: Store Output and view Accuracy Rate and Cross Validation Score.

Accuracy Rate : It is the percentage of correct predictions for a given dataset Cross

Validation Score: Statistical method used to estimate the skill of Models

35
RESULTS AND DISCUSSION

Finally, in our model by using logistic regression model we predict whether the loan is
approved or not. In order to implement this various input variables were used to get the
output. Whenever program takes the input data it gives the output in the form of binary i.e.,
either 0 or 1. If the output is 1 then ‘1’ will be displayed and it indicates that loan is approved.
If the output is 0 then ‘0’ will be displayed and it indicates that loan is not approved. Here, we
had implemented loan credibility prediction system that helps the organizations in making the
right decision to approve or reject the loan request of the customers. This will definitely help
the banking industry to open up efficient delivery channels. In this model, Logistic
Regression algorithm is used for the prediction.

Fig 5.1 Boxplot for Variable Applicant Income of Training Data set
The above Box Plot confirms the presence of a lot of outliers/extreme values. This can be
attributed to the income disparity in the society.

36
Fig 5.2 Box Plot for understanding the distributions and to observe the outliers Above
Fig 5.2 shows that Outliers are present in Male Gender than compared to female

Fig 5.3 Box plot for both Applicant Income by Education


In Fig 5.3, We can see that there is no substantial different between the mean income of
graduate and non-graduates. But there are a higher number of graduates with very high
incomes, which are appearing to be the outliers

37
Fig 5.4 Box plot for Applicant Income before and after Log Transformation The
extreme values are practically possible, i.e. some people might apply for high value loans
due to specific needs. So instead of treating them as outliers, let’s try a log transformation to
nullify their effect. Also we combine both Applicant and Co Applicant Income for better
results

Fig 5.5 Data Preparation for Model Building


In Fig 5.5 sklearn requires all inputs to be numeric, we should convert all our categorical
variables into numeric by encoding the categories. Before that we will fill all the missing
values in the dataset.

38
Fig 5.6 Accuracy Rate and Cross Validation Score

In Fig 5.6 we can see the Accuracy Rate is 80.945% with Cross Validation Score of 80.946%
which is best value when compared to recent models based on other Machine Learning
Models. This better values makes us understand how accurate this model.

Fig 5.7 Output.csv File


Output Generated and stored in Local Disk as shown in Fig 5.7 with Loan Status––

39
CONCLUSION AND FUTURE WORK

CONCLUSION
From a proper analysis of positive points and constraints on the component, it can be
safely concluded that the product is a highly efficient component. This application is working
properly and meeting to all Banker requirements. This component can be easily plugged in
many other systems. There have been numbers cases of computer glitches, errors in content
and most important weight of features is fixed in automated prediction system, So in the near
future the so – called software could be made more secure, reliable and dynamic weight
adjustment
Finally, in our model by using logistic regression model we predict whether the loan
is approved or not. In order to implement this various input variables were used to get the
output. Whenever program takes the input data it gives the output in the form of binary i.e.,
either 0 or 1. If the output is 1 then ‘1’ will be displayed and it indicates that loan is approved.
If the output is 0 then ‘0’ will be displayed and it indicates that loan is not approved.
In this paper, data preprocessing and transformation techniques are applied and results
are generated by implementing analytical models. The performance is analyzed using the
confusion matrix table. We can also use this model to make detail testing selections. Any
credit application that does not have the same outcome as predicted by the model is potential
audit exception. The inherent risk is that a credit card was issued to someone that should have
been denied. This account is more likely to default than a properly approved account which,
in turn, exposes the company to loss.

FUTURE WORK
Here, we had implemented loan credibility prediction system that helps the
organizations in making the right decision to approve or reject the loan request of the
customers. This will definitely help the banking industry to open up efficient delivery
channels. In this model, Logistic Regression algorithm is used for the prediction.
Incorporation of other techniques that outperform the performance of popular data mining
models has to be implemented and tested for the domain.
The inherent risk is that a credit card was issued to someone that should have been

40
denied. This account is more likely to default than a properly approved account which, in
turn, exposes the company to loss. The different machine learning models can be
implemented and the performance can be compared. We can also do this automation by using
Neural Networks, Naive Bayes and some other Machine Learning Algorithms which are in
progress.

41
REFERENCES

[1]. Sudhamathy G and Jothi Venkateswaran “Analytics Using R for Predicting Credit
Defaulters”, IEEE international conference on advances in computer applications (ICACA),
978-1-5090-3770-4, 2016.
[2]. M. Sudhakar, and C.V.K. Reddy, “Two Step Credit Risk Assessment Model For Retail
Bank Loan Applications Using Decision Tree Data Mining Technique”, International Journal
of Advanced Research in Computer Engineering & Technology (IJARCET), vol. 5, no.3, pp.
705-718, 2016.
[3]. J.H. Aboobyda, and M.A. Tarig, “Developing Prediction Model Of Loan Risk In Banks
Using Data Mining”, Machine Learning and Applications: An International Journal (MLAIJ),
vol. 3, no.1, pp. 1–9, 2016.
[4]. Z. Somayyeh, and M. Abdolkarim,“Natural Customer Ranking of Banks in Terms of
Credit Risk by Using Data Mining A Case Study: Branches of Mellat Bank of Iran”, Jurnal
UMP Social Sciences and Technology Management, vol. 3, no. 2, pp. 307– 316, 2015.
[5]. A.B. Hussain, and F.K.E. Shorouq, “Credit risk assessment model for Jordanian
commercial banks: Neuralscoring approach”, Review of Development Finance, Elsevier, vol.
4, pp. 20–28, 2014.
[6]. T. Harris, “Quantitative credit risk assessment using support vector machines: Broad
versus Narrow default definitions”, Expert Systems with Applications, vol. 40,
pp. 4404– 4413, 2013.
[7]. Dileep B. Desai, Dr. R.V.Kulkarni “A Review: Application of Data Mining Tools in
CRM for Selected Banks”, (IJCSIT) International Journal of Computer Science and
Information Technologies, Vol. 4 (2), 2013, 199 – 201.
[8].Gang Wang, Jian Ma, “Study of corporate credit risk prediction based on integrating
boosting and random subspace”, 2011.
[9]Hussain Ali Bekhet , Shorouq Fathi Kamel Eletter , “Credit risk assessment model for
Jordanian commercial banks: Neural scoring approach” ,Apr2014.
[10] M. Yaghini , T. Zhiyan , and M. Fallahi, “A Prediction Model for Recognition of Bad
Credit Customers in Saman Bank Using Neural Networks”, 2011

42
APPENDIX

A. PLAGIARISM REPORT
Customer Loan Approval Prediction using Logistic
Regression

ABSTRACT

In Banking Sector, a loan is a process of lending or borrowing a sum of money by one or more individuals,
organizations, etc. from Banks. The Person who lends that money from respective financier incurs a debt, and he
is responsible to pay back the money with the Interest decided by Bank within a certain period. Generally what
Banks look into before applying for a loan is Credit History, Credit loss and Income of Applicant. So basically,
loans play a major role regarding Income for Bank. Due to rapid urban development people who are applying for
loans got increased rapidly. As a result, finding the applicant to whom loan can be approved become a
complexed process. In this paper, we want to automate the loan eligibility process (real time) based on customer
details. Fields that required are Marital Status, Income, Education, Gender, Number of Dependents, Loan
Amount, Credit History and others. To predict the status, we will use Logistic Regression to spot the customers,
those are eligible for loan amount so that bank can reach out them for granting loans to those people who can
payback in a given time.

Keywords: Loan, Exploratory Data Analysis, Prediction, Logistic Regression

Introduction grant loans to trustworthy people, so they'll pay


back on the deadline comes. In recent times, banks
The banking industry plays a significant role in
are approving loans at their customers after a step-
present mostly in developing countries where
by-step procedure, but there's still no guarantee that
money is usually required for all of them, so they
the applicant’s loan was granted or not. To approve
will increase their market to capital value by
loan, banks will undergo to estimate risk
gaining profits. Banks allows their customers to
involved within the application, which is important
save lots of money in individual accounts. So, then
for them as they cannot lend money to those
Banks allows to lend money to business people or
that cannot afford to pay back in time which affects
others who can utilize it for his or her capital
their economic status during this huge competitive
growth and meet their Business requirements, and
market. So, we've got collected Dataset from the
payback to the bank within a specific period of your
Kaggle contains loan applicant’s details contains
time including Interest amount. So, interest is the
various fields like Gender, Applicant Income,
profit gained by the banks by giving loans to folks
Credit history etc.
that are in need. But Banks are worried about
whether the person whose loan got granted will be After doing an Exploratory Data analysis on these
ready to payback loan amount or not. so as to data sets, we have discovered that probabilities of
predict it, they basically inspect things associated getting Loan granted are higher for applicants who
with applicant like Credit score, Applicant Income. have credit history equals to ‘1’ with greater
Credit Score. Here the credit score plays the key applicant income, with Education level mostly
role in information given by the customer. In most graduated and eventually who lives urban areas
scenarios Credit score is required for Loan with properties. So, we'd like a model supported
sanction. If the applicant didn’t payback his loan Fields Credit history, Education and Gender of
amount, then eventually his credit score will Applicant. This model developed using Logistic
automatically get decreased. Giving Loans to Regression. we'll use this model to check with
people is one among the most business strategies another data set and results obtained are stored in
for pretty much every bank. Banks will get most of other file with Predicted Approval status as ‘Yes’ or
the profits from the loan within the types of ‘No’. We have chosen Logistic Regression because
Interest. The most goal of Bank authorities is to
it gives an Accuracy Rate of 80.945% model can predict bankruptcy with approximately
approximately. 86.14% accuracy of industry and size.
1. Literature Review In J.H. Aboobyda et al [6], In this paper, a model
for classifying loan risk in the banking sector is
In Somayyeh. Z et al [1] a model was proposed for
implemented by using data mining algorithms. The
identifying and predict the right customers who
model has been prepared using data from the
have applied for loan. Decision Tree is applied to
banking sector to predict the loan status. Three
estimate the traits where accuracy rate is not much
algorithms have been used to build the proposed
appreciable.
model: They are j48, Naive Bayes and Bayes Net.
In Sudhamathy G et al [2], Banks hold huge Using the Weka application, the model has been
volumes of customer information from which they implemented and tested. The results have been
are unable to judge. Data Mining is a promising discussed and a comparison between these three
area of an analysis which aims to get useful algorithms was conducted. J48 was selected as best
knowledge from an amount of data sets. This among the three based on accuracy.
work aims to develop a model. It was a
In A.B. Hussain et al [7], Two data mining models
decision tree- based classification model which uses
were created calculating the credit scoring that
the functions available in the R Package. Before
helps in decision making of giving loans for the
building the model, the dataset is processed, made it
banks in Jordan citizen. This paper will help to find
ready to provide effective predictions. The final
the accurate way to identify the right customers By
model is used for predict using the test dataset and
the accuracy rate, the regression model is found to
the experimental results shows the efficiency of the
perform better than the radial function model.
model.
In T. Harris [8], This work tries to see the
In Dileep. B et al [3], Data analysis was done by
probability of default as a tool to live credit risk in
figuring out techniques like Bayes classification,
an exceedingly Tunisian bank. A scoring model
Decision Tree, Boosting, Bagging algorithm,
was built per the normal technique of logistic
Random Forest etc. Techniques like Support Vector
regression (LR), and computer science techniques
Machine, Logistic Regression, k-means algorithm
i.e. (ANN) and SVM. Then a comparison was made
Neural Network, Perception model are combined in
between these models using performance metrics
this model. The accuracy rate for each of these
like the confusion matrix and therefore the area
techniques is studied. The results show the overall
under the ROC curve (AUC) to spot the foremost
performance is very good based on accuracy.
efficient model. Our results show that the Radial
In Allen et al [4], The authors studied the economic Basis Function kernel SVM was the foremost
effects of small business credit scoring (SBCS) performing method in terms of accuracy, sensitivity
with both high average prices and more risk levels and specificity with the smallest amount error rates.
for small business credits below $100,000. These Thus, within the Tunisia context, this model is
holdings are stable with a net increase in lending to worth implementing in banking institutions so as to
risky “marginal borrowers” who would not receive boost their credit risk management measures to
credit, but who would pay relatively high prices watch and control credit.
when they are funded. The authors also find that a)
In Charles Kwofie et al [9], This study shows the
bank-specific and industry-wide learning curves are
performance of logistic regression in predicting
very important b) SBCS effects differ for banks that
probability of default using data from a
follow “rules” and c) SBCS effects differ to people
microfinance company. A logistic multivariate
with slightly larger credits.
analysis was conducted to predict default status of
In Altman, E. I. [5], This paper provides some loan beneficiaries using 90 sampled beneficiaries
empirical results of a study considering financial for model building and 40 out of sample
ratios as predictors of Japanese corporate failure. A beneficiaries for prediction. Age, legal status,
few empirical studies related to corporate gender number of years of education, number of
bankruptcy in Japan have been taken into analysis. years in business and base capital were used as
However, the results of come after studies are not predictors. The predictors that were significant
generalizable, due to the limited data is provided. In within the model were legal status, number of years
contrast, the model proposed in this is independent in business and base capital. The variability within
of industry zone and size. This study shows that the the response variable within the logistic regression
was very weak.
2. Proposed Work 2.2 System
Python has could be a good area for data analytical Requirements Software
which helps us in analysing the data with better
models in data science. The libraries in python Requirements
make the prediction for loan data and results with
 Windows XP, Windows 7, Windows 10
multiple terms considering all properties of the
customer in terms of prediction. Logistic  Python 3.5 Mozilla Firefox (or any
browser)
Regression is deployed to create the model and
 Jupyter Notebook IDLE.
used to get the output by predicting results
accurately. Credit history could be a method that Hardware Requirements
helps the bank to rationalize its process for credit
granting decisions. Its principle is to synthesize a  Minimum HDD: 20 GB
group of financial ratios together able to distinguish  Random Access Memory: 512 MB
between good and bad customers. For predicting  I3 processor-based computer or higher.
results, we have collected Data sets in Kaggle. In
3. Software Methodology
order to build this model, we will import some
python packages which will be used to analyse the After reading the dataset, we will implement
data sets. They are NumPy, Pandas, Matplotlib and Exploratory Data Analysis to understand and find
sklearn Libraries. All these libraries are available in out the outliers in the dataset.
Python.
3.1 Block Diagram
2.1 Dataset Description

There are two Datasets which are required to build


this model. 1) Train Dataset and 2) Test Dataset.
The Test dataset contains list of customers applied
for loan. By using Train Dataset, we can train the
model and use it to predict loan status for Test
dataset. The dataset is in CSV format. In python,
Pandas is used to read the dataset. Refer the below
table for field variables of the dataset.

Variable Description
Loan_ID Unique Loan ID
Number In the above diagram, It shows what are the steps
Gender Male/ Female involved in building this model.

Married Applicant 3.2 Modules


Married(Y/N)
Dependents Number of dependents The Customer Loan Approval Prediction Model has
mainly 3 modules. They are
Education Graduate/Under
Graduate 1) Reading and Cleaning Dataset.
Self-Employed Self Employed (Y/N) 2) Model Building.
3) Testing Dataset with Model.
Applicant_Income Applicant Income
Let us look into each Module
Coapplicant_Income Co Applicant Income
a) Reading and Cleaning Dataset: We need to
Loan_Amount Loan Amount in import Pandas, NumPy, and sklearn libraries and
thousands
use them to process the information. Reading both
Loan_Amount_Term Term in Months
training and testing dataset using Pandas. Now
Credit_History Credit History meets Exploratory Data Analysis (EDA) is used to analyse
guidelines the Dataset and identify specific Outliers. By Using
Property_Area Urban/ Rural head () function, we are going to be ready to see the
first 10 rows of the dataset so that we'll have a clear
Loan_Status Loan Approved (Y/N)
picture of what fields does dataset contains in it.
After then we will store the length of rows and
columns within the dataset. we'd like to
grasp the
varied features and columns of the dataset. We will This Boxplot of Loan Amount by Gender. Outliers
use describe () function to urge the summary of all are present in Male Gender than compared to
the numerical and categorical value fields within female
the dataset which contains Applicant income, Co-
applicant income, Credit history, and Loan amount.
For the non-numerical values (e.g., Property Area,
Credit History, etc.), we can take a look at
distribution to grasp whether or not they add up or
not. Understanding Distribution of Numerical
Variables like Applicant Income and Loan The
amount may be done by using Boxplot to know and
finding the outliers of the Dataset fields. But there
are a better number of graduates with very high
incomes, which are appearing to be the outliers.
Now it's time to understanding distribution of
Box plot for Applicant Income by Education shows
Categorical Variables. By making a Cross table
that there are a higher number of graduates with
with both Credit History and Loan Amount Fields
very high incomes, which are appearing to be the
we will see that Loans Approved within the Train
outliers.
Data set are more with applicants having credit
history equals
1. Then we are going to write a function to search
out the proportion of applicants whose loans are
approved with credit history adequate to 1 and it
shows more than 79% of individuals have gotten
loans with a Credit history of 1. Now we are going
to move forward and understand outliers in an
exceedingly better way within the next module to
create the model. Outliers are slightly having
extreme values when compared to normal ones.
The extreme values are practically possible, i.e.,
some people might apply for high value loans due
to specific needs. So instead of treating them as
outliers, let’s try a log transformation to nullify
their effect. Also, we combine both Applicant and
Co Applicant Income for better results.
Also, we need to fill missing values for Data
preparation for Model Building in next modules.
We generally fill the missing values using pandas
by Null values.
b) Model Building
Boxplot for Applicant Income in Train Dataset. The
above Box Plot confirms the presence of a lot of After converting all our categorical variables into
outliers/extreme values. This can be attributed to numeric values as sklearn all inputs to be numeric.
the income disparity in the society. Here we can see all the variables in the dataset are
in numeric Data type. Now we will build Generic
Classification Function by importing scikit learn
module and accessing performance. Fit the model,
make predictions on dataset Perform k-fold cross-
validation with 5 folds. Training the algorithm
using target and predictors. Fit the model again so
that it can be referred outside the function by
adding Accuracy rate and cross validation score.
Now Combining both train and test dataset and
create a flag for both datasets. Look at the summary
of missing values of both datasets and fill them with
Null values for both categorical and numeric c) Testing Dataset with Model
variables
We can see the Accuracy Rate is 80.945% with
Cross Validation Score of 80.946% which is best
value when compared to recent models based on
other Machine Learning Models.

Accuracy Rate: It is the percentage of correct


predictions for a given dataset.

Cross Validation Score: Statistical method used to


estimate the skill of Models

Now all the variables are converted to numeric for


processing in model. Now we create a new column
named as Total Income by adding both Applicant
Income and Co Applicant Income and create label
encoders for them. We have used Logistic
Regression in building the model.
Output CSV file is stored in local storage with
fields Unique Loan ID with predicted Loan
Approval Status as Yes or No. By using this results
Banks can go ahead and complete the loan approval
process for right applicants.

Logistic Regression is one of the popular


algorithms that belongs to Supervised Learning
Techniques. In this algorithm, the result must lie
between 0 and 1 which means Yes or No
(True/False). So, it was mainly used for solving Experimental Results
classification-based problems. It predicts only two
values, so it makes our model job very accurate and 4. Conclusion and Future Scope
give good results as we expected.
Finally, by using logistic regression model we can
 Applicants having a credit history (we predict whether the loan can be approved or not. so
observed this in exploration.) as to implement this various input variables were
 Applicants with higher applicant and co- accustomed to get the output. Whenever a program
applicant incomes. takes the computer file it gives the output within the
 Applicants with higher education level. type of binary i.e., either 0 or 1. If the output is 1
 Properties in urban areas with high then ‘1’ is displayed and it indicates that the loan is
growth possibilities. approved. If the output is 0 then ‘0’ is displayed and
it indicates that the loan isn't approved. This will
So, we made our model with ‘Credit_History’, definitely help the banking system to open up
’Education’ & ’Gender’. efficient delivery channels by taking right decision
in approving or rejecting Loan applicants
5. References

1. Z. Somayyeh, and M. Abdolkarim “Natural


Customer Ranking of Banks in Terms of Credit
Risk by Using Data Mining A Case Study:
Branches of Mellat Bank of Iran”, Jurnal UMP
Social Sciences and Technology Management, vol.
3, no. 2, pp. 307–
316, 2015.

2. Sudhamathy G and Jothi Venkateswara”


Analytics Using R for Predicting Credit
Defaulters”, IEEE international conference on
advances in computer applications (ICACA), 978-
1-5090-3770- 4, 2016.

3. Dileep B. Desai, Dr. R.V.Kulkarni “A Review:


Application of Data Mining Tools in CRM for
Selected Banks”, (IJCSIT) International Journal of
Computer Science and Information Technologies,
Vol. 4 (2), 2013, 199 – 201.

4. Allen, N., Berger, W., Scott, F., & Nathan, H. M.


(2002). Credit Scoring and the Availability, Price,
and Risk of Small Business Credit. FRB of Atlanta
Working Paper No. 2002-6, FEDS Working Paper
No. 2002-26.
5. Altman, E. I. (1968). Financial Ratios,
Discriminant Analysis and the Prediction of
Corporate Bankruptcy. The Journal of Finance,
23(4), 589-609.

6. J.H. Aboobyda, and M.A. Tarig, “Developing


Prediction Model Of Loan Risk In Banks Using
Data Mining”, Machine Learning and Applications:
An International Journal (MLAIJ), vol. 3, no.1, pp.
1–9, 2016.

7. A.B. Hussain, and F.K.E. Shorouq, “Credit risk


assessment model for Jordanian commercial banks:
Neural scoring approach”, Review of Development
Finance, Elsevier, vol. 4, pp. 20–28, 2014.

8. T. Harris, “Quantitative credit risk assessment


using support vector machines: Broad versus
Narrow default definitions”, Expert Systems with
Applications, vol. 40, pp. 4404– 4413, 2013.

9. Kwofie, Charles & Owusu-Ansah, Caleb &


Boadi, Caleb. (2015). Predicting the Probability of
Loan-Default: An Application of Binary Logistic
Regression. Research Journal of Mathematics and
Statistics. 7. 46-52. 10.19026/rjms.7.2206.
C. SOURCE CODE
#Import models
from scikit learn module:
from sklearn import metrics
from sklearn.cross_validation import KFold
#Generic function for making a classification model and accessing performance: def
classification_model(model, data, predictors, outcome):
#Fit the model:
model.fit(data[predictors],data[outcome]) #Make
predictions on training set:
predictions = model.predict(data[predictors])
#Print accuracy accuracy = metrics.accuracy_score(predictions,data[outcome]) print
("Accuracy : %s" % "{0:.3%}".format(accuracy))
#Perform k-fold cross-validation with 5 folds kf
= KFold(data.shape[0], n_folds=5)
error = [] for train, test in kf:
# Filter training data
train_predictors = (data[predictors].iloc[train,:]) #
The target we're using to train the algorithm.
train_target = data[outcome].iloc[train]
# Training the algorithm using the predictors and target.
model.fit(train_predictors, train_target)
#Record error from each cross-validation run error.append(model.score(data[predictors].iloc[test,:],
data[outcome].iloc[test]))
print ("Cross-Validation Score : %s" % "{0:.3%}".format(np.mean(error)))
#Fit the model again so that it can be refered outside the function:
model.fit(data[predictors],data[outcome])
#Combining both train and test dataset #Create a
flag for Train and Test Data set df['Type']='Train'
test['Type']='Test'

58
fullData = pd.concat([df,test],axis=0, sort=True) #Look at
the available missing values in the dataset
fullData.isnull().sum()
#Create a new column as Total Income fullData['TotalIncome']=fullData['ApplicantIncome']
+ fullData['CoapplicantIncome'] fullData['TotalIncome_log'] =
np.log(fullData['TotalIncome'])
#Histogram for Total Income fullData['TotalIncome_log'].hist(bins=20) #create
label encoders for categorical features
for var in cat_cols:
number = LabelEncoder()
fullData[var] = number.fit_transform(fullData[var].astype('str'))
train_modified=fullData[fullData['Type']=='Train'] test_modified=fullData[fullData['Type']=='Test']
train_modified["Loan_Status"] = number.fit_transform(train_modified["Loan_Status"].asty
from sklearn.linear_model import LogisticRegression
predictors_Logistic=['Credit_History','Education','Gender']
x_train = train_modified[list(predictors_Logistic)].values
y_train = train_modified["Loan_Status"].values x_test=test_modified[list(predictors_Logistic)].values
# Create logistic regression object model =
LogisticRegression()
# Train the model using the training sets
model.fit(x_train, y_train)
#Predict Output
predicted= model.predict(x_test) #Reverse
encoding for predicted outcome
predicted = number.inverse_transform(predicted)
#Store it to test dataset
test_modified['Loan_Status']=predicted outcome_var =
'Loan_Status'
classification_model(model, df,predictors_Logistic,outcome_var)
test_modified.to_csv("Logistic_Prediction.csv",columns=['Loan_ID','Loan_Status'])
Accuracy: 80.945% Cross-Validation Score: 80.946%

58

You might also like