0% found this document useful (0 votes)
83 views6 pages

Loanliness: Predicting Loan Repayment Ability by Using Machine Learning Methods

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
83 views6 pages

Loanliness: Predicting Loan Repayment Ability by Using Machine Learning Methods

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Loanliness: Predicting Loan Repayment Ability by Using Machine

Learning Methods
Yiyun Liang (isaliang@stanford.edu)
Xiaomeng Jin (tracyjxm@stanford.edu)
Zihan Wang (wangzih@stanford.edu)

Abstract— Evaluating and predicting the repayment ability II. R ELATED W ORK AND BACKGROUND
of the loaners is important for the banks to minimize the risk of
loan payment default. By this reason, there is a system created A. Loan Repayment Ability Prediction
by the banks to process the loan request based on the loaners’
status, such as employment status, credit history, etc.. However, In the lending industry, the lenders normally evaluate the
the current existing evaluation system might not be appropriate repayment ability of the loaners and the risks of lending
to evaluate some loaners repayment ability, such as students or money to them. Based on the the repayment ability and risks,
people without credit histories. In order to properly assess the the lenders, especially the banks, can adjust the interest rates
repayment ability of all groups of people, we trained various
machine learning models on a Kaggle dataset, Home Credit of the loans which are issued to the borrowers [2].
Default Risk, and evaluated the importance of all the features The research on the evaluation of repayment ability has
used. Then, based on the importance score of the features, we been conducted for decades. Some of the research focus
analyze and select the most identifiable features to predict the on finding useful metrics to quantitatively evaluate the re-
repayment ability of the loaner.
payment ability of the loaner, such as the residual income
ratio and credit score [3] [4] [5]. Others target on finding the
repayment ability of a group of people with similar status,
I. I NTRODUCTION such as students and farmers [6] [7] [8] [9] [10].
Furthermore, the financial crisis in 2008 made an impact
Due to insufficient credit histories, many people are on the repayment evaluation process. The term ”ability to
struggling to get loans from trustworthy sources, such as repay” was used in the book, Dodd-Frank Wall Street Reform
banks. These people are normally students or unemployed and Consumer Protection Act, in 2010, which is used to
adults, who might not have enough knowledge to justify describe one’s financial capacity to make the payment to the
the credibility of the unidentified lenders. The untrustworthy debt [11]. It has been an requirement for a mortgage after
lenders can take advantage of these borrowers by taking the mortgage crisis in 2008. Before the financial crisis, the
high interest rates or including hidden terms in the contract. ability to repayment is not a hard requirement for the lenders
Instead of evaluating the borrower based on their credit score, to provide loans to the borrowers. The loaners, especially
there are many other alternative ways to measure or predict homebuyers, can get loans from the banks even their monthly
their repayment abilities. For example, employment can be income might not be able to cover the monthly mortgage
a big factor to affect the person’s repayment ability since payments [12]. In order to prevent and reduce the default
an employed adult has more stable incomes and cash flow. rate of the loan payment, the Consumer Financial Protection
Some other factors, such as real estates, marriage status and Bureau (CFPB) came up a new set of rules and regulations
the city of residence, might also be useful in the study of the to evaluate the ability to repay of the loaner. These rules and
repayment ability. Therefore, in our project, we are planning regulations include the loaner’s [13]:
to use machine learning algorithms to study the correlations • Expected income or assets
between borrower status and their repayment ability. • Employment status
We found the dataset, Home Credit Default Risk from • Expected monthly payment
Kaggle.com, to be used in this project [1]. This open • Monthly payment on the simultaneous loans
dataset contains 308K anonymous clients’ with 122 unique • Monthly payment of mortgage
features. By studying the correlation between these features • Current debt status
and repayment ability of the clients, our algorithm can • Residual income
help lenders evaluate borrowers from more dimensions and • Credit history
can also help borrowers, especially those who do not have
These factors become the rule of thumb to evaluate lenders’
sufficient credit histories, to find credible loaner, leading to
ability to repay.
a win-win situation. 1
However, these ability-to-repay rules might not fit for
evaluating some types loaners. For example, the university
1 Code available at: https://github.com/Yiyun-Liang/loanliness students might not satisfy the rules to get a loan from
trustworthy resources, such as banks, since they are not em- of data files is shown in Table 1. Our first step of
ployed and they have very limited credit history. Therefore, data pre-processing is to concatenate all the features
the untrustworthy lenders might take advantages of them. together. The way to combine all the features together
In order to prevent this to happen, the objective of our is to use each individual borrower’s unique ID number.
project is to discover more identifiable and useful features to For example, the entries of bureau.csv file can be joined
evaluate the credibility and repayment ability of the loaners. with corresponding rows in application train.csv using
Furthermore, we trained and tested machine learning models SK ID CURR. In this way, we concatenated all the
based on the features and find the best model to predict the features together to construct the training and testing
repayment ability of the loaners. sets with the maximal usage of the given data. After
feature concatenation, each data point has 217 features
III. DATASET AND C HALLENGES in total.
A. Problems with existing models • Feature Encoding and Normalization: Our features
We inspected the dataset and found that many entries con- come in a variety of format, eg. sentence strings,
tains invalid values such as nan(not a number). There are also unbounded integers, floating numbers in the range of
three features ‘EXT SOURCE 1’, ‘EXT SOURCE 2’, and 0 to 1, boolean values, etc. This poses a challenge for
‘EXT SOURCE 3’ and we do not know what they represent. us as the features cannot be directly used for training.
Existing models are evaluated using the three features and To prevent classification biases towards certain features,
by removing the invalid values. These assumptions do not we factorize these features using label encoding, that
necessarily make sense. is, we map the string values to categorical values, each
Another problem with many existing models is that these represented by an integer. However, for some features,
models are not trained on a balanced dataset, so when making the number of categories is too large and it is difficult to
predictions, the model tends to achieve high accuracy by apply label encoding. Then, for these types of features,
predicting all data with the majority label. This high accuracy we use one-hot encoding to expand the single feature
does not tell us much since the accuracy is equivalent to the into multiple features where each expanded feature has
proportion of majority class data in the test set. values limited to only 0 and 1. In the end, we also
normalize the feature values so that all features are
B. Challenges with the dataset evaluated on the same scale.
One challenge that rises in many finance-related machine • Invalid/Empty Entry Replacement: Besides feature
learning problems is that the dataset is heavily imbalanced. processing, invalid entries and empty entries is another
Our dataset records borrower profiles and binary truth labels, problem that prevent us from training the machine
the decision whether they should be accepted as a client by learning algorithms properly. In the dataset, there is
the lender. In reality, since only a small fraction of the loan a noticeable amount of data with invalid entries (such
applicants are eventually accepted, our dataset also suffers as a very large number) or empty entries (e.g. Nan).
from the problem of being imbalanced. One strategy we applied is to take the mean of the
The dataset we get from Kaggle is relatively large in terms feature values and fill in invalid entries with this mean
of number of features as well as the amount of data(around value. For entries with a large amount of invalid values,
300k), so the training process is quite slow, especially when we use a strategy to remove columns or rows based
we build and apply more sophisticated machine learning on the percentage of invalid values that present in the
models. corresponding column. We set a threshold value, which
The two challenges above make the problem more inter- is initially set to 30% in our case. If the percentage
esting because they are problems frequently encountered by of invalid values in the column is greater than the
researchers and scientists. threshold, we mark the feature as an invalid feature and
remove the column from the dataset. Otherwise, we just
IV. M ETHODS remove the row which contains the invalid value.
The goal of the project is to predict the repayment ability • Polynomial feature transformation: To gain the most
of the borrowers based on the factors other than the credit out of our linear classifiers, we also performed polyno-
history. It can be framed as a classification problem with mial transformation on our feature values to include a
two classes. In the following section, we will introduce the polynomial combination of the features.
methods we used to pre-process the data and the machine
learning algorithms we will use to solve the problem. The dataset provides us with a very comprehensive profile
of the loan applicants. In total, we expanded the number of
A. Data Pre-processing
features to around 651 features for each loan applicant, as
Due to the complexity of our raw data, we introduce some shown in TABLE II. A large number of features is helpful
data pre-processing techniques to our dataset before it is used in training the model but can sometimes slow down our
for training and testing. algorithms. We will also expand on our experiment with
• Feature concatenation: In the original data set, the feature reduction in the next section. We will also try to
features come from different sources. A brief summary make sense of the features to ensure our assumptions of the
TABLE I
S UMMARY OF F ILES AND R AW F EATURES

File Name Description # of Features


application train.csv Information about loan and loan applicant when they submit the application 121
bureau.csv Application data from previous loans that client got from other institutions reported to Credit Bureau 17
bureau balance.csv Monthly balance of credits in Credit Bureau 3
previous application.csv Information about the previous loan and client information at previous time 37
POS CASH balance.csv Monthly balance of client’s previous loans in Home Credit 8
instalments payments.csv Previous payment data related to loans 8
credit card balance.csv Monthly balance of client’s previous credit card loans 23

TABLE II
• LightGBM: LightGBM is a highly efficient gradient
C OMPARISON BETWEEN B EFORE AND A FTER I NVALID /E MPTY E NTRY
boosting decision tree algorithm for classification [15].
R EPLACEMENT
It is an improved version of Gradient Boosting Decision
# of Features # of Datapoints Tree (GBDT) algorithm, which is over 20 time faster.
Before 217 307511
After 651 102244
GBDT is a machine learning algorithm wildly used in
multi-class classification and click prediction. Since it is
a augmented tree-structure classifier, we can make com-
parisons between itself and the Random Forest classifier
dataset are valid and that the feature importance outcomes to get a sense of differences in performance between
match our expectations. standard algorithms and more advanced algorithms.
• Neural Networks: Neural networks, or multi-layer
B. Machine Learning Techniques perceptrons, are one of the most popular methods to
In this section, we try some machine learning models be applied on classification problems. It is a function
on the task of making predictions on borrower repayment approximator, which can not only model the distri-
abilities. The machine learning algorihtms include: logistic bution of linear data, but can also classify data with
regression, random forests, Naive Bayes, LightGBM and non-linear decision boundaries due to the non-linearity
neural networks. Some of the algorithms are the algorithms added by the activation functions. In our project, we fit
we learned from class and some are we explored online a multi-layer perceptron by carefully selecting hyper-
which might have a good performance on the dataset. In parameters, such as the number of layers and the
this section, we present the reasons why we choose to use number of neurons.
these algorithms. We will show the results of our initial
experiments in section 3. V. E XPERIMENTS AND R ESULTS
• Logistic Regression: Logistic regression is often a great A. Training and Test Data Split
baseline model to try on machine learning problems
because it does not enforce strong assumption on the By observing the data in the processed dataset, we found
distribution of the underlying data. For our classification that the negative data and positive data are imbalanced. The
problem, logistic regression is a great model to try as number of data points with negative labels is much larger
our first step. than the data points with positive labels. There are many
• Random Forests: Random Forests usually performs potential issues and risks related to having an imbalanced
well on imbalanced dataset. Another benefit of running dataset. If we train the classifier on an imbalanced dataset,
a random forest classifier on our dataset is that it the classifier may classify all the minority data with majority
provides us with an intuitive way of looking at our labels which can also result a high test accuracy. However,
features by listing individual feature importance, which the classifier might be meaningless since it does not have the
gives us intuitions into the factors that affect a person’s capacity to recognize data in the minority class. Therefore,
loan repayment ability. Moreover, we want to see if in order to prevent this problem, we tried several techniques.
introducing some degree of randomness into the classi- They include balancing the dataset with up-sampling, down-
fication problem would help in improving accuracy of sampling techniques, as well as using class weights to
our results [14]. penalize predictions on the majority class.
• Naive Bayes: Naive Bayes uses the ”naive” assumption • Down-sampling: Since the dataset is not balanced,
on the features, which means that the features are we can use the down-sampling method to reduce the
conditionally independent with each other given the number of negative examples for training purposes.
class variable. As we learned in class, Naive Bayes First, we take the minority examples, here they are the
is a generative method, which is different from the data points with positive labels. Then, we randomly
previous two algorithms. Then, we can compare the select the same amount of data with the majority labels.
performance between the basic discriminative methods However, it might also be ideal to process the dataset
and this generative method. in different ways. The reason is that a large number of
TABLE III
P ERFORMANCE OF M ACHINE L EARNING A LGORITHMS

Machine Learning Model Accuracy Precision Recall F1 Score


Logistic Regression 69.34% 0.66/0.75 0.81/0.58 0.72/0.65
Random Forest 63.51% 0.58/1.00 1.00/0.27 0.73/0.43
Naive Bayes 52.11% 0.51/0.71 0.97/0.07 0.67/0.13
Multi-layer Perceptron 69.15% 0.67/0.71 0.73/0.65 0.70/0.68
LightGBM 57.47% 0.54/1.00 1.00/0.15 0.70/0.26

TABLE IV
P ERFORMANCE OF K- MEANS C LUSTERING WITH C LASSIFICATION A LGORITHMS

Machine Learning Model Accuracy Precision Recall F1 Score


Cluster 1 72.24% 0.63/1.00 1.00/0.47 0.77/0.64
Cluster 2 67.01% 0.62/1.00 1.00/0.28 0.77/0.43
Cluster 3 82.34% 0.77/1.00 1.00/0.56 0.87/0.72
Cluster 4 70.78% 0.56/1.00 1.00/0.53 0.72/0.69
Overall 71.57% 0.63/1.00 1.00/0.43 0.77/0.59

negative data was removed from the dataset when we tell us about the classifier’s ability to distinguish between
perform down-sampling, which is a waste of resources. the two classes. Based on the figure, MLP achieves the best
• Up-sampling: Another technique we explored is the use area under the curve, followed by random forest and logistic
of up-sampling techniques to up-sample the minority regression.
data in the training set. We can up-sample positive
data using techniques such as SMOTE [16]. By using
up-sampling techniques, we can take advantage of the
negative data more effectively.
• Class weights: We experimented with class weight,
which penalizes the classifier when it predicts with the
majority class label. The ratio of the class weight is
calculated as the inverse to the frequency of the class
label. The weight of class i is calculated below, where
n is the total number of examples, and ni is the number
of examples in class i.
n
wi = (1)
2ni
B. Performance of Implemented Algorithms
Fig. 1. ROC and area under the ROC for each classifier
The best performance comes from our down-sampled
dataset. The performance of each algorithm is shown in
TABLE III. C. Performance of K-means clustering and Classifications
As we can see, the logistic regression classifier has the best The goal of the unsupervised learning part of the project
performance on both accuracy, reaching a value of around is to get some meaningful insights into the structure of
0.69, followed by the random forest and MLP algorithms, data and to potentially categorize the various types of loan
both of which has an accuracy of above 0.6. applicants in our dataset. We wanted to see if there exist
We can also look at the precision and recall scores of distinct characteristics among different groups of borrowers,
our classifiers more closely. From TABLE III, we see that if so, we could build different prediction models for different
our models all yields a good precision score on the positive groups. The unsupervised learning technique we tried is K-
class (second column of precision). This precision score is means clustering.
calculated as the number of true positives over the sum of We first performed k-means clustering on the dataset. We
true positives and false positives. A high value tells us that experimented with several values of k. Then we build a
our model is confident in its result in classifying a borrower prediction model for each one of the clusters. The results
as trustworthy. This aligns with our goal, where we want are shown in TABLE IV. The models achieved the best
to provide another criteria to evaluate trustworthy borrowers overall performance when k = 4. For each one of the four
who may not have enough credit scores. clusters, the LightGBM model performs the best out of all
The ROC curves and corresponding area under the ROC the machine learning models. If we compare the accuracy,
curve for each classifier is shown in Figure 1. ROC curves precision and F1 scores in TABLE IV with that of LightGBM
in TABLE III, we observed that the model’s performance im- slightly better in terms of separating the two classes
proved significantly from an accuracy of 57.47% to 71.57%. but the challenge is still present. This again confirmed
We could infer from the result that each cluster identified that our dataset is quite challenging to work with.
by the k-means algorithm exhibit characteristics that could Despite of that, we gained several interesting insights
be picked up by the model when trained separately, but not into the dataset, and we are able to draw reasonable and
when the model is trained on the entire dataset. meaningful conclusions from our results.

D. Visualization of high-dimensional data


• Principle Component Analysis (PCA): Since we have
expanded our feature space to include around more
than 800 features after data processing, despite the
large amount of data we have, we do not yet have a
good understanding of the relationship between each
variable. To answer this question, we are looking for a
dimension reduction technique that could tell us what Fig. 2. Visualizations of high dimensional data. 1) PCA for the top two
principle components. 2) t-SNE. 3) t-SNE after applying PCA
the important features are. One technique we tried is
the principle component analysis, a.k.a. PCA. Moreover,
the dimension reduction technique could also be used E. Feature Importance Analysis
as a convenient way to visualize our high-dimensional If we look at the feature importance in Figure 3, we no-
dataset. ticed that the top features are ’NUM DAYS EMPLOYED’,
Since it is the easiest to perform visualizations on a 2D ’DAYS BIRTH’, etc. The most important feature is number
or 3D plot, in Figure 2, we applied PCA to get the first of days that this person has been employed. This is reason-
two principle components. As a sanity check, the first able in that the more days the person gets employed, the
two principle components account for around 23% of more chance that he gets stable incomes. This shows the
the variation of our dataset. The scatter plot is based ability to keep a good credit and pay the loan on time. The
on the top two principle components and it colored second important feature is the days of birth, that is how old
the two classes differently. From the plot, we observed is this person. The older the person, the more chance that he
four distinct clusters. This aligns with our results in the can pay the loan more often and the lower risk of default.
previous section where k-means works the best when
number of clusters represented by k is set to 4. Another
thing we observed is that despite the fact that four
clusters are formed, the data points from the two classes
are still quite inseparable. The blue dots, representing
the positive class, are mingled within the clouds of red
dots. This result aligns with our experiment since it is
quite difficult for the classifiers to distinguish between
the two classes if they cannot be separated apart using
existing attributes, as we have seen in the visualization.
• t-Distributed Stochastic Neighbor Embedding (t-
SNE): t-Distributed Stochastic Neighbor Embedding,
a.k.a. t-SNE, is also another technique for dimension
reduction. According to [18], t-SNE differs from PCA
in that it ”minimizes the divergence between two dis- Fig. 3. Feature importance extracted from the random forest classifier
tributions: a distribution that measures pairwise sim-
ilarities of the input objects and a distribution that
VI. C ONCLUSIONS AND F UTURE W ORK
measures pairwise similarities of the corresponding low-
dimensional points in the embedding.” In other words, In our report, we demonstrated the use of machine learning
t-SNE takes an probabilistic approach to reduce dimen- algorithms on a very challenging dataset to predict loan re-
sions, rather than a mathematical one that requires an payment ability. To achieve the best performance, we showed
eigenvector computation as in PCA. that data pre-processing, a careful selection of techniques of
Similar to PCA, we compute the top two dimensions balancing dataset and classification algorithms are all very
using t-SNE and plotted that in Figure 2. A similar important. Logistic regression and neural networks work
problem exists in the plot since the two classes cannot quite well on our dataset, and the use of k-means is also
be dinstinguished easily. effective. In the future, we want to continue exploring more
Last but not the least, t-SNE works well when we sophisticated learning algorithms and dimension reduction
first perform dimension reduction using PCA, and then techniques to further improve model performance on this
run t-SNE again on the reduced data. The plot looks important prediction task.
VII. CONTRIBUTIONS OF TEAM MEMBERS
All three team members have roughly equal contribu-
tions to the project. The main contribution of Yiyun is to
implement various machine learning models for the loan
repayment prediction and implement k-means for the clus-
tering. The main contribution of Xiaomeng is to implement
machine learning models and feature engineering. The main
contribution of Zihan is to implement algorithms for data
processing and machine learning model implementation. All
team members contributed to the write-up of the report.
R EFERENCES
[1] “Home Credit Default Risk.” Kaggle, https://www.kaggle.com/c/home-
credit-default-risk/data.
[2] Gorton, Gary, and James Kahn. ”The design of bank loan contracts.”
The Review of Financial Studies 13, no. 2 (2000): 331-364.
[3] Langrehr, Virginia B., and Frederick W. Langrehr. ”Measuring the
ability to repay: The residual income ratio.” Journal of Consumer
Affairs 23, no. 2 (1989): 393-406.
[4] Kolo, Brian, Thomas Rickett McGraw, and Dathan Gaskill. ”Systems
and methods for using data metrics for credit score analysis.” U.S.
Patent Application 13/456,532, filed November 1, 2012.
[5] Çelik, Şaban. ”Micro credit risk metrics: a comprehensive review.”
Intelligent Systems in Accounting, Finance and Management 20, no.
4 (2013): 233-272.
[6] Olivas, Michael A. ”Paying for a law degree: Trends in student
borrowing and the ability to repay debt.” J. Legal Educ. 49 (1999):
333.
[7] Hesseldenz, Jon, and David Stockham. ”National direct student loan
defaulters: The ability to repay.” Research in Higher Education 17, no.
1 (1982): 3-14.
[8] Flint, Thomas A. ”Predicting student loan defaults.” The Journal of
Higher Education 68, no. 3 (1997): 322-354.
[9] Afolabi, J. A. ”Analysis of loan repayment among small scale farmers
in Oyo State, Nigeria.” Journal of Social Sciences 22, no. 2 (2010):
115-119.
[10] Wongnaa, C. A., and Dadson Awunyo-Vitor. ”Factors affecting loan
repayment performance among yam farmers in the Sene District,
Ghana.” Agris on-line Papers in Economics and Informatics 5, no. 665-
2016-44943 (2013): 111-122.
[11] Murdock, C.W., 2011. The Dodd-Frank Wall Street Reform and
Consumer Protection Act: What Caused the Financial Crisis and Will
Dodd-Frank Prevent Future Crises. SMUL Rev., 64, p.1243.
[12] Ivashina, Victoria, and David Scharfstein. ”Bank lending during the
financial crisis of 2008.” Journal of Financial economics 97, no. 3
(2010): 319-338.
[13] Mierzewski, Michael B., Christopher L. Allen, Jeremy W. Hochberg,
and Kevin Hall. ”CFPB Finalizes Ability-to-Repay and Qualified Mort-
gage Rule.” Banking LJ 130 (2013): 611.
[14] Liaw, Andy, and Matthew Wiener. ”Classification and regression by
randomForest.” R news 2.3 (2002): 18-22.
[15] Ke, Guolin, et al. ”Lightgbm: A highly efficient gradient boosting
decision tree.” Advances in Neural Information Processing Systems.
2017.
[16] Chawla, Nitesh V., et al. ”SMOTE: synthetic minority over-sampling
technique.” Journal of artificial intelligence research 16 (2002): 321-
357.
[17] Arthur, David, and Sergei Vassilvitskii. ”k-means++: The advantages
of careful seeding.” Proceedings of the eighteenth annual ACM-SIAM
symposium on Discrete algorithms. Society for Industrial and Applied
Mathematics, 2007.
[18] Laurens van der Maaten, Geoffrey Hinton. ”Visualizing Data using
t-SNE.” Journal of Machine Learning Research, 2008.
[19] Visualising High-dimensional Datasets Using PCA and t-SNE.
https://towardsdatascience.com/visualising-high-dimensional-datasets-
using-pca-and-t-sne-in-python-

You might also like