Research Report
Research Report
2
Comparing Loan Default Prediction accuracy using different 59
60
3 machine learning techniques. 61
4 62
5 63
6
Achyutam Verma Ashutosh Lembhe 64
7
University of Limerick University of Limerick 65
8 24046949@studentmail.ul.ie 24088714@studentmail.ul.ie 66
9 67
10 Dhruv Upadhyay Gorav Sharma 68
11 University of Limerick University of Limerick 69
12 24039497@studentmail.ul.ie 24246182@studentmail.ul.ie 70
13 71
14
ABSTRACT one of them gives the best results on clean data. Techniques like this 72
15 The purpose of this study is to evaluate how well various ma- help save time, reduce errors and increase accuracy of a complex 73
16 chine learning methods predict loan defaults in the financial Sector. process like loan lending. The good thing about the machine learn- 74
17 Traditional techniques of assessing loan applicants are frequently ing models is that they can be changed with changing scenarios 75
18 inadequate and result in financial losses due to the rising risks and still give accurate results. The dataset we are using is the Loan 76
19 connected with loan lending. The study investigates how well ma- Defaulter dataset from kaggle. This data set has a wide range of 77
20 chine learning methods such as Support Vector Machines (SVM), loan data which tells us if some person is going to default or not. 78
21 Decision Trees, XGBoost Classifier, Random Forest, and K-Nearest It has information on different type of loan contract, persons age, 79
22 Neighbours can forecast if a borrower will Default or Repay the he/she has assets, is married or not, number of children and many 80
23 Payment. In order to help banks and other financial organisations such factors which help in prediction of defaulter or not. The data 81
24 minimise losses and enhance decision-making, the comparison set has two files, first one has data regarding applications of clients 82
25 analysis attempts to determine which machine learning model has with all the above information and other is previous application 83
26 the best forecast accuracy for loan defaults .The above machine data file which contains data of clients who have applied for loan 84
27 learning models had a training accuracy from 83% to 99% while, previously and have they defaulted or not. The current dataset 85
28 their accuracy on test data dropped slightly suggesting some need contains a total of over 1.4 million in records combining both files 86
29 for fine tuning and more than 100 columns on which the prediction will take place. 87
30
We will be using CRISP-M (Cross-Industry Standard Process for 88
31 KEYWORDS Data Mining) for cleaning of our data. We will be employing Sup- 89
32
port Vector Machines (SVM), Decision Trees, XGBoost Classifier, 90
Support Vector Machine, XGB Classifier, Random Forest Classifier, Random Forest, and K-Nearest Neighbours over the given data.
33 91
Decision Tree, K-Nearest Neighbor, ROC, Precision Score, Recall
34 92
Score
35 2 RELATED WORK 93
36 Loan default prediction has been a hot topic to research. There has 94
37 1 INTRODUCTION been a lot of work going on the same topic. In the research paper, 95
38 In the world of finance and banking, lending loan has become one of Loan Default Prediction with Machine Learning Techniques[1] the 96
39 the biggest functions to earn money by these banks, be it personal author discusses five machine learning techniques, how they were 97
40 loan, business loan or financing your education. Lending out loans imbalanced and fined tuned for better performance using XGBoost, 98
41 to an applicant is a risky business, if the applicant defaults on the random forest (RF), AdaBoost, k nearest neighbors (kNN), The ad- 99
42 payments, this will lead to a financial loss for the bank and financial aboost performed best among all the models, also for XGBoost and 100
43 institutions. Banks usually do verification of the candidates in an Adaboost model the limitations were it lacked transparency. The 101
44 old fashion way to evaluate whether they will repay the loan or research paper Improving Credit Risk Assessment through Deep 102
45 not. But this old and manual process leads to wrong decisions using Learning-based Consumer Loan Default Prediction Model[2] talks 103
46 the wrong data points and then the loan is given to a defaulter about improving credit risk assessment using machine learning 104
47 candidate who defaults on the payment which results in losses for methods such as Data extraction and analytical algorithms Deep 105
48 the bank. This type of loan can be any ranging from credit card loan learning utilizing Keras and TensorFlow which gave an accuracy 106
49 and loan taken directly from the bank. In today’s world machine 95.2% with the limitions being that The data is limited to 11 banks. 107
50 learning is playing a crucial role in terms of predictive analytics, Smaller data size relates to higher accuracy which will not hold true 108
51 which can be used in the same way to determine if a candidate is a for larger datasets. The next research paper Loan Risk Prediction 109
52 defaulter candidate or good candidate for the bank. Using classifiers Model based on Random Forest [3] shows us the use of Random 110
53 and regression models and doing some exploratory data analysis Forest Classifier to predict loan default with a test accuracy of 111
54 we can predict the exact accuracy with which an applicant might 85.6% with its limitation being that it only evaluates one model and 112
55 default on a loan payment or not. We will be employing algorithms does not explore other model. The research paper Loan Default 113
56 like support vector machine, XGBoost etc. to get the accuracy value Prediction: A Complete Revision of Lending Club [4] tells us how 114
57 and do a comparative analysis among each algorithm to see which it uses binary classification to identify defaulters and repayers with 115
58 1 116
,, Achyutam Verma, Ashutosh Lembhe, Dhruv Upadhyay, and Gorav Sharma
117 an accuracy of 90% but with a limitation that it eliminates a lot learn from observations and reduce prediction mistakes through 175
176
118 of data for testing which is not correct. The research paper Loan iteration. From the above literature review it can be observed that 177
119 Default Prediction Using Spark Machine Learning Algorithms[5] the machine learning models we have selected align with the results 178
120 explores the method of using big data platform Apache spark using of other papers, but we must be careful while selecting the dataset 179
180
121 different model such as GBT, LSVM and FM. The dataset give good size since it does not need to be too small or too big for models. 181
122 accuracy of 99% but it is biased in terms of minorities and gender. Otherwise the model will have a problem of under fitting and over 182
123 The research paper Prediction of loan default based on multi-model fitting for the given training data. 183
184
124 fusion [6] is using Logistic regression, random forest, and gradient 185
125 boosting algorithms on the lending club loan dataset to predict the 3 STUDY DESIGN 186
187
126 default rate, however this study does not compare with different 3.1 Research Goal 188
127 deep learning methods in use. In the research paper by Pidikiti 189
128 Supriya et Loan Prediction by using Machine Learning Models [7] Research objectives: 190
129 suggests to check the credit worthiness of the client who is applying The overarching aim of this study is to test and compare several 191
192
130 for loan by using decision trees, logistic regression, gradient boot- machine learning algorithms for loan default prediction with the 193
131 ing, and classification. It is able to achieve an accuracy of 81% but aim of identifying the most accurate model of loan default predic- 194
132 with a limitation of gender and martial status not taken into account tion in the financial sector. This study aims to attain the following 195
196
133 which we are taking in our machine learning model. The paper objectives by applying technologies such as support vector ma- 197
134 Explainable AI for credit assessment in banks[8] makes predictions chine (SVM), decision tree, random forest, XGBoost, and K nearest 198
neighbor (KNN) to a comprehensive loan dataset: 199
135 using Gradient Boost and uses SHAP values for doing so, the limi- 200
136 tation over here is it only uses numeric data. In the research paper 1. Accuracy Comparison: Check the accuracy of different ma- 201
137 Comparing classification Algorithm on predicting loan[9] discusses chine learners in prediction. 202
203
138 the use of different machine learning methods like K - nearest neigh- 2. Model Performance Assessment: Assess the performance of 204
139 bour, Random Forest, Decision Tree, Support Vector Machine, and each using a large set of metrics including, but not limited to, accu- 205
140 Logistic Regression with higher accuracy, but limitation of the pa- racy, accuracy, recall, F1 score, AUC, and MSE. 206
207
141 per are that it uses a smaller dataset so accuracy is higher. The next 3. Feature Importance and Insights: Analyze the relationship 208
142 research paper Credit rating analysis with support vector machines between loan applicant features such as credit history, income, and 209
employment status with loan default risk and list the important 210
143 and neural networks: a market comparative study[10] by Huang 211
144 Zan talked on the application of machine learning methods, specif- features helping to forecast accurately. 212
145 ically support vector machines (SVM) and backpropagation neural 4. Loan Risk Assessment Optimization: The best machine learn- 213
214
146 networks (BNN), to forecast credit scores. The project’s goal was to ing algorithms that can be used to enhance financial institutions’ 215
147 develop a model that could explain more data than the conventional loan approval process and decrease the probability of defaults. 216
148 statistical approach. It compared how effectively SVM and BNN 5. Better Decisions: This would be a data-driven way to improve 217
218
149 predicted credit ratings and looked at ways to improve the inter- bank and lending institution decision-making by pinpointing the 219
150 pretability of AI-based models in credit risk assessments. The paper most probable loan defaults, which helps to reduce financial losses. 220
The study provides insight of how machine learning can be 221
151 Predicting Bank Loan Eligibility through Comparison Analysis and 222
152 Machine Learning Model[11] talks about using Logistic regression, used to increase the accuracy and speed of loan default prediction 223
153 decision trees, random forests, KNN, LightGBM, XGBoost, and Ad- systems. 224
225
154 aboost were among the approaches used in the development of the 226
155 model. The performance of each method is evaluated using stan-
3.2 Research Questions 227
228
156 dard measures such as F1-score, Area Under Curve (AUC), recall, "How can different machine learning models be applied to predict 229
157 accuracy, and precision. The highest accuracy, 91.89%, is achieved loan defaults more accurately, and which model achieves the highest 230
158 by LightGBM, followed by Random Forest and ADAboost. Light- prediction accuracy using the available loan data?" 231
232
159 GBM outperformed other classifiers, according on AUC analysis. 233
160 The paper The Evaluation of Consumer loans using Support Vector 3.3 Initial search 234
235
161 Machine[12] shows the T-test and cross-validation to compare SVM Methodology for the initial literature and Database Search search 236
162 and MLP. The outcome shows that SVM performed better than MLP 1. Search goal : the literature search was to identify research 237
163 in terms of visualization and generalization. To sum up, the study articles and papers focusing on loan default prediction through ma- 238
239
164 offers a thorough method for evaluating consumer loans that makes chine learning techniques. Special emphasis was placed on studies 240
165 use of SVM technology in order to satisfy the changing needs of that provided comparative analyses of various models, evaluated 241
242
166 the financial sector. Additionally, the misclassification error study performance metrics, and discussed practical implementations in 243
167 analysis offers important information about how to optimize the banking and financial institutions.. 244
168 SVM parameter. The following paper A roadmap for future neural 2. Database and Resources employed: A reliable academic data- 245
246
169 networks research in auditing and risk assessment[13] investigates base and resources, such as : 1. Elsevier (Science Direct) 2. Springer 247
170 neural networks (NNs) in relation to business risk. NNs research Link 3. ACM Digital Library 4. Kaggle 5. Google Scholar The afore- 248
171 the nerve system and brain. It emphasizes how nonlinearity and mentioned databases were selected due to their vast collections of 249
250
172 interdependencies between variables may be managed without the technical reports, conference papers, and peer-reviewed journal 251
173 need for Strick distributional assumptions. highlights how NNs articles and a suitable database was discovered on Kaggle. 252
253
174 2 254
Comparing Loan Default Prediction accuracy using different machine learning techniques. ,,
255 3. Key terms and phrases: "Loan default forecast," "Machine Learn- this step entails working with bankers and industry specialists. For 313
314
256 ing in Finance," and "Loan Risk Assessment using ML" were among the banking industry’s financial and long-term economic viability, 315
257 the keywords utilized in the study. - "Comparative Analysis of it is essential to comprehend how loan prediction works. The study 316
258 Machine Learning Models" - "Credit Risk Prediction Algorithm" - will look at the expenses of correctly identifying a bad loan as a 317
318
259 "Predictive analysis in banks" good loan, as well as the financial implications of loan prediction 319
260 4. Criteria for inclusion and exclusion: The following inclu- accuracy. The purpose of this study is to investigate the advan- 320
261 sion/exclusion criteria were taken into consideration in order to tages of accurately predicting good loans, including identifying bad 321
322
262 guarantee the caliber and applicability of the literature: Require- loans, predicting excellent loans, and enhancing machine learning 323
263 ments for inclusion: Studies must be conducted within the last ten algorithms. 324
325
264 years in order to capture the most recent advancements in machine 326
265 learning technology. An analysis of the application of deep learning 4.2 Data Perspective 327
266 and machine learning to the prediction of credit and loan default. Now we need to check the data from column perspective to un- 328
329
267 An analysis of a financial dataset that compares and assesses multi- derstand what are the important columns we will be using in ex- 330
268 ple machine learning models. Not included: Research in fields other ploratory data analysis and further exploration. Our current dataset 331
269 than non-financial factors; publications written in languages other 332
has more than 1.4 million rows and more than 100 columns, we will 333
270 than English - Incomplete data or inaccessible data studies. not be using all the data and will be doing data cleaning according 334
271 5. Review procedure: Titles and abstracts were first screened to CRISP-DM methodology. The data contains some insights we can 335
336
272 to ensure they were pertinent to the study issues. The relevant use and have grouped them into different categories by columns 337
273 literature was thoroughly examined to get knowledge about the (not all columns just an approximation) from both the tables. 338
274 methodologies, datasets, and outcomes, as well as any constraints. A 339
340
275 matrix that illustrates and contrasts each paper’s findings has been Table 1: Loan and Borrower Information 341
276 created. 6. Search results The gathered papers revealed a number of 342
277 pertinent additions to the field of loan default forecasting research. 343
Columns Description 344
278 Methodologies, data used, and corresponding outcomes were all 345
NAME CONTRACT TYPE Type of Loan
279 methodically examined in the review of these investigations. The 346
AMT CREDIT Credit amount of the loan 347
280 selection of machine learning models and evaluation indicators for
AMT ANNUITY Loan annuity 348
281 this study was influenced by the findings. 349
282
AMT INCOME TOTAL Income of the client 350
283 4 DATA MINING METHODOLOGIES CREDIT INCOME RATIO will derive this using 351
352
284 Now we need to analyze our data from the business perspective 353
285 354
and then data perspective. Once we have understood what both 355
286 perspectives are, we need to implement the CRISP-DM for pre- Table 2: Borrower Information 356
287 processing of our data, this is a common practice or technique used 357
358
288 in machine learning projects across various industries. Columns Description 359
289
GENDER Gender of the borrower 360
290 361
DAYS EMPLOYED Days has the borrower been employed 362
291
OCCUPATION TYPE Type of job 363
292 364
ORGANIZATION TYPE Which company 365
293
CNT CHILDREN Number of children 366
294 367
295 368
296 Now we will process the data according to our CRISP-DM method. 369
370
297 1. Data Cleaning 2. Data Integration 3. Data Transformation 4. Data 371
298 Reduction 372
373
299
374
300 4.3 Data Cleaning 375
376
301 In data cleaning our goal is to identify and eliminate outliers, fixes
377
302 inconsistencies, smooths noisy data, and fill in missing numbers 378
303 Figure 1: CRISP-DM Methodology to clean the data. For that we need to check which columns in our 379
380
304 data have missing or null values. We have identified more than 10 381
305 382
306 4.1 Business Perspective 383
Table 3: Borrower Asset Information 384
307 Using a variety of machine learning algorithms, this phase seeks 385
308 to analyze the significance of successfully forecasting loan appli- 386
Columns Description 387
309 cations. The assessment metrics score of each method will also be
310 compared and assessed. In order to determine the essential com- FLAG OWN CAR If borrower owns car 388
389
311 pany goals and prerequisites for precise loan projection forecasts, FLAG OWN REALTY If borrower owns house 390
391
312 3 392
,, Achyutam Verma, Ashutosh Lembhe, Dhruv Upadhyay, and Gorav Sharma
393 451
452
394
453
395 454
396 455
456
397
457
398 458
399 459
Figure 5: Boxplot showing outlier Data 460
400 461
401 462
463
402
464
403 465
404 466
467
405
468
406 469
407 470
Figure 6: Outlier removal code 471
408 Figure 2: Columns with Null Values with python code 472
409 473
410 outliers. This is applicable when both have same set of columns but 474
475
411 columns where we null or missing data and the sum of missing only differ in number or rows. 476
412 data is 9152465. 477
413 Now we will fill the missing data with mean for numerical 4.4 Data Integration 478
479
414 columns, zero for binary classification and mode depending on This step is for combining data from several sources into a single, 480
415 categorical columns. See images for filling values. bigger data repository. Since we have two csv files one is application 481
482
416 data and other is previous application data we will merge them 483
417 since they both have same columns but different amount of data 484
485
418 and we need to find combined results for both of the files. This 486
419 will also be helpful for exploratory data analysis. This combined 487
420 data will also be used for training our machine learning model and 488
489
421 predicting output from it. 490
422 491
423 492
493
424 494
425 Figure 3: Filling Mean data for numerical column 495
496
426
497
427 Figure 7: Merging Data for EDA 498
428 499
500
429
501
430 502
431 503
504
432 505
433 506
Figure 8: Merging Data for Training and Prediction 507
434
508
435 509
436 510
437 Figure 4: Filling zero and mode data 4.5 Data Transformation 511
512
438 In this section we see how we transform data using Z-score for 513
439 After running the above code we get the final sum for null values removing outlier and converting text data to numerical binary data 514
515
440 to 2 from 9152465. We now need to find the outliers in our data for for its use in machine learning models since machine learning only 516
441 each columns using box plots method to understand which data is at uses numerical data, this also involves converting negative values to 517
442 extreme ends. Our data before removing outlier is (307511, 122). We positive value for e.g. columns associated with date like in our data 518
519
443 need to remove outliers in our data by using Z-score normalization ’DAYS BIRTH’,’DAYS EMPLOYED’,’DAYS REGISTRATION’,’DAYS 520
444 see figure 6. Using this technique we are able to remove the outliers ID PUBLISH’ need to be in positive numbers. 521
522
445 where max score is greater than 2 and we have 277884 rows left 523
446 after it. 4.6 Data Reduction 524
447 We apply the above technique on both the application data file In this section our goal is to reduce the redundant data row wise 525
526
448 and previous application data file. We are able to clean the data up as well as remove any columns that are unrelated to our target 527
449 to 98% from both files by removing null and removing most of the column. Our target column tells us if the client/borrower is going 528
529
450 4 530
Comparing Loan Default Prediction accuracy using different machine learning techniques. ,,
531 drop the columns in both application data and previous application 589
590
532 data file. 591
533 592
534 593
594
535 Figure 9: Converting from text to binary data 595
536 596
537 597
598
538 599
539 600
601
540
602
541 Figure 10: Converting from negative to positive values 603
542 604
605
543
Figure 13: Dropping Data 606
544 607
545 608
546
Figure 11: Z-score to remove outlier This concludes our data cleaning or CRISP-DM process for pre- 609
610
547 processing our data. Now to extract more features from our data 611
548 we need to Exploratory Data analysis 612
to payback the amount or not. To find the correlation between the 613
549 614
target column and other column we need to create a heatmap. The 5 EXPLORATORY DATA ANALYSIS 615
550
heatmap tells us about the correlation between different columns. 616
551 We now need to draw more insights from our data to understand it.
We will try to show correlation for the first 20 columns, since we 617
552 Since our data contains details of the borrowers like if the borrower 618
have more than 100 columns it is not possible to show all at once. 619
553 owns a car, house, gender of the borrower and the percentage
Once we get to know the correlation, we will drop the columns 620
554 of re-payers and defaulters in our data . This details will help us 621
with little to no correlation. We will create this heatmap for both
555 understand if the borrower. We are limiting our data to 622
application data file as well as previous application data file to 623
556 Since the data helps us identify defaulters and repayers, we will
remove unnecessary data from both of them. 624
557 first check their percentage in our data, we now combine the data 625
558 from both the application data and previous application csv files in 626
627
559 our previous pre-processing stage. 628
560 For our understanding we are taking the defaulters as 1 and 629
561 repayers of the borrowed money as 0. And we will also see the 630
631
562 gender distribution of the borrowers to take more insights. We will 632
563 use pie char to visualize this data. We can see one useful insight, 633
634
564
635
565 636
566 637
638
567
639
568 640
569 641
642
570 643
571 644
645
572
646
573 647
574 648
575 Figure 14: Percentage of defaulters and Gender in our data. 649
650
576 651
577 that is more than 90% of the borrowers have returned/paid back 652
653
578 their loan. And only 8.9% of them have defaulted. Next we need to 654
579 check what type of loans are being taken by people. We will check 655
656
580 the combined output of both the files which tells us what loan did 657
581 the same people apply for in the past and present. 658
582 We can infer the borrowers always prefer Cash loans which 659
660
583 accounts for 53.4% of the loan and prefer the revolving loan the 661
Figure 12: Heatmap
584 least at 7%. 662
585 We need to check if the borrowers own any car or house since 663
664
586 By seeing the above heatmap we now know which columns to borrowers with any physical assets are more likely to repay the loan 665
587 drop from our data. We will use the drop() functionality of pandas to amount. We will again divide this by gender. Here we have used a 666
667
588 5 668
,, Achyutam Verma, Ashutosh Lembhe, Dhruv Upadhyay, and Gorav Sharma
669 727
728
670
729
671 730
672 731
732
673
733
674 734
675 735
736
676 737
677 738
739
678
740
679 Figure 15: Percentage of Contract Types. 741
680 742
743
681
744
682
notation for gender where 1=Male and 0=Female and for borrower Figure 19: Percentage of people repaying in each category. 745
683
1=they own a house or car, 0= They don’t any assets. From the 746
747
684 748
Age is also one of the main factors here that help us predict if
685 749
a person is going to default or repay the loan if money is lent to 750
686
them. We will histogram to analyze our hypothesis. 751
687 752
The above histogram tells us that people with more age are more
688 753
likely to repay than people of less age. It can be possibly due to 754
689
the fact people with slightly more age have higher income job, and 755
690 756
people with lesser age have lower paying job. 757
691
The last insight we need is that at what loan amount (AMT 758
692 759
ANNUITY) and loan amount credit (AMT CREDIT) people start
693 Figure 16: Percentage of people owing car by gender. 760
defaulting. We will need to create a scatter plot to check the corre- 761
694
lation. 762
695 763
696 764
765
697
766
698 767
699 768
769
700 770
701 771
772
702
773
703 774
704 Figure 17: Percentage of people owning house by gender. 775
776
705
777
706 778
pie charts we can infer that only 34.6% people own car, And out of 779
707
those male own around 57% cars and female own around 42% cars. 780
708 781
For houses, a good number of people own houses at around 69%
709 782
out of which 66% are owned by female and 33% by male. 783
710
We also need to check the percentage of repayers for each of the 784
711 785
above category. So we will count repayers in the pie chart.
712 786
787
713
788
714 789
715 790
716
Figure 20: Scatter plot between Amount Credit and Annuity 791
792
717
Amount. 793
794
718
795
719 One insight that is noticeable from the scatter plot is that people 796
720 Figure 18: Percentage of people repaying in each category. who have higher amount credit and higher amount annuity to pay 797
798
721 are more likely to repay the money. But people with less amount 799
722 We can see that people are mostly likely to repay on the cash on both axis are less likely to repay. 800
723 loans, also female tend to repay back more than males, one more This exploratory data analysis sets a background, if a loan is 801
802
724 insight is people who own house are less likely to pay back where defaulting, one of the above factors along with other factors from 803
725 as people who car are more likely to pay back. the dataset can be responsible for it. We now need to train our 804
805
726 6 806
Comparing Loan Default Prediction accuracy using different machine learning techniques. ,,
807 machine learning model to predict our default rates and repaying 865
866
808 rates. But before that we need to do feature scaling on our data. 867
809
Figure 24: Assigning the Scaled Data for training.
868
810 6 FEATURE SCALING AND DATA TRAINING 869
870
811 In this section we will implement feature scaling, we will be using 871
812 the StandardScalar from the sklearn library. The standard scalar 872
813 Figure 25: Assigning the Scaled Data for training.. 873
standardizes our data and makes the mean of data to 0 and stan- 874
814 dard deviation to 0. Since we are using machine learning libraries 875
815 like KNN and SVM which are sensitive to features of different 876
816
7 MACHINE LEARNING MODELS AND 877
scale in our data. This difference can affect our accuracy and out- 878
817 put for our machine learning model. Once we have imported the EVALUATION METHODS 879
818 For our experiment we are using 5 machine learning models, which 880
StandardScalar function into some variable we can use as follows 881
819 scaler.fit_transform(X_train) this will then fit the data that we have are: 882
820 passed for X_train and X_test variables. It will calculate the mean 1. Support Vector Machine: This machine learning model is a 883
884
821 and standard deviation for all our features which we have selected. type of supervised model. That is used for classification and regres-
885
822 sion. The Support vector machine method works well with large 886
823 data. In our data it will help classify the defaulters. The classifier 887
888
824 helps divide data into two different classes. We will be running this 889
825 model on two kernels, one is linear kernel and other is sigmoid 890
826 kernel. 891
892
827
893
828 894
829 895
896
830 897
831 898
832 Figure 26: Assigning our SVM with linear kernel. 899
900
833 901
834 902
903
835
904
836 905
906
837 Figure 21: Feature Scaling our Train and Test data. Figure 27: Assigning our SVM with Sigmoid kernel. 907
838 908
839 909
After we have done our feature scaling our data we need to 910
840 2. Decision tree: Decision trees use attribute and class as exter-
start our training and split our data into training and testing data. 911
841 nal nodes and leaf nodes to solve problems. Decision trees’ ease of 912
Usually the industry practice is to use 70 to 80% data for testing
842 use, interpretability, and capacity to manage both numerical and 913
and remaining 30% to 20% for testing. For this we need to put our 914
843 categorical data are among its benefits. However, decision trees are
feature variable in X and response variable in Y 915
844 vulnerable to overfitting, which occurs when the tree is very deep 916
845 and there is a lot of noise in the dataset. The purpose of pruning 917
918
846 is to increase a tree’s performance and reduce its depth. Because 919
847 it offers both prediction accuracy and interpretability—two cru- 920
cial factors in predicting loan applications—decision trees are a 921
848
922
849 significant machining learning technique that will be employed in 923
850 this investigation. Because decision trees produce good results for 924
925
851 Figure 22: Assigning X and Y variable for testing and train- categorization problems, they are used. For our decision tree we 926
852 ing. have included parameters such as using entropy as a criterion for 927
853 splitting the data, we have limited the tree depth to 5 and minimum 928
929
854 We now import the train_test_split function from the sklearn.model leaves at 5 for each node. 930
855 library. We will use this to assign our training data and test data 931
932
856 and also split them. 933
857 934
858 935
859
Figure 28: Initializing Decision Tree Classifier Model. 936
937
860 Figure 23: train test split function. 938
861 3. XGB Classifier: This library uses a gradient boosting frame- 939
940
862 We will use the above training and testing variables in our ma- work, this model is a supervised model. This algorithm uses parallel 941
863 chine learning model. trees for ensembled learning and builds a predictive model using 942
943
864 7 944
,, Achyutam Verma, Ashutosh Lembhe, Dhruv Upadhyay, and Gorav Sharma
945 data from dataset. During training, it minimizes a predetermined Also we need to check the metrics for each machine learning 1003
1004
946 loss function using a gradient descent optimization technique. The model for accuracy, precision, recall, ROC curve and confusion 1005
947 XG boost method is good for handling complex data and it also matrix. 1006
948 handles over fitting. We have given it certain parameters to avoid 1. Accuracy: This method tells us how many correct predictions 1007
1008
949 overfitting of data such as Number of boosting rounds (trees in or classification our model has done on the given data set. We need 1009
950 ensemble), Maximum depth of each tree (to control overfitting), to perform this on both training and test data for all our machine 1010
951 Objective function for binary classification. learning models. 1011
1012
952 1013
953 1014
1015
954
1016
955 1017
956
Figure 29: Initializing XGB Classifier Model. 1018
Figure 32: Accuracy formula.
1019
957
4. Random Forest: The Random Forest algorithm is a useful 1020
958 1021
machine learning technique. Decision Trees are produced during 2. Confusion Matrix: This method is used on the classification 1022
959
the training phase. A random subset of the data set is utilized to algorithm which tells us the performance of the classifier used on 1023
960
construct each tree, which measures a subset of features in each our dataset. It finds values such as True Negative, False Positive, 1024
961 1025
division. This unpredictability improves prediction performance False Negative, True Positive. This gives the correct idea how our 1026
962
overall and reduces the likelihood of overfitting by adding variation classifier has classified the data correctly or incorrectly. 1027
963 1028
964
among individual trees. Regression or classification are the two pos- 1029
965
sible Random Forest outputs. The Hyperparameter are Max-depth, 1030
max feature, Min_sample_leaf, N_estimators etc. These parameters 1031
966 1032
967
help us stabilize the model, balance the imbalanced datasets and 1033
968
helps reduce overfiiting while training our model. 1034
1035
969 1036
1037
970
1038
971 1039
972 1040
Figure 33: Confusion Matrix. 1041
973
1042
974 1043
975
3. Precision: It is the quality of Accuracy showing how correctly 1044
Figure 30: Initializing Random Forest Classifier. has the accuracy been predicted using the data. We want a model 1045
976 1046
977
with high precision that tells us the accuracy to predict our defaults 1047
5. K-Nearest Neighbour: KNN is a supervised machine learn- is high. TP=True positive, FP=false positive. 1048
978
ing technique for categorical value classification. The target vari- 1049
979 1050
ables are predicted once the dataset is trained using certain charac-
980 1051
teristics. KNN uses a category target variable. In essence, KNN is 1052
981
used to predict yes/no, true/false, and several more scenarios. KNN 1053
982 1054
is not utilized for predictions such as home or vehicle prices, among 1055
983
many other things, since it is a non-parametric method and cannot Figure 34: Precision. 1056
984 1057
make assumptions. For example, in the case of logistic regression,
985 1058
KNN presupposes that all data points are linearly separable. It uses 4. ROC Curve: This method helps evaluate how correctly the 1059
986
the categorization of its neighboring data to classify the data col- model is able to classify or separate out different data points. This 1060
987 1061
lection. we are providing addition hyperparameters like nearest method is used to check for the classification models. FP=false
988 1062
neighbours, wieghts parameter like what to measure, which metric positives, TN=true negatives. 1063
989
to use for distance mesaurement e.g. Eucledian and auto selection For all the above evaluation metrics we will be using the inbuilt
1064
990 1065
of algorithms. libraries provided by python to find more accurate results. 1066
991
1067
992 1068
993 8 RESULTS 1069
1070
994
1071
995 1072
996 1073
1074
997 Figure 31: Initializing K-Nearest Neighbour. 1075
998 1076
999 After we have initialized our machine learning models with the 1077
1078
1000 hyperparameters we need to train them and test them to get the 1079
1001 accuracy. 1080
1081
1002 8 1082