0% found this document useful (0 votes)

80 views1 page

4DATA: Data Scientist M1 - Project (2020-2021) : 1-Import The Useful Library

This document outlines a mini-project for automating a loan eligibility process. It includes analyzing loan application data to identify customer segments that are eligible for loans. The data is provided in train and test CSV files. The document describes exploring the data through univariate analysis of features, including visualizations of categorical, ordinal, and numerical variables. Several machine learning pipelines are defined using techniques like one-hot encoding, standard scaling, and algorithms like logistic regression, KNN, and decision trees. The best pipeline is selected and applied to the test data to predict loan eligibility.

Uploaded by

houssam ziouany

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

80 views1 page

4DATA: Data Scientist M1 - Project (2020-2021) : 1-Import The Useful Library

Uploaded by

houssam ziouany

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 1

4DATA : Data Scientist

M1 - Project (2020-2021)

PREAMBULE
This mini-project is to be solved in groups of no more than three. Any form of plagiarism, even partially, is strictly prohibited and will be
punished. You will need to send a single notebook containing all your commented python scripts.

Context
Dream Housing Finance company deals in all home loans. They have presence across all urban, semi urban and rural areas. Customer
first apply for home loan after that company validates the customer eligibility for loan. Company wants to automate the loan eligibility
process (real time) based on customer detail provided while filling online application form. These details are Gender, Marital Status,
Education, Number of Dependents, Income, Loan Amount, Credit History and others. To automate this process, they have given a
problem to identify the customers segments, those are eligible for loan amount so that they can specifically target these customers.

1- import the useful library

In [ ]:

2- Data
For this practice problem, we have been given 2 CSV files: train and test.

Train file will be used for training the model, i.e. our model will learn from this file. It contains all the independent variables and the target
variable. Test file contains all the independent variables, but not the target variable. We will apply the model to predict the target variable
for the test data.

Reading data

In [ ]:

Let’s make a copy of train and test data so that even if we have to make any changes in these datasets we would not lose the original
datasets.

In [ ]:

In this section, we will look at the structure of the train and test datasets. Firstly, we will check the features present in our data and then
we will look at their data types.

In [ ]:

print datatypes for each variables

How many rows and colums we have in train and test dataset ?

In [ ]:

Univariate Analysis

We will first look at the target variable, i.e., Loan_Status. As it is a categorical variable, let us look at its frequency table, percentage
distribution and bar plot.

Frequency table of a variable will give us the count of each category in that variable.

In [ ]:

Now lets visualize each variable separately. Different types of variables are Categorical, ordinal and numerical.

Categorical features: These features have categories

Ordinal features: Variables in categorical features having some order involved
Numerical features: These features have numerical values

Let’s visualize the categorical and ordinal features first.

In [ ]:

Statistical analyses for categorical features:

Give the pourcentage of the male applicants in the dataset
Give the pourcentage of the married applicants in the dataset
Give the pourcentage of the self employed applicants in the dataset
Give the pourcentage of the repaid their debts applicants in the dataset

Now let’s visualize the ordinal variables.

In [ ]:

Give your Statistical analyses for ordinal features:

In [ ]:

Lets visualise Numerical data

plot the distribution of all Numerical features:

In [ ]:

Give some statistic for Numerical features :

In [ ]:

After exploring all the variables in our data, we can now build our
ML pipeline

ML-pipeline 1 ( base line) = drop all NAN values -> drop all non numeric features -> standarscaler ->
LogisticRegression(LogReg)

In [ ]:

ML-pipeline 2 = drop all NAN values -> encode all non numeric features (using sklearn.preprocessing.OneHotEncoder) ->
standarscaler -> LogReg

In [ ]:

ML-pipeline 3 = drop all NAN values -> encode all non numeric features (using sklearn.preprocessing.OneHotEncoder) ->
standarscaler -> KNN

In [ ]:

ML-pipeline 4 = drop all NAN values -> encode all non numeric features -> decision Tree

Which pipeline is the best

In [ ]:

Apply the best pipeline on test dataset

In [ ]:

Loan Status Prediction
No ratings yet
Loan Status Prediction
23 pages
Implementing Artificial Neural Network in Python From Scratch
No ratings yet
Implementing Artificial Neural Network in Python From Scratch
16 pages
Zindi Financial Inclusion Guide
No ratings yet
Zindi Financial Inclusion Guide
12 pages
NUS - SOC - AML - Required Capstone Project
No ratings yet
NUS - SOC - AML - Required Capstone Project
5 pages
Python Code For Loan Default Prediction
No ratings yet
Python Code For Loan Default Prediction
4 pages
Machine Learning With PySpark and MLlib - Solving A Binary Classification Problem - by Susan Li - Towards Data Science
No ratings yet
Machine Learning With PySpark and MLlib - Solving A Binary Classification Problem - by Susan Li - Towards Data Science
10 pages
Credit Risk Project
No ratings yet
Credit Risk Project
11 pages
Project Report
100% (3)
Project Report
36 pages
Cars Project PDF
No ratings yet
Cars Project PDF
9 pages
Advanced Feature Engineering and Data Preprocessing in Machine Learning
No ratings yet
Advanced Feature Engineering and Data Preprocessing in Machine Learning
7 pages
Ids Lab
No ratings yet
Ids Lab
14 pages
Machine Learning Laboratory (BTCS619-18) B.Tech Cse 6Th 2024 EVEN
No ratings yet
Machine Learning Laboratory (BTCS619-18) B.Tech Cse 6Th 2024 EVEN
29 pages
Train
No ratings yet
Train
17 pages
DSBDA Practicals
No ratings yet
DSBDA Practicals
16 pages
Assignment 2: Hive
No ratings yet
Assignment 2: Hive
11 pages
ML 6 7 8
No ratings yet
ML 6 7 8
10 pages
Project - Machine Learning-Business Report: By: K Ravi Kumar PGP-Data Science and Business Analytics (PGPDSBA.O.MAR23.A)
No ratings yet
Project - Machine Learning-Business Report: By: K Ravi Kumar PGP-Data Science and Business Analytics (PGPDSBA.O.MAR23.A)
38 pages
ML Report
No ratings yet
ML Report
3 pages
Real Estate ML Project Guide
No ratings yet
Real Estate ML Project Guide
20 pages
House Price Prediction: Project Description
No ratings yet
House Price Prediction: Project Description
11 pages
ML Final Prac
No ratings yet
ML Final Prac
47 pages
Anil DS Project
No ratings yet
Anil DS Project
33 pages
ML Assignment 2025 (2022 25)
No ratings yet
ML Assignment 2025 (2022 25)
1 page
ML Lab Experiment Shivansh
No ratings yet
ML Lab Experiment Shivansh
29 pages
Kartik MLP 4-9prg
No ratings yet
Kartik MLP 4-9prg
10 pages
FIND-S Algorithm Implementation
No ratings yet
FIND-S Algorithm Implementation
51 pages
Loan Approval Prediction Models
No ratings yet
Loan Approval Prediction Models
10 pages
Machine Learning VIVEK
80% (5)
Machine Learning VIVEK
118 pages
DSBDA Lab Plan
No ratings yet
DSBDA Lab Plan
5 pages
DSBDA Manual
No ratings yet
DSBDA Manual
76 pages
ISMLA Module5
No ratings yet
ISMLA Module5
25 pages
Predicting Term Deposit Subscriptions
No ratings yet
Predicting Term Deposit Subscriptions
19 pages
Sasi 111111111
No ratings yet
Sasi 111111111
48 pages
ML Adv
No ratings yet
ML Adv
51 pages
Regression Pipeline in AI Techniques
No ratings yet
Regression Pipeline in AI Techniques
94 pages
Project Presentation
No ratings yet
Project Presentation
19 pages
Digital Transformation in Banking
No ratings yet
Digital Transformation in Banking
4 pages
The Implication of Statistical Analysis and Feature Engineering For Model Building Using Machine Learning Algorithms
No ratings yet
The Implication of Statistical Analysis and Feature Engineering For Model Building Using Machine Learning Algorithms
11 pages
ML Complete Notes Hridoy
No ratings yet
ML Complete Notes Hridoy
5 pages
DM LabManual Teena
No ratings yet
DM LabManual Teena
6 pages
Project Report
No ratings yet
Project Report
19 pages
Home Work
No ratings yet
Home Work
12 pages
Model Learning Steps
No ratings yet
Model Learning Steps
12 pages
INeuron ML Practical Assignments
No ratings yet
INeuron ML Practical Assignments
14 pages
Capstone Project - Jaro-Prof. Babji
No ratings yet
Capstone Project - Jaro-Prof. Babji
5 pages
Lab Mannual of ML
No ratings yet
Lab Mannual of ML
43 pages
Python For Data Science - Unit 7 - Week 4
No ratings yet
Python For Data Science - Unit 7 - Week 4
5 pages
Class Xii PDF For Practical
No ratings yet
Class Xii PDF For Practical
24 pages
ML Lab Manual
No ratings yet
ML Lab Manual
14 pages
27 KrishParasShah
No ratings yet
27 KrishParasShah
17 pages
Machine Learning Project Report
No ratings yet
Machine Learning Project Report
65 pages
ML - Extended Project Business Report-Richa
No ratings yet
ML - Extended Project Business Report-Richa
32 pages
Machine Learning
100% (1)
Machine Learning
33 pages
Dawit House
No ratings yet
Dawit House
49 pages
Abinash Nag Project Report CART
No ratings yet
Abinash Nag Project Report CART
40 pages
Record
No ratings yet
Record
22 pages
ML Lab Programs
No ratings yet
ML Lab Programs
9 pages
Import Pandas As PD DF PD - Read - CSV ("Titanic - Train - CSV") DF - Head
No ratings yet
Import Pandas As PD DF PD - Read - CSV ("Titanic - Train - CSV") DF - Head
20 pages
GCD Detailed Syllabus
No ratings yet
GCD Detailed Syllabus
24 pages
Ebbinghaus 1885 PDF
No ratings yet
Ebbinghaus 1885 PDF
100 pages
MSP
No ratings yet
MSP
55 pages
More Complex Conditions
No ratings yet
More Complex Conditions
16 pages
Unit-4 Question Answer
No ratings yet
Unit-4 Question Answer
14 pages
CBNST Lab Practical File B.Tech CSE
No ratings yet
CBNST Lab Practical File B.Tech CSE
14 pages
Math7 Las Q4-1
No ratings yet
Math7 Las Q4-1
82 pages
Cell Coverage For Signal Traffic
No ratings yet
Cell Coverage For Signal Traffic
43 pages
BK Case Study
No ratings yet
BK Case Study
4 pages
Permutations and Combinations Homework
100% (1)
Permutations and Combinations Homework
7 pages
Sample Questions: A) 120 Miles B) Between 120 and 140 Miles C) 160 Miles D) 100 Miles
No ratings yet
Sample Questions: A) 120 Miles B) Between 120 and 140 Miles C) 160 Miles D) 100 Miles
3 pages
Tall Buildings and Damping: A Concept-Based Data-Driven Model
No ratings yet
Tall Buildings and Damping: A Concept-Based Data-Driven Model
15 pages
4 Angular Kinematics of Human Movement
No ratings yet
4 Angular Kinematics of Human Movement
55 pages
TGN Level 1 No. 5 Derivation of Snow Load
No ratings yet
TGN Level 1 No. 5 Derivation of Snow Load
4 pages
Grade6 6 End of Unit Assessment (B) Assessment
No ratings yet
Grade6 6 End of Unit Assessment (B) Assessment
5 pages
Complex Number: Om Sharma
No ratings yet
Complex Number: Om Sharma
10 pages
Mosaic Help
100% (1)
Mosaic Help
5 pages
Detailed Lesson Plan in Operation On Integer
100% (10)
Detailed Lesson Plan in Operation On Integer
12 pages
Digital Image Processing Lab
No ratings yet
Digital Image Processing Lab
38 pages
Chapter 13 Exercise Solutions
100% (1)
Chapter 13 Exercise Solutions
49 pages
L1 Logical Thinking
No ratings yet
L1 Logical Thinking
27 pages
Inductive Reasoning Test1 Solutions
No ratings yet
Inductive Reasoning Test1 Solutions
11 pages
Overlapping Shapes
No ratings yet
Overlapping Shapes
3 pages
Douglas Mooney
No ratings yet
Douglas Mooney
85 pages
Chapter 1
No ratings yet
Chapter 1
13 pages
2nd Revised
No ratings yet
2nd Revised
9 pages
Nat 5 Notes
No ratings yet
Nat 5 Notes
51 pages
Chapter 4. Discrete Random Variables Practice and Homework Solutions
No ratings yet
Chapter 4. Discrete Random Variables Practice and Homework Solutions
13 pages
Power Flow Management Through Interline Power Flow Controller
No ratings yet
Power Flow Management Through Interline Power Flow Controller
6 pages
OR Ch-2
No ratings yet
OR Ch-2
30 pages
Lab A - 03 - 2022
No ratings yet
Lab A - 03 - 2022
2 pages

4DATA: Data Scientist M1 - Project (2020-2021) : 1-Import The Useful Library

Uploaded by

4DATA: Data Scientist M1 - Project (2020-2021) : 1-Import The Useful Library

Uploaded by

4DATA : Data Scientist

1- import the useful library

print datatypes for each variables

Categorical features: These features have categories

Let’s visualize the categorical and ordinal features first.

Statistical analyses for categorical features:

Now let’s visualize the ordinal variables.

Give your Statistical analyses for ordinal features:

Lets visualise Numerical data

Give some statistic for Numerical features :

Which pipeline is the best

Apply the best pipeline on test dataset

You might also like