0% found this document useful (0 votes)
80 views1 page

4DATA: Data Scientist M1 - Project (2020-2021) : 1-Import The Useful Library

This document outlines a mini-project for automating a loan eligibility process. It includes analyzing loan application data to identify customer segments that are eligible for loans. The data is provided in train and test CSV files. The document describes exploring the data through univariate analysis of features, including visualizations of categorical, ordinal, and numerical variables. Several machine learning pipelines are defined using techniques like one-hot encoding, standard scaling, and algorithms like logistic regression, KNN, and decision trees. The best pipeline is selected and applied to the test data to predict loan eligibility.

Uploaded by

houssam ziouany
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
80 views1 page

4DATA: Data Scientist M1 - Project (2020-2021) : 1-Import The Useful Library

This document outlines a mini-project for automating a loan eligibility process. It includes analyzing loan application data to identify customer segments that are eligible for loans. The data is provided in train and test CSV files. The document describes exploring the data through univariate analysis of features, including visualizations of categorical, ordinal, and numerical variables. Several machine learning pipelines are defined using techniques like one-hot encoding, standard scaling, and algorithms like logistic regression, KNN, and decision trees. The best pipeline is selected and applied to the test data to predict loan eligibility.

Uploaded by

houssam ziouany
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 1

4DATA : Data Scientist

M1 - Project (2020-2021)

PREAMBULE
This mini-project is to be solved in groups of no more than three. Any form of plagiarism, even partially, is strictly prohibited and will be
punished. You will need to send a single notebook containing all your commented python scripts.

Context
Dream Housing Finance company deals in all home loans. They have presence across all urban, semi urban and rural areas. Customer
first apply for home loan after that company validates the customer eligibility for loan. Company wants to automate the loan eligibility
process (real time) based on customer detail provided while filling online application form. These details are Gender, Marital Status,
Education, Number of Dependents, Income, Loan Amount, Credit History and others. To automate this process, they have given a
problem to identify the customers segments, those are eligible for loan amount so that they can specifically target these customers.

1- import the useful library


In [ ]:

2- Data
For this practice problem, we have been given 2 CSV files: train and test.

Train file will be used for training the model, i.e. our model will learn from this file. It contains all the independent variables and the target
variable. Test file contains all the independent variables, but not the target variable. We will apply the model to predict the target variable
for the test data.

Reading data

In [ ]:

Let’s make a copy of train and test data so that even if we have to make any changes in these datasets we would not lose the original
datasets.

In [ ]:

In this section, we will look at the structure of the train and test datasets. Firstly, we will check the features present in our data and then
we will look at their data types.

In [ ]:

print datatypes for each variables

How many rows and colums we have in train and test dataset ?

In [ ]:

Univariate Analysis

We will first look at the target variable, i.e., Loan_Status. As it is a categorical variable, let us look at its frequency table, percentage
distribution and bar plot.

Frequency table of a variable will give us the count of each category in that variable.

In [ ]:

Now lets visualize each variable separately. Different types of variables are Categorical, ordinal and numerical.

Categorical features: These features have categories


Ordinal features: Variables in categorical features having some order involved
Numerical features: These features have numerical values

Let’s visualize the categorical and ordinal features first.

In [ ]:

Statistical analyses for categorical features:


Give the pourcentage of the male applicants in the dataset
Give the pourcentage of the married applicants in the dataset
Give the pourcentage of the self employed applicants in the dataset
Give the pourcentage of the repaid their debts applicants in the dataset

Now let’s visualize the ordinal variables.

In [ ]:

Give your Statistical analyses for ordinal features:

In [ ]:

Lets visualise Numerical data


plot the distribution of all Numerical features:

In [ ]:

Give some statistic for Numerical features :

In [ ]:

After exploring all the variables in our data, we can now build our
ML pipeline

ML-pipeline 1 ( base line) = drop all NAN values -> drop all non numeric features -> standarscaler ->
LogisticRegression(LogReg)

In [ ]:

ML-pipeline 2 = drop all NAN values -> encode all non numeric features (using sklearn.preprocessing.OneHotEncoder) ->
standarscaler -> LogReg

In [ ]:

ML-pipeline 3 = drop all NAN values -> encode all non numeric features (using sklearn.preprocessing.OneHotEncoder) ->
standarscaler -> KNN

In [ ]:

ML-pipeline 4 = drop all NAN values -> encode all non numeric features -> decision Tree

Which pipeline is the best

In [ ]:

Apply the best pipeline on test dataset

In [ ]:

In [ ]:

You might also like