0% found this document useful (0 votes)
25 views75 pages

Attrition Project Mangal

This project report outlines the development of an Employee Attrition Prediction Model using machine learning techniques to identify factors contributing to employee turnover. The model aims to provide actionable insights for HR teams to improve retention strategies by analyzing historical employee data. Key steps include data collection, preprocessing, model selection, and evaluation, ultimately demonstrating the model's effectiveness in predicting attrition and informing HR decision-making.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views75 pages

Attrition Project Mangal

This project report outlines the development of an Employee Attrition Prediction Model using machine learning techniques to identify factors contributing to employee turnover. The model aims to provide actionable insights for HR teams to improve retention strategies by analyzing historical employee data. Key steps include data collection, preprocessing, model selection, and evaluation, ultimately demonstrating the model's effectiveness in predicting attrition and informing HR decision-making.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 75

Institute of Technology and

Management, Meerut
( Affiliated to Dr. A.P.J. Abdul Kalam Technical University,
Lucknow ) College Code 285
Session 2024-25

Department of Computer Science

Project Report
On
Employee Attrition Prediction Model
Using Machine Learning

Submitted By : Submitted To:


Mangleshwar Pratap Dr. P K Vashistha
B.Tech (CSE) 4th Year
Roll no. 2102850100008
Acknowledgement

I would like to express my sincere gratitude to everyone


who contributed to the successful completion of this
project, "Employee Attrition Prediction using Machine
Learning."
I am particularly thankful for the support of my friends,
whose encouragement and motivation kept me focused
throughout this journey.
Additionally, I am grateful to the open-source community
and online platforms for providing invaluable resources,
datasets, and documentation that served as the
foundation for this work.
This project is the result of dedication and effort, and I
truly appreciate all the assistance and resources that
made it possible.
Project Description

Overview:
This project aims to leverage machine learning techniques
to predict employee attrition and identify the factors
contributing to it. Employee attrition is a significant
challenge for organizations, leading to increased costs for
recruitment, onboarding, and training, as well as
disruptions to workflow and morale. By analyzing historical
employee data, the project seeks to provide actionable
insights to reduce turnover rates and improve employee
retention strategies.
Objectives:
1. To develop a machine learning model capable of
accurately predicting whether an employee is likely to
leave the organization.
2. To identify key factors influencing attrition, such as job
satisfaction, work-life balance, compensation, and
career growth opportunities.
3. To provide insights that can guide human resource
teams in making data-driven decisions to improve
employee engagement and retention.
Problem Statement
Employee attrition is a critical challenge faced by
organizations across industries. High attrition rates lead to
increased costs associated with recruitment, onboarding,
and training, as well as disruptions in workflow, team
dynamics, and overall productivity. Identifying the
underlying reasons for employee turnover and predicting
potential attrition are essential for designing effective
retention strategies.
Despite the availability of HR data, many organizations
struggle to leverage it effectively for proactive decision-
making. Traditional methods of analyzing attrition are often
time-consuming, lack accuracy, and fail to identify subtle
patterns that contribute to employee dissatisfaction and
eventual resignation.
The goal of this project is to develop a machine learning-
based solution that can:
1. Accurately predict whether an employee is likely to
leave the organization.
2. Identify and rank the factors that contribute to
attrition. Provide actionable insights for human
resource teams to implement targeted interventions
and reduce turnover rates.
Dataset Info
Possible Sources:
• Kaggle Dataset
Key Features of the Dataset
1. Age: Employee's age.
2. Attrition: Target variable indicating whether the employee left the
organization.
3. BusinessTravel: Frequency of business travel (e.g., Rarely,
Frequently).
4. DailyRate / MonthlyIncome: Financial metrics related to employee
salary.
5. Department: Department of the employee (e.g., Sales, HR, R&D).
6. DistanceFromHome: Commute distance from home to work.
7. Education & EducationField: Employee’s education level and field of
study.
8. EnvironmentSatisfaction: Satisfaction with the work environment (1–
4 scale).
9. Gender: Gender of the employee.
10. JobRole & JobLevel: Role and position level in the organization.
11. JobSatisfaction: Satisfaction with the job itself (1–4 scale).
12. MaritalStatus: Employee’s marital status.
13. OverTime: Whether the employee works overtime.
14. WorkLifeBalance: Perception of work-life balance (1–4 scale).
15. YearsAtCompany / TotalWorkingYears: Employee’s tenure and
total work experience.
16. YearsSinceLastPromotion: Time since the last promotion.
17. TrainingTimesLastYear: Number of training sessions attended in
the past year.

Steps to Build the Model


1. Problem Understanding
• Objective: Predict whether an employee will leave the company or not based
on historical data.
• Business Goal: Minimize attrition and optimize retention strategies by
understanding key predictors.
2. Data Collection
• Gather data that can include:
o Employee demographics (age, gender, marital status)
o Job role and department
o Work environment factors (satisfaction, work-life balance)
o Compensation and benefits
o Performance data
o Historical data on employees who left vs. stayed
• Typically, datasets such as the "IBM HR Analytics Employee Attrition &
Performance" dataset can be useful.
3. Data Exploration and Preprocessing
• Exploratory Data Analysis (EDA): Visualize distributions, correlations, and
basic statistics to understand the data.
• Handle Missing Values: Fill or drop missing data depending on the situation
(e.g., using mean imputation or dropping rows with too many missing
values).
• Feature Engineering:
o Convert categorical variables into numerical values using encoding
methods like one-hot encoding or label encoding.
o Create new features if necessary (e.g., creating a "tenure category"
based on years of service).
• Scaling and Normalization: Scale numerical features using methods like
MinMaxScaler or StandardScaler if necessary, especially for algorithms
sensitive to feature scaling like SVM or k-NN.
4. Feature Selection
• Use correlation matrices, feature importance (via tree-based models), or
recursive feature elimination to identify the most relevant features.
• Remove irrelevant or highly correlated features that may lead to overfitting.
5. Model Selection
• Train-test Split: Split your dataset into training and test sets (usually 70%-80%
for training, 20%-30% for testing).
• Model Selection: Start with a few machine learning algorithms:
o Logistic Regression (for binary classification)
o Decision Trees / Random Forests (to capture non-linear relationships)
o Support Vector Machine (SVM)
o Gradient Boosting Methods (XGBoost, LightGBM)
o Neural Networks (if you have large data)
• For classification problems, consider using cross-validation for robust
performance estimation.
6. Model Training
• Train your selected models on the training dataset.
• For models like Random Forests or XGBoost, tune hyperparameters using
GridSearchCV or RandomizedSearchCV to optimize model performance.
7. Model Evaluation
• Confusion Matrix: To evaluate accuracy, precision, recall, and F1-score.
• ROC Curve & AUC: To measure the performance in terms of true positive rate
vs. false positive rate.
• Cross-Validation: Validate the model using K-fold cross-validation to ensure
robustness and minimize overfitting.
• Feature Importance: Analyze which features are driving the model’s
predictions.
8. Model Interpretation
• Use model interpretation techniques like SHAP (Shapley Additive
Explanations) to understand how different features influence model
predictions.
• This can be valuable for business stakeholders to explain model decisions.
9. Model Deployment
• Once the model performs well on the test set, you can deploy it into a
production environment where it can predict attrition for new employees.
• Integrate the model into an HR system or dashboard where the business can
monitor predictions and take action.
10. Monitoring and Maintenance
• Continuously monitor the model’s performance. As new data is added, the
model’s accuracy might change, and it may require retraining.
• Set up a retraining pipeline if the data distribution changes over time.
Conclusion
In this project, we developed a machine learning model to
predict employee attrition, offering valuable insights into
the factors that influence an employee's decision to leave
the organization. By analyzing historical employee data, we
identified key predictors such as job satisfaction,
compensation, tenure, and work-life balance, which
significantly impact attrition rates.
The process involved data collection, preprocessing, feature
engineering, and model selection. Several machine learning
algorithms, including Logistic Regression, Decision Trees,
and Random Forests, were tested to determine the most
effective model. After fine-tuning hyperparameters and
evaluating the models using metrics like accuracy, precision,
recall, and AUC, the best-performing model was identified.
The model's performance showed promising results,
accurately predicting which employees are at risk of leaving.
Feature importance analysis highlighted the role of job
satisfaction, salary, and tenure in the predictions, providing
actionable insights for HR teams. These insights can help
companies identify at-risk employees early and implement
targeted retention strategies.
Key takeaways from the project include:
• Accurate Prediction: The machine learning model
successfully predicted employee attrition with a high
degree of accuracy, providing HR with a useful tool to
forecast turnover.
• Business Application: The model’s insights can help HR
departments develop strategies to improve employee
retention, such as addressing dissatisfaction or offering
career advancement opportunities.
• Model Interpretability: Using techniques like SHAP, we
ensured the model’s predictions could be explained,
making it easier for HR professionals to understand and
trust the results.
Ultimately, this project demonstrates how machine learning
can be leveraged to predict employee attrition and improve
HR decision-making. With further refinements and regular
updates, the model can continue to provide valuable
support to organizations in reducing turnover and
enhancing employee engagement.
Employee Attrition Prediction Using
Machine Learning
In [1]: import math, time, random, datetime

# data analysis and wrangling


import pandas as pd
import numpy as np
from pandas_profiling import ProfileReport

In [2]: # visualization
import seaborn as sns
import matplotlib.pyplot as plt
plt.style.use('seaborn-whitegrid')

#import for interactive plotting


import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.tools as tls
import plotly.figure_factory as ff
from plotly.subplots import make_subplots
%matplotlib inline

In [3]: # Preprocessing
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, label_binarize, StandardScaler

In [4]: # machine learning


from sklearn import model_selection, tree, preprocessing, metrics, linear_model
from sklearn.metrics import confusion_matrix,classification_report
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import Perceptron,SGDClassifier,LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split,StratifiedKFold, GridSearchCV, learning_curve, cross_val_score
from catboost import CatBoostClassifier, Pool, cv

In [5]: # ignore Warnings


import warnings
warnings.filterwarnings('ignore')

Import and Inspect Data


In [6]: df = pd.read_csv("Data/WA_Fn-UseC_-HR-Employee-Attrition.csv")

In [7]: df.head()

Out[7]: Age Attrition BusinessTravel DailyRate Department DistanceFromHome Education EducationField EmployeeCount Employe

0 41 Yes Travel_Rarely 1102 Sales 1 2 Life Sciences 1

Research &
1 49 No Travel_Frequently 279 8 1 Life Sciences 1
Development

Research &
2 37 Yes Travel_Rarely 1373 2 2 Other 1
Development

Research &
3 33 No Travel_Frequently 1392 3 4 Life Sciences 1
Development

Research &
4 27 No Travel_Rarely 591 2 1 Medical 1
Development

5 rows × 35 columns

In [8]: df.shape

Out[8]: (1470, 35)


Exploratory Data Analysis
Job level is strongly correlated with total working hours
Monthly income is strongly correlated with Job level
Monthly income is strongly correlated with total working hours
Age is stongly correlated with monthly income

In [9]: ProfileReport(df)
Out[9]:

Overview

Dataset info

Number of variables 35
Number of observations 1470
Total Missing (%) 0.0%
Total size in memory 402.1 KiB
Average record size in memory 280.1 B
Variables types

Numeric 22
Categorical 8
Boolean 1
Date 0
Text (Unique) 0
Rejected 4
Unsupported 0

Warnings

EmployeeCount has constant value 1 Rejected


MonthlyIncome is highly correlated with JobLevel (ρ = 0.9503) Rejected
NumCompaniesWorked has 197 / 13.4% zeros Zeros
Over18 has constant value Y Rejected
StandardHours has constant value 80 Rejected
StockOptionLevel has 631 / 42.9% zeros Zeros
TrainingTimesLastYear has 54 / 3.7% zeros Zeros
YearsAtCompany has 44 / 3.0% zeros Zeros
YearsInCurrentRole has 244 / 16.6% zeros Zeros
YearsSinceLastPromotion has 581 / 39.5% zeros Zeros
YearsWithCurrManager has 263 / 17.9% zeros Zeros

Variables

Age
Numeric

Distinct count 43
Unique (%) 2.9%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0

Mean 36.924
Minimum 18
Maximum 60
Zeros (%) 0.0%

Toggle details

Statistics
Histogram
Common Values
Extreme Values

Quantile statistics

Minimum 18
5-th percentile 24
Q1 30
Median 36
Q3 43
95-th percentile 54
Maximum 60
Range 42
Interquartile range 13
Descriptive statistics

Standard deviation 9.1354


Coef of variation 0.24741
Kurtosis -0.40415
Mean 36.924
MAD 7.4098
Skewness 0.41329
Sum 54278
Variance 83.455
Memory size 11.6 KiB

ValueCountFrequency (%)
35 78 5.3%
34 77 5.2%
31 69 4.7%
36 69 4.7%
29 68 4.6%
32 61 4.1%
30 60 4.1%
33 58 3.9%
38 58 3.9%
40 57 3.9%
Other values (33) 815 55.4%

Minimum 5 values

ValueCountFrequency (%)
18 8 0.5%
19 9 0.6%
20 11 0.7%
21 13 0.9%
22 16 1.1%

Maximum 5 values

ValueCountFrequency (%)
56 14 1.0%
57 4 0.3%
58 14 1.0%
59 10 0.7%
60 5 0.3%

Attrition
Categorical
Distinct count 2
Unique (%) 0.1%
Missing (%) 0.0%
Missing (n) 0

No 1233
Yes 237

Toggle details

ValueCountFrequency (%)
No 1233 83.9%
Yes 237 16.1%

BusinessTravel
Categorical

Distinct count 3
Unique (%) 0.2%
Missing (%) 0.0%
Missing (n) 0

Travel_Rarely 1043
Travel_Frequently 277
Non-Travel 150

Toggle details

ValueCountFrequency (%)
Travel_Rarely 1043 71.0%
Travel_Frequently 277 18.8%
Non-Travel 150 10.2%

DailyRate
Numeric

Distinct count 886


Unique (%) 60.3%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0

Mean 802.49
Minimum 102
Maximum 1499
Zeros (%) 0.0%

Toggle details

Statistics
Histogram
Common Values
Extreme Values

Quantile statistics

Minimum 102
5-th percentile 165.35
Q1 465
Median 802
Q3 1157
95-th percentile 1424.1
Maximum 1499
Range 1397
Interquartile range 692
Descriptive statistics

Standard deviation 403.51


Coef of variation 0.50282
Kurtosis -1.2038
Mean 802.49
MAD 350.25
Skewness -0.0035186
Sum 1179654
Variance 162820
Memory size 11.6 KiB

ValueCountFrequency (%)
691 6 0.4%
1082 5 0.3%
329 5 0.3%
1329 5 0.3%
530 5 0.3%
408 5 0.3%
715 4 0.3%
589 4 0.3%
906 4 0.3%
350 4 0.3%
Other values (876) 1423 96.8%

Minimum 5 values

ValueCountFrequency (%)
102 1 0.1%
103 1 0.1%
104 1 0.1%
105 1 0.1%
106 1 0.1%

Maximum 5 values

ValueCountFrequency (%)
1492 1 0.1%
1495 3 0.2%
1496 2 0.1%
1498 1 0.1%
1499 1 0.1%

Department
Categorical

Distinct count 3
Unique (%) 0.2%
Missing (%) 0.0%
Missing (n) 0
Research & Development 961
Sales 446
Human Resources 63

Toggle details

ValueCountFrequency (%)
Research & Development 961 65.4%
Sales 446 30.3%
Human Resources 63 4.3%

DistanceFromHome
Numeric

Distinct count 29
Unique (%) 2.0%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0

Mean 9.1925
Minimum 1
Maximum 29
Zeros (%) 0.0%

Toggle details

Statistics
Histogram
Common Values
Extreme Values

Quantile statistics

Minimum 1
5-th percentile 1
Q1 2
Median 7
Q3 14
95-th percentile 26
Maximum 29
Range 28
Interquartile range 12

Descriptive statistics

Standard deviation 8.1069


Coef of variation 0.8819
Kurtosis -0.22483
Mean 9.1925
MAD 6.5727
Skewness 0.95812
Sum 13513
Variance 65.721
Memory size 11.6 KiB
ValueCountFrequency (%)
2 211 14.4%
1 208 14.1%
10 86 5.9%
9 85 5.8%
3 84 5.7%
7 84 5.7%
8 80 5.4%
5 65 4.4%
4 64 4.4%
6 59 4.0%
Other values (19) 444 30.2%

Minimum 5 values

ValueCountFrequency (%)
1 208 14.1%
2 211 14.4%
3 84 5.7%
4 64 4.4%
5 65 4.4%

Maximum 5 values

ValueCountFrequency (%)
25 25 1.7%
26 25 1.7%
27 12 0.8%
28 23 1.6%
29 27 1.8%

Education
Numeric

Distinct count 5
Unique (%) 0.3%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0

Mean 2.9129
Minimum 1
Maximum 5
Zeros (%) 0.0%

Toggle details

Statistics
Histogram
Common Values
Extreme Values

Quantile statistics

Minimum 1
5-th percentile 1
Q1 2
Median 3
Q3 4
95-th percentile 4
Maximum 5
Range 4
Interquartile range 2

Descriptive statistics

Standard deviation 1.0242


Coef of variation 0.35159
Kurtosis -0.55911
Mean 2.9129
MAD 0.79271
Skewness -0.28968
Sum 4282
Variance 1.0489
Memory size 11.6 KiB

ValueCountFrequency (%)
3 572 38.9%
4 398 27.1%
2 282 19.2%
1 170 11.6%
5 48 3.3%

Minimum 5 values

ValueCountFrequency (%)
1 170 11.6%
2 282 19.2%
3 572 38.9%
4 398 27.1%
5 48 3.3%

Maximum 5 values

ValueCountFrequency (%)
1 170 11.6%
2 282 19.2%
3 572 38.9%
4 398 27.1%
5 48 3.3%
EducationField
Categorical

Distinct count 6
Unique (%) 0.4%
Missing (%) 0.0%
Missing (n) 0

Life Sciences 606


Medical 464
Marketing 159
Other values (3) 241

Toggle details

ValueCountFrequency (%)
Life Sciences 606 41.2%
Medical 464 31.6%
Marketing 159 10.8%
Technical Degree 132 9.0%
Other 82 5.6%
Human Resources 27 1.8%

EmployeeCount
Constant

This variable is constant and should be ignored for analysis

Constant value 1

EmployeeNumber
Numeric

Distinct count 1470


Unique (%) 100.0%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0

Mean 1024.9
Minimum 1
Maximum 2068
Zeros (%) 0.0%

Toggle details

Statistics
Histogram
Common Values
Extreme Values

Quantile statistics

Minimum 1
5-th percentile 96.45
Q1 491.25
Median 1020.5
Q3 1555.8
95-th percentile 1967.5
Maximum 2068
Range 2067
Interquartile range 1064.5
Descriptive statistics

Standard deviation 602.02


Coef of variation 0.58742
Kurtosis -1.2232
Mean 1024.9
MAD 522.41
Skewness 0.016574
Sum 1506552
Variance 362430
Memory size 11.6 KiB

ValueCountFrequency (%)
2046 1 0.1%
641 1 0.1%
644 1 0.1%
645 1 0.1%
647 1 0.1%
648 1 0.1%
649 1 0.1%
650 1 0.1%
652 1 0.1%
653 1 0.1%
Other values (1460) 1460 99.3%

Minimum 5 values

ValueCountFrequency (%)
1 1 0.1%
2 1 0.1%
4 1 0.1%
5 1 0.1%
7 1 0.1%

Maximum 5 values

ValueCountFrequency (%)
2061 1 0.1%
2062 1 0.1%
2064 1 0.1%
2065 1 0.1%
2068 1 0.1%

EnvironmentSatisfaction
Numeric

Distinct count 4
Unique (%) 0.3%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0

Mean 2.7218
Minimum 1
Maximum 4
Zeros (%) 0.0%

Toggle details

Statistics
Histogram
Common Values
Extreme Values

Quantile statistics

Minimum 1
5-th percentile 1
Q1 2
Median 3
Q3 4
95-th percentile 4
Maximum 4
Range 3
Interquartile range 2

Descriptive statistics

Standard deviation 1.0931


Coef of variation 0.40161
Kurtosis -1.2025
Mean 2.7218
MAD 0.94712
Skewness -0.32165
Sum 4001
Variance 1.1948
Memory size 11.6 KiB

ValueCountFrequency (%)
3 453 30.8%
4 446 30.3%
2 287 19.5%
1 284 19.3%

Minimum 5 values

ValueCountFrequency (%)
1 284 19.3%
2 287 19.5%
3 453 30.8%
4 446 30.3%

Maximum 5 values
ValueCountFrequency (%)
1 284 19.3%
2 287 19.5%
3 453 30.8%
4 446 30.3%

Gender
Categorical

Distinct count 2
Unique (%) 0.1%
Missing (%) 0.0%
Missing (n) 0

Male 882
Female 588

Toggle details

ValueCountFrequency (%)
Male 882 60.0%
Female 588 40.0%

HourlyRate
Numeric

Distinct count 71
Unique (%) 4.8%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0

Mean 65.891
Minimum 30
Maximum 100
Zeros (%) 0.0%

Toggle details

Statistics
Histogram
Common Values
Extreme Values

Quantile statistics

Minimum 30
5-th percentile 33
Q1 48
Median 66
Q3 83.75
95-th percentile 97
Maximum 100
Range 70
Interquartile range 35.75

Descriptive statistics

Standard deviation 20.329


Coef of variation 0.30853
Kurtosis -1.1964
Mean 65.891
MAD 17.649
Skewness -0.032311
Sum 96860
Variance 413.29
Memory size 11.6 KiB

ValueCountFrequency (%)
66 29 2.0%
42 28 1.9%
98 28 1.9%
48 28 1.9%
84 28 1.9%
79 27 1.8%
96 27 1.8%
57 27 1.8%
52 26 1.8%
87 26 1.8%
Other values (61) 1196 81.4%

Minimum 5 values

ValueCountFrequency (%)
30 19 1.3%
31 15 1.0%
32 24 1.6%
33 19 1.3%
34 12 0.8%

Maximum 5 values

ValueCountFrequency (%)
96 27 1.8%
97 21 1.4%
98 28 1.9%
99 20 1.4%
100 19 1.3%

JobInvolvement
Numeric

Distinct count 4
Unique (%) 0.3%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0

Mean 2.7299
Minimum 1
Maximum 4
Zeros (%) 0.0%

Toggle details
Statistics
Histogram
Common Values
Extreme Values

Quantile statistics

Minimum 1
5-th percentile 1
Q1 2
Median 3
Q3 3
95-th percentile 4
Maximum 4
Range 3
Interquartile range 1

Descriptive statistics

Standard deviation 0.71156


Coef of variation 0.26065
Kurtosis 0.271
Mean 2.7299
MAD 0.56777
Skewness -0.49842
Sum 4013
Variance 0.50632
Memory size 11.6 KiB

ValueCountFrequency (%)
3 868 59.0%
2 375 25.5%
4 144 9.8%
1 83 5.6%

Minimum 5 values

ValueCountFrequency (%)
1 83 5.6%
2 375 25.5%
3 868 59.0%
4 144 9.8%

Maximum 5 values

ValueCountFrequency (%)
1 83 5.6%
2 375 25.5%
3 868 59.0%
4 144 9.8%
JobLevel
Numeric

Distinct count 5
Unique (%) 0.3%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0

Mean 2.0639
Minimum 1
Maximum 5
Zeros (%) 0.0%

Toggle details

Statistics
Histogram
Common Values
Extreme Values

Quantile statistics

Minimum 1
5-th percentile 1
Q1 1
Median 2
Q3 3
95-th percentile 4
Maximum 5
Range 4
Interquartile range 2

Descriptive statistics

Standard deviation 1.1069


Coef of variation 0.53632
Kurtosis 0.39915
Mean 2.0639
MAD 0.83248
Skewness 1.0254
Sum 3034
Variance 1.2253
Memory size 11.6 KiB

ValueCountFrequency (%)
1 543 36.9%
2 534 36.3%
3 218 14.8%
4 106 7.2%
ValueCountFrequency (%)
5 69 4.7%
Minimum 5 values

ValueCountFrequency (%)
1 543 36.9%
2 534 36.3%
3 218 14.8%
4 106 7.2%
5 69 4.7%

Maximum 5 values

ValueCountFrequency (%)
1 543 36.9%
2 534 36.3%
3 218 14.8%
4 106 7.2%
5 69 4.7%

JobRole
Categorical

Distinct count 9
Unique (%) 0.6%
Missing (%) 0.0%
Missing (n) 0

Sales Executive 326


Research Scientist 292
Laboratory Technician 259
Other values (6) 593

Toggle details

ValueCountFrequency (%)
Sales Executive 326 22.2%
Research Scientist 292 19.9%
Laboratory Technician 259 17.6%
Manufacturing Director 145 9.9%
Healthcare Representative 131 8.9%
Manager 102 6.9%
Sales Representative 83 5.6%
Research Director 80 5.4%
Human Resources 52 3.5%

JobSatisfaction
Numeric

Distinct count 4
Unique (%) 0.3%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0

Mean 2.7286
Minimum 1
Maximum 4
Zeros (%) 0.0%

Toggle details

Statistics
Histogram
Common Values
Extreme Values

Quantile statistics

Minimum 1
5-th percentile 1
Q1 2
Median 3
Q3 4
95-th percentile 4
Maximum 4
Range 3
Interquartile range 2

Descriptive statistics

Standard deviation 1.1028


Coef of variation 0.40418
Kurtosis -1.2222
Mean 2.7286
MAD 0.95722
Skewness -0.32967
Sum 4011
Variance 1.2163
Memory size 11.6 KiB

ValueCountFrequency (%)
4 459 31.2%
3 442 30.1%
1 289 19.7%
2 280 19.0%

Minimum 5 values

ValueCountFrequency (%)
1 289 19.7%
2 280 19.0%
3 442 30.1%
4 459 31.2%

Maximum 5 values

ValueCountFrequency (%)
1 289 19.7%
2 280 19.0%
3 442 30.1%
4 459 31.2%

MaritalStatus
Categorical

Distinct count 3
Unique (%) 0.2%
Missing (%) 0.0%
Missing (n) 0
Married 673
Single 470
Divorced 327

Toggle details

ValueCountFrequency (%)
Married 673 45.8%
Single 470 32.0%
Divorced 327 22.2%

MonthlyIncome
Highly correlated

This variable is highly correlated with JobLevel and should be ignored for analysis

Correlation 0.9503

MonthlyRate
Numeric

Distinct count 1427


Unique (%) 97.1%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0

Mean 14313
Minimum 2094
Maximum 26999
Zeros (%) 0.0%

Toggle details

Statistics
Histogram
Common Values
Extreme Values

Quantile statistics

Minimum 2094
5-th percentile 3384.6
Q1 8047
Median 14236
Q3 20462
95-th percentile 25432
Maximum 26999
Range 24905
Interquartile range 12414

Descriptive statistics

Standard deviation 7117.8


Coef of variation 0.49729
Kurtosis -1.215
Mean 14313
MAD 6188.1
Skewness 0.018578
Sum 21040262
Variance 50663000
Memory size 11.6 KiB
ValueCountFrequency (%)
4223 3 0.2%
9150 3 0.2%
6670 2 0.1%
7324 2 0.1%
4658 2 0.1%
21534 2 0.1%
16154 2 0.1%
13008 2 0.1%
12355 2 0.1%
6069 2 0.1%
Other values (1417) 1448 98.5%

Minimum 5 values

ValueCountFrequency (%)
2094 1 0.1%
2097 1 0.1%
2104 1 0.1%
2112 1 0.1%
2122 1 0.1%

Maximum 5 values

ValueCountFrequency (%)
26956 1 0.1%
26959 1 0.1%
26968 1 0.1%
26997 1 0.1%
26999 1 0.1%

NumCompaniesWorked
Numeric

Distinct count 10
Unique (%) 0.7%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0

Mean 2.6932
Minimum 0
Maximum 9
Zeros (%) 13.4%

Toggle details

Statistics
Histogram
Common Values
Extreme Values

Quantile statistics

Minimum 0
5-th percentile 0
Q1 1
Median 2
Q3 4
95-th percentile 8
Maximum 9
Range 9
Interquartile range 3

Descriptive statistics

Standard deviation 2.498


Coef of variation 0.92753
Kurtosis 0.010214
Mean 2.6932
MAD 2.0598
Skewness 1.0265
Sum 3959
Variance 6.24
Memory size 11.6 KiB

ValueCountFrequency (%)
1 521 35.4%
0 197 13.4%
3 159 10.8%
2 146 9.9%
4 139 9.5%
7 74 5.0%
6 70 4.8%
5 63 4.3%
9 52 3.5%
8 49 3.3%

Minimum 5 values

ValueCountFrequency (%)
0 197 13.4%
1 521 35.4%
2 146 9.9%
3 159 10.8%
4 139 9.5%

Maximum 5 values

ValueCountFrequency (%)
5 63 4.3%
6 70 4.8%
7 74 5.0%
ValueCountFrequency (%)
8 49 3.3%
9 52 3.5%

Over18
Constant

This variable is constant and should be ignored for analysis

Constant value Y

OverTime
Categorical

Distinct count 2
Unique (%) 0.1%
Missing (%) 0.0%
Missing (n) 0

No 1054
Yes 416

Toggle details

ValueCountFrequency (%)
No 1054 71.7%
Yes 416 28.3%

PercentSalaryHike
Numeric

Distinct count 15
Unique (%) 1.0%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0

Mean 15.21
Minimum 11
Maximum 25
Zeros (%) 0.0%

Toggle details

Statistics
Histogram
Common Values
Extreme Values

Quantile statistics

Minimum 11
5-th percentile 11
Q1 12
Median 14
Q3 18
95-th percentile 22
Maximum 25
Range 14
Interquartile range 6

Descriptive statistics

Standard deviation 3.6599


Coef of variation 0.24063
Kurtosis -0.3006
Mean 15.21
MAD 3.0552
Skewness 0.82113
Sum 22358
Variance 13.395
Memory size 11.6 KiB

ValueCountFrequency (%)
11 210 14.3%
13 209 14.2%
14 201 13.7%
12 198 13.5%
15 101 6.9%
18 89 6.1%
17 82 5.6%
16 78 5.3%
19 76 5.2%
22 56 3.8%
Other values (5) 170 11.6%

Minimum 5 values

ValueCountFrequency (%)
11 210 14.3%
12 198 13.5%
13 209 14.2%
14 201 13.7%
15 101 6.9%

Maximum 5 values

ValueCountFrequency (%)
21 48 3.3%
22 56 3.8%
23 28 1.9%
24 21 1.4%
25 18 1.2%

PerformanceRating
Boolean

Distinct count 2
Unique (%) 0.1%
Missing (%) 0.0%
Missing (n) 0

Mean 3.1537

3 1244
4 226
Toggle details

ValueCountFrequency (%)
3 1244 84.6%
4 226 15.4%

RelationshipSatisfaction
Numeric

Distinct count 4
Unique (%) 0.3%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0

Mean 2.7122
Minimum 1
Maximum 4
Zeros (%) 0.0%

Toggle details

Statistics
Histogram
Common Values
Extreme Values

Quantile statistics

Minimum 1
5-th percentile 1
Q1 2
Median 3
Q3 4
95-th percentile 4
Maximum 4
Range 3
Interquartile range 2

Descriptive statistics

Standard deviation 1.0812


Coef of variation 0.39864
Kurtosis -1.1848
Mean 2.7122
MAD 0.93658
Skewness -0.30283
Sum 3987
Variance 1.169
Memory size 11.6 KiB
ValueCountFrequency (%)
3 459 31.2%
4 432 29.4%
2 303 20.6%
1 276 18.8%

Minimum 5 values

ValueCountFrequency (%)
1 276 18.8%
2 303 20.6%
3 459 31.2%
4 432 29.4%

Maximum 5 values

ValueCountFrequency (%)
1 276 18.8%
2 303 20.6%
3 459 31.2%
4 432 29.4%

StandardHours
Constant

This variable is constant and should be ignored for analysis

Constant value 80

StockOptionLevel
Numeric

Distinct count 4
Unique (%) 0.3%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0

Mean 0.79388
Minimum 0
Maximum 3
Zeros (%) 42.9%

Toggle details

Statistics
Histogram
Common Values
Extreme Values

Quantile statistics

Minimum 0
5-th percentile 0
Q1 0
Median 1
Q3 1
95-th percentile 3
Maximum 3
Range 3
Interquartile range 1

Descriptive statistics

Standard deviation 0.85208


Coef of variation 1.0733
Kurtosis 0.36463
Mean 0.79388
MAD 0.68155
Skewness 0.96898
Sum 1167
Variance 0.72603
Memory size 11.6 KiB

ValueCountFrequency (%)
0 631 42.9%
1 596 40.5%
2 158 10.7%
3 85 5.8%

Minimum 5 values

ValueCountFrequency (%)
0 631 42.9%
1 596 40.5%
2 158 10.7%
3 85 5.8%

Maximum 5 values

ValueCountFrequency (%)
0 631 42.9%
1 596 40.5%
2 158 10.7%
3 85 5.8%

TotalWorkingYears
Numeric
Distinct count 40
Unique (%) 2.7%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0

Mean 11.28
Minimum 0
Maximum 40
Zeros (%) 0.7%

Toggle details

Statistics
Histogram
Common Values
Extreme Values

Quantile statistics

Minimum 0
5-th percentile 1
Q1 6
Median 10
Q3 15
95-th percentile 28
Maximum 40
Range 40
Interquartile range 9

Descriptive statistics

Standard deviation 7.7808


Coef of variation 0.68981
Kurtosis 0.91827
Mean 11.28
MAD 6.0342
Skewness 1.1172
Sum 16581
Variance 60.541
Memory size 11.6 KiB

ValueCountFrequency (%)
10 202 13.7%
6 125 8.5%
8 103 7.0%
9 96 6.5%
5 88 6.0%
1 81 5.5%
7 81 5.5%
4 63 4.3%
ValueCountFrequency (%)
12 48 3.3%
3 42 2.9%
Other values (30) 541 36.8%
Minimum 5 values

ValueCountFrequency (%)
0 11 0.7%
1 81 5.5%
2 31 2.1%
3 42 2.9%
4 63 4.3%

Maximum 5 values

ValueCountFrequency (%)
35 3 0.2%
36 6 0.4%
37 4 0.3%
38 1 0.1%
40 2 0.1%

TrainingTimesLastYear
Numeric

Distinct count 7
Unique (%) 0.5%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0

Mean 2.7993
Minimum 0
Maximum 6
Zeros (%) 3.7%

Toggle details

Statistics
Histogram
Common Values
Extreme Values

Quantile statistics

Minimum 0
5-th percentile 1
Q1 2
Median 3
Q3 3
95-th percentile 5
Maximum 6
Range 6
Interquartile range 1

Descriptive statistics

Standard deviation 1.2893


Coef of variation 0.46057
Kurtosis 0.49499
Mean 2.7993
MAD 0.97434
Skewness 0.55312
Sum 4115
Variance 1.6622
Memory size 11.6 KiB
ValueCountFrequency (%)
2 547 37.2%
3 491 33.4%
4 123 8.4%
5 119 8.1%
1 71 4.8%
6 65 4.4%
0 54 3.7%

Minimum 5 values

ValueCountFrequency (%)
0 54 3.7%
1 71 4.8%
2 547 37.2%
3 491 33.4%
4 123 8.4%

Maximum 5 values

ValueCountFrequency (%)
2 547 37.2%
3 491 33.4%
4 123 8.4%
5 119 8.1%
6 65 4.4%

WorkLifeBalance
Numeric

Distinct count 4
Unique (%) 0.3%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0

Mean 2.7612
Minimum 1
Maximum 4
Zeros (%) 0.0%

Toggle details

Statistics
Histogram
Common Values
Extreme Values
Quantile statistics

Minimum 1
5-th percentile 1
Q1 2
Median 3
Q3 3
95-th percentile 4
Maximum 4
Range 3
Interquartile range 1

Descriptive statistics

Standard deviation 0.70648


Coef of variation 0.25586
Kurtosis 0.41946
Mean 2.7612
MAD 0.54797
Skewness -0.55248
Sum 4059
Variance 0.49911
Memory size 11.6 KiB

ValueCountFrequency (%)
3 893 60.7%
2 344 23.4%
4 153 10.4%
1 80 5.4%

Minimum 5 values

ValueCountFrequency (%)
1 80 5.4%
2 344 23.4%
3 893 60.7%
4 153 10.4%

Maximum 5 values

ValueCountFrequency (%)
1 80 5.4%
2 344 23.4%
3 893 60.7%
4 153 10.4%

YearsAtCompany
Numeric

Distinct count 37
Unique (%) 2.5%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 7.0082
Minimum 0
Maximum 40
Zeros (%) 3.0%

Toggle details

Statistics
Histogram
Common Values
Extreme Values

Quantile statistics

Minimum 0
5-th percentile 1
Q1 3
Median 5
Q3 9
95-th percentile 20
Maximum 40
Range 40
Interquartile range 6

Descriptive statistics

Standard deviation 6.1265


Coef of variation 0.8742
Kurtosis 3.9355
Mean 7.0082
MAD 4.4717
Skewness 1.7645
Sum 10302
Variance 37.534
Memory size 11.6 KiB

ValueCountFrequency (%)
5 196 13.3%
1 171 11.6%
3 128 8.7%
2 127 8.6%
10 120 8.2%
4 110 7.5%
7 90 6.1%
9 82 5.6%
8 80 5.4%
6 76 5.2%
Other values (27) 290 19.7%

Minimum 5 values
ValueCountFrequency (%)
0 44 3.0%
1 171 11.6%
2 127 8.6%
3 128 8.7%
4 110 7.5%

Maximum 5 values

ValueCountFrequency (%)
33 5 0.3%
34 1 0.1%
36 2 0.1%
37 1 0.1%
40 1 0.1%

YearsInCurrentRole
Numeric

Distinct count 19
Unique (%) 1.3%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0

Mean 4.2293
Minimum 0
Maximum 18
Zeros (%) 16.6%

Toggle details

Statistics
Histogram
Common Values
Extreme Values

Quantile statistics

Minimum 0
5-th percentile 0
Q1 2
Median 3
Q3 7
95-th percentile 11
Maximum 18
Range 18
Interquartile range 5

Descriptive statistics

Standard deviation 3.6231


Coef of variation 0.85669
Kurtosis 0.47742
Mean 4.2293
MAD 3.0409
Skewness 0.91736
Sum 6217
Variance 13.127
Memory size 11.6 KiB
ValueCountFrequency (%)
2 372 25.3%
0 244 16.6%
7 222 15.1%
3 135 9.2%
4 104 7.1%
8 89 6.1%
9 67 4.6%
1 57 3.9%
6 37 2.5%
5 36 2.4%
Other values (9) 107 7.3%

Minimum 5 values

ValueCountFrequency (%)
0 244 16.6%
1 57 3.9%
2 372 25.3%
3 135 9.2%
4 104 7.1%

Maximum 5 values

ValueCountFrequency (%)
14 11 0.7%
15 8 0.5%
16 7 0.5%
17 4 0.3%
18 2 0.1%

YearsSinceLastPromotion
Numeric

Distinct count 16
Unique (%) 1.1%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0

Mean 2.1878
Minimum 0
Maximum 15
Zeros (%) 39.5%

Toggle details

Statistics
Histogram
Common Values
Extreme Values

Quantile statistics

Minimum 0
5-th percentile 0
Q1 0
Median 1
Q3 3
95-th percentile 9
Maximum 15
Range 15
Interquartile range 3

Descriptive statistics

Standard deviation 3.2224


Coef of variation 1.4729
Kurtosis 3.6127
Mean 2.1878
MAD 2.3469
Skewness 1.9843
Sum 3216
Variance 10.384
Memory size 11.6 KiB

ValueCountFrequency (%)
0 581 39.5%
1 357 24.3%
2 159 10.8%
7 76 5.2%
4 61 4.1%
3 52 3.5%
5 45 3.1%
6 32 2.2%
11 24 1.6%
8 18 1.2%
Other values (6) 65 4.4%

Minimum 5 values

ValueCountFrequency (%)
0 581 39.5%
1 357 24.3%
2 159 10.8%
3 52 3.5%
4 61 4.1%

Maximum 5 values

ValueCountFrequency (%)
11 24 1.6%
12 10 0.7%
ValueCountFrequency (%)
13 10 0.7%
14 9 0.6%
15 13 0.9%

YearsWithCurrManager
Numeric

Distinct count 18
Unique (%) 1.2%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0

Mean 4.1231
Minimum 0
Maximum 17
Zeros (%) 17.9%

Toggle details

Statistics
Histogram
Common Values
Extreme Values

Quantile statistics

Minimum 0
5-th percentile 0
Q1 2
Median 3
Q3 7
95-th percentile 10
Maximum 17
Range 17
Interquartile range 5

Descriptive statistics

Standard deviation 3.5681


Coef of variation 0.8654
Kurtosis 0.17106
Mean 4.1231
MAD 3.0254
Skewness 0.83345
Sum 6061
Variance 12.732
Memory size 11.6 KiB
ValueCountFrequency (%)
2 344 23.4%
0 263 17.9%
7 216 14.7%
3 142 9.7%
8 107 7.3%
4 98 6.7%
1 76 5.2%
9 64 4.4%
5 31 2.1%
6 29 2.0%
Other values (8) 100 6.8%

Minimum 5 values

ValueCountFrequency (%)
0 263 17.9%
1 76 5.2%
2 344 23.4%
3 142 9.7%
4 98 6.7%

Maximum 5 values

ValueCountFrequency (%)
13 14 1.0%
14 5 0.3%
15 5 0.3%
16 2 0.1%
17 7 0.5%

Correlations
Sample

Age Attrition BusinessTravel DailyRate Department DistanceFromHome Education EducationField Em

0 41 Yes Travel_Rarely 1102 Sales 1 2 Life Sciences

1 49 No Travel_Frequently 279 Research & Development 8 1 Life Sciences

2 37 Yes Travel_Rarely 1373 Research & Development 2 2 Other

3 33 No Travel_Frequently 1392 Research & Development 3 4 Life Sciences

4 27 No Travel_Rarely 591 Research & Development 2 1 Medical

In [10]: # drop the unnecessary columns


df.drop(['EmployeeNumber','Over18','StandardHours','EmployeeCount'],axis=1,inplace=True)

In [11]: df['Attrition'] = df['Attrition'].apply(lambda x:1 if x == "Yes" else 0 )


df['OverTime'] = df['OverTime'].apply(lambda x:1 if x =="Yes" else 0 )

In [12]: attrition = df[df['Attrition'] == 1]


no_attrition = df[df['Attrition']==0]

Visualization of Categorical Features


In [13]: def categorical_column_viz(col_name):

f,ax = plt.subplots(1,2, figsize=(10,6))

# Count Plot
df[col_name].value_counts().plot.bar(cmap='Set2',ax=ax[0])
ax[1].set_title(f'Number of Employee by {col_name}')
ax[1].set_ylabel('Count')
ax[1].set_xlabel(f'{col_name}')

# Attrition Count per factors


sns.countplot(col_name, hue='Attrition',data=df, ax=ax[1], palette='Set2')
ax[1].set_title(f'Attrition by {col_name}')
ax[1].set_xlabel(f'{col_name}')
ax[1].set_ylabel('Count')

In [14]: categorical_column_viz('BusinessTravel')

In [15]: categorical_column_viz('Department')
In [16]: categorical_column_viz('EducationField')

In [17]: categorical_column_viz('Education')
In [18]: categorical_column_viz('EnvironmentSatisfaction')

In [19]: categorical_column_viz('Gender')
In [20]: categorical_column_viz('JobRole')

In [21]: categorical_column_viz('JobInvolvement')
In [22]: categorical_column_viz('MaritalStatus')

In [23]: categorical_column_viz('NumCompaniesWorked')
In [24]: categorical_column_viz('OverTime')

In [25]: categorical_column_viz('StockOptionLevel')
In [26]: categorical_column_viz('TrainingTimesLastYear')

In [27]: categorical_column_viz('YearsWithCurrManager')
Visualization of Numerical Features
In [28]: def numerical_column_viz(col_name):
f,ax = plt.subplots(1,2, figsize=(18,6))
sns.kdeplot(attrition[col_name], label='Employee who left',ax=ax[0], shade=True, color='palegreen')
sns.kdeplot(no_attrition[col_name], label='Employee who stayed', ax=ax[0], shade=True, color='salmon')

sns.boxplot(y=col_name, x='Attrition',data=df, palette='Set3', ax=ax[1])

In [29]: numerical_column_viz("Age")

In [30]: numerical_column_viz("Age")
In [31]: numerical_column_viz("DailyRate")

In [32]: numerical_column_viz("DistanceFromHome")

In [33]: numerical_column_viz("MonthlyIncome")

In [34]: numerical_column_viz("HourlyRate")
In [35]: numerical_column_viz("JobInvolvement")

In [36]: numerical_column_viz("PercentSalaryHike")

In [37]: numerical_column_viz("Age")
In [38]: numerical_column_viz("DailyRate")

In [39]: numerical_column_viz("TotalWorkingYears")

In [40]: numerical_column_viz("YearsAtCompany")

In [41]: numerical_column_viz("YearsInCurrentRole")
In [42]: numerical_column_viz("YearsSinceLastPromotion")

In [43]: numerical_column_viz("YearsWithCurrManager")

Visualization of Categorical vs Numericals Features


In [44]: def categorical_numerical(numerical_col, categorical_col1, categorical_col2):

f,ax = plt.subplots(1,2, figsize=(20,8))

g1= sns.swarmplot( categorical_col1, numerical_col,hue='Attrition', data=df, dodge=True, ax=ax[0], palette='Set2')


ax[0].set_title(f'{numerical_col} vs {categorical_col1} separeted by Attrition')
g1.set_xticklabels(g1.get_xticklabels(), rotation=90)

g2 = sns.swarmplot( categorical_col2, numerical_col,hue='Attrition', data=df, dodge=True, ax=ax[1], palette='Set2')


ax[1].set_title(f'{numerical_col} vs {categorical_col1} separeted by Attrition')
g2.set_xticklabels(g2.get_xticklabels(), rotation=90)

In [45]: categorical_numerical('Age','Gender','MaritalStatus')
In [46]: categorical_numerical('Age','JobRole','EducationField')

In [47]: categorical_numerical('MonthlyIncome','Gender','MaritalStatus')

Feature Engineering
In [48]: # 'EnviornmentSatisfaction', 'JobInvolvement', 'JobSatisfacction', 'RelationshipSatisfaction', 'WorklifeBalance' can be cl

df['Total_Satisfaction'] = (df['EnvironmentSatisfaction'] +
df['JobInvolvement'] +
df['JobSatisfaction'] +
df['RelationshipSatisfaction'] +
df['WorkLifeBalance']) /5

# Drop Columns
df.drop(['EnvironmentSatisfaction','JobInvolvement','JobSatisfaction','RelationshipSatisfaction','WorkLifeBalance'], axis=

In [49]: categorical_column_viz('Total_Satisfaction')

In [50]: df.Total_Satisfaction.describe()

Out[50]: count 1470.000000


mean 2.730748
std 0.428551
min 1.200000
25% 2.400000
50% 2.800000
75% 3.000000
max 4.000000
Name: Total_Satisfaction, dtype: float64

In [51]: # Convert Total satisfaction into boolean


# median = 2.8
# x = 1 if x >= 2.8

df['Total_Satisfaction_bool'] = df['Total_Satisfaction'].apply(lambda x:1 if x>=2.8 else 0 )


df.drop('Total_Satisfaction', axis=1, inplace=True)

In [52]: # It can be observed that the rate of attrition of employees below age of 35 is high

df['Age_bool'] = df['Age'].apply(lambda x:1 if x<35 else 0)


df.drop('Age', axis=1, inplace=True)

In [53]: # It can be observed that the employees are more likey the drop the job if dailtRate less than 800

df['DailyRate_bool'] = df['DailyRate'].apply(lambda x:1 if x<800 else 0)


df.drop('DailyRate', axis=1, inplace=True)

In [54]: # Employees working at R&D Department have higher attrition rate

df['Department_bool'] = df['Department'].apply(lambda x:1 if x=='Research & Development' else 0)


df.drop('Department', axis=1, inplace=True)

In [55]: # Rate of attrition of employees is high if DistanceFromHome > 10


df['DistanceFromHome_bool'] = df['DistanceFromHome'].apply(lambda x:1 if x>10 else 0)
df.drop('DistanceFromHome', axis=1, inplace=True)

In [56]: # Employees are more likey to drop the job if the employee is working as Laboratory Technician

df['JobRole_bool'] = df['JobRole'].apply(lambda x:1 if x=='Laboratory Technician' else 0)


df.drop('JobRole', axis=1, inplace=True)

In [57]: # Employees are more likey to the drop the job if the employee's hourly rate < 65

df['HourlyRate_bool'] = df['HourlyRate'].apply(lambda x:1 if x<65 else 0)


df.drop('HourlyRate', axis=1, inplace=True)

In [58]: # Employees are more likey to the drop the job if the employee's MonthlyIncome < 4000

df['MonthlyIncome_bool'] = df['MonthlyIncome'].apply(lambda x:1 if x<4000 else 0)


df.drop('MonthlyIncome', axis=1, inplace=True)

In [59]: # Rate of attrition of employees is high if NumCompaniesWorked < 3

df['NumCompaniesWorked_bool'] = df['NumCompaniesWorked'].apply(lambda x:1 if x>3 else 0)


df.drop('NumCompaniesWorked', axis=1, inplace=True)

In [60]: # Employees are more likey to the drop the job if the employee's TotalWorkingYears < 8

df['TotalWorkingYears_bool'] = df['TotalWorkingYears'].apply(lambda x:1 if x<8 else 0)


df.drop('TotalWorkingYears', axis=1, inplace=True)

In [61]: # Employees are more likey to the drop the job if the employee's YearsAtCompany < 3

df['YearsAtCompany_bool'] = df['YearsAtCompany'].apply(lambda x:1 if x<3 else 0)


df.drop('YearsAtCompany', axis=1, inplace=True)

In [62]: # Employees are more likey to the drop the job if the employee's YearsInCurrentRole < 3

df['YearsInCurrentRole_bool'] = df['YearsInCurrentRole'].apply(lambda x:1 if x<3 else 0)


df.drop('YearsInCurrentRole', axis=1, inplace=True)

In [63]: # Employees are more likey to the drop the job if the employee's YearsSinceLastPromotion < 1

df['YearsSinceLastPromotion_bool'] = df['YearsSinceLastPromotion'].apply(lambda x:1 if x<1 else 0)


df.drop('YearsSinceLastPromotion', axis=1, inplace=True)

In [64]: # Employees are more likey to the drop the job if the employee's YearsWithCurrManager < 1

df['YearsWithCurrManager_bool'] = df['YearsWithCurrManager'].apply(lambda x:1 if x<1 else 0)


df.drop('YearsWithCurrManager', axis=1, inplace=True)

In [65]: df['Gender'] = df['Gender'].apply(lambda x:1 if x=='Female' else 0)

In [66]: df.drop('MonthlyRate', axis=1, inplace=True)


df.drop('PercentSalaryHike', axis=1, inplace=True)

In [67]: convert_category = ['BusinessTravel','Education','EducationField','MaritalStatus','StockOptionLevel','OverTime','Gender','


for col in convert_category:
df[col] = df[col].astype('category')

In [69]: #separate the categorical and numerical data


X_categorical = df.select_dtypes(include=['category'])
X_numerical = df.select_dtypes(include=['int64'])
X_numerical.drop('Attrition', axis=1, inplace=True)

In [70]: y = df['Attrition']

In [68]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1470 entries, 0 to 1469
Data columns (total 25 columns):
Attrition 1470 non-null int64
BusinessTravel 1470 non-null category
Education 1470 non-null category
EducationField 1470 non-null category
Gender 1470 non-null category
JobLevel 1470 non-null int64
MaritalStatus 1470 non-null category
OverTime 1470 non-null category
PerformanceRating 1470 non-null int64
StockOptionLevel 1470 non-null category
TrainingTimesLastYear 1470 non-null category
Total_Satisfaction_bool 1470 non-null int64
Age_bool 1470 non-null int64
DailyRate_bool 1470 non-null int64
Department_bool 1470 non-null int64
DistanceFromHome_bool 1470 non-null int64
JobRole_bool 1470 non-null int64
HourlyRate_bool 1470 non-null int64
MonthlyIncome_bool 1470 non-null int64
NumCompaniesWorked_bool 1470 non-null int64
TotalWorkingYears_bool 1470 non-null int64
YearsAtCompany_bool 1470 non-null int64
YearsInCurrentRole_bool 1470 non-null int64
YearsSinceLastPromotion_bool 1470 non-null int64
YearsWithCurrManager_bool 1470 non-null int64
dtypes: category(8), int64(17)
memory usage: 208.2 KB

In [72]: #concat the categorical and numerical values

X_all = pd.concat([X_categorical, X_numerical], axis=1)


X_all.head()

Out[72]: 0 1 2 3 4 5 6 7 8 9 ... DistanceFromHome_bool JobRole_bool HourlyRate_bool MonthlyIncome_bool N

0 0.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 ... 0 0 0 0

1 0.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 ... 0 0 1 0

2 0.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 ... 0 1 0 1

3 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 ... 0 0 1 1

4 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0 1 1 1

5 rows × 48 columns

In [71]: # One HOt Encoding Categorical Features

onehotencoder = OneHotEncoder()

X_categorical = onehotencoder.fit_transform(X_categorical).toarray()
X_categorical = pd.DataFrame(X_categorical)
X_categorical
Out[71]: 0 1 2 3 4 5 6 7 8 9 ... 22 23 24 25 26 27 28 29 30 31

0 0.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 ... 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0

1 0.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 ... 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0

2 0.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0

3 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0

4 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0

5 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 ... 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0

6 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0

7 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 ... 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0

8 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 ... 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0

9 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 ... 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0

10 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 ... 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0

11 0.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0

12 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 ... 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0

13 0.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 ... 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0

14 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0

15 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 ... 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0

16 0.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 ... 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0

17 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0

18 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0

19 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0

20 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 ... 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0

21 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0

22 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0

23 0.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0

24 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0

25 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 ... 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0

26 0.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0

27 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 ... 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0

28 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 ... 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0

29 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0

... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...

1440 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 ... 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0

1441 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 ... 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0

1442 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 ... 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0

1443 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 ... 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0

1444 0.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 ... 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0

1445 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 ... 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0

1446 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 ... 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0

1447 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 ... 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0

1448 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 ... 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0

1449 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0

1450 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 ... 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0

1451 0.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 ... 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0

1452 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 ... 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0

1453 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 ... 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0

1454 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0

1455 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 ... 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0
0 1 2 3 4 5 6 7 8 9 ... 22 23 24 25 26 27 28 29 30 31

1456 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 ... 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0

1457 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 ... 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0

1458 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 ... 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0

1459 0.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 ... 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0

1460 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0

1461 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 ... 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0

1462 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0

1463 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0

1464 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0

1465 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 ... 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0

1466 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0

1467 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 ... 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0

1468 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0

1469 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0

1470 rows × 32 columns

In [73]: X_all.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1470 entries, 0 to 1469
Data columns (total 48 columns):
0 1470 non-null float64
1 1470 non-null float64
2 1470 non-null float64
3 1470 non-null float64
4 1470 non-null float64
5 1470 non-null float64
6 1470 non-null float64
7 1470 non-null float64
8 1470 non-null float64
9 1470 non-null float64
10 1470 non-null float64
11 1470 non-null float64
12 1470 non-null float64
13 1470 non-null float64
14 1470 non-null float64
15 1470 non-null float64
16 1470 non-null float64
17 1470 non-null float64
18 1470 non-null float64
19 1470 non-null float64
20 1470 non-null float64
21 1470 non-null float64
22 1470 non-null float64
23 1470 non-null float64
24 1470 non-null float64
25 1470 non-null float64
26 1470 non-null float64
27 1470 non-null float64
28 1470 non-null float64
29 1470 non-null float64
30 1470 non-null float64
31 1470 non-null float64
JobLevel 1470 non-null int64
PerformanceRating 1470 non-null int64
Total_Satisfaction_bool 1470 non-null int64
Age_bool 1470 non-null int64
DailyRate_bool 1470 non-null int64
Department_bool 1470 non-null int64
DistanceFromHome_bool 1470 non-null int64
JobRole_bool 1470 non-null int64
HourlyRate_bool 1470 non-null int64
MonthlyIncome_bool 1470 non-null int64
NumCompaniesWorked_bool 1470 non-null int64
TotalWorkingYears_bool 1470 non-null int64
YearsAtCompany_bool 1470 non-null int64
YearsInCurrentRole_bool 1470 non-null int64
YearsSinceLastPromotion_bool 1470 non-null int64
YearsWithCurrManager_bool 1470 non-null int64
dtypes: float64(32), int64(16)
memory usage: 551.4 KB
Split Data
In [74]: X_train,X_test, y_train, y_test = train_test_split(X_all,y, test_size=0.30)

In [75]: print(f"Train data shape: {X_train.shape}, Test Data Shape {X_test.shape}")

Train data shape: (1029, 48), Test Data Shape (441, 48)

In [76]: X_train.head()

Out[76]: 0 1 2 3 4 5 6 7 8 9 ... DistanceFromHome_bool JobRole_bool HourlyRate_bool MonthlyIncome_boo

772 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 ... 0 0 1 1

1403 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 ... 1 0 0 0

9 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 ... 1 0 0 0

662 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 ... 0 0 1 1

1387 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 ... 0 0 0 0

5 rows × 48 columns

Train Data
In [77]: # Function that runs the requested algorithm and returns the accuracy metrics
def fit_ml_algo(algo, X_train,y_train, cv):

# One Pass
model = algo.fit(X_train, y_train)
acc = round(model.score(X_train, y_train) * 100, 2)

# Cross Validation
train_pred = model_selection.cross_val_predict(algo,X_train,y_train,cv=cv,n_jobs = -1)

# Cross-validation accuracy metric


acc_cv = round(metrics.accuracy_score(y_train, train_pred) * 100, 2)

return train_pred, acc, acc_cv

Logistic Regression
In [78]: # Logistic Regression
start_time = time.time()
train_pred_log, acc_log, acc_cv_log = fit_ml_algo(LogisticRegression(), X_train,y_train, 10)
log_time = (time.time() - start_time)
print("Accuracy: %s" % acc_log)
print("Accuracy CV 10-Fold: %s" % acc_cv_log)
print("Running Time: %s" % datetime.timedelta(seconds=log_time))

Accuracy: 89.89
Accuracy CV 10-Fold: 87.66
Running Time: 0:00:02.534987

Support Vector Machine


In [79]: # SVC
start_time = time.time()
train_pred_svc, acc_svc, acc_cv_svc = fit_ml_algo(SVC(),X_train,y_train,10)
svc_time = (time.time() - start_time)
print("Accuracy: %s" % acc_svc)
print("Accuracy CV 10-Fold: %s" % acc_cv_svc)
print("Running Time: %s" % datetime.timedelta(seconds=svc_time))

Accuracy: 87.76
Accuracy CV 10-Fold: 86.1
Running Time: 0:00:00.207994

Linear Support Vector Machines


In [80]: # Linear SVC
start_time = time.time()
train_pred_svc, acc_linear_svc, acc_cv_linear_svc = fit_ml_algo(LinearSVC(),X_train, y_train,10)
linear_svc_time = (time.time() - start_time)
print("Accuracy: %s" % acc_linear_svc)
print("Accuracy CV 10-Fold: %s" % acc_cv_linear_svc)
print("Running Time: %s" % datetime.timedelta(seconds=linear_svc_time))
Accuracy: 89.5
Accuracy CV 10-Fold: 87.27
Running Time: 0:00:00.269995

K Nearest Neighbour
In [81]: # K Nearest Neighbour
start_time = time.time()
train_pred_knn, acc_knn, acc_cv_knn = fit_ml_algo(KNeighborsClassifier(n_neighbors = 3),X_train,y_train,10)
knn_time = (time.time() - start_time)
print("Accuracy: %s" % acc_knn)
print("Accuracy CV 10-Fold: %s" % acc_cv_knn)
print("Running Time: %s" % datetime.timedelta(seconds=knn_time))

Accuracy: 89.21
Accuracy CV 10-Fold: 83.28
Running Time: 0:00:00.239998

Gaussian Naive Bayes


In [82]: # Gaussian Naive Bayes
start_time = time.time()
train_pred_gaussian, acc_gaussian, acc_cv_gaussian = fit_ml_algo(GaussianNB(),X_train,y_train,10)
gaussian_time = (time.time() - start_time)
print("Accuracy: %s" % acc_gaussian)
print("Accuracy CV 10-Fold: %s" % acc_cv_gaussian)
print("Running Time: %s" % datetime.timedelta(seconds=gaussian_time))

Accuracy: 80.17
Accuracy CV 10-Fold: 77.45
Running Time: 0:00:00.064000

Perceptron
In [83]: # Perceptron
start_time = time.time()
train_pred_gaussian, acc_perceptron, acc_cv_perceptron = fit_ml_algo(Perceptron(),X_train,y_train,10)
perceptron_time = (time.time() - start_time)
print("Accuracy: %s" % acc_perceptron)
print("Accuracy CV 10-Fold: %s" % acc_cv_perceptron)
print("Running Time: %s" % datetime.timedelta(seconds=perceptron_time))

Accuracy: 87.27
Accuracy CV 10-Fold: 82.12
Running Time: 0:00:00.073985

Stochastic Gradient Descent


In [84]: # Stochastic Gradient Descent
start_time = time.time()
train_pred_sgd, acc_sgd, acc_cv_sgd = fit_ml_algo(SGDClassifier(),X_train, y_train,10)
sgd_time = (time.time() - start_time)
print("Accuracy: %s" % acc_sgd)
print("Accuracy CV 10-Fold: %s" % acc_cv_sgd)
print("Running Time: %s" % datetime.timedelta(seconds=sgd_time))

Accuracy: 86.78
Accuracy CV 10-Fold: 83.97
Running Time: 0:00:00.096004

Decision Tree
In [85]: # Decision Tree
start_time = time.time()
train_pred_dt, acc_dt, acc_cv_dt = fit_ml_algo(DecisionTreeClassifier(),X_train, y_train,10)
dt_time = (time.time() - start_time)
print("Accuracy: %s" % acc_dt)
print("Accuracy CV 10-Fold: %s" % acc_cv_dt)
print("Running Time: %s" % datetime.timedelta(seconds=dt_time))

Accuracy: 100.0
Accuracy CV 10-Fold: 78.62
Running Time: 0:00:00.098000

Gradient Boosting Trees


In [86]: # Gradient Boosting Trees
start_time = time.time()
train_pred_gbt, acc_gbt, acc_cv_gbt = fit_ml_algo(GradientBoostingClassifier(),X_train, y_train,10)
gbt_time = (time.time() - start_time)
print("Accuracy: %s" % acc_gbt)
print("Accuracy CV 10-Fold: %s" % acc_cv_gbt)
print("Running Time: %s" % datetime.timedelta(seconds=gbt_time))

Accuracy: 93.0
Accuracy CV 10-Fold: 87.17
Running Time: 0:00:00.702999

Random Forest
In [87]: # Random Forest
start_time = time.time()
train_pred_dt, acc_rf, acc_cv_rf = fit_ml_algo(RandomForestClassifier(n_estimators=100),X_train, y_train,10)
rf_time = (time.time() - start_time)
print("Accuracy: %s" % acc_rf)
print("Accuracy CV 10-Fold: %s" % acc_cv_rf)
print("Running Time: %s" % datetime.timedelta(seconds=rf_time))

Accuracy: 100.0
Accuracy CV 10-Fold: 85.33
Running Time: 0:00:00.789033

CatBoost Classifier
In [88]: # Define the categorical features for the CatBoost model
cat_features = np.where(X_train.dtypes != np.float)[0]
cat_features

Out[88]: array([32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47],
dtype=int64)

In [89]: # pool training data and categorical feature labels together


train_pool = Pool(X_train, y_train,cat_features)

In [90]: # CatBoost
catboost_model = CatBoostClassifier(iterations=1000,custom_loss=['Accuracy'],loss_function='Logloss')

# Fit CatBoost model


catboost_model.fit(train_pool,plot=True)

# CatBoost accuracy
acc_catboost = round(catboost_model.score(X_train, y_train) * 100, 2)

MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))
990: learn: 0.0269810 test: 0.4227576 best: 0.3451640 (189) total: 8m 16s remaining: 4.51s
991: learn: 0.0269395 test: 0.4228075 best: 0.3451640 (189) total: 8m 16s remaining: 4.01s
992: learn: 0.0269058 test: 0.4228815 best: 0.3451640 (189) total: 8m 17s remaining: 3.51s
993: learn: 0.0268638 test: 0.4230957 best: 0.3451640 (189) total: 8m 18s remaining: 3.01s
994: learn: 0.0268283 test: 0.4231735 best: 0.3451640 (189) total: 8m 18s remaining: 2.5s
995: learn: 0.0267771 test: 0.4232267 best: 0.3451640 (189) total: 8m 19s remaining: 2s
996: learn: 0.0267233 test: 0.4233080 best: 0.3451640 (189) total: 8m 19s remaining: 1.5s
997: learn: 0.0266916 test: 0.4234219 best: 0.3451640 (189) total: 8m 20s remaining: 1s
998: learn: 0.0266528 test: 0.4236120 best: 0.3451640 (189) total: 8m 20s remaining: 501ms
999: learn: 0.0266080 test: 0.4235866 best: 0.3451640 (189) total: 8m 21s remaining: 0us

Training Model Results


In [93]: models = pd.DataFrame({
'Model': ['Logistic Regression','SVM','Linear SVC','KNN','Naive Bayes','Perceptron',
'Stochastic Gradient Decent','Decision Tree', 'Gradient Boosting Trees','Random Forest',
'CatBoost'],
'Score': [
acc_log,
acc_svc,
acc_linear_svc,
acc_knn,
acc_gaussian,
acc_perceptron,
acc_sgd,
acc_dt,
acc_gbt,
acc_rf,
acc_catboost
]})
models.sort_values(by='Score', ascending=False)

Out[93]: Model Score

7 Decision Tree 100.00

9 Random Forest 100.00

10 CatBoost 95.82

8 Gradient Boosting Trees 93.00

0 Logistic Regression 89.89

2 Linear SVC 89.50

3 KNN 89.21

1 SVM 87.76

5 Perceptron 87.27

6 Stochastic Gradient Decent 86.78

4 Naive Bayes 80.17

In [94]: cv_models = pd.DataFrame({


'Model': ['Logistic Regression','SVM','Linear SVC','KNN','Naive Bayes','Perceptron',
'Stochastic Gradient Decent','Decision Tree', 'Gradient Boosting Trees','Random Forest',
'CatBoost'],
'Score': [
acc_cv_log,
acc_cv_svc,
acc_cv_linear_svc,
acc_cv_knn,
acc_cv_gaussian,
acc_cv_perceptron,
acc_cv_sgd,
acc_cv_dt,
acc_cv_gbt,
acc_cv_rf,
acc_cv_catboost
]})
cv_models.sort_values(by='Score', ascending=False)
Out[94]: Model Score

0 Logistic Regression 87.66

2 Linear SVC 87.27

8 Gradient Boosting Trees 87.17

10 CatBoost 86.49

1 SVM 86.10

9 Random Forest 85.33

6 Stochastic Gradient Decent 83.97

3 KNN 83.28

5 Perceptron 82.12

7 Decision Tree 78.62

4 Naive Bayes 77.45

Predict Data using Logistic Regression


In [95]: model = LogisticRegression().fit(X_train, y_train)

In [96]: predictions = model.predict(X_test)

In [97]: pred_df = pd.DataFrame(index=X_test.index)

In [98]: pred_df['Attrition'] = predictions


pred_df.head()

Out[98]: Attrition

71 0

464 0

294 0

1230 0

1181 0

In [99]: # Cross-validation accuracy metric


score = round(metrics.accuracy_score(y_test, predictions) * 100, 2)

In [100… print("Accuracy: %s" % score)

Accuracy: 87.76

In [101… print(classification_report(y_test, predictions))

precision recall f1-score support

0 0.90 0.96 0.93 375


1 0.64 0.42 0.51 66

accuracy 0.88 441


macro avg 0.77 0.69 0.72 441
weighted avg 0.86 0.88 0.87 441

In [102… # get importance


importance = model.coef_[0]
# summarize feature importance
for i,v in enumerate(importance):
print('Feature: %0d, Score: %.5f' % (i,v))
# plot feature importance
plt.bar([x for x in range(len(importance))], importance)
plt.show()
Feature: 0, Score: -0.83977
Feature: 1, Score: 0.78917
Feature: 2, Score: 0.05069
Feature: 3, Score: -0.02039
Feature: 4, Score: 0.11335
Feature: 5, Score: 0.02325
Feature: 6, Score: 0.01159
Feature: 7, Score: -0.12771
Feature: 8, Score: 0.20678
Feature: 9, Score: -0.28549
Feature: 10, Score: 0.12482
Feature: 11, Score: -0.28421
Feature: 12, Score: -0.39745
Feature: 13, Score: 0.63564
Feature: 14, Score: 0.13060
Feature: 15, Score: -0.13052
Feature: 16, Score: -0.29474
Feature: 17, Score: 0.11663
Feature: 18, Score: 0.17819
Feature: 19, Score: -0.92836
Feature: 20, Score: 0.92844
Feature: 21, Score: 0.70871
Feature: 22, Score: -0.35084
Feature: 23, Score: -0.43281
Feature: 24, Score: 0.07503
Feature: 25, Score: 0.89239
Feature: 26, Score: -0.12934
Feature: 27, Score: 0.04080
Feature: 28, Score: -0.24705
Feature: 29, Score: 0.27531
Feature: 30, Score: -0.46624
Feature: 31, Score: -0.36578
Feature: 32, Score: 0.05549
Feature: 33, Score: -0.27119
Feature: 34, Score: -1.16726
Feature: 35, Score: 0.83722
Feature: 36, Score: 0.26956
Feature: 37, Score: -0.91975
Feature: 38, Score: 0.81994
Feature: 39, Score: 0.84985
Feature: 40, Score: 0.27492
Feature: 41, Score: 0.83162
Feature: 42, Score: 0.80548
Feature: 43, Score: 0.36834
Feature: 44, Score: 0.42694
Feature: 45, Score: -0.01816
Feature: 46, Score: -0.43036
Feature: 47, Score: 0.92481

In [ ]:
Employee Attrition Prediction

Employee Attrition Prediction


Age : 18-80

BusinessTravel : Rarely Frequently No Travel

Daily Rate : 100-1600

Department : Research & Development Human Resources Sales

Distance From Home : 1-29

Education : 1-5

Education Field : Life Sciences Medical Marketing Technical


Degree Human Resources Other

Environment Satisfaction : 1-4

Gender : Male Female

Hourly Rate : 30-100

Job Involvement : 1-4

Job Level : 1-5

Job Role : Sales Executive Research Scientist Laboratory Technician


Manufacturing Director Healthcare Representative Manager Sales
Representative Research Director Human Resources

Job Satisfaction : 1-4

Marital Status : Married Single Divorced

Monthly Income : 1000-2000 (1000-20000)

Number of Companies Worked in : 0-9


Over Time : Yes No

Performance Rating : 1-4

Relationship Satisfaction : 1-4

Stock Option Level : 0-3

Total Working Years : 0-40

Training Times Last Year : 0-6

Work Life Balance : 1-4

Years At Company : 0-40

Years In Current Role : 0-18

Years Since Last Promotion : 0-15

Years With Curr Manager : 0-17

Predict
Age Attrition BusinessTravel DailyRate Department DistanceFromHome Education EducationField EmployeeCount
41 Yes Travel_Rarely 1102 Sales 1 2 Life Sciences 1
49 No Travel_Frequently 279 Research & Development 8 1 Life Sciences 1
37 Yes Travel_Rarely 1373 Research & Development 2 2 Other 1
33 No Travel_Frequently 1392 Research & Development 3 4 Life Sciences 1
27 No Travel_Rarely 591 Research & Development 2 1 Medical 1
32 No Travel_Frequently 1005 Research & Development 2 2 Life Sciences 1
59 No Travel_Rarely 1324 Research & Development 3 3 Medical 1
30 No Travel_Rarely 1358 Research & Development 24 1 Life Sciences 1
38 No Travel_Frequently 216 Research & Development 23 3 Life Sciences 1
36 No Travel_Rarely 1299 Research & Development 27 3 Medical 1
35 No Travel_Rarely 809 Research & Development 16 3 Medical 1
29 No Travel_Rarely 153 Research & Development 15 2 Life Sciences 1
31 No Travel_Rarely 670 Research & Development 26 1 Life Sciences 1
34 No Travel_Rarely 1346 Research & Development 19 2 Medical 1
28 Yes Travel_Rarely 103 Research & Development 24 3 Life Sciences 1
29 No Travel_Rarely 1389 Research & Development 21 4 Life Sciences 1
32 No Travel_Rarely 334 Research & Development 5 2 Life Sciences 1
22 No Non-Travel 1123 Research & Development 16 2 Medical 1
53 No Travel_Rarely 1219 Sales 2 4 Life Sciences 1
38 No Travel_Rarely 371 Research & Development 2 3 Life Sciences 1
24 No Non-Travel 673 Research & Development 11 2 Other 1
36 Yes Travel_Rarely 1218 Sales 9 4 Life Sciences 1
34 No Travel_Rarely 419 Research & Development 7 4 Life Sciences 1
21 No Travel_Rarely 391 Research & Development 15 2 Life Sciences 1
34 Yes Travel_Rarely 699 Research & Development 6 1 Medical 1
53 No Travel_Rarely 1282 Research & Development 5 3 Other 1
32 Yes Travel_Frequently 1125 Research & Development 16 1 Life Sciences 1
42 No Travel_Rarely 691 Sales 8 4 Marketing 1
44 No Travel_Rarely 477 Research & Development 7 4 Medical 1
46 No Travel_Rarely 705 Sales 2 4 Marketing 1
33 No Travel_Rarely 924 Research & Development 2 3 Medical 1
44 No Travel_Rarely 1459 Research & Development 10 4 Other 1
30 No Travel_Rarely 125 Research & Development 9 2 Medical 1
39 Yes Travel_Rarely 895 Sales 5 3 Technical Degree 1
24 Yes Travel_Rarely 813 Research & Development 1 3 Medical 1
43 No Travel_Rarely 1273 Research & Development 2 2 Medical 1
50 Yes Travel_Rarely 869 Sales 3 2 Marketing 1
35 No Travel_Rarely 890 Sales 2 3 Marketing 1
36 No Travel_Rarely 852 Research & Development 5 4 Life Sciences 1
33 No Travel_Frequently 1141 Sales 1 3 Life Sciences 1
35 No Travel_Rarely 464 Research & Development 4 2 Other 1
27 No Travel_Rarely 1240 Research & Development 2 4 Life Sciences 1
26 Yes Travel_Rarely 1357 Research & Development 25 3 Life Sciences 1
27 No Travel_Frequently 994 Sales 8 3 Life Sciences 1
30 No Travel_Frequently 721 Research & Development 1 2 Medical 1
41 Yes Travel_Rarely 1360 Research & Development 12 3 Technical Degree 1
34 No Non-Travel 1065 Sales 23 4 Marketing 1
37 No Travel_Rarely 408 Research & Development 19 2 Life Sciences 1
46 No Travel_Frequently 1211 Sales 5 4 Marketing 1
35 No Travel_Rarely 1229 Research & Development 8 1 Life Sciences 1
48 Yes Travel_Rarely 626 Research & Development 1 2 Life Sciences 1
28 Yes Travel_Rarely 1434 Research & Development 5 4 Technical Degree 1
44 No Travel_Rarely 1488 Sales 1 5 Marketing 1
35 No Non-Travel 1097 Research & Development 11 2 Medical 1
26 No Travel_Rarely 1443 Sales 23 3 Marketing 1
33 No Travel_Frequently 515 Research & Development 1 2 Life Sciences 1
35 No Travel_Frequently 853 Sales 18 5 Life Sciences 1
35 No Travel_Rarely 1142 Research & Development 23 4 Medical 1
31 No Travel_Rarely 655 Research & Development 7 4 Life Sciences 1
37 No Travel_Rarely 1115 Research & Development 1 4 Life Sciences 1
32 No Travel_Rarely 427 Research & Development 1 3 Medical 1
38 No Travel_Frequently 653 Research & Development 29 5 Life Sciences 1
50 No Travel_Rarely 989 Research & Development 7 2 Medical 1
59 No Travel_Rarely 1435 Sales 25 3 Life Sciences 1
36 No Travel_Rarely 1223 Research & Development 8 3 Technical Degree 1
55 No Travel_Rarely 836 Research & Development 8 3 Medical 1
36 No Travel_Frequently 1195 Research & Development 11 3 Life Sciences 1

You might also like