Attrition Project Mangal
Attrition Project Mangal
Management, Meerut
( Affiliated to Dr. A.P.J. Abdul Kalam Technical University,
Lucknow ) College Code 285
Session 2024-25
Project Report
On
Employee Attrition Prediction Model
Using Machine Learning
Overview:
This project aims to leverage machine learning techniques
to predict employee attrition and identify the factors
contributing to it. Employee attrition is a significant
challenge for organizations, leading to increased costs for
recruitment, onboarding, and training, as well as
disruptions to workflow and morale. By analyzing historical
employee data, the project seeks to provide actionable
insights to reduce turnover rates and improve employee
retention strategies.
Objectives:
1. To develop a machine learning model capable of
accurately predicting whether an employee is likely to
leave the organization.
2. To identify key factors influencing attrition, such as job
satisfaction, work-life balance, compensation, and
career growth opportunities.
3. To provide insights that can guide human resource
teams in making data-driven decisions to improve
employee engagement and retention.
Problem Statement
Employee attrition is a critical challenge faced by
organizations across industries. High attrition rates lead to
increased costs associated with recruitment, onboarding,
and training, as well as disruptions in workflow, team
dynamics, and overall productivity. Identifying the
underlying reasons for employee turnover and predicting
potential attrition are essential for designing effective
retention strategies.
Despite the availability of HR data, many organizations
struggle to leverage it effectively for proactive decision-
making. Traditional methods of analyzing attrition are often
time-consuming, lack accuracy, and fail to identify subtle
patterns that contribute to employee dissatisfaction and
eventual resignation.
The goal of this project is to develop a machine learning-
based solution that can:
1. Accurately predict whether an employee is likely to
leave the organization.
2. Identify and rank the factors that contribute to
attrition. Provide actionable insights for human
resource teams to implement targeted interventions
and reduce turnover rates.
Dataset Info
Possible Sources:
• Kaggle Dataset
Key Features of the Dataset
1. Age: Employee's age.
2. Attrition: Target variable indicating whether the employee left the
organization.
3. BusinessTravel: Frequency of business travel (e.g., Rarely,
Frequently).
4. DailyRate / MonthlyIncome: Financial metrics related to employee
salary.
5. Department: Department of the employee (e.g., Sales, HR, R&D).
6. DistanceFromHome: Commute distance from home to work.
7. Education & EducationField: Employee’s education level and field of
study.
8. EnvironmentSatisfaction: Satisfaction with the work environment (1–
4 scale).
9. Gender: Gender of the employee.
10. JobRole & JobLevel: Role and position level in the organization.
11. JobSatisfaction: Satisfaction with the job itself (1–4 scale).
12. MaritalStatus: Employee’s marital status.
13. OverTime: Whether the employee works overtime.
14. WorkLifeBalance: Perception of work-life balance (1–4 scale).
15. YearsAtCompany / TotalWorkingYears: Employee’s tenure and
total work experience.
16. YearsSinceLastPromotion: Time since the last promotion.
17. TrainingTimesLastYear: Number of training sessions attended in
the past year.
In [2]: # visualization
import seaborn as sns
import matplotlib.pyplot as plt
plt.style.use('seaborn-whitegrid')
In [3]: # Preprocessing
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, label_binarize, StandardScaler
In [7]: df.head()
Out[7]: Age Attrition BusinessTravel DailyRate Department DistanceFromHome Education EducationField EmployeeCount Employe
Research &
1 49 No Travel_Frequently 279 8 1 Life Sciences 1
Development
Research &
2 37 Yes Travel_Rarely 1373 2 2 Other 1
Development
Research &
3 33 No Travel_Frequently 1392 3 4 Life Sciences 1
Development
Research &
4 27 No Travel_Rarely 591 2 1 Medical 1
Development
5 rows × 35 columns
In [8]: df.shape
In [9]: ProfileReport(df)
Out[9]:
Overview
Dataset info
Number of variables 35
Number of observations 1470
Total Missing (%) 0.0%
Total size in memory 402.1 KiB
Average record size in memory 280.1 B
Variables types
Numeric 22
Categorical 8
Boolean 1
Date 0
Text (Unique) 0
Rejected 4
Unsupported 0
Warnings
Variables
Age
Numeric
Distinct count 43
Unique (%) 2.9%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 36.924
Minimum 18
Maximum 60
Zeros (%) 0.0%
Toggle details
Statistics
Histogram
Common Values
Extreme Values
Quantile statistics
Minimum 18
5-th percentile 24
Q1 30
Median 36
Q3 43
95-th percentile 54
Maximum 60
Range 42
Interquartile range 13
Descriptive statistics
ValueCountFrequency (%)
35 78 5.3%
34 77 5.2%
31 69 4.7%
36 69 4.7%
29 68 4.6%
32 61 4.1%
30 60 4.1%
33 58 3.9%
38 58 3.9%
40 57 3.9%
Other values (33) 815 55.4%
Minimum 5 values
ValueCountFrequency (%)
18 8 0.5%
19 9 0.6%
20 11 0.7%
21 13 0.9%
22 16 1.1%
Maximum 5 values
ValueCountFrequency (%)
56 14 1.0%
57 4 0.3%
58 14 1.0%
59 10 0.7%
60 5 0.3%
Attrition
Categorical
Distinct count 2
Unique (%) 0.1%
Missing (%) 0.0%
Missing (n) 0
No 1233
Yes 237
Toggle details
ValueCountFrequency (%)
No 1233 83.9%
Yes 237 16.1%
BusinessTravel
Categorical
Distinct count 3
Unique (%) 0.2%
Missing (%) 0.0%
Missing (n) 0
Travel_Rarely 1043
Travel_Frequently 277
Non-Travel 150
Toggle details
ValueCountFrequency (%)
Travel_Rarely 1043 71.0%
Travel_Frequently 277 18.8%
Non-Travel 150 10.2%
DailyRate
Numeric
Mean 802.49
Minimum 102
Maximum 1499
Zeros (%) 0.0%
Toggle details
Statistics
Histogram
Common Values
Extreme Values
Quantile statistics
Minimum 102
5-th percentile 165.35
Q1 465
Median 802
Q3 1157
95-th percentile 1424.1
Maximum 1499
Range 1397
Interquartile range 692
Descriptive statistics
ValueCountFrequency (%)
691 6 0.4%
1082 5 0.3%
329 5 0.3%
1329 5 0.3%
530 5 0.3%
408 5 0.3%
715 4 0.3%
589 4 0.3%
906 4 0.3%
350 4 0.3%
Other values (876) 1423 96.8%
Minimum 5 values
ValueCountFrequency (%)
102 1 0.1%
103 1 0.1%
104 1 0.1%
105 1 0.1%
106 1 0.1%
Maximum 5 values
ValueCountFrequency (%)
1492 1 0.1%
1495 3 0.2%
1496 2 0.1%
1498 1 0.1%
1499 1 0.1%
Department
Categorical
Distinct count 3
Unique (%) 0.2%
Missing (%) 0.0%
Missing (n) 0
Research & Development 961
Sales 446
Human Resources 63
Toggle details
ValueCountFrequency (%)
Research & Development 961 65.4%
Sales 446 30.3%
Human Resources 63 4.3%
DistanceFromHome
Numeric
Distinct count 29
Unique (%) 2.0%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 9.1925
Minimum 1
Maximum 29
Zeros (%) 0.0%
Toggle details
Statistics
Histogram
Common Values
Extreme Values
Quantile statistics
Minimum 1
5-th percentile 1
Q1 2
Median 7
Q3 14
95-th percentile 26
Maximum 29
Range 28
Interquartile range 12
Descriptive statistics
Minimum 5 values
ValueCountFrequency (%)
1 208 14.1%
2 211 14.4%
3 84 5.7%
4 64 4.4%
5 65 4.4%
Maximum 5 values
ValueCountFrequency (%)
25 25 1.7%
26 25 1.7%
27 12 0.8%
28 23 1.6%
29 27 1.8%
Education
Numeric
Distinct count 5
Unique (%) 0.3%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 2.9129
Minimum 1
Maximum 5
Zeros (%) 0.0%
Toggle details
Statistics
Histogram
Common Values
Extreme Values
Quantile statistics
Minimum 1
5-th percentile 1
Q1 2
Median 3
Q3 4
95-th percentile 4
Maximum 5
Range 4
Interquartile range 2
Descriptive statistics
ValueCountFrequency (%)
3 572 38.9%
4 398 27.1%
2 282 19.2%
1 170 11.6%
5 48 3.3%
Minimum 5 values
ValueCountFrequency (%)
1 170 11.6%
2 282 19.2%
3 572 38.9%
4 398 27.1%
5 48 3.3%
Maximum 5 values
ValueCountFrequency (%)
1 170 11.6%
2 282 19.2%
3 572 38.9%
4 398 27.1%
5 48 3.3%
EducationField
Categorical
Distinct count 6
Unique (%) 0.4%
Missing (%) 0.0%
Missing (n) 0
Toggle details
ValueCountFrequency (%)
Life Sciences 606 41.2%
Medical 464 31.6%
Marketing 159 10.8%
Technical Degree 132 9.0%
Other 82 5.6%
Human Resources 27 1.8%
EmployeeCount
Constant
Constant value 1
EmployeeNumber
Numeric
Mean 1024.9
Minimum 1
Maximum 2068
Zeros (%) 0.0%
Toggle details
Statistics
Histogram
Common Values
Extreme Values
Quantile statistics
Minimum 1
5-th percentile 96.45
Q1 491.25
Median 1020.5
Q3 1555.8
95-th percentile 1967.5
Maximum 2068
Range 2067
Interquartile range 1064.5
Descriptive statistics
ValueCountFrequency (%)
2046 1 0.1%
641 1 0.1%
644 1 0.1%
645 1 0.1%
647 1 0.1%
648 1 0.1%
649 1 0.1%
650 1 0.1%
652 1 0.1%
653 1 0.1%
Other values (1460) 1460 99.3%
Minimum 5 values
ValueCountFrequency (%)
1 1 0.1%
2 1 0.1%
4 1 0.1%
5 1 0.1%
7 1 0.1%
Maximum 5 values
ValueCountFrequency (%)
2061 1 0.1%
2062 1 0.1%
2064 1 0.1%
2065 1 0.1%
2068 1 0.1%
EnvironmentSatisfaction
Numeric
Distinct count 4
Unique (%) 0.3%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 2.7218
Minimum 1
Maximum 4
Zeros (%) 0.0%
Toggle details
Statistics
Histogram
Common Values
Extreme Values
Quantile statistics
Minimum 1
5-th percentile 1
Q1 2
Median 3
Q3 4
95-th percentile 4
Maximum 4
Range 3
Interquartile range 2
Descriptive statistics
ValueCountFrequency (%)
3 453 30.8%
4 446 30.3%
2 287 19.5%
1 284 19.3%
Minimum 5 values
ValueCountFrequency (%)
1 284 19.3%
2 287 19.5%
3 453 30.8%
4 446 30.3%
Maximum 5 values
ValueCountFrequency (%)
1 284 19.3%
2 287 19.5%
3 453 30.8%
4 446 30.3%
Gender
Categorical
Distinct count 2
Unique (%) 0.1%
Missing (%) 0.0%
Missing (n) 0
Male 882
Female 588
Toggle details
ValueCountFrequency (%)
Male 882 60.0%
Female 588 40.0%
HourlyRate
Numeric
Distinct count 71
Unique (%) 4.8%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 65.891
Minimum 30
Maximum 100
Zeros (%) 0.0%
Toggle details
Statistics
Histogram
Common Values
Extreme Values
Quantile statistics
Minimum 30
5-th percentile 33
Q1 48
Median 66
Q3 83.75
95-th percentile 97
Maximum 100
Range 70
Interquartile range 35.75
Descriptive statistics
ValueCountFrequency (%)
66 29 2.0%
42 28 1.9%
98 28 1.9%
48 28 1.9%
84 28 1.9%
79 27 1.8%
96 27 1.8%
57 27 1.8%
52 26 1.8%
87 26 1.8%
Other values (61) 1196 81.4%
Minimum 5 values
ValueCountFrequency (%)
30 19 1.3%
31 15 1.0%
32 24 1.6%
33 19 1.3%
34 12 0.8%
Maximum 5 values
ValueCountFrequency (%)
96 27 1.8%
97 21 1.4%
98 28 1.9%
99 20 1.4%
100 19 1.3%
JobInvolvement
Numeric
Distinct count 4
Unique (%) 0.3%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 2.7299
Minimum 1
Maximum 4
Zeros (%) 0.0%
Toggle details
Statistics
Histogram
Common Values
Extreme Values
Quantile statistics
Minimum 1
5-th percentile 1
Q1 2
Median 3
Q3 3
95-th percentile 4
Maximum 4
Range 3
Interquartile range 1
Descriptive statistics
ValueCountFrequency (%)
3 868 59.0%
2 375 25.5%
4 144 9.8%
1 83 5.6%
Minimum 5 values
ValueCountFrequency (%)
1 83 5.6%
2 375 25.5%
3 868 59.0%
4 144 9.8%
Maximum 5 values
ValueCountFrequency (%)
1 83 5.6%
2 375 25.5%
3 868 59.0%
4 144 9.8%
JobLevel
Numeric
Distinct count 5
Unique (%) 0.3%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 2.0639
Minimum 1
Maximum 5
Zeros (%) 0.0%
Toggle details
Statistics
Histogram
Common Values
Extreme Values
Quantile statistics
Minimum 1
5-th percentile 1
Q1 1
Median 2
Q3 3
95-th percentile 4
Maximum 5
Range 4
Interquartile range 2
Descriptive statistics
ValueCountFrequency (%)
1 543 36.9%
2 534 36.3%
3 218 14.8%
4 106 7.2%
ValueCountFrequency (%)
5 69 4.7%
Minimum 5 values
ValueCountFrequency (%)
1 543 36.9%
2 534 36.3%
3 218 14.8%
4 106 7.2%
5 69 4.7%
Maximum 5 values
ValueCountFrequency (%)
1 543 36.9%
2 534 36.3%
3 218 14.8%
4 106 7.2%
5 69 4.7%
JobRole
Categorical
Distinct count 9
Unique (%) 0.6%
Missing (%) 0.0%
Missing (n) 0
Toggle details
ValueCountFrequency (%)
Sales Executive 326 22.2%
Research Scientist 292 19.9%
Laboratory Technician 259 17.6%
Manufacturing Director 145 9.9%
Healthcare Representative 131 8.9%
Manager 102 6.9%
Sales Representative 83 5.6%
Research Director 80 5.4%
Human Resources 52 3.5%
JobSatisfaction
Numeric
Distinct count 4
Unique (%) 0.3%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 2.7286
Minimum 1
Maximum 4
Zeros (%) 0.0%
Toggle details
Statistics
Histogram
Common Values
Extreme Values
Quantile statistics
Minimum 1
5-th percentile 1
Q1 2
Median 3
Q3 4
95-th percentile 4
Maximum 4
Range 3
Interquartile range 2
Descriptive statistics
ValueCountFrequency (%)
4 459 31.2%
3 442 30.1%
1 289 19.7%
2 280 19.0%
Minimum 5 values
ValueCountFrequency (%)
1 289 19.7%
2 280 19.0%
3 442 30.1%
4 459 31.2%
Maximum 5 values
ValueCountFrequency (%)
1 289 19.7%
2 280 19.0%
3 442 30.1%
4 459 31.2%
MaritalStatus
Categorical
Distinct count 3
Unique (%) 0.2%
Missing (%) 0.0%
Missing (n) 0
Married 673
Single 470
Divorced 327
Toggle details
ValueCountFrequency (%)
Married 673 45.8%
Single 470 32.0%
Divorced 327 22.2%
MonthlyIncome
Highly correlated
This variable is highly correlated with JobLevel and should be ignored for analysis
Correlation 0.9503
MonthlyRate
Numeric
Mean 14313
Minimum 2094
Maximum 26999
Zeros (%) 0.0%
Toggle details
Statistics
Histogram
Common Values
Extreme Values
Quantile statistics
Minimum 2094
5-th percentile 3384.6
Q1 8047
Median 14236
Q3 20462
95-th percentile 25432
Maximum 26999
Range 24905
Interquartile range 12414
Descriptive statistics
Minimum 5 values
ValueCountFrequency (%)
2094 1 0.1%
2097 1 0.1%
2104 1 0.1%
2112 1 0.1%
2122 1 0.1%
Maximum 5 values
ValueCountFrequency (%)
26956 1 0.1%
26959 1 0.1%
26968 1 0.1%
26997 1 0.1%
26999 1 0.1%
NumCompaniesWorked
Numeric
Distinct count 10
Unique (%) 0.7%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 2.6932
Minimum 0
Maximum 9
Zeros (%) 13.4%
Toggle details
Statistics
Histogram
Common Values
Extreme Values
Quantile statistics
Minimum 0
5-th percentile 0
Q1 1
Median 2
Q3 4
95-th percentile 8
Maximum 9
Range 9
Interquartile range 3
Descriptive statistics
ValueCountFrequency (%)
1 521 35.4%
0 197 13.4%
3 159 10.8%
2 146 9.9%
4 139 9.5%
7 74 5.0%
6 70 4.8%
5 63 4.3%
9 52 3.5%
8 49 3.3%
Minimum 5 values
ValueCountFrequency (%)
0 197 13.4%
1 521 35.4%
2 146 9.9%
3 159 10.8%
4 139 9.5%
Maximum 5 values
ValueCountFrequency (%)
5 63 4.3%
6 70 4.8%
7 74 5.0%
ValueCountFrequency (%)
8 49 3.3%
9 52 3.5%
Over18
Constant
Constant value Y
OverTime
Categorical
Distinct count 2
Unique (%) 0.1%
Missing (%) 0.0%
Missing (n) 0
No 1054
Yes 416
Toggle details
ValueCountFrequency (%)
No 1054 71.7%
Yes 416 28.3%
PercentSalaryHike
Numeric
Distinct count 15
Unique (%) 1.0%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 15.21
Minimum 11
Maximum 25
Zeros (%) 0.0%
Toggle details
Statistics
Histogram
Common Values
Extreme Values
Quantile statistics
Minimum 11
5-th percentile 11
Q1 12
Median 14
Q3 18
95-th percentile 22
Maximum 25
Range 14
Interquartile range 6
Descriptive statistics
ValueCountFrequency (%)
11 210 14.3%
13 209 14.2%
14 201 13.7%
12 198 13.5%
15 101 6.9%
18 89 6.1%
17 82 5.6%
16 78 5.3%
19 76 5.2%
22 56 3.8%
Other values (5) 170 11.6%
Minimum 5 values
ValueCountFrequency (%)
11 210 14.3%
12 198 13.5%
13 209 14.2%
14 201 13.7%
15 101 6.9%
Maximum 5 values
ValueCountFrequency (%)
21 48 3.3%
22 56 3.8%
23 28 1.9%
24 21 1.4%
25 18 1.2%
PerformanceRating
Boolean
Distinct count 2
Unique (%) 0.1%
Missing (%) 0.0%
Missing (n) 0
Mean 3.1537
3 1244
4 226
Toggle details
ValueCountFrequency (%)
3 1244 84.6%
4 226 15.4%
RelationshipSatisfaction
Numeric
Distinct count 4
Unique (%) 0.3%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 2.7122
Minimum 1
Maximum 4
Zeros (%) 0.0%
Toggle details
Statistics
Histogram
Common Values
Extreme Values
Quantile statistics
Minimum 1
5-th percentile 1
Q1 2
Median 3
Q3 4
95-th percentile 4
Maximum 4
Range 3
Interquartile range 2
Descriptive statistics
Minimum 5 values
ValueCountFrequency (%)
1 276 18.8%
2 303 20.6%
3 459 31.2%
4 432 29.4%
Maximum 5 values
ValueCountFrequency (%)
1 276 18.8%
2 303 20.6%
3 459 31.2%
4 432 29.4%
StandardHours
Constant
Constant value 80
StockOptionLevel
Numeric
Distinct count 4
Unique (%) 0.3%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 0.79388
Minimum 0
Maximum 3
Zeros (%) 42.9%
Toggle details
Statistics
Histogram
Common Values
Extreme Values
Quantile statistics
Minimum 0
5-th percentile 0
Q1 0
Median 1
Q3 1
95-th percentile 3
Maximum 3
Range 3
Interquartile range 1
Descriptive statistics
ValueCountFrequency (%)
0 631 42.9%
1 596 40.5%
2 158 10.7%
3 85 5.8%
Minimum 5 values
ValueCountFrequency (%)
0 631 42.9%
1 596 40.5%
2 158 10.7%
3 85 5.8%
Maximum 5 values
ValueCountFrequency (%)
0 631 42.9%
1 596 40.5%
2 158 10.7%
3 85 5.8%
TotalWorkingYears
Numeric
Distinct count 40
Unique (%) 2.7%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 11.28
Minimum 0
Maximum 40
Zeros (%) 0.7%
Toggle details
Statistics
Histogram
Common Values
Extreme Values
Quantile statistics
Minimum 0
5-th percentile 1
Q1 6
Median 10
Q3 15
95-th percentile 28
Maximum 40
Range 40
Interquartile range 9
Descriptive statistics
ValueCountFrequency (%)
10 202 13.7%
6 125 8.5%
8 103 7.0%
9 96 6.5%
5 88 6.0%
1 81 5.5%
7 81 5.5%
4 63 4.3%
ValueCountFrequency (%)
12 48 3.3%
3 42 2.9%
Other values (30) 541 36.8%
Minimum 5 values
ValueCountFrequency (%)
0 11 0.7%
1 81 5.5%
2 31 2.1%
3 42 2.9%
4 63 4.3%
Maximum 5 values
ValueCountFrequency (%)
35 3 0.2%
36 6 0.4%
37 4 0.3%
38 1 0.1%
40 2 0.1%
TrainingTimesLastYear
Numeric
Distinct count 7
Unique (%) 0.5%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 2.7993
Minimum 0
Maximum 6
Zeros (%) 3.7%
Toggle details
Statistics
Histogram
Common Values
Extreme Values
Quantile statistics
Minimum 0
5-th percentile 1
Q1 2
Median 3
Q3 3
95-th percentile 5
Maximum 6
Range 6
Interquartile range 1
Descriptive statistics
Minimum 5 values
ValueCountFrequency (%)
0 54 3.7%
1 71 4.8%
2 547 37.2%
3 491 33.4%
4 123 8.4%
Maximum 5 values
ValueCountFrequency (%)
2 547 37.2%
3 491 33.4%
4 123 8.4%
5 119 8.1%
6 65 4.4%
WorkLifeBalance
Numeric
Distinct count 4
Unique (%) 0.3%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 2.7612
Minimum 1
Maximum 4
Zeros (%) 0.0%
Toggle details
Statistics
Histogram
Common Values
Extreme Values
Quantile statistics
Minimum 1
5-th percentile 1
Q1 2
Median 3
Q3 3
95-th percentile 4
Maximum 4
Range 3
Interquartile range 1
Descriptive statistics
ValueCountFrequency (%)
3 893 60.7%
2 344 23.4%
4 153 10.4%
1 80 5.4%
Minimum 5 values
ValueCountFrequency (%)
1 80 5.4%
2 344 23.4%
3 893 60.7%
4 153 10.4%
Maximum 5 values
ValueCountFrequency (%)
1 80 5.4%
2 344 23.4%
3 893 60.7%
4 153 10.4%
YearsAtCompany
Numeric
Distinct count 37
Unique (%) 2.5%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 7.0082
Minimum 0
Maximum 40
Zeros (%) 3.0%
Toggle details
Statistics
Histogram
Common Values
Extreme Values
Quantile statistics
Minimum 0
5-th percentile 1
Q1 3
Median 5
Q3 9
95-th percentile 20
Maximum 40
Range 40
Interquartile range 6
Descriptive statistics
ValueCountFrequency (%)
5 196 13.3%
1 171 11.6%
3 128 8.7%
2 127 8.6%
10 120 8.2%
4 110 7.5%
7 90 6.1%
9 82 5.6%
8 80 5.4%
6 76 5.2%
Other values (27) 290 19.7%
Minimum 5 values
ValueCountFrequency (%)
0 44 3.0%
1 171 11.6%
2 127 8.6%
3 128 8.7%
4 110 7.5%
Maximum 5 values
ValueCountFrequency (%)
33 5 0.3%
34 1 0.1%
36 2 0.1%
37 1 0.1%
40 1 0.1%
YearsInCurrentRole
Numeric
Distinct count 19
Unique (%) 1.3%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 4.2293
Minimum 0
Maximum 18
Zeros (%) 16.6%
Toggle details
Statistics
Histogram
Common Values
Extreme Values
Quantile statistics
Minimum 0
5-th percentile 0
Q1 2
Median 3
Q3 7
95-th percentile 11
Maximum 18
Range 18
Interquartile range 5
Descriptive statistics
Minimum 5 values
ValueCountFrequency (%)
0 244 16.6%
1 57 3.9%
2 372 25.3%
3 135 9.2%
4 104 7.1%
Maximum 5 values
ValueCountFrequency (%)
14 11 0.7%
15 8 0.5%
16 7 0.5%
17 4 0.3%
18 2 0.1%
YearsSinceLastPromotion
Numeric
Distinct count 16
Unique (%) 1.1%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 2.1878
Minimum 0
Maximum 15
Zeros (%) 39.5%
Toggle details
Statistics
Histogram
Common Values
Extreme Values
Quantile statistics
Minimum 0
5-th percentile 0
Q1 0
Median 1
Q3 3
95-th percentile 9
Maximum 15
Range 15
Interquartile range 3
Descriptive statistics
ValueCountFrequency (%)
0 581 39.5%
1 357 24.3%
2 159 10.8%
7 76 5.2%
4 61 4.1%
3 52 3.5%
5 45 3.1%
6 32 2.2%
11 24 1.6%
8 18 1.2%
Other values (6) 65 4.4%
Minimum 5 values
ValueCountFrequency (%)
0 581 39.5%
1 357 24.3%
2 159 10.8%
3 52 3.5%
4 61 4.1%
Maximum 5 values
ValueCountFrequency (%)
11 24 1.6%
12 10 0.7%
ValueCountFrequency (%)
13 10 0.7%
14 9 0.6%
15 13 0.9%
YearsWithCurrManager
Numeric
Distinct count 18
Unique (%) 1.2%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 4.1231
Minimum 0
Maximum 17
Zeros (%) 17.9%
Toggle details
Statistics
Histogram
Common Values
Extreme Values
Quantile statistics
Minimum 0
5-th percentile 0
Q1 2
Median 3
Q3 7
95-th percentile 10
Maximum 17
Range 17
Interquartile range 5
Descriptive statistics
Minimum 5 values
ValueCountFrequency (%)
0 263 17.9%
1 76 5.2%
2 344 23.4%
3 142 9.7%
4 98 6.7%
Maximum 5 values
ValueCountFrequency (%)
13 14 1.0%
14 5 0.3%
15 5 0.3%
16 2 0.1%
17 7 0.5%
Correlations
Sample
# Count Plot
df[col_name].value_counts().plot.bar(cmap='Set2',ax=ax[0])
ax[1].set_title(f'Number of Employee by {col_name}')
ax[1].set_ylabel('Count')
ax[1].set_xlabel(f'{col_name}')
In [14]: categorical_column_viz('BusinessTravel')
In [15]: categorical_column_viz('Department')
In [16]: categorical_column_viz('EducationField')
In [17]: categorical_column_viz('Education')
In [18]: categorical_column_viz('EnvironmentSatisfaction')
In [19]: categorical_column_viz('Gender')
In [20]: categorical_column_viz('JobRole')
In [21]: categorical_column_viz('JobInvolvement')
In [22]: categorical_column_viz('MaritalStatus')
In [23]: categorical_column_viz('NumCompaniesWorked')
In [24]: categorical_column_viz('OverTime')
In [25]: categorical_column_viz('StockOptionLevel')
In [26]: categorical_column_viz('TrainingTimesLastYear')
In [27]: categorical_column_viz('YearsWithCurrManager')
Visualization of Numerical Features
In [28]: def numerical_column_viz(col_name):
f,ax = plt.subplots(1,2, figsize=(18,6))
sns.kdeplot(attrition[col_name], label='Employee who left',ax=ax[0], shade=True, color='palegreen')
sns.kdeplot(no_attrition[col_name], label='Employee who stayed', ax=ax[0], shade=True, color='salmon')
In [29]: numerical_column_viz("Age")
In [30]: numerical_column_viz("Age")
In [31]: numerical_column_viz("DailyRate")
In [32]: numerical_column_viz("DistanceFromHome")
In [33]: numerical_column_viz("MonthlyIncome")
In [34]: numerical_column_viz("HourlyRate")
In [35]: numerical_column_viz("JobInvolvement")
In [36]: numerical_column_viz("PercentSalaryHike")
In [37]: numerical_column_viz("Age")
In [38]: numerical_column_viz("DailyRate")
In [39]: numerical_column_viz("TotalWorkingYears")
In [40]: numerical_column_viz("YearsAtCompany")
In [41]: numerical_column_viz("YearsInCurrentRole")
In [42]: numerical_column_viz("YearsSinceLastPromotion")
In [43]: numerical_column_viz("YearsWithCurrManager")
In [45]: categorical_numerical('Age','Gender','MaritalStatus')
In [46]: categorical_numerical('Age','JobRole','EducationField')
In [47]: categorical_numerical('MonthlyIncome','Gender','MaritalStatus')
Feature Engineering
In [48]: # 'EnviornmentSatisfaction', 'JobInvolvement', 'JobSatisfacction', 'RelationshipSatisfaction', 'WorklifeBalance' can be cl
df['Total_Satisfaction'] = (df['EnvironmentSatisfaction'] +
df['JobInvolvement'] +
df['JobSatisfaction'] +
df['RelationshipSatisfaction'] +
df['WorkLifeBalance']) /5
# Drop Columns
df.drop(['EnvironmentSatisfaction','JobInvolvement','JobSatisfaction','RelationshipSatisfaction','WorkLifeBalance'], axis=
In [49]: categorical_column_viz('Total_Satisfaction')
In [50]: df.Total_Satisfaction.describe()
In [52]: # It can be observed that the rate of attrition of employees below age of 35 is high
In [53]: # It can be observed that the employees are more likey the drop the job if dailtRate less than 800
In [56]: # Employees are more likey to drop the job if the employee is working as Laboratory Technician
In [57]: # Employees are more likey to the drop the job if the employee's hourly rate < 65
In [58]: # Employees are more likey to the drop the job if the employee's MonthlyIncome < 4000
In [60]: # Employees are more likey to the drop the job if the employee's TotalWorkingYears < 8
In [61]: # Employees are more likey to the drop the job if the employee's YearsAtCompany < 3
In [62]: # Employees are more likey to the drop the job if the employee's YearsInCurrentRole < 3
In [63]: # Employees are more likey to the drop the job if the employee's YearsSinceLastPromotion < 1
In [64]: # Employees are more likey to the drop the job if the employee's YearsWithCurrManager < 1
In [70]: y = df['Attrition']
In [68]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1470 entries, 0 to 1469
Data columns (total 25 columns):
Attrition 1470 non-null int64
BusinessTravel 1470 non-null category
Education 1470 non-null category
EducationField 1470 non-null category
Gender 1470 non-null category
JobLevel 1470 non-null int64
MaritalStatus 1470 non-null category
OverTime 1470 non-null category
PerformanceRating 1470 non-null int64
StockOptionLevel 1470 non-null category
TrainingTimesLastYear 1470 non-null category
Total_Satisfaction_bool 1470 non-null int64
Age_bool 1470 non-null int64
DailyRate_bool 1470 non-null int64
Department_bool 1470 non-null int64
DistanceFromHome_bool 1470 non-null int64
JobRole_bool 1470 non-null int64
HourlyRate_bool 1470 non-null int64
MonthlyIncome_bool 1470 non-null int64
NumCompaniesWorked_bool 1470 non-null int64
TotalWorkingYears_bool 1470 non-null int64
YearsAtCompany_bool 1470 non-null int64
YearsInCurrentRole_bool 1470 non-null int64
YearsSinceLastPromotion_bool 1470 non-null int64
YearsWithCurrManager_bool 1470 non-null int64
dtypes: category(8), int64(17)
memory usage: 208.2 KB
0 0.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 ... 0 0 0 0
1 0.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 ... 0 0 1 0
2 0.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 ... 0 1 0 1
3 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 ... 0 0 1 1
4 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0 1 1 1
5 rows × 48 columns
onehotencoder = OneHotEncoder()
X_categorical = onehotencoder.fit_transform(X_categorical).toarray()
X_categorical = pd.DataFrame(X_categorical)
X_categorical
Out[71]: 0 1 2 3 4 5 6 7 8 9 ... 22 23 24 25 26 27 28 29 30 31
0 0.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 ... 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0
1 0.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 ... 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0
2 0.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0
3 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0
4 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0
5 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 ... 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0
6 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0
7 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 ... 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0
8 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 ... 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0
9 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 ... 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0
10 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 ... 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
11 0.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0
12 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 ... 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0
13 0.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 ... 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0
14 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0
15 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 ... 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0
16 0.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 ... 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
17 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0
18 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0
19 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0
20 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 ... 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
21 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0
22 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0
23 0.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
24 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0
25 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 ... 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0
26 0.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
27 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 ... 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0
28 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 ... 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0
29 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1440 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 ... 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0
1441 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 ... 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0
1442 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 ... 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0
1443 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 ... 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0
1444 0.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 ... 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0
1445 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 ... 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0
1446 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 ... 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0
1447 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 ... 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0
1448 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 ... 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
1449 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0
1450 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 ... 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0
1451 0.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 ... 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0
1452 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 ... 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0
1453 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 ... 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0
1454 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0
1455 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 ... 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0
0 1 2 3 4 5 6 7 8 9 ... 22 23 24 25 26 27 28 29 30 31
1456 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 ... 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0
1457 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 ... 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0
1458 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 ... 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
1459 0.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 ... 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0
1460 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0
1461 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 ... 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0
1462 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0
1463 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0
1464 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0
1465 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 ... 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0
1466 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
1467 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 ... 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0
1468 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0
1469 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0
In [73]: X_all.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1470 entries, 0 to 1469
Data columns (total 48 columns):
0 1470 non-null float64
1 1470 non-null float64
2 1470 non-null float64
3 1470 non-null float64
4 1470 non-null float64
5 1470 non-null float64
6 1470 non-null float64
7 1470 non-null float64
8 1470 non-null float64
9 1470 non-null float64
10 1470 non-null float64
11 1470 non-null float64
12 1470 non-null float64
13 1470 non-null float64
14 1470 non-null float64
15 1470 non-null float64
16 1470 non-null float64
17 1470 non-null float64
18 1470 non-null float64
19 1470 non-null float64
20 1470 non-null float64
21 1470 non-null float64
22 1470 non-null float64
23 1470 non-null float64
24 1470 non-null float64
25 1470 non-null float64
26 1470 non-null float64
27 1470 non-null float64
28 1470 non-null float64
29 1470 non-null float64
30 1470 non-null float64
31 1470 non-null float64
JobLevel 1470 non-null int64
PerformanceRating 1470 non-null int64
Total_Satisfaction_bool 1470 non-null int64
Age_bool 1470 non-null int64
DailyRate_bool 1470 non-null int64
Department_bool 1470 non-null int64
DistanceFromHome_bool 1470 non-null int64
JobRole_bool 1470 non-null int64
HourlyRate_bool 1470 non-null int64
MonthlyIncome_bool 1470 non-null int64
NumCompaniesWorked_bool 1470 non-null int64
TotalWorkingYears_bool 1470 non-null int64
YearsAtCompany_bool 1470 non-null int64
YearsInCurrentRole_bool 1470 non-null int64
YearsSinceLastPromotion_bool 1470 non-null int64
YearsWithCurrManager_bool 1470 non-null int64
dtypes: float64(32), int64(16)
memory usage: 551.4 KB
Split Data
In [74]: X_train,X_test, y_train, y_test = train_test_split(X_all,y, test_size=0.30)
Train data shape: (1029, 48), Test Data Shape (441, 48)
In [76]: X_train.head()
772 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 ... 0 0 1 1
1403 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 ... 1 0 0 0
9 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 ... 1 0 0 0
662 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 ... 0 0 1 1
1387 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 ... 0 0 0 0
5 rows × 48 columns
Train Data
In [77]: # Function that runs the requested algorithm and returns the accuracy metrics
def fit_ml_algo(algo, X_train,y_train, cv):
# One Pass
model = algo.fit(X_train, y_train)
acc = round(model.score(X_train, y_train) * 100, 2)
# Cross Validation
train_pred = model_selection.cross_val_predict(algo,X_train,y_train,cv=cv,n_jobs = -1)
Logistic Regression
In [78]: # Logistic Regression
start_time = time.time()
train_pred_log, acc_log, acc_cv_log = fit_ml_algo(LogisticRegression(), X_train,y_train, 10)
log_time = (time.time() - start_time)
print("Accuracy: %s" % acc_log)
print("Accuracy CV 10-Fold: %s" % acc_cv_log)
print("Running Time: %s" % datetime.timedelta(seconds=log_time))
Accuracy: 89.89
Accuracy CV 10-Fold: 87.66
Running Time: 0:00:02.534987
Accuracy: 87.76
Accuracy CV 10-Fold: 86.1
Running Time: 0:00:00.207994
K Nearest Neighbour
In [81]: # K Nearest Neighbour
start_time = time.time()
train_pred_knn, acc_knn, acc_cv_knn = fit_ml_algo(KNeighborsClassifier(n_neighbors = 3),X_train,y_train,10)
knn_time = (time.time() - start_time)
print("Accuracy: %s" % acc_knn)
print("Accuracy CV 10-Fold: %s" % acc_cv_knn)
print("Running Time: %s" % datetime.timedelta(seconds=knn_time))
Accuracy: 89.21
Accuracy CV 10-Fold: 83.28
Running Time: 0:00:00.239998
Accuracy: 80.17
Accuracy CV 10-Fold: 77.45
Running Time: 0:00:00.064000
Perceptron
In [83]: # Perceptron
start_time = time.time()
train_pred_gaussian, acc_perceptron, acc_cv_perceptron = fit_ml_algo(Perceptron(),X_train,y_train,10)
perceptron_time = (time.time() - start_time)
print("Accuracy: %s" % acc_perceptron)
print("Accuracy CV 10-Fold: %s" % acc_cv_perceptron)
print("Running Time: %s" % datetime.timedelta(seconds=perceptron_time))
Accuracy: 87.27
Accuracy CV 10-Fold: 82.12
Running Time: 0:00:00.073985
Accuracy: 86.78
Accuracy CV 10-Fold: 83.97
Running Time: 0:00:00.096004
Decision Tree
In [85]: # Decision Tree
start_time = time.time()
train_pred_dt, acc_dt, acc_cv_dt = fit_ml_algo(DecisionTreeClassifier(),X_train, y_train,10)
dt_time = (time.time() - start_time)
print("Accuracy: %s" % acc_dt)
print("Accuracy CV 10-Fold: %s" % acc_cv_dt)
print("Running Time: %s" % datetime.timedelta(seconds=dt_time))
Accuracy: 100.0
Accuracy CV 10-Fold: 78.62
Running Time: 0:00:00.098000
Accuracy: 93.0
Accuracy CV 10-Fold: 87.17
Running Time: 0:00:00.702999
Random Forest
In [87]: # Random Forest
start_time = time.time()
train_pred_dt, acc_rf, acc_cv_rf = fit_ml_algo(RandomForestClassifier(n_estimators=100),X_train, y_train,10)
rf_time = (time.time() - start_time)
print("Accuracy: %s" % acc_rf)
print("Accuracy CV 10-Fold: %s" % acc_cv_rf)
print("Running Time: %s" % datetime.timedelta(seconds=rf_time))
Accuracy: 100.0
Accuracy CV 10-Fold: 85.33
Running Time: 0:00:00.789033
CatBoost Classifier
In [88]: # Define the categorical features for the CatBoost model
cat_features = np.where(X_train.dtypes != np.float)[0]
cat_features
Out[88]: array([32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47],
dtype=int64)
In [90]: # CatBoost
catboost_model = CatBoostClassifier(iterations=1000,custom_loss=['Accuracy'],loss_function='Logloss')
# CatBoost accuracy
acc_catboost = round(catboost_model.score(X_train, y_train) * 100, 2)
MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))
990: learn: 0.0269810 test: 0.4227576 best: 0.3451640 (189) total: 8m 16s remaining: 4.51s
991: learn: 0.0269395 test: 0.4228075 best: 0.3451640 (189) total: 8m 16s remaining: 4.01s
992: learn: 0.0269058 test: 0.4228815 best: 0.3451640 (189) total: 8m 17s remaining: 3.51s
993: learn: 0.0268638 test: 0.4230957 best: 0.3451640 (189) total: 8m 18s remaining: 3.01s
994: learn: 0.0268283 test: 0.4231735 best: 0.3451640 (189) total: 8m 18s remaining: 2.5s
995: learn: 0.0267771 test: 0.4232267 best: 0.3451640 (189) total: 8m 19s remaining: 2s
996: learn: 0.0267233 test: 0.4233080 best: 0.3451640 (189) total: 8m 19s remaining: 1.5s
997: learn: 0.0266916 test: 0.4234219 best: 0.3451640 (189) total: 8m 20s remaining: 1s
998: learn: 0.0266528 test: 0.4236120 best: 0.3451640 (189) total: 8m 20s remaining: 501ms
999: learn: 0.0266080 test: 0.4235866 best: 0.3451640 (189) total: 8m 21s remaining: 0us
10 CatBoost 95.82
3 KNN 89.21
1 SVM 87.76
5 Perceptron 87.27
10 CatBoost 86.49
1 SVM 86.10
3 KNN 83.28
5 Perceptron 82.12
Out[98]: Attrition
71 0
464 0
294 0
1230 0
1181 0
Accuracy: 87.76
In [ ]:
Employee Attrition Prediction
Education : 1-5
Predict
Age Attrition BusinessTravel DailyRate Department DistanceFromHome Education EducationField EmployeeCount
41 Yes Travel_Rarely 1102 Sales 1 2 Life Sciences 1
49 No Travel_Frequently 279 Research & Development 8 1 Life Sciences 1
37 Yes Travel_Rarely 1373 Research & Development 2 2 Other 1
33 No Travel_Frequently 1392 Research & Development 3 4 Life Sciences 1
27 No Travel_Rarely 591 Research & Development 2 1 Medical 1
32 No Travel_Frequently 1005 Research & Development 2 2 Life Sciences 1
59 No Travel_Rarely 1324 Research & Development 3 3 Medical 1
30 No Travel_Rarely 1358 Research & Development 24 1 Life Sciences 1
38 No Travel_Frequently 216 Research & Development 23 3 Life Sciences 1
36 No Travel_Rarely 1299 Research & Development 27 3 Medical 1
35 No Travel_Rarely 809 Research & Development 16 3 Medical 1
29 No Travel_Rarely 153 Research & Development 15 2 Life Sciences 1
31 No Travel_Rarely 670 Research & Development 26 1 Life Sciences 1
34 No Travel_Rarely 1346 Research & Development 19 2 Medical 1
28 Yes Travel_Rarely 103 Research & Development 24 3 Life Sciences 1
29 No Travel_Rarely 1389 Research & Development 21 4 Life Sciences 1
32 No Travel_Rarely 334 Research & Development 5 2 Life Sciences 1
22 No Non-Travel 1123 Research & Development 16 2 Medical 1
53 No Travel_Rarely 1219 Sales 2 4 Life Sciences 1
38 No Travel_Rarely 371 Research & Development 2 3 Life Sciences 1
24 No Non-Travel 673 Research & Development 11 2 Other 1
36 Yes Travel_Rarely 1218 Sales 9 4 Life Sciences 1
34 No Travel_Rarely 419 Research & Development 7 4 Life Sciences 1
21 No Travel_Rarely 391 Research & Development 15 2 Life Sciences 1
34 Yes Travel_Rarely 699 Research & Development 6 1 Medical 1
53 No Travel_Rarely 1282 Research & Development 5 3 Other 1
32 Yes Travel_Frequently 1125 Research & Development 16 1 Life Sciences 1
42 No Travel_Rarely 691 Sales 8 4 Marketing 1
44 No Travel_Rarely 477 Research & Development 7 4 Medical 1
46 No Travel_Rarely 705 Sales 2 4 Marketing 1
33 No Travel_Rarely 924 Research & Development 2 3 Medical 1
44 No Travel_Rarely 1459 Research & Development 10 4 Other 1
30 No Travel_Rarely 125 Research & Development 9 2 Medical 1
39 Yes Travel_Rarely 895 Sales 5 3 Technical Degree 1
24 Yes Travel_Rarely 813 Research & Development 1 3 Medical 1
43 No Travel_Rarely 1273 Research & Development 2 2 Medical 1
50 Yes Travel_Rarely 869 Sales 3 2 Marketing 1
35 No Travel_Rarely 890 Sales 2 3 Marketing 1
36 No Travel_Rarely 852 Research & Development 5 4 Life Sciences 1
33 No Travel_Frequently 1141 Sales 1 3 Life Sciences 1
35 No Travel_Rarely 464 Research & Development 4 2 Other 1
27 No Travel_Rarely 1240 Research & Development 2 4 Life Sciences 1
26 Yes Travel_Rarely 1357 Research & Development 25 3 Life Sciences 1
27 No Travel_Frequently 994 Sales 8 3 Life Sciences 1
30 No Travel_Frequently 721 Research & Development 1 2 Medical 1
41 Yes Travel_Rarely 1360 Research & Development 12 3 Technical Degree 1
34 No Non-Travel 1065 Sales 23 4 Marketing 1
37 No Travel_Rarely 408 Research & Development 19 2 Life Sciences 1
46 No Travel_Frequently 1211 Sales 5 4 Marketing 1
35 No Travel_Rarely 1229 Research & Development 8 1 Life Sciences 1
48 Yes Travel_Rarely 626 Research & Development 1 2 Life Sciences 1
28 Yes Travel_Rarely 1434 Research & Development 5 4 Technical Degree 1
44 No Travel_Rarely 1488 Sales 1 5 Marketing 1
35 No Non-Travel 1097 Research & Development 11 2 Medical 1
26 No Travel_Rarely 1443 Sales 23 3 Marketing 1
33 No Travel_Frequently 515 Research & Development 1 2 Life Sciences 1
35 No Travel_Frequently 853 Sales 18 5 Life Sciences 1
35 No Travel_Rarely 1142 Research & Development 23 4 Medical 1
31 No Travel_Rarely 655 Research & Development 7 4 Life Sciences 1
37 No Travel_Rarely 1115 Research & Development 1 4 Life Sciences 1
32 No Travel_Rarely 427 Research & Development 1 3 Medical 1
38 No Travel_Frequently 653 Research & Development 29 5 Life Sciences 1
50 No Travel_Rarely 989 Research & Development 7 2 Medical 1
59 No Travel_Rarely 1435 Sales 25 3 Life Sciences 1
36 No Travel_Rarely 1223 Research & Development 8 3 Technical Degree 1
55 No Travel_Rarely 836 Research & Development 8 3 Medical 1
36 No Travel_Frequently 1195 Research & Development 11 3 Life Sciences 1