0% found this document useful (0 votes)

49 views28 pages

Salary Prediction

Uploaded by

skat3r63

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

49 views28 pages

Salary Prediction

Uploaded by

skat3r63

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 28

8/9/23, 10:20 AM Salary Prediction

Salary Prediction based on Country and

Race
The aim of this project to predict the salary of individuals from varied countires and races
based on their demographics such as occupation, age, gender, experience, education,
etc. The dataset is taken from Kaggle. The dataset has 32561 rows and 15 columns. The
dataset has 8 independent variables and 1 target variable i.e Salary

Aboout the dataset

The dataset consists of a comprehensive collection of salary and demographic
information with additional details on years of experience. It offers a valuable resource
for studying the relationship between income and various socio-demographic factors.
The demographic attributes include age, gender, education, country, and race, providing
a diverse range of variables for analysis. Researchers can explore patterns and trends in
income distribution across different demographic categories, allowing for insights into
potential disparities or variations in earning potential. Moreover, the dataset
incorporates the crucial dimension of years of experience, enabling investigations into
the impact of professional tenure on salary levels. This aspect adds a dynamic aspect to
the analysis, enabling researchers to examine how income varies based on both
demographic characteristics and accumulated work experience. The dataset presents a
rich opportunity for conducting comprehensive studies on income diversity and
understanding the multifaceted factors influencing earning potential in today's
workforce.

Data Dictionary
Column Description

Unnamed: 0 Index

Age Age of the employee

Education Level Education level of the employee

Job Title Job title of the employee

Years of Experience Years of experience of the employee

Salary Salary of the employee

Country Country of the employee

Race Race of the employee

In [ ]: #importing the libraries

import numpy as np
import pandas as pd

file:///E:/Repo Page/Salary Prediction.html 1/28

8/9/23, 10:20 AM Salary Prediction

import matplotlib.pyplot as plt

import seaborn as sns

In [ ]: #loading the data

df = pd.read_csv('Salary_Data_Based_country_and_race.csv')
df.head()

Out[ ]: Unnamed: Education Years of

Age Gender Job Title Salary Country Ra
0 Level Experience

Software
0 0 32.0 Male Bachelor's 5.0 90000.0 UK Wh
Engineer

Data
1 1 28.0 Female Master's 3.0 65000.0 USA Hispa
Analyst

Senior
2 2 45.0 Male PhD 15.0 150000.0 Canada Wh
Manager

Sales
3 3 36.0 Female Bachelor's 7.0 60000.0 USA Hispa
Associate

4 4 52.0 Male Master's Director 20.0 200000.0 USA As

Data Preprocessing
In [ ]: #checking the shape of the data
df.shape

Out[ ]: (6704, 9)

In [ ]: #checking for null/missing values

df.isnull().sum()

Out[ ]: Unnamed: 0 0
Age 2
Gender 2
Education Level 3
Job Title 2
Years of Experience 3
Salary 5
Country 0
Race 0
dtype: int64

Since the number of rows with null/missing value is very less as compared to the total
number of rows, I will be dropping these rows.

In [ ]: df.dropna(axis=0, inplace=True)

In [ ]: #checking for null values

df.isnull().sum()

file:///E:/Repo Page/Salary Prediction.html 2/28

8/9/23, 10:20 AM Salary Prediction

Out[ ]: Unnamed: 0 0
Age 0
Gender 0
Education Level 0
Job Title 0
Years of Experience 0
Salary 0
Country 0
Race 0
dtype: int64

Dropping Unnamed Column beacuse it is just an index column

In [ ]: #dropping column
df.drop(columns = 'Unnamed: 0',axis=1,inplace=True)

Checking data type of each column

In [ ]: df.dtypes

Out[ ]: Age float64

Gender object
Education Level object
Job Title object
Years of Experience float64
Salary float64
Country object
Race object
dtype: object

Checking for unique values in each column

In [ ]: #unique values in each column

df.nunique()

Out[ ]: Age 41
Gender 3
Education Level 7
Job Title 191
Years of Experience 37
Salary 444
Country 5
Race 10
dtype: int64

The job title column has 191 different values. It will be very difficult to analyze so many
job titles. So, I will group the job titles under similar job domains.

Grouping Job Titles

In [ ]: df['Job Title'].unique()

file:///E:/Repo Page/Salary Prediction.html 3/28

8/9/23, 10:20 AM Salary Prediction

Out[ ]: array(['Software Engineer', 'Data Analyst', 'Senior Manager',

'Sales Associate', 'Director', 'Marketing Analyst',
'Product Manager', 'Sales Manager', 'Marketing Coordinator',
'Senior Scientist', 'Software Developer', 'HR Manager',
'Financial Analyst', 'Project Manager', 'Customer Service Rep',
'Operations Manager', 'Marketing Manager', 'Senior Engineer',
'Data Entry Clerk', 'Sales Director', 'Business Analyst',
'VP of Operations', 'IT Support', 'Recruiter', 'Financial Manager',
'Social Media Specialist', 'Software Manager', 'Junior Developer',
'Senior Consultant', 'Product Designer', 'CEO', 'Accountant',
'Data Scientist', 'Marketing Specialist', 'Technical Writer',
'HR Generalist', 'Project Engineer', 'Customer Success Rep',
'Sales Executive', 'UX Designer', 'Operations Director',
'Network Engineer', 'Administrative Assistant',
'Strategy Consultant', 'Copywriter', 'Account Manager',
'Director of Marketing', 'Help Desk Analyst',
'Customer Service Manager', 'Business Intelligence Analyst',
'Event Coordinator', 'VP of Finance', 'Graphic Designer',
'UX Researcher', 'Social Media Manager', 'Director of Operations',
'Senior Data Scientist', 'Junior Accountant',
'Digital Marketing Manager', 'IT Manager',
'Customer Service Representative', 'Business Development Manager',
'Senior Financial Analyst', 'Web Developer', 'Research Director',
'Technical Support Specialist', 'Creative Director',
'Senior Software Engineer', 'Human Resources Director',
'Content Marketing Manager', 'Technical Recruiter',
'Sales Representative', 'Chief Technology Officer',
'Junior Designer', 'Financial Advisor', 'Junior Account Manager',
'Senior Project Manager', 'Principal Scientist',
'Supply Chain Manager', 'Senior Marketing Manager',
'Training Specialist', 'Research Scientist',
'Junior Software Developer', 'Public Relations Manager',
'Operations Analyst', 'Product Marketing Manager',
'Senior HR Manager', 'Junior Web Developer',
'Senior Project Coordinator', 'Chief Data Officer',
'Digital Content Producer', 'IT Support Specialist',
'Senior Marketing Analyst', 'Customer Success Manager',
'Senior Graphic Designer', 'Software Project Manager',
'Supply Chain Analyst', 'Senior Business Analyst',
'Junior Marketing Analyst', 'Office Manager', 'Principal Engineer',
'Junior HR Generalist', 'Senior Product Manager',
'Junior Operations Analyst', 'Senior HR Generalist',
'Sales Operations Manager', 'Senior Software Developer',
'Junior Web Designer', 'Senior Training Specialist',
'Senior Research Scientist', 'Junior Sales Representative',
'Junior Marketing Manager', 'Junior Data Analyst',
'Senior Product Marketing Manager', 'Junior Business Analyst',
'Senior Sales Manager', 'Junior Marketing Specialist',
'Junior Project Manager', 'Senior Accountant', 'Director of Sales',
'Junior Recruiter', 'Senior Business Development Manager',
'Senior Product Designer', 'Junior Customer Support Specialist',
'Senior IT Support Specialist', 'Junior Financial Analyst',
'Senior Operations Manager', 'Director of Human Resources',
'Junior Software Engineer', 'Senior Sales Representative',
'Director of Product Management', 'Junior Copywriter',
'Senior Marketing Coordinator', 'Senior Human Resources Manager',
'Junior Business Development Associate', 'Senior Account Manager',
'Senior Researcher', 'Junior HR Coordinator',
'Director of Finance', 'Junior Marketing Coordinator',
'Junior Data Scientist', 'Senior Operations Analyst',

file:///E:/Repo Page/Salary Prediction.html 4/28

8/9/23, 10:20 AM Salary Prediction

'Senior Human Resources Coordinator', 'Senior UX Designer',

'Junior Product Manager', 'Senior Marketing Specialist',
'Senior IT Project Manager', 'Senior Quality Assurance Analyst',
'Director of Sales and Marketing', 'Senior Account Executive',
'Director of Business Development', 'Junior Social Media Manager',
'Senior Human Resources Specialist', 'Senior Data Analyst',
'Director of Human Capital', 'Junior Advertising Coordinator',
'Junior UX Designer', 'Senior Marketing Director',
'Senior IT Consultant', 'Senior Financial Advisor',
'Junior Business Operations Analyst',
'Junior Social Media Specialist',
'Senior Product Development Manager', 'Junior Operations Manager',
'Senior Software Architect', 'Junior Research Scientist',
'Senior Financial Manager', 'Senior HR Specialist',
'Senior Data Engineer', 'Junior Operations Coordinator',
'Director of HR', 'Senior Operations Coordinator',
'Junior Financial Advisor', 'Director of Engineering',
'Software Engineer Manager', 'Back end Developer',
'Senior Project Engineer', 'Full Stack Engineer',
'Front end Developer', 'Front End Developer',
'Director of Data Science', 'Human Resources Coordinator',
'Junior Sales Associate', 'Human Resources Manager',
'Juniour HR Generalist', 'Juniour HR Coordinator',
'Digital Marketing Specialist', 'Receptionist',
'Marketing Director', 'Social Media Man', 'Delivery Driver'],
dtype=object)

In [ ]: def categorize_job_title(job_title):
job_title = str(job_title).lower()
if 'software' in job_title or 'developer' in job_title:
return 'Software/Developer'
elif 'data' in job_title or 'analyst' in job_title or 'scientist' in job_tit
return 'Data Analyst/Scientist'
elif 'manager' in job_title or 'director' in job_title or 'vp' in job_title:
return 'Manager/Director/VP'
elif 'sales' in job_title or 'representative' in job_title:
return 'Sales'
elif 'marketing' in job_title or 'social media' in job_title:
return 'Marketing/Social Media'
elif 'product' in job_title or 'designer' in job_title:
return 'Product/Designer'
elif 'hr' in job_title or 'human resources' in job_title:
return 'HR/Human Resources'
elif 'financial' in job_title or 'accountant' in job_title:
return 'Financial/Accountant'
elif 'project manager' in job_title:
return 'Project Manager'
elif 'it' in job_title or 'support' in job_title:
return 'IT/Technical Support'
elif 'operations' in job_title or 'supply chain' in job_title:
return 'Operations/Supply Chain'
elif 'customer service' in job_title or 'receptionist' in job_title:
return 'Customer Service/Receptionist'
else:
return 'Other'

df['Job Title'] = df['Job Title'].apply(categorize_job_title)

In [ ]: df['Education Level'].unique()

file:///E:/Repo Page/Salary Prediction.html 5/28

8/9/23, 10:20 AM Salary Prediction

Out[ ]: array(["Bachelor's", "Master's", 'PhD', "Bachelor's Degree",

"Master's Degree", 'High School', 'phD'], dtype=object)

In the dataset the education level is represented in two different ways : Bachelor and
Bachelor degree, which means same. So I will be grouping it with Bachelor

Grouping Education Level

In [ ]: def group_education(Educaton):

Educaton = str(Educaton).lower()
if 'high school' in Educaton:
return 'High School'
elif 'bachelor\'s' in Educaton:
return 'Bachelors'
elif 'master\'s' in Educaton:
return 'Masters'
elif 'phd' in Educaton:
return 'PhD'

df['Education Level'] = df['Education Level'].apply(group_education)

Descriptive Statistics

In [ ]: #descriptive statistics
df.describe()

Out[ ]: Age Years of Experience Salary

count 6698.000000 6698.000000 6698.000000

mean 33.623022 8.095178 115329.253061

std 7.615784 6.060291 52789.792507

min 21.000000 0.000000 350.000000

25% 28.000000 3.000000 70000.000000

50% 32.000000 7.000000 115000.000000

75% 38.000000 12.000000 160000.000000

max 62.000000 34.000000 250000.000000

In [ ]: df.head()

file:///E:/Repo Page/Salary Prediction.html 6/28

8/9/23, 10:20 AM Salary Prediction

Out[ ]: Education Years of

Age Gender Job Title Salary Country Rac
Level Experience

0 32.0 Male Bachelors Software/Developer 5.0 90000.0 UK Whi

Data
1 28.0 Female Masters 3.0 65000.0 USA Hispan
Analyst/Scientist

2 45.0 Male PhD Manager/Director/VP 15.0 150000.0 Canada Whi

3 36.0 Female Bachelors Sales 7.0 60000.0 USA Hispan

4 52.0 Male Masters Manager/Director/VP 20.0 200000.0 USA Asia

Exploratory Data Analysis

In the exploratory data analysis, I will be looking at the data and try to understand the
data. I will begin by looking at the distribution of data across the datset, followed by
visualizing the data to understand the relationship between the features and the target
variable.

Pie chart for Gender

In [ ]: #pie chart
plt.figure(figsize=(10,6))
plt.pie(df['Gender'].value_counts(), labels=['Male','Female', 'Other'], autopct=
plt.title('Gender Distribution')
plt.show()

file:///E:/Repo Page/Salary Prediction.html 7/28

8/9/23, 10:20 AM Salary Prediction

The pie chart shows that majority of the employees are male with 54.8 % on the dataset,
followed by females with 45% and 0.2% employees belong to other gender.

Age Distribution
In [ ]: sns.histplot(data=df, x='Age', bins=20, kde=True)
plt.title('Age Distribution')
plt.show()

file:///E:/Repo Page/Salary Prediction.html 8/28

8/9/23, 10:20 AM Salary Prediction

Majority of the employees are in the range of 25 - 35 years of age, which means majority
of the employees are young and energetic. There is only minimal number of old
employees in the dataset having age more than 55 years.

Education Level
In [ ]: sns.countplot(x = 'Education Level', data = df, palette='Set1')
plt.xticks(rotation=90)

Out[ ]: (array([0, 1, 2, 3]),

[Text(0, 0, 'Bachelors'),
Text(1, 0, 'Masters'),
Text(2, 0, 'PhD'),
Text(3, 0, 'High School')])

file:///E:/Repo Page/Salary Prediction.html 9/28

8/9/23, 10:20 AM Salary Prediction

Most of the employees have a Bachelor's degree followed by Master's degree and
Doctoral degree. The least number of employees have a High School education. From
the graph it is clear that most of the employees started working after graduation, few of
them started working after post graduation and very few of them have gone for
doctorate. The least number of employees have started working after high school
education.

Job Title
In [ ]: sns.countplot(x='Job Title', data = df)
plt.xticks(rotation=90)

Out[ ]: (array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]),

[Text(0, 0, 'Software/Developer'),
Text(1, 0, 'Data Analyst/Scientist'),
Text(2, 0, 'Manager/Director/VP'),
Text(3, 0, 'Sales'),
Text(4, 0, 'Marketing/Social Media'),
Text(5, 0, 'Customer Service/Receptionist'),
Text(6, 0, 'Other'),
Text(7, 0, 'IT/Technical Support'),
Text(8, 0, 'Product/Designer'),
Text(9, 0, 'Financial/Accountant'),
Text(10, 0, 'HR/Human Resources'),
Text(11, 0, 'Operations/Supply Chain')])

file:///E:/Repo Page/Salary Prediction.html 10/28

8/9/23, 10:20 AM Salary Prediction

This graph helps us to breakdown the data of job title in a simpler form. From the graph,
it is clear that majority of the employees have job titles - Software Developer, Data
Analyst/Scientist or Manager/Director/Vp. Few amount of employees have job titles such
as sales, marketing/social media, HR, Product Designer and Customer Service. Very few
of the eomployees work as a Financial/accountant or operation/supply management.

From this I build a hypothesis that the job titles such as Software Developer, Data
Analyst/Scientist and Manager/Director are in more demand as compared to other job
titles. It also means that job titles like Financial/accountant or operation/supply
management and Customer Service are in less demand and paid comparatively less.

Years of Experience
In [ ]: sns.histplot(x = 'Years of Experience', data = df,kde=True)

Out[ ]: <Axes: xlabel='Years of Experience', ylabel='Count'>

file:///E:/Repo Page/Salary Prediction.html 11/28

8/9/23, 10:20 AM Salary Prediction

Most of the employees in the dataset havr experience of 0-7 years in the respective
domains in which particularly majority of them have experience between less than 5
years. Moreover the number of employees in the dataset decreases with increasing
number of years of experience.

Country
In [ ]: sns.countplot(x='Country', data=df)
plt.xticks(rotation=90)

Out[ ]: (array([0, 1, 2, 3, 4]),

[Text(0, 0, 'UK'),
Text(1, 0, 'USA'),
Text(2, 0, 'Canada'),
Text(3, 0, 'China'),
Text(4, 0, 'Australia')])

file:///E:/Repo Page/Salary Prediction.html 12/28

8/9/23, 10:20 AM Salary Prediction

The number of employees from the above 5 countries is nearly same, with a little more in
USA.

Racial Distribution
In [ ]: sns.countplot(x='Race', data=df)
plt.xticks(rotation=90)

Out[ ]: (array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]),

[Text(0, 0, 'White'),
Text(1, 0, 'Hispanic'),
Text(2, 0, 'Asian'),
Text(3, 0, 'Korean'),
Text(4, 0, 'Chinese'),
Text(5, 0, 'Australian'),
Text(6, 0, 'Welsh'),
Text(7, 0, 'African American'),
Text(8, 0, 'Mixed'),
Text(9, 0, 'Black')])

file:///E:/Repo Page/Salary Prediction.html 13/28

8/9/23, 10:20 AM Salary Prediction

This graph help us to know about the racial distribution in the dataset. From the graph, it
is clear that most of the employees are either White or Asian, followed by Korean,
Chinese, Australian and Black. Number of employees from Welsh, African American,
Mixed and Hispanic race are less as compared to other groups.

From all the above plots and graphs, we can a understanding about the data we are
dealing with, its distribution and quantity as well. Now I am gonna explore the realtion of
these independent variables with the target Variable i.e. Salary.

Age and Salary

In [ ]: sns.scatterplot(x = 'Age', y='Salary', data=df)
plt.title('Age vs Salary')

Out[ ]: Text(0.5, 1.0, 'Age vs Salary')

file:///E:/Repo Page/Salary Prediction.html 14/28

8/9/23, 10:20 AM Salary Prediction

In this scatter plot we see a trend that the salary of the person increases with increse in
the age, which is obvious because of promotion and apprisals. However upon closer
observation we can find that similar age have multiple salaries, which means there are
other factors which decides the salary.

Gender and Salary

In [ ]: fig, ax = plt.subplots(1,2, figsize = (15, 5))
sns.boxplot(x = 'Gender', y='Salary', data = df, ax =ax[0]).set_title('Gender vs
sns.violinplot(x = 'Gender', y='Salary', data = df, ax =ax[1]).set_title('Gender

Out[ ]: Text(0.5, 1.0, 'Gender vs Salary')

The boxplot and violinplot describes the salary distribution among the three genders. In
the boxplot the employees from Other gender has quite high salary as compared to
Makes and Females. The other gender employees have a median salary above 150000,

file:///E:/Repo Page/Salary Prediction.html 15/28

8/9/23, 10:20 AM Salary Prediction

followed by males with median salary near 107500 and females with median salary near
100000. The voilin plot visualizes the distribution of salary with respect to the gender,
where most of the Other gender employees have salary above 150000. In makes this
distribution is concentrated between 50000 and 10000 as well as near 200000. In case of
females, there salary distribution is quite spread as compared to other genders with
most near 50000.

Education Level and Salary

In [ ]: fig,ax = plt.subplots(1,2,figsize=(15,6))
sns.boxplot(x = 'Education Level', y = 'Salary', data = df, ax=ax[0]).set_title(
sns.violinplot(x = 'Education Level', y = 'Salary', data = df, ax=ax[1]).set_tit

Out[ ]: Text(0.5, 1.0, 'Education Level vs Salary')

The boxplot and violinplot shows the distribution of salary based on the employees
education level. The median salary for the Phd holders is highest followed by Masters
and bachelors degreee holders, with employees with no degree having the lowest
median salary. In the violinplot the phd scholars have distribution near 200000, whereas
Masters degree holders have a very sleak distribution where the salary distribution is
spread from 100k to 150k, The Bachelors degree holders have a salary distribution near
50000 whereas the employees with no degree have a salary distribution near 40k-45k.

From these graph, I assume that the employees with higher education level have higher
salary than the employees with lower education level.

Job Title and Salary

In [ ]: sns.barplot(x = 'Job Title', y = 'Salary', data = df, palette = 'Set2')
plt.xticks(rotation = 90)

file:///E:/Repo Page/Salary Prediction.html 16/28

8/9/23, 10:20 AM Salary Prediction

Out[ ]: (array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]),

This graph falsifies my previous hypothesis regarding the demand and paywith respect
to job titles. In this graph, 'Other' category job titles have higher salary than those titles
which assumed to be in high demand and pay. In contrast to previous Job title graph,
this graph shows that there is no relation between the job title distribution and salary.
The job titles which gave high salary are found to be less in number.

file:///E:/Repo Page/Salary Prediction.html 17/28

8/9/23, 10:20 AM Salary Prediction

However the hypothesis is true about the Job titles such as Software Developer, Data
analyst/scuentust and Manager/Director/VP. These job titles are found to be in high
demand and pay. But in contrast to that the job titles such as Operation/Supply chain,
HR, Financial/Accountant and Marketing/Social Media are found have much more salary
as assumed.

Experience and Salary

In [ ]: sns.scatterplot(x= 'Years of Experience', y = 'Salary', data = df).set_title('Y

Out[ ]: Text(0.5, 1.0, 'Years of Experience vs Salary')

From this scaaterplot, it is clear that on the whole, the salary of the employees is
increasing with the years of experience. However, on closer look we can see that similar
experience have different salaries. This is because the salary is also dependent on other
factors like job title, age, gender education level as discussed earlier.

Country and Salary

In [ ]: fig,ax = plt.subplots(1,2,figsize=(15,6))
sns.boxplot(x = 'Country', y = 'Salary', data = df, ax=ax[0])
sns.violinplot(x = 'Country', y = 'Salary', data = df, ax=ax[1])

Out[ ]: <Axes: xlabel='Country', ylabel='Salary'>

file:///E:/Repo Page/Salary Prediction.html 18/28

8/9/23, 10:20 AM Salary Prediction

Both the boxplot and violinplot shows very similar insight about the salary across all the
countiries even in the violinplot distribution. However, there is very small variation in
median salary in USA, which is slighlty less as compared to other countries.

Since, the we cannot get much information about the salary with respect to the
countries. So, I will plot the job title vs salary graph for each country, so that we can get a
overview of job title vs salary for each country.

In [ ]: fig,ax = plt.subplots(3,2,figsize=(20,20))
plt.subplots_adjust(hspace=0.5)
sns.boxplot(x = 'Job Title', y = 'Salary', data = df[df['Country'] == 'USA'], ax
ax[0,0].tick_params(axis='x', rotation=90)
sns.boxplot(x = 'Job Title', y = 'Salary', data = df[df['Country'] == 'UK'], ax
ax[0,1].tick_params(axis='x', rotation=90)
sns.boxplot(x = 'Job Title', y = 'Salary', data = df[df['Country'] == 'Canada'],
ax[1,0].tick_params(axis='x', rotation=90)
sns.boxplot(x = 'Job Title', y = 'Salary', data = df[df['Country'] == 'Australia
ax[1,1].tick_params(axis='x', rotation=90)
sns.boxplot(x = 'Job Title', y = 'Salary', data = df[df['Country'] == 'China'],
ax[2,0].tick_params(axis='x', rotation=90)
sns.boxplot(x = 'Job Title', y = 'Salary', data = df, ax = ax[2,1]).set_title('A
ax[2,1].tick_params(axis='x', rotation=90)

file:///E:/Repo Page/Salary Prediction.html 19/28

8/9/23, 10:20 AM Salary Prediction

After observing all these plots, I conclude that the Job Titles such as Softwarre Developer,
Manager/Director/VP and Data Analyst/Scientist hare in high demand as well as receive
much higer salary than other job titles, excluding the Job Titles that come under 'Other'
category. The job titles such as Operation/Supply Chain, Customer Service/Receptionist,
Product Designer and sales are in low demand and have low salary.

Race and Salary

In [ ]: fig,ax = plt.subplots(1,2,figsize=(15,6))
sns.boxplot(x = 'Race', y = 'Salary', data = df, ax = ax[0])
ax[0].tick_params(axis='x', rotation=90)
sns.violinplot(x = 'Race', y ='Salary', data = df, ax = ax[1])
ax[1].tick_params(axis='x', rotation=90)

file:///E:/Repo Page/Salary Prediction.html 20/28

8/9/23, 10:20 AM Salary Prediction

The employees from the races - Australian, Mixed, Blacks and White have the highest
median salary, followed by Asian, Korean and Chinese with lowest median salary in
employees from hispanic race. Looking at the violinplot the salary distribution is more
concentrated after 150k in white, australian, black and mixed race. Whereas the hispanic
has more concentration near 75k

Data Preprocessing 2

Label encoding to categorical features

In [ ]: from sklearn.preprocessing import LabelEncoder
features = ['Gender','Country','Education Level','Job Title', 'Race']
le = LabelEncoder()
for feature in features:
le.fit(df[feature].unique())
df[feature] = le.transform(df[feature])
print(feature, df[feature].unique())

Gender [1 0 2]
Country [3 4 1 2 0]
Education Level [0 2 3 1]
Job Title [11 1 5 10 6 0 8 4 9 2 3 7]
Race [9 5 1 6 4 2 8 0 7 3]

Normalization
In [ ]: #normalizing the continuous variables
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df[['Age', 'Years of Experience', 'Salary']] = scaler.fit_transform(df[['Age', '

In [ ]: df.head()

file:///E:/Repo Page/Salary Prediction.html 21/28

8/9/23, 10:20 AM Salary Prediction

Out[ ]: Education Job Years of

Age Gender Salary Country Race
Level Title Experience

0 -0.213129 1 0 11 -0.510769 -0.479849 3 9

1 -0.738393 0 2 1 -0.840811 -0.953461 4 5

2 1.493980 1 3 5 1.139440 0.656819 1 9

3 0.312135 0 0 10 -0.180727 -1.048183 4 5

4 2.413192 1 2 5 1.964544 1.604042 4 1

Coorelation Matrix Heatmap

In [ ]: #coorelation heatmap
plt.figure(figsize=(12,8))
sns.heatmap(df.corr(),annot=True, cmap='coolwarm')

Out[ ]: <Axes: >

In this coorelation matrix, there are three major coorealtions.

Salary and Age

Salary and Years of Experience
Years of Experience and Age

The coorelation salary with age and years of experience is already explored in the above
plots. The coorelation between the years of experience and age is obvious as the person

file:///E:/Repo Page/Salary Prediction.html 22/28

8/9/23, 10:20 AM Salary Prediction

ages the experience will be more.

Train Test Split

In [ ]: from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df.drop('Salary', axis=1), d

Salary Prediction
I will be using the following models:

Decision Tree Regressor

Random Forest Regressor

Decision Tree Regressor

In [ ]: from sklearn.tree import DecisionTreeRegressor

#createing the decision tree gressor object

dtree = DecisionTreeRegressor()

Hypertuning the model

In [ ]: from sklearn.model_selection import GridSearchCV

#defining the parameters for the grid search

parameters = {'max_depth' :[2,4,6,8,10],
'min_samples_split' :[2,4,6,8],
'min_samples_leaf' :[2,4,6,8],
'max_features' :['auto','sqrt','log2'],
'random_state' :[0,42]}
#creating the grid search object
grid_search = GridSearchCV(dtree,parameters,cv=5,scoring='neg_mean_squared_error

#fit the grid search object to the training data

grid_search.fit(X_train,y_train)

#print the best parameters

print(grid_search.best_params_)

{'max_depth': 10, 'max_features': 'auto', 'min_samples_leaf': 2, 'min_samples_spl

it': 8, 'random_state': 42}

Building the model on best parameters

In [ ]: dtree = DecisionTreeRegressor(max_depth = 10, max_features = 'auto', min_samples

dtree

file:///E:/Repo Page/Salary Prediction.html 23/28

8/9/23, 10:20 AM Salary Prediction

Out[ ]: ▾ DecisionTreeRegressor

DecisionTreeRegressor(max_depth=10, max_features='auto', min_samples_le

af=2,
min_samples_split=8, random_state=42)

In [ ]: #fitting the training data

dtree.fit(X_train,y_train)

Out[ ]: ▾ DecisionTreeRegressor

DecisionTreeRegressor(max_depth=10, max_features='auto', min_samples_le

af=2,
min_samples_split=8, random_state=42)

In [ ]: #training accuracy
dtree.score(X_train, y_train)

Out[ ]: 0.9656459784687974

In [ ]: #predicting the salary of an employee

d_pred = dtree.predict(X_test)

Evaluating the Decision Tree Regressor Model

In [ ]: dft = pd.DataFrame({'Actual': y_test, 'Predicted': d_pred})
dft.reset_index(drop=True, inplace=True)
dft.head(10)

Out[ ]: Actual Predicted

0 0.656819 0.678470

1 -0.745659 -0.688434

2 -0.290405 -0.290405

3 -1.048183 -1.036343

4 -0.669294 -0.610093

5 1.414598 1.494747

6 -0.820850 -0.715794

7 -1.142906 -1.122777

8 1.509320 1.554189

9 0.277930 0.287811

In [ ]: ax = sns.distplot(dft['Actual'], color = 'blue', hist = False, kde = True, kde_k

sns.distplot( dft['Predicted'], color = 'red', ax=ax, hist = False, kde = True,

Out[ ]: <Axes: xlabel='Predicted', ylabel='Density'>

file:///E:/Repo Page/Salary Prediction.html 24/28

8/9/23, 10:20 AM Salary Prediction

The blue shows the distribution count for actual values and the red line shows the
distribution count for predicted values. The predicted values are close to the actual
values and ther curve coincides with the actual values curve. This shows that the model is
a good fit.

In [ ]: from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error

print("R2 Score: ", r2_score(y_test, d_pred))
print("Mean Squared Error: ", mean_squared_error(y_test, d_pred))
print("Mean Absolute Error: ", mean_absolute_error(y_test, d_pred))
print('RMSE:', np.sqrt(mean_squared_error(y_test, d_pred)))

R2 Score: 0.9323013355107719
Mean Squared Error: 0.06928069008068977
Mean Absolute Error: 0.13812719621413622
RMSE: 0.2632122529075912

Random Forest Regressor

In [ ]: from sklearn.ensemble import RandomForestRegressor
#creating random forest regressor object
rfg = RandomForestRegressor()

In [ ]: #trainig the model

rfg.fit(X_train, y_train)

Out[ ]: ▾ RandomForestRegressor

RandomForestRegressor()

file:///E:/Repo Page/Salary Prediction.html 25/28

8/9/23, 10:20 AM Salary Prediction

In [ ]: #training accuracy
rfg.score(X_train, y_train)

Out[ ]: 0.9881489086015691

In [ ]: #predicitng salary of the employee

r_pred = rfg.predict(X_test)

Evaluating Random Forest Regressor Model

In [ ]: dfr = pd.DataFrame({'Actual': y_test, 'Predicted': r_pred})
dfr.reset_index(drop=True, inplace=True)
dfr.head(10)

Out[ ]: Actual Predicted

0 0.656819 0.648206

1 -0.745659 -0.716941

2 -0.290405 -0.288510

3 -1.048183 -1.049699

4 -0.669294 -0.637562

5 1.414598 1.501506

6 -0.820850 -0.813651

7 -1.142906 -1.113062

8 1.509320 1.541334

9 0.277930 0.306604

In [ ]: ax = sns.distplot(dft['Actual'], color = 'blue', hist = False, kde = True, kde_k

sns.distplot( dft['Predicted'], color = 'red', ax=ax, hist = False, kde = True,

Out[ ]: <Axes: xlabel='Predicted', ylabel='Density'>

file:///E:/Repo Page/Salary Prediction.html 26/28

8/9/23, 10:20 AM Salary Prediction

In [ ]: print("R2 Score: ", r2_score(y_test, r_pred))

print("Mean Squared Error: ", mean_squared_error(y_test, r_pred))
print("Mean Absolute Error: ", mean_absolute_error(y_test, r_pred))
print('RMSE:', np.sqrt(mean_squared_error(y_test, r_pred)))

R2 Score: 0.946740751192265
Mean Squared Error: 0.05450384491951317
Mean Absolute Error: 0.11418652633630026
RMSE: 0.23346058536616662

Conclusion
From the exploratory data analysis, I have concluded that the salary of the employees is
dependent upon the following factors:

1. Years of Experience
2. Job Title
3. Education Level

Employees with greater years of experience, having job title such as Data
analyst/scientist, Software Developer or Director/Manager/VP and having a Master's or
Doctoral degree are more likely to have a higher salary.

file:///E:/Repo Page/Salary Prediction.html 27/28

8/9/23, 10:20 AM Salary Prediction

Coming to the machine learning models, I have used regressor models - Decision Tree
Regressor and Random Forest Regressor for predicting the salary. The Random Forest
Regressor has performed well with the accuracy of 94.6%

file:///E:/Repo Page/Salary Prediction.html 28/28

Data Science Jobs & Salaries Report
No ratings yet
Data Science Jobs & Salaries Report
8 pages
AMCAT Data Analysis
No ratings yet
AMCAT Data Analysis
18 pages
Salary Data Analysis - Phase 1
No ratings yet
Salary Data Analysis - Phase 1
5 pages
Chapter 1
No ratings yet
Chapter 1
19 pages
Data Scientist
No ratings yet
Data Scientist
10 pages
AI Assignment 6 - Employee Performance Analysis - Jupyter Notebook
No ratings yet
AI Assignment 6 - Employee Performance Analysis - Jupyter Notebook
9 pages
Ejemplo de DataTables Exportación de Archivos
No ratings yet
Ejemplo de DataTables Exportación de Archivos
2 pages
Synopsis Group 6 Final
No ratings yet
Synopsis Group 6 Final
6 pages
FoAI - ASM2 - Mai Ngo
100% (1)
FoAI - ASM2 - Mai Ngo
15 pages
2 Workbooks
No ratings yet
2 Workbooks
588 pages
HR & Job Dashboard
No ratings yet
HR & Job Dashboard
41 pages
A Model To Predict Pay Scale Fixation in Job Marke
No ratings yet
A Model To Predict Pay Scale Fixation in Job Marke
6 pages
Group 4 Ins1053 Ins105301
No ratings yet
Group 4 Ins1053 Ins105301
57 pages
Employee Data Analysis 2 (1) AKSHAYA.M
No ratings yet
Employee Data Analysis 2 (1) AKSHAYA.M
9 pages
DWDM PR
No ratings yet
DWDM PR
29 pages
Data Science Job Salaries & Goals
No ratings yet
Data Science Job Salaries & Goals
2,732 pages
Pps Ui22cs57lab 10
No ratings yet
Pps Ui22cs57lab 10
17 pages
ML Project 2
No ratings yet
ML Project 2
19 pages
Basic of Statistics
No ratings yet
Basic of Statistics
4 pages
Mini Project Report
No ratings yet
Mini Project Report
10 pages
Assignment03 21979258
No ratings yet
Assignment03 21979258
3 pages
Capstone Project Assignment
No ratings yet
Capstone Project Assignment
3 pages
1624106057@g.us
No ratings yet
1624106057@g.us
13 pages
Examplefile: Name Position Office Age Start Date Salary
No ratings yet
Examplefile: Name Position Office Age Start Date Salary
1 page
InfoPro Solutions
No ratings yet
InfoPro Solutions
20 pages
HRDF IndSF DATASCIENCE
No ratings yet
HRDF IndSF DATASCIENCE
3 pages
2020-2021 Data Jobs Overview
No ratings yet
2020-2021 Data Jobs Overview
40 pages
Data Analytics Basic Package
No ratings yet
Data Analytics Basic Package
40 pages
DA Course PDF
No ratings yet
DA Course PDF
40 pages
Assignment INF
No ratings yet
Assignment INF
4 pages
Assignment 4
No ratings yet
Assignment 4
17 pages
Examplefile: Name Position Office Age Start - Date Salary
No ratings yet
Examplefile: Name Position Office Age Start - Date Salary
1 page
Day 8
No ratings yet
Day 8
2,060 pages
Salaries For San Francisco Employee - ML - FA - DA Projects
No ratings yet
Salaries For San Francisco Employee - ML - FA - DA Projects
33 pages
Data Job Market Insights
No ratings yet
Data Job Market Insights
23 pages
Kushal Kadayat
No ratings yet
Kushal Kadayat
33 pages
Datatables Example - Icons: Name Position Office Age Start Date Salary
No ratings yet
Datatables Example - Icons: Name Position Office Age Start Date Salary
2 pages
Data Viz Case Study
No ratings yet
Data Viz Case Study
3 pages
Job Postings 07-2024
No ratings yet
Job Postings 07-2024
5,390 pages
Group 3
No ratings yet
Group 3
56 pages
Employee Analysis
No ratings yet
Employee Analysis
19 pages
Datatables Example - Export Options - Row Selector
No ratings yet
Datatables Example - Export Options - Row Selector
3 pages
Entry Level Data Analyst Resume Guide
100% (2)
Entry Level Data Analyst Resume Guide
8 pages
Empid Name
No ratings yet
Empid Name
2 pages
Glassdoor Insights for Job Seekers
No ratings yet
Glassdoor Insights for Job Seekers
15 pages
5 Statistical Functions
No ratings yet
5 Statistical Functions
2,048 pages
Assignment Ds Midterm
No ratings yet
Assignment Ds Midterm
2 pages
2022ucd2164 1 2
No ratings yet
2022ucd2164 1 2
35 pages
Data Preprocessing & Visualization1
No ratings yet
Data Preprocessing & Visualization1
2 pages
PFDA
No ratings yet
PFDA
23 pages
Ca2 MGN100M Solution
No ratings yet
Ca2 MGN100M Solution
11 pages
Assignment3: 1) Identify Percentage of Missing Values in Each Column and Display The Same
No ratings yet
Assignment3: 1) Identify Percentage of Missing Values in Each Column and Display The Same
30 pages
Data Analyst vs. Data Scientist Detailed Career Path
No ratings yet
Data Analyst vs. Data Scientist Detailed Career Path
6 pages
PySpark Slides
No ratings yet
PySpark Slides
30 pages
Data Science and Engineering: Avg. Salary Hike
No ratings yet
Data Science and Engineering: Avg. Salary Hike
13 pages
Capstone Project - Employee Attrition Rate
No ratings yet
Capstone Project - Employee Attrition Rate
66 pages
Data Science Bootcamp Syllabus - HyperionDev
No ratings yet
Data Science Bootcamp Syllabus - HyperionDev
18 pages
Ganpati IRP Agreement
No ratings yet
Ganpati IRP Agreement
12 pages
Forklift Bucket, Hildale, UT, United States - Used Other Attachments and Components - Mascus USA
No ratings yet
Forklift Bucket, Hildale, UT, United States - Used Other Attachments and Components - Mascus USA
2 pages
Stages of The Employment Cycle.x
No ratings yet
Stages of The Employment Cycle.x
8 pages
Resume - Shashi Parmar - Infor
No ratings yet
Resume - Shashi Parmar - Infor
3 pages
TPSR - REQUEST ON 27/12/2022 NO Name New Ic
No ratings yet
TPSR - REQUEST ON 27/12/2022 NO Name New Ic
12 pages
Autotuner ECU Compatibility-20230305
No ratings yet
Autotuner ECU Compatibility-20230305
68 pages
Annual Report MRT Jakarta 2020
No ratings yet
Annual Report MRT Jakarta 2020
684 pages
SWOT Analysis of Starbucks
No ratings yet
SWOT Analysis of Starbucks
7 pages
Sandbag 2024 Steel CCSU Report
No ratings yet
Sandbag 2024 Steel CCSU Report
42 pages
Nabard
No ratings yet
Nabard
11 pages
HA+Super Trend
No ratings yet
HA+Super Trend
3 pages
Swot of Tupperware
No ratings yet
Swot of Tupperware
7 pages
Mortgage Project
No ratings yet
Mortgage Project
13 pages
Work Engagement Plan For Grayson Roofing and Construction
No ratings yet
Work Engagement Plan For Grayson Roofing and Construction
2 pages
Rawanda
No ratings yet
Rawanda
2 pages
Journalizing Basic Accounting Tolentino John Carlo BSMA 2A
No ratings yet
Journalizing Basic Accounting Tolentino John Carlo BSMA 2A
7 pages
Tender Documents - 2022 06 25 17 03 54
No ratings yet
Tender Documents - 2022 06 25 17 03 54
37 pages
Health Checkup Bill Format
No ratings yet
Health Checkup Bill Format
1 page
MBA - PGDPM - IKM - Webinar - 2
No ratings yet
MBA - PGDPM - IKM - Webinar - 2
78 pages
Hossam Musallam (New)
No ratings yet
Hossam Musallam (New)
1 page
AIMARKETING1 BookChapter
No ratings yet
AIMARKETING1 BookChapter
9 pages
BMP Report Final
No ratings yet
BMP Report Final
23 pages
The Impact of Visual Content On Consumer Engagement and Buying Behavior in Herat Province: A Social Media Perspective
No ratings yet
The Impact of Visual Content On Consumer Engagement and Buying Behavior in Herat Province: A Social Media Perspective
8 pages
Xiaomi Casestudy PDF
No ratings yet
Xiaomi Casestudy PDF
3 pages
Matematik Keputusan: Decision Mathematics
No ratings yet
Matematik Keputusan: Decision Mathematics
3 pages
Ai Application To Segmented Marketing
No ratings yet
Ai Application To Segmented Marketing
4 pages
Module 2 - Understanding The Marketing Environment, Ethical Behavior, Social Responsibility
No ratings yet
Module 2 - Understanding The Marketing Environment, Ethical Behavior, Social Responsibility
18 pages
Ielts Trainer 2 - Test 3 - Correcting Papers
No ratings yet
Ielts Trainer 2 - Test 3 - Correcting Papers
3 pages
Alano & Sons Coop Recognition Award
No ratings yet
Alano & Sons Coop Recognition Award
4 pages
Emabb Ebook - Share 150523
No ratings yet
Emabb Ebook - Share 150523
8 pages