Salary Prediction
Salary Prediction
Data Dictionary
Column Description
Unnamed: 0 Index
Software
0 0 32.0 Male Bachelor's 5.0 90000.0 UK Wh
Engineer
Data
1 1 28.0 Female Master's 3.0 65000.0 USA Hispa
Analyst
Senior
2 2 45.0 Male PhD 15.0 150000.0 Canada Wh
Manager
Sales
3 3 36.0 Female Bachelor's 7.0 60000.0 USA Hispa
Associate
Data Preprocessing
In [ ]: #checking the shape of the data
df.shape
Out[ ]: (6704, 9)
Out[ ]: Unnamed: 0 0
Age 2
Gender 2
Education Level 3
Job Title 2
Years of Experience 3
Salary 5
Country 0
Race 0
dtype: int64
Since the number of rows with null/missing value is very less as compared to the total
number of rows, I will be dropping these rows.
In [ ]: df.dropna(axis=0, inplace=True)
Out[ ]: Unnamed: 0 0
Age 0
Gender 0
Education Level 0
Job Title 0
Years of Experience 0
Salary 0
Country 0
Race 0
dtype: int64
In [ ]: #dropping column
df.drop(columns = 'Unnamed: 0',axis=1,inplace=True)
In [ ]: df.dtypes
Out[ ]: Age 41
Gender 3
Education Level 7
Job Title 191
Years of Experience 37
Salary 444
Country 5
Race 10
dtype: int64
The job title column has 191 different values. It will be very difficult to analyze so many
job titles. So, I will group the job titles under similar job domains.
In [ ]: df['Job Title'].unique()
In [ ]: def categorize_job_title(job_title):
job_title = str(job_title).lower()
if 'software' in job_title or 'developer' in job_title:
return 'Software/Developer'
elif 'data' in job_title or 'analyst' in job_title or 'scientist' in job_tit
return 'Data Analyst/Scientist'
elif 'manager' in job_title or 'director' in job_title or 'vp' in job_title:
return 'Manager/Director/VP'
elif 'sales' in job_title or 'representative' in job_title:
return 'Sales'
elif 'marketing' in job_title or 'social media' in job_title:
return 'Marketing/Social Media'
elif 'product' in job_title or 'designer' in job_title:
return 'Product/Designer'
elif 'hr' in job_title or 'human resources' in job_title:
return 'HR/Human Resources'
elif 'financial' in job_title or 'accountant' in job_title:
return 'Financial/Accountant'
elif 'project manager' in job_title:
return 'Project Manager'
elif 'it' in job_title or 'support' in job_title:
return 'IT/Technical Support'
elif 'operations' in job_title or 'supply chain' in job_title:
return 'Operations/Supply Chain'
elif 'customer service' in job_title or 'receptionist' in job_title:
return 'Customer Service/Receptionist'
else:
return 'Other'
In [ ]: df['Education Level'].unique()
In the dataset the education level is represented in two different ways : Bachelor and
Bachelor degree, which means same. So I will be grouping it with Bachelor
In [ ]: def group_education(Educaton):
Educaton = str(Educaton).lower()
if 'high school' in Educaton:
return 'High School'
elif 'bachelor\'s' in Educaton:
return 'Bachelors'
elif 'master\'s' in Educaton:
return 'Masters'
elif 'phd' in Educaton:
return 'PhD'
Descriptive Statistics
In [ ]: #descriptive statistics
df.describe()
In [ ]: df.head()
Data
1 28.0 Female Masters 3.0 65000.0 USA Hispan
Analyst/Scientist
The pie chart shows that majority of the employees are male with 54.8 % on the dataset,
followed by females with 45% and 0.2% employees belong to other gender.
Age Distribution
In [ ]: sns.histplot(data=df, x='Age', bins=20, kde=True)
plt.title('Age Distribution')
plt.show()
Majority of the employees are in the range of 25 - 35 years of age, which means majority
of the employees are young and energetic. There is only minimal number of old
employees in the dataset having age more than 55 years.
Education Level
In [ ]: sns.countplot(x = 'Education Level', data = df, palette='Set1')
plt.xticks(rotation=90)
Most of the employees have a Bachelor's degree followed by Master's degree and
Doctoral degree. The least number of employees have a High School education. From
the graph it is clear that most of the employees started working after graduation, few of
them started working after post graduation and very few of them have gone for
doctorate. The least number of employees have started working after high school
education.
Job Title
In [ ]: sns.countplot(x='Job Title', data = df)
plt.xticks(rotation=90)
This graph helps us to breakdown the data of job title in a simpler form. From the graph,
it is clear that majority of the employees have job titles - Software Developer, Data
Analyst/Scientist or Manager/Director/Vp. Few amount of employees have job titles such
as sales, marketing/social media, HR, Product Designer and Customer Service. Very few
of the eomployees work as a Financial/accountant or operation/supply management.
From this I build a hypothesis that the job titles such as Software Developer, Data
Analyst/Scientist and Manager/Director are in more demand as compared to other job
titles. It also means that job titles like Financial/accountant or operation/supply
management and Customer Service are in less demand and paid comparatively less.
Years of Experience
In [ ]: sns.histplot(x = 'Years of Experience', data = df,kde=True)
Most of the employees in the dataset havr experience of 0-7 years in the respective
domains in which particularly majority of them have experience between less than 5
years. Moreover the number of employees in the dataset decreases with increasing
number of years of experience.
Country
In [ ]: sns.countplot(x='Country', data=df)
plt.xticks(rotation=90)
The number of employees from the above 5 countries is nearly same, with a little more in
USA.
Racial Distribution
In [ ]: sns.countplot(x='Race', data=df)
plt.xticks(rotation=90)
This graph help us to know about the racial distribution in the dataset. From the graph, it
is clear that most of the employees are either White or Asian, followed by Korean,
Chinese, Australian and Black. Number of employees from Welsh, African American,
Mixed and Hispanic race are less as compared to other groups.
From all the above plots and graphs, we can a understanding about the data we are
dealing with, its distribution and quantity as well. Now I am gonna explore the realtion of
these independent variables with the target Variable i.e. Salary.
In this scatter plot we see a trend that the salary of the person increases with increse in
the age, which is obvious because of promotion and apprisals. However upon closer
observation we can find that similar age have multiple salaries, which means there are
other factors which decides the salary.
The boxplot and violinplot describes the salary distribution among the three genders. In
the boxplot the employees from Other gender has quite high salary as compared to
Makes and Females. The other gender employees have a median salary above 150000,
followed by males with median salary near 107500 and females with median salary near
100000. The voilin plot visualizes the distribution of salary with respect to the gender,
where most of the Other gender employees have salary above 150000. In makes this
distribution is concentrated between 50000 and 10000 as well as near 200000. In case of
females, there salary distribution is quite spread as compared to other genders with
most near 50000.
The boxplot and violinplot shows the distribution of salary based on the employees
education level. The median salary for the Phd holders is highest followed by Masters
and bachelors degreee holders, with employees with no degree having the lowest
median salary. In the violinplot the phd scholars have distribution near 200000, whereas
Masters degree holders have a very sleak distribution where the salary distribution is
spread from 100k to 150k, The Bachelors degree holders have a salary distribution near
50000 whereas the employees with no degree have a salary distribution near 40k-45k.
From these graph, I assume that the employees with higher education level have higher
salary than the employees with lower education level.
This graph falsifies my previous hypothesis regarding the demand and paywith respect
to job titles. In this graph, 'Other' category job titles have higher salary than those titles
which assumed to be in high demand and pay. In contrast to previous Job title graph,
this graph shows that there is no relation between the job title distribution and salary.
The job titles which gave high salary are found to be less in number.
However the hypothesis is true about the Job titles such as Software Developer, Data
analyst/scuentust and Manager/Director/VP. These job titles are found to be in high
demand and pay. But in contrast to that the job titles such as Operation/Supply chain,
HR, Financial/Accountant and Marketing/Social Media are found have much more salary
as assumed.
From this scaaterplot, it is clear that on the whole, the salary of the employees is
increasing with the years of experience. However, on closer look we can see that similar
experience have different salaries. This is because the salary is also dependent on other
factors like job title, age, gender education level as discussed earlier.
Both the boxplot and violinplot shows very similar insight about the salary across all the
countiries even in the violinplot distribution. However, there is very small variation in
median salary in USA, which is slighlty less as compared to other countries.
Since, the we cannot get much information about the salary with respect to the
countries. So, I will plot the job title vs salary graph for each country, so that we can get a
overview of job title vs salary for each country.
In [ ]: fig,ax = plt.subplots(3,2,figsize=(20,20))
plt.subplots_adjust(hspace=0.5)
sns.boxplot(x = 'Job Title', y = 'Salary', data = df[df['Country'] == 'USA'], ax
ax[0,0].tick_params(axis='x', rotation=90)
sns.boxplot(x = 'Job Title', y = 'Salary', data = df[df['Country'] == 'UK'], ax
ax[0,1].tick_params(axis='x', rotation=90)
sns.boxplot(x = 'Job Title', y = 'Salary', data = df[df['Country'] == 'Canada'],
ax[1,0].tick_params(axis='x', rotation=90)
sns.boxplot(x = 'Job Title', y = 'Salary', data = df[df['Country'] == 'Australia
ax[1,1].tick_params(axis='x', rotation=90)
sns.boxplot(x = 'Job Title', y = 'Salary', data = df[df['Country'] == 'China'],
ax[2,0].tick_params(axis='x', rotation=90)
sns.boxplot(x = 'Job Title', y = 'Salary', data = df, ax = ax[2,1]).set_title('A
ax[2,1].tick_params(axis='x', rotation=90)
After observing all these plots, I conclude that the Job Titles such as Softwarre Developer,
Manager/Director/VP and Data Analyst/Scientist hare in high demand as well as receive
much higer salary than other job titles, excluding the Job Titles that come under 'Other'
category. The job titles such as Operation/Supply Chain, Customer Service/Receptionist,
Product Designer and sales are in low demand and have low salary.
The employees from the races - Australian, Mixed, Blacks and White have the highest
median salary, followed by Asian, Korean and Chinese with lowest median salary in
employees from hispanic race. Looking at the violinplot the salary distribution is more
concentrated after 150k in white, australian, black and mixed race. Whereas the hispanic
has more concentration near 75k
Data Preprocessing 2
Gender [1 0 2]
Country [3 4 1 2 0]
Education Level [0 2 3 1]
Job Title [11 1 5 10 6 0 8 4 9 2 3 7]
Race [9 5 1 6 4 2 8 0 7 3]
Normalization
In [ ]: #normalizing the continuous variables
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df[['Age', 'Years of Experience', 'Salary']] = scaler.fit_transform(df[['Age', '
In [ ]: df.head()
The coorelation salary with age and years of experience is already explored in the above
plots. The coorelation between the years of experience and age is obvious as the person
Salary Prediction
I will be using the following models:
Out[ ]: ▾ DecisionTreeRegressor
Out[ ]: ▾ DecisionTreeRegressor
In [ ]: #training accuracy
dtree.score(X_train, y_train)
Out[ ]: 0.9656459784687974
0 0.656819 0.678470
1 -0.745659 -0.688434
2 -0.290405 -0.290405
3 -1.048183 -1.036343
4 -0.669294 -0.610093
5 1.414598 1.494747
6 -0.820850 -0.715794
7 -1.142906 -1.122777
8 1.509320 1.554189
9 0.277930 0.287811
The blue shows the distribution count for actual values and the red line shows the
distribution count for predicted values. The predicted values are close to the actual
values and ther curve coincides with the actual values curve. This shows that the model is
a good fit.
R2 Score: 0.9323013355107719
Mean Squared Error: 0.06928069008068977
Mean Absolute Error: 0.13812719621413622
RMSE: 0.2632122529075912
Out[ ]: ▾ RandomForestRegressor
RandomForestRegressor()
In [ ]: #training accuracy
rfg.score(X_train, y_train)
Out[ ]: 0.9881489086015691
0 0.656819 0.648206
1 -0.745659 -0.716941
2 -0.290405 -0.288510
3 -1.048183 -1.049699
4 -0.669294 -0.637562
5 1.414598 1.501506
6 -0.820850 -0.813651
7 -1.142906 -1.113062
8 1.509320 1.541334
9 0.277930 0.306604
The blue shows the distribution count for actual values and the red line shows the
distribution count for predicted values. The predicted values are close to the actual
values and ther curve coincides with the actual values curve. This shows that the model is
a good fit.
R2 Score: 0.946740751192265
Mean Squared Error: 0.05450384491951317
Mean Absolute Error: 0.11418652633630026
RMSE: 0.23346058536616662
Conclusion
From the exploratory data analysis, I have concluded that the salary of the employees is
dependent upon the following factors:
1. Years of Experience
2. Job Title
3. Education Level
Employees with greater years of experience, having job title such as Data
analyst/scientist, Software Developer or Director/Manager/VP and having a Master's or
Doctoral degree are more likely to have a higher salary.
Coming to the machine learning models, I have used regressor models - Decision Tree
Regressor and Random Forest Regressor for predicting the salary. The Random Forest
Regressor has performed well with the accuracy of 94.6%