0 ratings0% found this document useful (0 votes) 68 views32 pagesSalary Prediction
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content,
claim it here.
Available Formats
Download as PDF or read online on Scribd
1810812023, 20:48
Salary Presicton
# Import the required Libraries for data preprocessing
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings. filterwarnings( ignore’)
Data Preprocessing
* Load the data using the Pandas read function
print the dataset information
© Print the stastics about the data using describe functior
* Visualize the correlation map to understand the correlation with the columns
Check for null values, and if the data contains any, remove them
* Additionally, inspect for duplicate values and remove them if present.
# Load the data set and print the top 5 rows
data=pd.read_csv('C://Users//vinod//Downloads//salary csv" )
data.head()
education- marital-
age workclass fnlugt education ‘occupation relationship race sex
num status
© 39 Sategov 77516 Bachelors 13, Never Adm Notin- yhite Male
maried ——dericl = family
Martied
150 Seem 93311, Bachelors 13 ce SE Husbanc White Male
spouse "99
23 inate 21564 Hod 9 ohaced Ma NOI ite
Marted
3°53 Private 234721 Ith 7 cv. Mandlets: —usbané Black Male
spouse “eaners
Marted rot
4 28 Private 33640 Bachelors Bo cle fre Wife Black Female
spouse SPecialty
# about the data set
data. info)
localhost 8888inbconverthiml’Salary Precctonipynb 7download=falso 328108/2023, 20:45
In [4]
out[4]
In [5]
Salary Predicton
Rangelndex: 32561 entries, @ to 3256€
Data colunns (total 15 columns)
Column Non-Null Count Dtype
2 age 32561 non-null inte4
1 workclass 32561 non-null object
2 fnlwgt 32561 non-null intea
3 education 32561 non-null object
4 education-num 32561 non-null int64
5 marital-status 32561 non-null object
& occupation 32561 non-null object
7 relationship 32561 non-null object
8 race 32561 non-null object
9 sex 32561 non-null object
1@ capital-gain 32561 non-null intéa
11 capital-loss 32561 non-null inté4
12 hours-per-week 32561 non-null inté4
13 native-country 32561 non-null object
14 salary 32561 non-null object
dtypes: inte4(6), object(9)
memory usage: 3.7+ ME
# Under standing stastics in the data set
data.describe(). style. background_gradient (cmap="tab20c’ )
education:
age frlwgt Nem @pital-gain —capital-loss
199778366512 (RR ae ceo)
SEEEM ha ee
Cord cE)
# Checking the correlation matirx
ple. Figure (Figsize=(10,4))
‘sns.heatmap(data.corr(),annot=True, cmap="winter_r’, fmt='
ple title("correlation map")
pit show)
2" ,Linewidths=1)
localhost 8888inbconverthiml’Salary Precctonipynb 7download=falso1810812023, 20:48 Salary Presicton
Correlation map.
10
08
06 |
on
oz
00
fhiwot —education-num capitakgain capita-tosshours-per week
Data Cleaning Process
In [6]: # Checking the null values in the data set
data. isna().sum()/len(data)*10e
age ae
workclass ae
frlwgt ee
education ae
education-num @.€
marital-status @.¢
‘occupation ee
relationship ee
race ae
sex ae
ee
ee
ee
ee
ee
out
capital-gain
capital-loss
nours-per-week
native-country
salary
dtype: floated
In [7]: Checking the Percentage of the null values in the dataset
null_values=data.isna().sun()
total_shel1s-np. product (data. shape)
total_missing_values=null_values.sum()
percentage_missing_values=(total_missing_values/total_shells)*100
print(f'The data set contains {percentage_missing values} of values')
The data set contains 0.0 of values
In [8]: # Checking the duplicate values in the dataset
duplicate=data.duplicated().sum()
print(f'There is {duplicate} values in the data set we remove it’)
There is 24 values in the data set we remove it
In [9]: # Remove the duplicate values and store the data set as data variable
data-data.drop_duplicates()
after_renove_duplicates-data.duplicated().sum()
print(f' There is {after_remove duplicates} values in the data set‘)
Iocahost 8888inbconverthiml’Salary Precctonipynb ?éownload=false 1321810812023, 20:48 Salary Predicton
There is @ values in the data set
Explore Data Analysis Process
Question asked from the data: |
We use a for loop to print count plots for numerical columns to understand the most
repeated values.
* We also visualize a histogram for the number of hours employees work
* Additionally, we visualize the output percentages using a pie chart
* Furthermore, we perform data cleaning by replacing unwanted names in the dataset with
“Others”.
* We create a pie chart to understand the distribution of output in numerical columns
* We generate a separate dataset for the USA to analyze the most demanded education anc
jobs
* We explore the education levels with the most hours worked in the USA
# Using a for loop, we visualize selected numerical columns in the USA dataset using bar
charts
* Additionally, we utilize box plots with the dataset and hue values based on salary using ¢
for loop
* Furthermore, we create a pivot table for better data understanding.
© Lastly, we visualize the most demanded jobs with work hours ranging from less than 20
hours to 40 hours.
# create a count plot to visualize the some numerical columns in the data
numerical=[ ‘education-nun' , "capital-gain’, ‘hours-per-week' ]
for i in numerical:
pit. figure(figsiz
sns..countplot (dat:
pit.title([i])
pit .xticks (rotation=98)
pit. show()
localhost 8888inbconverthiml’Salary Precctonipynb 7download=falso 1321810812023, 20:48 Salary Predicton
Teducation-num']
‘2000
6000
count
2000
a a ae a ee a
education-num
2
B
u
6
16
[capital-gain')
20000
35000
20000
5000
localhost 8888inbconverthiml’Salary Precctonipynb 7download=falso 51321810812023, 20:48 Salary Predicton
Chours-per-week']
12000
19000
‘8000
count
‘6000
4000
2000
plt. figure(figsize=(10,5))
sns-histplot (data-data, x=hours-per-week' , bin:
plt.title(
plt.xlabel(
plt.ylabel ("Count of value")
pit. show()
Distribution of the hours-per-waek
18000
14000
12000
1g 10000
5 2000
§ e000
000
2000
°
3 2 0 6 ” 380
Hours
# Let's find the percentage of the gender in the data set using the pie chat
datal' sex" ].value_counts().plot(kind=' pie",
explode=[0,2.21,
labels=[‘Male", "Female" ],
colors=["blue', ‘gray’ ],
autopct="%1.2°%%"
shadow=True,
)
plt.title("Visualize the Gender percentage in the data")
plt.show()
localhost 888inbconverthiml’Salay Precctonipynb ?download=falso e1321810812023, 20:48 Salary Prediction
Visualize the Gender percentage in the data
Male
sex
Female
In [13]: # Distibution of the age column with the gender
plt. figure(Figsize=(10,6))
sns.countplot (data=data, x="age" ,hu
plt.xticks(rotation=99)
‘deep')
“sex' palett
plt.show()
600 me Male
mmm Female
300
400
3
8 300
200
100
°
ROR RRR AERA RSI TI 8 AANA ESSE BBR EE ET RR a
ae
In [10]: plt.figure(#igsize=(10,5))
data[ ‘salary’ ].value_counts().sort_values(ascending=False) .plot(kind="bar',
Iocahost 8888inbconverthiml’Salay Precctonipynb ?download=falso 1321810812023, 20:48
In [15
In [16
In [17
Salary Prediction
color=[' #A9E2F3",, "#190;
plt.title("Visualize the Salary values in the data")
plt.xlabel("Salary*)
plt.ylabel("Count of the values")
pt. show()
Visualize the Salary values in the data
25000
20000
15000
10000
Count of the values
50k
z
salary
# The data contains the unwanted informatin we would Like to remove and add new values
datal‘native-country' ].value_counts().head(3)
United-States 29153
Mexico 638
? 582
Name: native-country, dtype: int6a
data ‘native-country' ]=data[ 'native-country’].str.replace('?', ‘others')
data[ 'workclass" J=data[ 'workclass" ].str.replace('?', ‘others' )
data[' occupation’ J=datal ‘occupation’ ].str.replace('?', ‘others’)
# Visualize the top 20 countrys in the dataset
datal‘native-country' ].value_counts().nlargest(20)\
-plot(kind= "bar" ,title="Top 28 country in the data set" hatch=
plt.xlabel("Country name
plt.ylabel("Count of values")
pt. show()
» Figsize=(10,5),colot
localhost 8888inbconverthiml’Salary Precctonipynb 7download=falso aise1810812023, 20:48 Salary Presicton
Top 20 country in the data set
30000
oso
sou f |
10000 | 2
Count of values
g
sooo }¥0.
:
United States
Mexico
others
Philippines
Germany
canada
Puerto-Rico
salvador
india
cuba
England
Jamaica
south
china
aly
Vietnam
Japan
ouatemala
Poland
Dominican-Repubii
‘country name
In [18]: # Create a pie chart to understande the relationships, race, sex percentage with output
another_list=[ ‘relationship’ , ‘race’, "sex" ]
num_of_colunns=1en(another_list)
plt.figure(figsize=(25,8))
for i, col in enumerate(another_list):
plt. subplot (1, num_of_columns, i+1)
data[col].value_counts() .plot(kind=
plt.title([col])
plt.tight_layout()
pit. show()
pie’ ,autopct="%1.1F2%", startangle=9@)
In [19]: |# Create a pie chart to understand the percentage of the workclass and education and a
work_place=[‘workclass", ‘education’, ‘narital-status' ]
rium_colunn=1en (work_place)
pit. Figure(figsize=(25,8))
for i,col in enunerate(work place):
pit. subplot(1,num_colunn, #+1)
data[col] .value_counts() .plot (kind= "pie, autopct=
plt.title([col])
1.14%", startangle=90)
Iocahost 8888inbconverthiml’Salay Precctonipynb ?download=falso 91321810812023, 20:48 Salary Predicton
plt.tight_Layout ()
plt.show()
In [28]: # Apply the different condition to the data to creat a seperate data frame for united
usa=data[data[ 'native-country" ]==" United-states']
# Find the which education is most demanding in the unitedstates
usa[ ‘education’ }.value_counts().sort_values (ascending-False)\
-plot (kind=' bar’, figsize=(10,5),hatch='*" , color=['#81F781' , '#FACC2E" , “HESCEF6" ,"#F2F5/
plt.title(*Most Demanding eudcation in the united states")
plt.xlabel ("Degree")
plt.ylabel("Count of the values")
pit. show()
Most Demanding eudcation in the united states
19000
‘8000
‘6000
4000
‘count of the values
2000
° =
fe e2 2662 § FE EG EE
& £3 ¢ ¢ 3 2 2 € =& sg 2 $
23,2 5 3 gk goa oR §
iG 3 3 : g E
3 2
ons
In [21]: # Find the total working hours with there education using the bar chart
usa. groupby( education" )[ 'hours-per-week' ].sum().sort_values(ascending-False)\
-plot (kind= ‘bar’ , figsize=(10,5),hatch='//" ,color=['#D8F781' , ‘#@080FF" ,'#@86A87", ' #0A8:
plt.title("Total Working hours with there degree”)
plt.xlabel("Job")
plt.ylabel("Count of values")
pit. show()
localhost 8888inbconverthiml’Salary Precctonipynb 7download=falso 01321810812023, 20:48 Salary Presicton
‘otal Working hours with there degree
400000
200000
250000
200000
count of values
150000
200000
REE RE £ E282 8 8 8 2 8
pede, Pt ta aE "ea |
ia 2 3 z a &
po
#Create a countplot to understanding the informaiton about the united states
plt. figure(figsize=(13,6))
for i in ['workclass’, ‘education’, ‘race’, ‘relationship, 'sex"]:
sns.countplot (data=usa,x=1,hue=' salary’, palette=' viridis")
plt.title(#' information about the {1} column with salary")
plt.xlabel([1])
plt.ylabel("Count of the values")
plt.xticks(retation=90)
plt.show()
information about the workclass column with salary
$
5
Federab gee
setampinc
Ssltempacine
inoue pay
eras
localhost 8888inbconverthiml’Salary Precctonipynb 7download=falso se1810812023, 20:48 Salary Predicton
information about the education column with salary
3000
7000
6000
5000
4000
3000
Count of the values
2000
1000
Bachelors
HS-grad
11th
Masters
Some-college
‘Assoc-acdm
Doctorate
9th
Assoc-voc
10th
‘Tth-8th_
Prof-school
Ast-4th
Preschool
5th-6th
12th
[education']
localhost 8888inbconverthiml’Salary Precctonipynb 7download=falso aise1810812023, 20:48 Salary Presicton
information about the race column with salary
17500
15000
12500
10000
7500
Count of the values
5000
2500
& zt £
2 a 8
Asian-Pac-lslander
Amer-Indian-Eskimo
[race']
localhost 8888inbconverthiml’Salary Precctonipynb 7download=falso 31321810812023, 20:48 Salary Presicton
information about the relationship column with salary
7000
6000
5000
4000
3000
‘Count of the values
2000
1000
Not-in-family
Husband
Wife
‘own-child
Unmarried
Other relative
[relationship*]
localhost 8888inbconverthiml’Salary Precctonipynb 7download=falso aise1810812023, 20:48 Salary Presicton
information about the sex column with salary
14000
salary
mmm <=50K
1z900 mm >50K
10000 [
8000
6000
Count of the values
4000
Female
['sex']
In [23]: # Create a boxplot for numerical column with age
for i in [‘workclass' , ‘education’, ‘marital-status*, ‘occupation’ , ‘relationship’, ‘race’
plt.figure(figsize=(16,5))
sns..boxplot (data=data, x=data[i],y="age' palette='his')
pit.title({i])
pit.xlabel([i])
plt.xticks(rotation=98)
pit.ylabel( ‘Age )
plt.show()
localhost 8888inbconverthiml’Salary Precctonipynb 7download=falso 15132Salary Presicton
1810812023, 20:48
[workclass']
anuomsonoN
Aedanounim
Sur-dwsyias
sroqo
064j2307
nob-erepay
penis
punou-duiesies,
pob.ares
(rmorkclass'}
[education’)
16192
wer
roowpsaus
wapast
‘nor
wnsans
oops Jota
s1e10p00
wane
aan-2085y
education’)
upre-s0ssy
abajosawos,
false
wwe
srorsen.
wnt
pests
siojsupe3
localhost 8888inbconverthiml'Salary Prediction pynb?downloadSalary Presicton
1810812023, 20:48
Cmarital-status']
20
20
0
60
so
0
20
20
pamopin
asnods-sy-pauien,
payesedas
juasqe-asnods-pauen
pon10nia
‘asnods-n-pauuen
poueuianen
[maritatstatus']
Coccupation']
nies-2snoy.nue
seniou-pauuiy
ies-aniay0ig
sieqno
poddns-upes
pdsurdo-auyen
Surysy-Suyuures
Suinow-uodsuent
edaryers
sare
aaunvasse120
Aepadssord
srouee)>-si9jpuely
reuadeuewoaea
yeoue;>upy
occupation)
vise
false
localhost 8888inbconverthiml'Salary Prediction pynb?download1810812023, 20:48 Salary Presicton
relationship’)
.
80
70
60
20
a
wite
own-chi
| fj -
a
relationship‘)
frace')
20 ’ ‘
‘
20 | ’
10
60
20
white
Black
3S & 8
|
LL
In [ # Find the job between the range 26 to 40 hours with workclass
filterd_jobs=datal (data[ ‘hours-per-week’ ]<2@) & (data ‘hours-per-week' ]<=40)]
filterd_jobs[ ‘workclass’ ].value_counts().plot (kind="bar’ ,figsize=(10,5), color=["#A9FSE
plt.title("Top most working jobs in the data")
plt.xlabel ("Jobs")
Iocahost 8888inbconverthiml’Salay Precctonipynb ?download=falso arse1810812023, 20:48 Salary Prediction
plt.ylabel("Count of values")
pt. show()
‘Top most working jabs in the data
1900
200 |
600
count of values
200
others
settempnotine ff
Private
|
BS
Federalgov 1
Seltempine |
without pay
Never worked
pbs
In [25]: # using groupby condition find the some intresting facts
data. groupby(‘ education" )[ "salary" ].value_counts()\
sunstack() .style.background_gradient (cmap="gist_heat_r’)
localhost 8888inbconverthiml’Salary Precctonipynb 7download=falso 19132salary <=50K >50K
education
oy ees
‘10 BEE xoxo
1st-4th — 160,000000 6.000000
| eee
-802,000000
HS-grad
Masters 763.000000
Preschool 0.000000
Prof-schoo! _ 153.000000
Some-college
# Create a pivot table
pivot_table=data.pivot_table(colunns="workclass' ,index=" education’ , value:
pivot_table. style.background_gradient(cnap="cividis_r")
localhost 8888inbconverthiml’Salary Precctonipynb 7download=falso
Salary Presicton
201321810872028, 20:48 Salary Presicton
Self-emp-
notine
Never-
worked
Private Selfemp-ine Stat
Workclass Federal-gov —_Local-gov
‘education
oth — 253,000000 1185.000000 25595.000000 729,000000 2915.000000 $08.0
rath r7s0n00 esotctn ‘easton 238c0n00 esncomon 500
ssa EEE) seonnon Serer soonen sTsinow 200
r-si| raomzo ‘esto ee
‘9th 120.0000 853.0000 14806.000000 469.000000 1417.000000 2380
Assoc 2257000000 3556000000 279000000 167700000 3139400000 15110
Assoc 570000000 3582000000 «432000000 1656000000 057000000 16900
Cries 2035.000000 Bee cti) amr)
Doctorate 803000000 _1177.000000 EI 8809.000000 1914000000 2087,000000
HS-grad es
Masters een) 39696.000000 '5352.000000
Preschool 130.000000 1495,000000 240
Pr saxrooo0n0 1316000000 "2362000000 002.0000 15540
nee eee osc aE Toor Parone)
# Sone intresting questions asked from the data
print('The most demanding education is’ ,data[ ‘education’ ].value_counts().idxmax())
print("\n the least demanding education is’ ,data[ ‘education’ ].value_counts().idxmin(?
print("\n The highest working hours in the data is’, data[ 'hours-per-week' ].value_count
print("\n The less working hours in the data is‘ ,data[ "hours-per-week' ].value_counts(:
print("\n Most dominate race is’ ,datal'race’].value_counts().idxmax())
print("\n Less dominate race is ',data['race’].value_counts().ddxmin())
print(*\n Most dominate occupation is’ ,datal ‘occupation’ ].value_counts().idxmax())
print("\n Less dominate race is‘ ,datal ‘occupation’ ].value_counts().idxmin())
localhost 8888inbconverthiml’Salary Precctonipynb 7download=falso 23218/08/2023, 20:45 Salary Predicton
The most demanding education is HS-grac
the least demanding education is Preschool
The highest working hours in the data is 4@
The less working hours in the data is 82 |
Most dominate race is White
Less dominate race is other
Most doninate occupation is Prof-specialty
Less dominate race is Armed-Forces
In [28]: |# Find the average working hours with occupation where more then 5@ hours
long_hours_jobs=data[ (data['hours-per-week* ]>=5@) ]
long_hours_jobs .groupby( ‘occupation’ )[ ‘hours-per-week’ ].mean().sort_values(ascending=I
«plot (kind= "bar", figsize=(10,5),color=[‘#D@FSA9", "#FSBCA9", '#@@40FF*,#F781BE"])
plt.title("Average hours per week differnt occupation")
plt.xlabel (“occupation”)
plt.ylabel(“Avergage hours per week")
plt.show()
Average hours per week differnt occupation
‘Avergage hours per week
s
Privhouse-serv
arming fishing
protecive-serv
otherservice
others
‘Transport moving
“eeh-suppor
Protspecialty
Machine-opsinspet
Sales
Craferepair
execmanagerial
Handlers-cleeners
Admlerical
anmed-Forces
occupation
# Find the average age of the bachelors degree holder
bachelors=datal (data[ ‘education’ }==" Bachelors’) ]
find_the_averge_age=bachelors..groupby( sex’ ){ ‘age’ ] .mean().sort_values(ascending=Falst
plt. Figure(figsize=(7,5))
plt. bar(Find_the_averge_age. index, Find_the_averge_age.values, color=["#FAS882" , "#F6CEEC
plt.title("Average age of the bachelors degree holders”)
plt.xlabel( Gender")
In [
Iocahost 8888inbconverthiml’Salay Precctonipynb ?download=falso zs1810812023, 20:48
Salary Prediction
plt.ylabel("Average age")
pt. show()
Average age of the bachelors degree holders
40
35
30
25
20
Average age
15
10
Male Female
Gender
Observations:
We observed that the majority of working hours per week fall within the 30 to 40 range.
© The pie chart illustrates a higher percentage of males in the dataset.
# The USA has the highest number of records in the dataset
+ A significant portion of employees earned a salary of less than 50k.
* Within the USA data, the most demanded degree is high school (hs-degree), which alse
corresponds to the highest working hours.
© The pie chart provides insights into various aspects of the output
* Employees in the “private house service" sector work for more than 50 hours per week.
* Working hours for employees in private companies typically range between 20 to 40 hours
Machine Learning Modeling
# Install the all Required Libraries for the machine Learning Modeling
from sklearn.model_selection import train test_split
rom sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
Iocahost 8888inbconverthiml’Salay Precctonipynb ?download=falso 2321810812023, 20:48
Salary Prediction
from sklearn.ensenble import RandonForestClassifier
fron sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from xgboost import xG8Classifier
rom catboost import CatBoostClassifier
for col in data. select_dtypes(include='object')
Labelencoder=LabelEncoder()
Labelencoder. fit (data[col] .unique())
data[col]=1abelencoder .transfora(data{col])
ee eee ae et ae |
# split the data into independent and dependent variable
data.drop(['salary'],axis=1)
lata['salary']
# Normalization the data using the Standard Scaler
standard-StandardScaler()
tandard.fit_transform(X)
# Split the data into train and test data
X_train,X_test,y_train,y_test-train_test_split(x,y,test_siz
).20,randon_state=220)
# Create a function for machine Learning modeling
def machine_learning_model (nodel,x_train,X_test,y_train,y_test):
In the function we write about the code for machine learning model
Firstly we fit the train data to the model
and predict the values with test data and store the values with variable
and then print the accuracy score along with classification and confusion matrix
print(f'The {model} *)
model. Fit(X_train,y_train)
y_pred=model. predict (x_test)
model_score-accuracy_score(y_test,y_pred)
print(f"\nthe accuracy score of the {model} is {nodel_score*100 :.2F}")
print(f"\n (classification_report(y_test,y_pred)}")
print(#"\n{confusion_matrix(y_test,y_pred)}")
matrix=confusion_matrix(y_test,y_pred)
sns.heatmap(matrix, annot=True, cnap='Reds", fmt=".2F" , Linewidth:
plt.show()
print(*="*30)
models={
“ logistic’ : LogisticRegression(penalty="12'),
‘decison’ :DecisionTreeClassifier(criterion="gini', splitter="best’,
‘Random’ :RandonForestClassifier(n_estimators=5@, criterion="gini'),
"kn :kNeighborsClassifier(),,
‘xg’ :xGBClassifier(),
“catboost' :CatBoostClassifier(iterations=1)
for i in range(1en(nodels)):
nodel_nanes=list(nodels.values()) [1]
names=list(models.keys()) [4]
# And apply the machine Learning function to the models
machine_learning model (model_nanes,x_train,X_test,y_train,y test)
locahost 8888inbconverthiml’Salay Precctonipynb 7download=falso 2432‘10872023, 20:45 Salary Presicton
The LogisticRegression()
The accuracy score of the LogisticRegression() is 82.62
precision recall fi-score support
e 0.84 © 8.95 8.8944
1 0.73 0.45 = @.55 1567
accuracy 0.83 6508
nacro avg 0.78 8.78 8.726508
weighted avg 0.82 8.83 8.816508
1467 264)
[867 700)!
264.00
a: 867.00 700.00
0 1
The DecisionTreeClassifier()
The accuracy score of the DecisionTreeClassifier() is 80.82
precision recall fi-score support
e 0.88 0.87 0.87 4941
1 0.68 0.61 0.61 1567
accuracy e.81 e508
macro avg 0.74 0.74 0.74 508
weighted avg 0.81 2.81 0.81 6508
[4304 637)
[ 611 956]
localhost 8888inbconverthiml’Salary Precctonipynb 7download=falso
4500
4000
3500
3000
2500
2000
- 1500
- 1000
- 500
25921810812023, 20:48
a: 611.00
Salary Presicton
637.00
956.00
The RandonForestClassifier(n_estimators=50)
The accuracy score of the RandonForestClassifier(n_estimators=5@) is 85.59
precision recall f1-score
e 0.88 0.93
1 0.74 0.62
accuracy
macro avg 0.81 0.77
weighted avg 0.85 0.86
[4604 337)
[ 601 966]
localhost 8888inbconverthiml’Salary Precctonipynb 7download=falso
support
0.91 4941
0.67 1567
0.86 6508
0.79 6508
0.85 6508
4000
3500
3000
2500
- 2000
- 1500
- 1000
26921810812023, 20:48
Terme)
601.00
The KNeighborsClassifier()
The accuracy score of the KNeighborsClassifier() is 83.37
accuracy
aacro avg
weighted avg
[4520 421
[ 661 906]
precision
0.78
localhost 8888inbconverthiml’Salary Precctonipynb 7download=falso
Salary Presicton
337.00
966.00
recall f1-score
support
4941
1567
6508
6508
6508
4500
4000
3500
3000
2500
2000
- 1500
- 1000
-500
ame1810812023, 20:48 Salary Predicton
- 4500
4900
of 20.00 421.00 3500 |
3000
2500
2000
| 661.00 906.00 ~ 1500
- 1000
- 500
localhost 8888inbconverthiml’Salary Precctonipynb 7download=falso 2a1810812023, 20:48
Salary Presicton
The XGBClassifier(base_score-None, booster
colsample_bylevel=None, colsample_bynode=None,
colsample_bytreesNone, early_stopping_round:
enable_categorical=False, eval_metric-None, feature_type:
interaction_constraints-None, learning rate-None, max_bit
mmax_cat_threshold-None, max_cat_to_onehot
min_child_weight-None, missing-nan, monotone_constraints=None,
‘_estimators=100, n_jobs=None, num_parallel_tree=None,
sPedictor-None, random_stat
The accuracy score of the XGBClassifier(base_score-None, booster-None, callbacks=Nor
accuracy
aacro avg
weighted avg
[14838 303)
[553 1014))
colsample_bylevel=None, colsample_bynode=None,
colsample_bytreesNone, early_stopping_rounds-None,
enable_categorical=False, eval_metric-None, feature_types=None,
gamnasione, gpu_id=None, grow_policy=None, importance_type=None,
interaction_constraints-None, learning ratesNone, max_bin-None,
mmax_cat_threshold-None, max_cat_to_oneho'
min_child_weight-None, missing-nan, monotone_constraints=None,
‘_estimators=100, n_jobs=None, num parallel_tree=None,
sredictor-None, random_stat ) is 86.85
precision recall fi-score support
0.89 0.94 0.92 4941
0.77 0.65 0.70 1567
0.87 6508
0.83 0.79 e.81 6508
0.86 0.87 0.86 6508
localhost 8888inbconverthiml’Salary Precctonipynb 7download=falso 2o1321810812023, 20:48
a: 553.00
Salary Presicton
303.00
1014.00
4500
4000
3500
3000
2500
2000
- 1500
- 1000
- 500
The
Learning rate set to 0.5
a: learn: @.4868985 total: 165ms
The accuracy score of the is 84,53
precision recall f1-score
e 0.86 0.95
1 0.77 0.51
accuracy
macro avg 0.82 0.73
weighted avg 0.84 0.85
[4706 235
772795)
localhost 8888inbconverthiml’Salary Precctonipynb 7download=falso
0.90
0.61
26
0.85
2.76
0.83
remaining: ous
support
4941
1567
6508
508
6508
01921810812023, 20:48
Tl Mo)
712.00
c Random=RandonForestClassifier()
machine_learning_nodel(Random,X_train,X_test,y train,y test)
The RandonForestClassifier()
The accuracy score of the RandomForestClassifier() is 85.93
accuracy
nacro avg
weighted avg
[4616 325
[ se1 976]
precision
recall f1-score
localhost 8888inbconverthiml’Salary Precctonipynb 7download=falso
Salary Presicton
235.00
795.00
support
4941
1567
6508
6508
6508
4500
4000
3500
3000
2500
2000
- 1500
- 1000
- 500
a21810812023, 20:48 Salary Predicton
4500
4000
3300 |
3000
4616.00 325.00
2500
2000
as 591.00 976.00 y 2800.
- 1000
- 500
c # Let's dump the model
import pickle
# Let's dump the mode
pickle. dump(Random, open('RandonForest.pkl', "wo'))
localhost 8888inbconverthiml’Salary Precctonipynb 7download=falso sas