9/21/2018 Random forest
RANDOM FOREST
ensemble of multiple trees
Ensemble learning technique ( collection of many individual components to create a big tree). used as boosting
technique ( sum of the whole is more than sum of individual parts ). Forest -->collection of trees. Many decision
trees (each tree = one model with different subsets of features(different combinations of feature) and subsets of
data) -> combine output of all these trees -> predict sample based on the maximum occurence of output from the
different DT models. Hence it makes up for the difficiencies in the individual models.
The term ensemble is used when more than one machine learning model/algorithm is bundled to give out the
average
Outcome of all models are taken and majority voting is made the decision
Random forest has classifier for classification and regressor for regression
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import RandomForestRegressor
In [6]: datafile = "D:\komal\SIMPLILEARN\MY COURSES\IN PROGRESS\MACHINE LEARNING RECOR
DINGS\Jul 28 Sat - Aug 25 Sat\Drive downloads\Machine Learning _ Jul 28 - Aug
25 _ Sayan\Decision Trees/titanicdata.htm"
In [7]: #BeautifulSoup is the library used for web scrapping
from bs4 import BeautifulSoup
with open(datafile,"r",encoding="Latin-1") as f:
soup = BeautifulSoup(f,"html.parser")
In [8]: table = soup.find('table')
In [9]: import pandas as pd
data = data = pd.read_html(str(table).encode('ascii', errors='replace'), flavo
r='bs4')[0]
file:///D:/KOMAL/SIMPLILEARN/MY%20COURSES/IN%20PROGRESS/My%20Codes_ML_DS/pdf%20conversion/htmls/komal_RF1_sayan_Titanic.html 1/9
9/21/2018 Random forest
In [10]: data.head()
Out[10]:
Boat Unnamed:
Name Age Class/Dept Ticket Joined Job
[Body] 7
AB??-AL-
MUN??, Mr 3rd Class 2699?18
0 27 Cherbourg ? 15? NaN
N??s??f Passenger 15s 9d
Q??sim
ABBING, Mr 3rd Class 5547?7 Blacksmith
1 42 Southampton ?? NaN
Anthony Passenger 11s ?
ABBOTT,
3rd Class CA2673?
2 Mrs Rhoda 39 Southampton ? A? NaN
Passenger 20 5s
Mary 'Rosa'
ABBOTT, Mr
3rd Class CA2673?
3 Rossmore 16 Southampton Jeweller ? ?[190] NaN
Passenger 20 5s
Edward
ABBOTT, Mr
3rd Class CA2673?
4 Eugene 13 Southampton Scholar ? ?? NaN
Passenger 20 5s
Joseph
In [11]: def cleanup(value):
return value.replace("?", " ")
data['Name']= data['Name'].apply(cleanup)
data['Boat [Body]']= data['Boat [Body]'].apply(cleanup)
data['Age'] = data['Age'].apply(pd.to_numeric, errors='coerce')
data = data[["Name","Age","Class/Dept","Boat [Body]"]]
data.head()
Out[11]:
Name Age Class/Dept Boat [Body]
0 AB -AL-MUN , Mr N s f Q sim 27.0 3rd Class Passenger 15
1 ABBING, Mr Anthony 42.0 3rd Class Passenger
2 ABBOTT, Mrs Rhoda Mary 'Rosa' 39.0 3rd Class Passenger A
3 ABBOTT, Mr Rossmore Edward 16.0 3rd Class Passenger [190]
4 ABBOTT, Mr Eugene Joseph 13.0 3rd Class Passenger
file:///D:/KOMAL/SIMPLILEARN/MY%20COURSES/IN%20PROGRESS/My%20Codes_ML_DS/pdf%20conversion/htmls/komal_RF1_sayan_Titanic.html 2/9
9/21/2018 Random forest
In [12]: def checkPass(class_type):
if "Passenger" in class_type:
return "Passenger"
else:
return "Crew"
data["Crew/Pass"]=data["Class/Dept"].apply(checkPass)
data.head()
Out[12]:
Name Age Class/Dept Boat [Body] Crew/Pass
0 AB -AL-MUN , Mr N s f Q sim 27.0 3rd Class Passenger 15 Passenger
1 ABBING, Mr Anthony 42.0 3rd Class Passenger Passenger
2 ABBOTT, Mrs Rhoda Mary 'Rosa' 39.0 3rd Class Passenger A Passenger
3 ABBOTT, Mr Rossmore Edward 16.0 3rd Class Passenger [190] Passenger
4 ABBOTT, Mr Eugene Joseph 13.0 3rd Class Passenger Passenger
In [13]: def checkClass(class_type):
if "Passenger" in class_type:
return class_type.split(" ")[0]
else:
return "Crew"
data["Class"]=data["Class/Dept"].apply(checkClass)
data.head()
Out[13]:
Boat
Name Age Class/Dept Crew/Pass Class
[Body]
3rd Class
0 AB -AL-MUN , Mr N s f Q sim 27.0 15 Passenger 3rd
Passenger
3rd Class
1 ABBING, Mr Anthony 42.0 Passenger 3rd
Passenger
ABBOTT, Mrs Rhoda Mary 3rd Class
2 39.0 A Passenger 3rd
'Rosa' Passenger
ABBOTT, Mr Rossmore 3rd Class
3 16.0 [190] Passenger 3rd
Edward Passenger
3rd Class
4 ABBOTT, Mr Eugene Joseph 13.0 Passenger 3rd
Passenger
file:///D:/KOMAL/SIMPLILEARN/MY%20COURSES/IN%20PROGRESS/My%20Codes_ML_DS/pdf%20conversion/htmls/komal_RF1_sayan_Titanic.html 3/9
9/21/2018 Random forest
In [14]: def checkAdult(age):
if age>=18:
return "Adult"
else:
return "Child"
data["Adult/Child"]=data["Age"].apply(checkAdult)
data.head()
Out[14]:
Boat
Name Age Class/Dept Crew/Pass Class Adult/Child
[Body]
AB -AL-MUN , Mr N s f 3rd Class
0 27.0 15 Passenger 3rd Adult
Q sim Passenger
3rd Class
1 ABBING, Mr Anthony 42.0 Passenger 3rd Adult
Passenger
ABBOTT, Mrs Rhoda 3rd Class
2 39.0 A Passenger 3rd Adult
Mary 'Rosa' Passenger
ABBOTT, Mr Rossmore 3rd Class
3 16.0 [190] Passenger 3rd Child
Edward Passenger
ABBOTT, Mr Eugene 3rd Class
4 13.0 Passenger 3rd Child
Joseph Passenger
In [15]: def checkGender(name):
firstname = name[name.index(",")+2:]
salutation = firstname.split(" ")[0]
if salutation in ["Mr","Master"]:
return "Male"
else:
return "Female"
file:///D:/KOMAL/SIMPLILEARN/MY%20COURSES/IN%20PROGRESS/My%20Codes_ML_DS/pdf%20conversion/htmls/komal_RF1_sayan_Titanic.html 4/9
9/21/2018 Random forest
In [16]: data["Gender"]=data["Name"].apply(checkGender)
data.head()
Out[16]:
Boat
Name Age Class/Dept Crew/Pass Class Adult/Child Gender
[Body]
AB -AL-MUN , Mr 3rd Class
0 27.0 15 Passenger 3rd Adult Male
N s f Q sim Passenger
ABBING, Mr 3rd Class
1 42.0 Passenger 3rd Adult Male
Anthony Passenger
ABBOTT, Mrs
3rd Class
2 Rhoda Mary 39.0 A Passenger 3rd Adult Female
Passenger
'Rosa'
ABBOTT, Mr
3rd Class
3 Rossmore 16.0 [190] Passenger 3rd Child Male
Passenger
Edward
ABBOTT, Mr 3rd Class
4 13.0 Passenger 3rd Child Male
Eugene Joseph Passenger
file:///D:/KOMAL/SIMPLILEARN/MY%20COURSES/IN%20PROGRESS/My%20Codes_ML_DS/pdf%20conversion/htmls/komal_RF1_sayan_Titanic.html 5/9
9/21/2018 Random forest
In [17]: def checkSurvival(boat):
if boat.strip()==" " or "[" in boat:
return 0
else:
return 1
data["Survival"]=data["Boat [Body]"].apply(checkSurvival)
data.head()
Out[17]:
Boat
Name Age Class/Dept Crew/Pass Class Adult/Child Gender Survival
[Body]
AB -AL-
MUN , Mr 3rd Class
0 27.0 15 Passenger 3rd Adult Male 1
NsfQ Passenger
sim
ABBING,
3rd Class
1 Mr 42.0 Passenger 3rd Adult Male 1
Passenger
Anthony
ABBOTT,
Mrs
3rd Class
2 Rhoda 39.0 A Passenger 3rd Adult Female 1
Passenger
Mary
'Rosa'
ABBOTT,
Mr 3rd Class
3 16.0 [190] Passenger 3rd Child Male 0
Rossmore Passenger
Edward
ABBOTT,
Mr 3rd Class
4 13.0 Passenger 3rd Child Male 1
Eugene Passenger
Joseph
In [18]: data.groupby(['Crew/Pass'])['Survival'].sum()*100/data.groupby(['Crew/Pass'])[
'Survival'].count()
Out[18]: Crew/Pass
Crew 90.217391
Passenger 90.310651
Name: Survival, dtype: float64
file:///D:/KOMAL/SIMPLILEARN/MY%20COURSES/IN%20PROGRESS/My%20Codes_ML_DS/pdf%20conversion/htmls/komal_RF1_sayan_Titanic.html 6/9
9/21/2018 Random forest
In [19]: def compare(group,data):
return data.groupby([group])['Survival'].sum()*100/data.groupby([group])[
'Survival'].count()
compare("Class",data)
Out[19]: Class
1st 89.714286
2nd 88.395904
3rd 91.396333
Crew 90.217391
Name: Survival, dtype: float64
In [20]: compare("Gender",data)
Out[20]: Gender
Female 95.840555
Male 88.557743
Name: Survival, dtype: float64
In [21]: compare("Adult/Child",data)
Out[21]: Adult/Child
Adult 89.699955
Child 95.964126
Name: Survival, dtype: float64
In [22]: trainingData=data[["Age","Crew/Pass","Class","Adult/Child","Gender","Survival"
]]
trainingData.head()
Out[22]:
Age Crew/Pass Class Adult/Child Gender Survival
0 27.0 Passenger 3rd Adult Male 1
1 42.0 Passenger 3rd Adult Male 1
2 39.0 Passenger 3rd Adult Female 1
3 16.0 Passenger 3rd Child Male 0
4 13.0 Passenger 3rd Child Male 1
file:///D:/KOMAL/SIMPLILEARN/MY%20COURSES/IN%20PROGRESS/My%20Codes_ML_DS/pdf%20conversion/htmls/komal_RF1_sayan_Titanic.html 7/9
9/21/2018 Random forest
In [23]: def catToNum(series):
series = series.astype('category')
return series.cat.codes
catData=trainingData[["Crew/Pass","Class","Adult/Child","Gender"]].apply(catTo
Num)
trainingData[["Crew/Pass","Class","Adult/Child","Gender"]]=catData
trainingData.head()
C:\Users\hariz\Anaconda3\lib\site-packages\pandas\core\frame.py:3137: Setting
WithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/st
able/indexing.html#indexing-view-versus-copy
self[k1] = value[k2]
Out[23]:
Age Crew/Pass Class Adult/Child Gender Survival
0 27.0 1 2 0 1 1
1 42.0 1 2 0 1 1
2 39.0 1 2 0 0 1
3 16.0 1 2 1 1 0
4 13.0 1 2 1 1 1
In [24]: trainingData = trainingData.dropna()
len(trainingData)
Out[24]: 2426
In [25]: from sklearn.model_selection import train_test_split
train, test = train_test_split(trainingData, test_size = 0.2)
In [26]: test.head()
Out[26]:
Age Crew/Pass Class Adult/Child Gender Survival
1990 30.0 0 3 0 1 1
485 32.0 0 3 0 1 1
1591 17.0 1 1 1 1 1
1704 31.0 0 3 0 1 1
2318 34.0 0 3 0 1 1
file:///D:/KOMAL/SIMPLILEARN/MY%20COURSES/IN%20PROGRESS/My%20Codes_ML_DS/pdf%20conversion/htmls/komal_RF1_sayan_Titanic.html 8/9
9/21/2018 Random forest
In [27]: #n_estimators specifies the number of trees to have
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators=1000,max_leaf_nodes=15)
In [28]: clf
Out[28]: RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
max_depth=None, max_features='auto', max_leaf_nodes=15,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=1000, n_jobs=1,
oob_score=False, random_state=None, verbose=0,
warm_start=False)
In [31]: from sklearn.metrics import accuracy_score
In [32]: def checkAccuracy(clf):
clf=clf.fit(train[["Age","Crew/Pass","Class","Adult/Child","Gender"]],trai
n["Survival"])
predictions = clf.predict(test[["Age","Crew/Pass","Class","Adult/Child","G
ender"]])
return accuracy_score(test["Survival"], predictions)
In [33]: checkAccuracy(clf)
Out[33]: 0.8930041152263375
In [34]: #There are known issues while installing xgboost on windows. Hence, commented
the below code
In [35]: #from xgboost.sklearn import XGBClassifier
In [36]: #clf = XGBClassifier(n_estimators=1000)
In [37]: #checkAccuracy(clf)
In [38]: #clf
file:///D:/KOMAL/SIMPLILEARN/MY%20COURSES/IN%20PROGRESS/My%20Codes_ML_DS/pdf%20conversion/htmls/komal_RF1_sayan_Titanic.html 9/9