0% found this document useful (0 votes)
26 views16 pages

Random Forest

Uploaded by

nilashish sarkar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views16 pages

Random Forest

Uploaded by

nilashish sarkar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

nilashishsarkar70@gmail.

com
AB5D4F1ITD
Random Forest

Proprietary content. ©Great


This file Learning.
is meant All Rights
for personal useReserved. Unauthorized use or distribution
by nilashishsarkar70@gmail.com only. prohibited
Sharing or publishing the contents in part or full is liable for legal action.
Basic steps -Classification algorithms

Profiling Differentiation Classification

nilashishsarkar70@gmail.com
AB5D4F1ITD

Proprietary content. ©Great


This file Learning.
is meant All Rights
for personal useReserved. Unauthorized use or distribution
by nilashishsarkar70@gmail.com only. prohibited
Sharing or publishing the contents in part or full is liable for legal action. 2
Should I invest in a company – ask the experts

Employee of XYZ Financial Advisor of XYZ Stock Market Trader Employee of acompetitor Market Researchteam Social Media Expert

Knows internal perspective on companies observed company’sstock internal functionality of the analyzes the customer understand product
functionality
nilashishsarkar70@gmail.com
AB5D4F1ITD
vs competition price over past 3years competitor firms preference of XYZ’sproduct positioning

lacks a sight of companyin


lacks a view on internal knows seasonality trends unaware of thechanges Changes in customer
insider information focus and the external
policies and market performance XYZwill bring sentiment overtime
factors

lacks a broader perspective has been right 60% of have been right 75%of unaware of detailsbeyond
has been right 75%times. has been right 70%times.
on competitors times. times. digital marketing

has been right 65%of


has been right 70%times.
times.

Proprietary content. ©Great


This file Learning.
is meant All Rights
for personal useReserved. Unauthorized use or distribution
by nilashishsarkar70@gmail.com only. prohibited
Sharing or publishing the contents in part or full is liable for legal action. 3
Scenario1 - Combine all the info – informed decision

assuming all the


all the 6experts/teams
predictions are Assumption – decisions
verify that it’s a good
independent of each are not correlated
decision
other
nilashishsarkar70@gmail.com
AB5D4F1ITD

Each person thinks from combined accuracy rate


a different perspective improves when we take
to take the decisionand voting principle
not influenced by the (complementing
other principle)

Proprietary content. ©Great


This file Learning.
is meant All Rights
for personal useReserved. Unauthorized use or distribution
by nilashishsarkar70@gmail.com only. prohibited
Sharing or publishing the contents in part or full is liable for legal action. 4
Scenario 2 – info from similar sources
6 experts, all of If we combinetheir
Everyone has a
them are employees advice into single
propensity of 70%to
of XYZworking in prediction based on
advocate correctly.
the samedivision voting?
nilashishsarkar70@gmail.com
AB5D4F1ITD

All the predictions


Similar predictions –
are based on very
accuracy will go
similar set of
down while voting
information

Proprietary content. ©Great


This file Learning.
is meant All Rights
for personal useReserved. Unauthorized use or distribution
by nilashishsarkar70@gmail.com only. prohibited
Sharing or publishing the contents in part or full is liable for legal action. 5
Ensemble learning
• Machine learning technique that combines several base
models in order to produce one optimal predictive model.
• Weak classifiers
• Different set of variables for each classifier
nilashishsarkar70@gmail.com
AB5D4F1ITD

• Combine into singleprediction

Proprietary content. ©Great


This file Learning.
is meant All Rights
for personal useReserved. Unauthorized use or distribution
by nilashishsarkar70@gmail.com only. prohibited
Sharing or publishing the contents in part or full is liable for legal action. 6
What is a boot strapped dataset
Sno X1 X2 Y
4 144 29 No
2 529 34 Yes
3 125 67 No

Sno X1 X2 Y Sno X1 X2 Y
1 432
nilashishsarkar70@gmail.com
29 Yes Random sample rows 3 125 67 No
2
AB5D4F1ITD
529 34 Yes with replacement 4 144 29 No
4 144 29 No
3 125 67 No
4 144 29 No

Sno X1 X2 Y
3 125 67 No
2 529 34 Yes
3 125 67 No

Proprietary content. ©Great


This file Learning.
is meant All Rights
for personal useReserved. Unauthorized use or distribution
by nilashishsarkar70@gmail.com only. prohibited
Sharing or publishing the contents in part or full is liable for legal action. 7
Using a random set of variables every time
Sno X1 X2 Y
4 144 29 No
2 529 34 Yes
Random 3 125 67 No
sample rows
with
Sno X1
nilashishsarkar70@gmail.com
AB5D4F1ITD
X2 X3 X4 Y replacement Sno X3 X4 Y
1 432 29 313 6 Yes
2 529 34 379 2 Yes 3 317 4 No
3 125 67 317 4 No Random 4 103 8 No
4 144 29 103 8 No 4 103 8 No
subset of X
variables

Sno X1 X3 Y
3 125 317 No
2 529 379 Yes
3 125 317 No

Proprietary content. ©Great


This file Learning.
is meant All Rights
for personal useReserved. Unauthorized use or distribution
by nilashishsarkar70@gmail.com only. prohibited
Sharing or publishing the contents in part or full is liable for legal action. 8
Basic idea ofrandom forest

Draw multiple random Using a random subset of


samples, with Combine the
nilashishsarkar70@gmail.com
predictors at each stage,
AB5D4F1ITD
replacement, from the predictions/classifications Use voting for
fit a classification (or
data from the individual trees classification and
regression) tree to each
to obtain improved averaging for prediction.
• (this sampling approach iscalled sample (and thus obtain a
the bootstrap). predictions.
“forest”).

Proprietary content. ©Great


This file Learning.
is meant All Rights
for personal useReserved. Unauthorized use or distribution
by nilashishsarkar70@gmail.com only. prohibited
Sharing or publishing the contents in part or full is liable for legal action. 9
Steps in random forest algorithm

nilashishsarkar70@gmail.com
Step2 – create a decision
AB5D4F1ITD
tree using boot strapped
Step1 – create a Step3 – repeat the same
dataset. But only use a
bootstrapped dataset and create multiple trees
random subset of
variables at each step

Proprietary content. ©Great


This file Learning.
is meant All Rights
for personal useReserved. Unauthorized use or distribution
by nilashishsarkar70@gmail.com only. prohibited
Sharing or publishing the contents in part or full is liable for legal action. 10
Out of bag data points
Sno
4
X1
144
X2
29 No
Y
• When we create a
bootstrapped dataset, ~1/3
2 529 34 Yes
3 125 67 No

of the original data does not


Sno

2
1
AB5D4F1ITD
X1
432
529
X2
29
nilashishsarkar70@gmail.com
34
X3
313
379
X4
6 Yes
2 Yes
Y
Sno
3
X3
317
X4
4 No
Y end up in the boot strapped
3
4
125
144
67
29
317
103
4 No
8 No
4
4
103
103
8 No
8 No dataset
• This is called out-of-bag
Sno
3
X1
125
X3
317 No
Y dataset
2 529 379 Yes
3 125 317 No

Proprietary content. ©Great


This file Learning.
is meant All Rights
for personal useReserved. Unauthorized use or distribution
by nilashishsarkar70@gmail.com only. prohibited
Sharing or publishing the contents in part or full is liable for legal action. 11
How to calculate accuracy
• OOB samples used to measure how accurate our random
forest is
• by the ratio of out of bag samples correctly classified by the
random forest model
nilashishsarkar70@gmail.com
AB5D4F1ITD

• Proportion of OOB samples incorrectly classified – out of


bag error

Proprietary content. ©Great


This file Learning.
is meant All Rights
for personal useReserved. Unauthorized use or distribution
by nilashishsarkar70@gmail.com only. prohibited
Sharing or publishing the contents in part or full is liable for legal action. 12
How to decide on how many variables to use per step?

• Compare OOB error for using 2 variables per step, 3 variables


and so on
• Choose the most accurate set of variables
• Typically we start by using square root of number of
nilashishsarkar70@gmail.com
AB5D4F1ITD

variables
• Then try a few settings above and below the value

Proprietary content. ©Great


This file Learning.
is meant All Rights
for personal useReserved. Unauthorized use or distribution
by nilashishsarkar70@gmail.com only. prohibited
Sharing or publishing the contents in part or full is liable for legal action. 13
Summary of Random forest

Consists of a large number Each tree in the random class with most votes
of individual decision trees forest spits out a class
becomes model’s prediction
that operate asan ensemble prediction
nilashishsarkar70@gmail.com
AB5D4F1ITD

A large number of relatively


fundamental concept - uncorrelated models (trees)
operating as a committee
wisdom of crowds
will outperform any of the
individual models.

Proprietary content. ©Great


This file Learning.
is meant All Rights
for personal useReserved. Unauthorized use or distribution
by nilashishsarkar70@gmail.com only. prohibited
Sharing or publishing the contents in part or full is liable for legal action. 14
Overall flow of the RFclassification process
Feature engineering EDA–Univariate
Find Baseline Yclass
– convert relevant • Boxplot for numvar
Read csv file %to checkclass
variables to • Barplot for catvar
imbalance
categorical

EDA– bivariate
nilashishsarkar70@gmail.com
AB5D4F1ITD

• Boxplot – num X vs catY Split into training Build a random


Tune ntree & mtry
• Stacked bar – cat X vsY and test sets forest model

Model performance
Predict for train & • Acc, sens, spec
Variable importance
test • AUC

Proprietary content. ©Great


This file Learning.
is meant All Rights
for personal useReserved. Unauthorized use or distribution
by nilashishsarkar70@gmail.com only. prohibited
Sharing or publishing the contents in part or full is liable for legal action. 15
nilashishsarkar70@gmail.com
AB5D4F1ITD

Proprietary content. ©Great Learning. All Rights


useReserved. Unauthorized use or distribution prohibited
This file is meant for personal by nilashishsarkar70@gmail.com only. 40
Sharing or publishing the contents in part or full is liable for legal action.

You might also like