Random Forest
Applied Multivariate Statistics Spring 2012
Overview
Intuition of Random Forest
The Random Forest Algorithm
De-correlation gives better accuracy
Out-of-bag error (OOB-error)
Variable importance
Healthy
Diseased
Healthy
Diseased
Diseased
Intuition of Random Forest
Tree 2
Tree 1
young
old
young
old
diseased
healthy
diseased
healthy
male
tall
female
short
healthy
healthy
healthy
diseased
Tree 3
New sample:
retired
working
healthy
healthy
old, retired, male, short
Tree predictions:
diseased, healthy, diseased
tall
Majority rule:
diseased
healthy
short
diseased
2
The Random Forest Algorithm
Differences to standard tree
Train each tree on bootstrap resample of data
(Bootstrap resample of data set with N samples:
Make new data set by drawing with replacement N samples; i.e., some samples will
probably occur multiple times in new data set)
For each split, consider only m randomly selected variables
Dont prune
Fit B trees in such a way and use average or majority
voting to aggregate results
Why Random Forest works 1/2
Mean Squared Error = Variance + Bias2
If trees are sufficiently deep, they have very small bias
How could we improve the variance over that of a single
tree?
Why Random Forest works 2/2
i=j
Decreaes, if
decreases, i.e., if
De-correlation gives
better accuracy
m decreases
Decreases, if number of trees B
increases (irrespective of )
6
Estimating generalization error:
Out-of bag (OOB) error
Similar to leave-one-out cross-validation, but almost
without any additional computational burden
OOB error is a random number, since based on random
resamples of the data
Data:
Resampled Data:
Out of bag samples:
old, tall healthy
old, tall healthy
young, short diseased
old, short diseased
old, short diseased
young, tall healthy
young, tall healthy
young, short diseased
young, tall healthy
young, short healthy
young, tall healthy
old, short diseased
young, short healthy
young, tall healthy
young
old
old, short diseased
diseased
tall
healthy
healthy
short
diseased
Out of bag (OOB) error rate:
= 0.25
Variable Importance for variable i
Data
using Permutations
Resampled
Resampled
Dataset 1
Dataset m
OOB
Data 1
OOB
Data m
Permute values of
variable i in OOB
Tree 1
Tree m
data set
OOB error e1
OOB error em
d1 = e1p1
dm =em-pm
OOB error pm
OOB error p1
d=
s2d
1
m
Pm
1
m1
i=1 di
Pm
i=1 (di
d)
vi =
d
sd
8
Trees
vs.
Random Forest
+ Trees yield insight into
decision rules
+ Rather fast
+ Easy to tune
parameters
+ RF as smaller prediction
variance and therefore
usually a better general
performance
+ Easy to tune parameters
- Prediction of trees tend
to have a high variance
- Rather slow
- Black Box: Rather hard
to get insights into decision
rules
Comparing runtime
(just for illustration)
Up to thousands of variables
Problematic if there are categorical predictors with many levels (max: 32 levels)
RF: First predictor cut into 15 levels
RF
Tree
10
RF
vs.
+ Can model nonlinear
class boundaries
+ OOB error for free (no
CV needed)
+ Works on continuous and
categorical responses
(regression / classification)
+ Gives variable
importance
+ Very good performance
x
- Black box
- Slow
xx x
x
x
x
xx
x
x x
x
LDA
+ Very fast
+ Discriminants for visualizing
group separation
+ Can read off decision rule
- Can model only linear class
boundaries
- Mediocre performance
- No variable selection
- Only on categorical response
- Needs CV for estimating
prediction error
x x
x
x
x
x
x
x x
x x
11
Concepts to know
Idea of Random Forest and how it reduces the prediction
variance of trees
OOB error
Variable Importance based on Permutation
12
R functions to know
Function randomForest and varImpPlot from package
randomForest
13