Lecture 20: Bagging, Random Forests,
Boosting
Reading: Chapter 8
STATS 202: Data mining and analysis
November 13, 2017
1 / 17
Classification and Regression trees, in a nut shell
I Grow the tree by recursively splitting the samples in the leaf
Ri according to Xj > s, such that (Ri , Xj , s) maximize the
drop in RSS.
2 / 17
Classification and Regression trees, in a nut shell
I Grow the tree by recursively splitting the samples in the leaf
Ri according to Xj > s, such that (Ri , Xj , s) maximize the
drop in RSS.
→ Greedy algorithm.
2 / 17
Classification and Regression trees, in a nut shell
I Grow the tree by recursively splitting the samples in the leaf
Ri according to Xj > s, such that (Ri , Xj , s) maximize the
drop in RSS.
→ Greedy algorithm.
I Create a sequence of subtrees T0 , T1 , . . . , Tm using a pruning
algorithm.
2 / 17
Classification and Regression trees, in a nut shell
I Grow the tree by recursively splitting the samples in the leaf
Ri according to Xj > s, such that (Ri , Xj , s) maximize the
drop in RSS.
→ Greedy algorithm.
I Create a sequence of subtrees T0 , T1 , . . . , Tm using a pruning
algorithm.
I Select the best tree Ti (or the best α) by cross validation.
2 / 17
Classification and Regression trees, in a nut shell
I Grow the tree by recursively splitting the samples in the leaf
Ri according to Xj > s, such that (Ri , Xj , s) maximize the
drop in RSS.
→ Greedy algorithm.
I Create a sequence of subtrees T0 , T1 , . . . , Tm using a pruning
algorithm.
I Select the best tree Ti (or the best α) by cross validation.
→ Why might it be better to choose α instead of the tree Ti
by cross-validation?
2 / 17
Example. Heart dataset.
How do we deal with categorical predictors?
Thal:a
|
Ca < 0.5 Ca < 0.5
Slope < 1.5 Oldpeak < 1.1
MaxHR < 161.5 ChestPain:bc Age < 52 Thal:b RestECG < 1
ChestPain:a Yes
RestBP < 157 No Yes Yes
Yes No
Chol < 244 No Chol < 244 Sex < 0.5
MaxHR < 156 No Yes
MaxHR < 145.5 Yes
No
No No No No Yes
No Yes
3 / 17
Categorical predictors
I If there are only 2 categories, then the split is obvious. We
don’t have to choose the splitting point s, as for a numerical
variable.
4 / 17
Categorical predictors
I If there are only 2 categories, then the split is obvious. We
don’t have to choose the splitting point s, as for a numerical
variable.
I If there are more than 2 categories:
I Order the categories according to the average of the response:
ChestPain : a > ChestPain : c > ChestPain : b
I Treat as a numerical variable with this ordering, and choose a
splitting point s.
4 / 17
Categorical predictors
I If there are only 2 categories, then the split is obvious. We
don’t have to choose the splitting point s, as for a numerical
variable.
I If there are more than 2 categories:
I Order the categories according to the average of the response:
ChestPain : a > ChestPain : c > ChestPain : b
I Treat as a numerical variable with this ordering, and choose a
splitting point s.
I One can show that this is the optimal way of partitioning.
4 / 17
Missing data
I Suppose we can assign every sample to a leaf Ri despite the
missing data.
5 / 17
Missing data
I Suppose we can assign every sample to a leaf Ri despite the
missing data.
I When choosing a new split with variable Xj (growing the tree):
5 / 17
Missing data
I Suppose we can assign every sample to a leaf Ri despite the
missing data.
I When choosing a new split with variable Xj (growing the tree):
I Only consider the samples which have the variable Xj .
5 / 17
Missing data
I Suppose we can assign every sample to a leaf Ri despite the
missing data.
I When choosing a new split with variable Xj (growing the tree):
I Only consider the samples which have the variable Xj .
I In addition to choosing the best split, choose a second best
split using a different variable, and a third best, ...
5 / 17
Missing data
I Suppose we can assign every sample to a leaf Ri despite the
missing data.
I When choosing a new split with variable Xj (growing the tree):
I Only consider the samples which have the variable Xj .
I In addition to choosing the best split, choose a second best
split using a different variable, and a third best, ...
I To propagate a sample down the tree, if it is missing a variable
to make a decision, try the second best decision, or the third
best, ...
5 / 17
Bagging
I Bagging = Bootstrap Aggregating
6 / 17
Bagging
I Bagging = Bootstrap Aggregating
I In the Bootstrap, we replicate our dataset by sampling with
replacement:
I Original dataset: x = c(x1, x2, . . . , x100)
I Bootstrap samples:
boot1 = sample(x, 100, replace = True), ...,
bootB = sample(x, 100, replace = True).
6 / 17
Bagging
I Bagging = Bootstrap Aggregating
I In the Bootstrap, we replicate our dataset by sampling with
replacement:
I Original dataset: x = c(x1, x2, . . . , x100)
I Bootstrap samples:
boot1 = sample(x, 100, replace = True), ...,
bootB = sample(x, 100, replace = True).
I We used these samples to get the Standard Error of a
parameter estimate:
v
u
u 1 X B B
(b) 1 X (k) 2
SE(β̂1 ) ≈ t (β̂1 − β̂1 )
B−1 B
b=1 k=1
6 / 17
Bagging
I In Bagging we average the predictions of a model fit to many
Bootstrap samples.
Example. Bagging the Lasso
7 / 17
Bagging
I In Bagging we average the predictions of a model fit to many
Bootstrap samples.
Example. Bagging the Lasso
I Let ŷ L,b be the prediction of the Lasso applied to the bth
bootstrap sample.
7 / 17
Bagging
I In Bagging we average the predictions of a model fit to many
Bootstrap samples.
Example. Bagging the Lasso
I Let ŷ L,b be the prediction of the Lasso applied to the bth
bootstrap sample.
I Bagging prediction:
B
1 X L,b
ŷ boot = ŷ .
B
b=1
7 / 17
When does Bagging make sense?
When a regression method or a classifier has a tendency to overfit,
Bagging reduces the variance of the prediction.
8 / 17
When does Bagging make sense?
When a regression method or a classifier has a tendency to overfit,
Bagging reduces the variance of the prediction.
I When n is large, the empirical distribution is similar to the
true distribution of the samples.
8 / 17
When does Bagging make sense?
When a regression method or a classifier has a tendency to overfit,
Bagging reduces the variance of the prediction.
I When n is large, the empirical distribution is similar to the
true distribution of the samples.
I Bootstrap samples are like independent realizations of the
data.
8 / 17
When does Bagging make sense?
When a regression method or a classifier has a tendency to overfit,
Bagging reduces the variance of the prediction.
I When n is large, the empirical distribution is similar to the
true distribution of the samples.
I Bootstrap samples are like independent realizations of the
data.
I Bagging amounts to averaging the fits from many independent
datasets, which would reduce the variance by a factor 1/B.
8 / 17
Bagging decision trees
I Disadvantage: Every time we fit a decision tree to a
Bootstrap sample, we get a different tree T b .
9 / 17
Bagging decision trees
I Disadvantage: Every time we fit a decision tree to a
Bootstrap sample, we get a different tree T b .
→ Loss of interpretability
9 / 17
Bagging decision trees
I Disadvantage: Every time we fit a decision tree to a
Bootstrap sample, we get a different tree T b .
→ Loss of interpretability
I For each predictor, add up the total amount by which the RSS
(or Gini index) decreases every time we use the predictor in T b .
9 / 17
Bagging decision trees
I Disadvantage: Every time we fit a decision tree to a
Bootstrap sample, we get a different tree T b .
→ Loss of interpretability
I For each predictor, add up the total amount by which the RSS
(or Gini index) decreases every time we use the predictor in T b .
I Average this total over each Boostrap estimate T 1 , . . . , T B .
9 / 17
Bagging decision trees
I Disadvantage: Every time we fit a decision tree to a
Bootstrap sample, we get a different tree T b .
→ Loss of interpretability
I For each predictor, add up the total amount by which the RSS
(or Gini index) decreases every time we use the predictor in T b .
I Average this total over each Boostrap estimate T 1 , . . . , T B .
Fbs
RestECG
ExAng
Sex
Slope
Chol
Age
RestBP
MaxHR
Oldpeak
ChestPain
Ca
Thal
0 20 40 60 80 100
Variable Importance
9 / 17
Out-of-bag (OOB) error
I To estimate the test error of a bagging estimate, we could use
cross-validation.
10 / 17
Out-of-bag (OOB) error
I To estimate the test error of a bagging estimate, we could use
cross-validation.
I Each time we draw a Bootstrap sample, we only use 63% of
the observations.
10 / 17
Out-of-bag (OOB) error
I To estimate the test error of a bagging estimate, we could use
cross-validation.
I Each time we draw a Bootstrap sample, we only use 63% of
the observations.
I Idea: use the rest of the observations as a test set.
10 / 17
Out-of-bag (OOB) error
I To estimate the test error of a bagging estimate, we could use
cross-validation.
I Each time we draw a Bootstrap sample, we only use 63% of
the observations.
I Idea: use the rest of the observations as a test set.
I OOB error:
I For each sample xi , find the prediction ŷib for all bootstrap
samples b which do not contain xi . There should be around
0.37B of them. Average these predictions to obtain ŷioob .
10 / 17
Out-of-bag (OOB) error
I To estimate the test error of a bagging estimate, we could use
cross-validation.
I Each time we draw a Bootstrap sample, we only use 63% of
the observations.
I Idea: use the rest of the observations as a test set.
I OOB error:
I For each sample xi , find the prediction ŷib for all bootstrap
samples b which do not contain xi . There should be around
0.37B of them. Average these predictions to obtain ŷioob .
I Compute the error (yi − ŷioob )2 .
10 / 17
Out-of-bag (OOB) error
I To estimate the test error of a bagging estimate, we could use
cross-validation.
I Each time we draw a Bootstrap sample, we only use 63% of
the observations.
I Idea: use the rest of the observations as a test set.
I OOB error:
I For each sample xi , find the prediction ŷib for all bootstrap
samples b which do not contain xi . There should be around
0.37B of them. Average these predictions to obtain ŷioob .
I Compute the error (yi − ŷioob )2 .
I Average the errors over all observations i = 1, . . . , n.
10 / 17
Out-of-bag (OOB) error
0.30
0.25
Error
0.20
0.15
Test: Bagging
Test: RandomForest
OOB: Bagging
0.10
OOB: RandomForest
0 50 100 150 200 250 300
Number of Trees
The test error decreases as we increase B
(dashed line is the error for a plain decision tree).
11 / 17
Random Forests
Bagging has a problem:
→ The trees produced by different Bootstrap samples can be very
similar.
Random Forests:
12 / 17
Random Forests
Bagging has a problem:
→ The trees produced by different Bootstrap samples can be very
similar.
Random Forests:
I We fit a decision tree to different Bootstrap samples.
12 / 17
Random Forests
Bagging has a problem:
→ The trees produced by different Bootstrap samples can be very
similar.
Random Forests:
I We fit a decision tree to different Bootstrap samples.
I When growing the tree, we select a random sample of m < p
predictors to consider in each step.
12 / 17
Random Forests
Bagging has a problem:
→ The trees produced by different Bootstrap samples can be very
similar.
Random Forests:
I We fit a decision tree to different Bootstrap samples.
I When growing the tree, we select a random sample of m < p
predictors to consider in each step.
I This will lead to very different (or “uncorrelated”) trees from
each sample.
12 / 17
Random Forests
Bagging has a problem:
→ The trees produced by different Bootstrap samples can be very
similar.
Random Forests:
I We fit a decision tree to different Bootstrap samples.
I When growing the tree, we select a random sample of m < p
predictors to consider in each step.
I This will lead to very different (or “uncorrelated”) trees from
each sample.
I Finally, average the prediction of each tree.
12 / 17
Random Forests vs. Bagging
0.30
0.25
Error
0.20
0.15
Test: Bagging
Test: RandomForest
OOB: Bagging
0.10
OOB: RandomForest
0 50 100 150 200 250 300
Number of Trees
13 / 17
Random Forests, choosing m
m=p
m=p/2
0.5
Test Classification Error m= p
0.4
0.3
0.2
0 100 200 300 400 500
Number of Trees
√
The optimal m is usually around p,
but this can be used as a tuning parameter.
14 / 17
Boosting
15 / 17
Boosting
1. Set fˆ(x) = 0, and ri = yi for i = 1, . . . , n.
15 / 17
Boosting
1. Set fˆ(x) = 0, and ri = yi for i = 1, . . . , n.
2. For b = 1, . . . , B, iterate:
15 / 17
Boosting
1. Set fˆ(x) = 0, and ri = yi for i = 1, . . . , n.
2. For b = 1, . . . , B, iterate:
2.1 Fit a decision tree fˆb with d splits to the response r1 , . . . , rn .
15 / 17
Boosting
1. Set fˆ(x) = 0, and ri = yi for i = 1, . . . , n.
2. For b = 1, . . . , B, iterate:
2.1 Fit a decision tree fˆb with d splits to the response r1 , . . . , rn .
2.2 Update the prediction to:
fˆ(x) ← fˆ(x) + λfˆb (x).
15 / 17
Boosting
1. Set fˆ(x) = 0, and ri = yi for i = 1, . . . , n.
2. For b = 1, . . . , B, iterate:
2.1 Fit a decision tree fˆb with d splits to the response r1 , . . . , rn .
2.2 Update the prediction to:
fˆ(x) ← fˆ(x) + λfˆb (x).
2.3 Update the residuals,
ri ← ri − λfˆb (xi ).
15 / 17
Boosting
1. Set fˆ(x) = 0, and ri = yi for i = 1, . . . , n.
2. For b = 1, . . . , B, iterate:
2.1 Fit a decision tree fˆb with d splits to the response r1 , . . . , rn .
2.2 Update the prediction to:
fˆ(x) ← fˆ(x) + λfˆb (x).
2.3 Update the residuals,
ri ← ri − λfˆb (xi ).
3. Output the final model:
B
X
fˆ(x) = λfˆb (x).
b=1
15 / 17
Boosting, intuitively
Boosting learns slowly:
We first use the samples that are easiest to predict, then slowly
down weigh these cases, moving on to harder samples.
16 / 17
Boosting vs. random forests
0.25
Boosting: depth=1
Boosting: depth=2
0.20 RandomForest: m= p
Test Classification Error
0.15
0.10
0.05
0 1000 2000 3000 4000 5000
Number of Trees
The parameter λ = 0.01 in each case.
We can tune the model by CV using λ, d, B.
17 / 17