0% found this document useful (0 votes)

101 views53 pages

Lecture 20: Bagging, Random Forests, Boosting: Reading: Chapter 8

Ensemble Methods are methods that combine together many model predictions. For example, in Bagging (short for bootstrap aggregation), parallel models are constructed on m = many bootstrapped samples (eg., 50), and then the predictions from the m models are averaged to obtain the prediction from the ensemble of models. In this tutorial we walk through basics of three Ensemble Methods: Bagging, Random Forests, and Boosting.

Uploaded by

isaias.prestes

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

101 views53 pages

Lecture 20: Bagging, Random Forests, Boosting: Reading: Chapter 8

Uploaded by

isaias.prestes

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 53

Lecture 20: Bagging, Random Forests,

Boosting
Reading: Chapter 8

STATS 202: Data mining and analysis

November 13, 2017

1 / 17
Classification and Regression trees, in a nut shell

I Grow the tree by recursively splitting the samples in the leaf

Ri according to Xj > s, such that (Ri , Xj , s) maximize the
drop in RSS.

2 / 17
Classification and Regression trees, in a nut shell

I Grow the tree by recursively splitting the samples in the leaf

Ri according to Xj > s, such that (Ri , Xj , s) maximize the
drop in RSS.
→ Greedy algorithm.

2 / 17
Classification and Regression trees, in a nut shell

I Grow the tree by recursively splitting the samples in the leaf

Ri according to Xj > s, such that (Ri , Xj , s) maximize the
drop in RSS.
→ Greedy algorithm.
I Create a sequence of subtrees T0 , T1 , . . . , Tm using a pruning
algorithm.

2 / 17
Classification and Regression trees, in a nut shell

I Grow the tree by recursively splitting the samples in the leaf

Ri according to Xj > s, such that (Ri , Xj , s) maximize the
drop in RSS.
→ Greedy algorithm.
I Create a sequence of subtrees T0 , T1 , . . . , Tm using a pruning
algorithm.
I Select the best tree Ti (or the best α) by cross validation.

2 / 17
Classification and Regression trees, in a nut shell

I Grow the tree by recursively splitting the samples in the leaf

2 / 17
Example. Heart dataset.

How do we deal with categorical predictors?

Thal:a
|

Ca < 0.5 Ca < 0.5

Slope < 1.5 Oldpeak < 1.1

MaxHR < 161.5 ChestPain:bc Age < 52 Thal:b RestECG < 1

ChestPain:a Yes
RestBP < 157 No Yes Yes
Yes No
Chol < 244 No Chol < 244 Sex < 0.5
MaxHR < 156 No Yes
MaxHR < 145.5 Yes
No
No No No No Yes
No Yes

3 / 17
Categorical predictors

I If there are only 2 categories, then the split is obvious. We

don’t have to choose the splitting point s, as for a numerical
variable.

4 / 17
Categorical predictors

I If there are only 2 categories, then the split is obvious. We

don’t have to choose the splitting point s, as for a numerical
variable.
I If there are more than 2 categories:
I Order the categories according to the average of the response:

ChestPain : a > ChestPain : c > ChestPain : b

I Treat as a numerical variable with this ordering, and choose a

splitting point s.

4 / 17
Categorical predictors

I If there are only 2 categories, then the split is obvious. We

don’t have to choose the splitting point s, as for a numerical
variable.
I If there are more than 2 categories:
I Order the categories according to the average of the response:

ChestPain : a > ChestPain : c > ChestPain : b

I Treat as a numerical variable with this ordering, and choose a

splitting point s.
I One can show that this is the optimal way of partitioning.

4 / 17
Missing data

I Suppose we can assign every sample to a leaf Ri despite the

missing data.

5 / 17
Missing data

I Suppose we can assign every sample to a leaf Ri despite the

missing data.
I When choosing a new split with variable Xj (growing the tree):

5 / 17
Missing data

I Suppose we can assign every sample to a leaf Ri despite the

missing data.
I When choosing a new split with variable Xj (growing the tree):
I Only consider the samples which have the variable Xj .

5 / 17
Missing data

I Suppose we can assign every sample to a leaf Ri despite the

5 / 17
Missing data

I Suppose we can assign every sample to a leaf Ri despite the

missing data.
I When choosing a new split with variable Xj (growing the tree):
I Only consider the samples which have the variable Xj .
I In addition to choosing the best split, choose a second best
split using a different variable, and a third best, ...
I To propagate a sample down the tree, if it is missing a variable
to make a decision, try the second best decision, or the third
best, ...

5 / 17
Bagging
I Bagging = Bootstrap Aggregating

6 / 17
Bagging
I Bagging = Bootstrap Aggregating
I In the Bootstrap, we replicate our dataset by sampling with
replacement:
I Original dataset: x = c(x1, x2, . . . , x100)
I Bootstrap samples:
boot1 = sample(x, 100, replace = True), ...,
bootB = sample(x, 100, replace = True).
I We used these samples to get the Standard Error of a
parameter estimate:
v
u
u 1 X B B
(b) 1 X (k) 2
SE(β̂1 ) ≈ t (β̂1 − β̂1 )
B−1 B
b=1 k=1

6 / 17
Bagging

I In Bagging we average the predictions of a model fit to many

Bootstrap samples.
Example. Bagging the Lasso

7 / 17
Bagging

I In Bagging we average the predictions of a model fit to many

Bootstrap samples.
Example. Bagging the Lasso
I Let ŷ L,b be the prediction of the Lasso applied to the bth
bootstrap sample.

7 / 17
Bagging

I In Bagging we average the predictions of a model fit to many

Bootstrap samples.
Example. Bagging the Lasso
I Let ŷ L,b be the prediction of the Lasso applied to the bth
bootstrap sample.
I Bagging prediction:
B
1 X L,b
ŷ boot = ŷ .
B
b=1

7 / 17
When does Bagging make sense?

When a regression method or a classifier has a tendency to overfit,

Bagging reduces the variance of the prediction.

8 / 17
When does Bagging make sense?

When a regression method or a classifier has a tendency to overfit,

Bagging reduces the variance of the prediction.

I When n is large, the empirical distribution is similar to the

true distribution of the samples.

8 / 17
When does Bagging make sense?

When a regression method or a classifier has a tendency to overfit,

Bagging reduces the variance of the prediction.

I When n is large, the empirical distribution is similar to the

true distribution of the samples.
I Bootstrap samples are like independent realizations of the
data.

8 / 17
When does Bagging make sense?

When a regression method or a classifier has a tendency to overfit,

Bagging reduces the variance of the prediction.

I When n is large, the empirical distribution is similar to the

true distribution of the samples.
I Bootstrap samples are like independent realizations of the
data.
I Bagging amounts to averaging the fits from many independent
datasets, which would reduce the variance by a factor 1/B.

8 / 17
Bagging decision trees
I Disadvantage: Every time we fit a decision tree to a
Bootstrap sample, we get a different tree T b .

9 / 17
Bagging decision trees
I Disadvantage: Every time we fit a decision tree to a
Bootstrap sample, we get a different tree T b .
→ Loss of interpretability

9 / 17
Bagging decision trees
I Disadvantage: Every time we fit a decision tree to a
Bootstrap sample, we get a different tree T b .
→ Loss of interpretability
I For each predictor, add up the total amount by which the RSS
(or Gini index) decreases every time we use the predictor in T b .

RestECG

ExAng

Sex

Slope

Chol

Age

RestBP

MaxHR

Oldpeak

ChestPain

Thal

0 20 40 60 80 100
Variable Importance

9 / 17
Out-of-bag (OOB) error