Ensemble Techniques
1
Ensemble Methods
Construct a set of base classifiers learned from
the training data
Predict class label of test records by combining
the predictions made by multiple classifiers (e.g.,
by taking majority vote)
2
Example: Why Do Ensemble Methods Work?
3
Necessary Conditions for Ensemble Methods
Ensemble Methods work better than a single base classifier if:
1. All base classifiers are independent of each other
2. All base classifiers perform better than random guessing
(error rate < 0.5 for binary classification)
Classification error for an
ensemble of 25 base classifiers,
assuming their errors are
uncorrelated.
4
Rationale for Ensemble Learning
Ensemble Methods work best with unstable
base classifiers
– Classifiers that are sensitive to minor perturbations in
training set, due to high model complexity
– Examples: Unpruned decision trees, ANNs, …
5
Bias-Variance Decomposition
Analogous problem of reaching a target y by firing
projectiles from x (regression problem)
For classification, the generalization error of model 𝑚 can
be given by:
𝑔𝑒𝑛. 𝑒𝑟𝑟𝑜𝑟 𝑚 = 𝑐1 + 𝑏𝑖𝑎𝑠 𝑚 + 𝑐2 × 𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒(𝑚)
6
In general,
- Bias is contributed to by the training error; a complex
model has low bias.
-Variance is caused by future error; a complex model has
High variance.
- Bagging reduces the variance in the base classifiers.
7
Bias-Variance Trade-off and Overfitting
Overfitting
Underfitting
Ensemble methods try to reduce the variance of complex
models (with low bias) by aggregating responses of
multiple base classifiers
8
9
10
General Approach of Ensemble Learning
Using majority vote or
weighted majority vote
(weighted according to their
accuracy or relevance)
11
Constructing Ensemble Classifiers
By manipulating training set
– Example: bagging, boosting, random forests
By manipulating input features
– Example: random forests
By manipulating class labels
– Example: error-correcting output coding
By manipulating learning algorithm
– Example: injecting randomness in the initial weights of ANN
12
Bagging (Bootstrap AGGregating)
Bootstrap sampling: sampling with replacement
Original Data 1 2 3 4 5 6 7 8 9 10
Bagging (Round 1) 7 8 10 8 2 5 10 10 5 9
Bagging (Round 2) 1 4 9 1 2 3 2 7 3 2
Bagging (Round 3) 1 8 5 10 5 5 9 6 3 7
Build classifier on each bootstrap sample
Probability of a training instance being selected in
a bootstrap sample is:
➢ 1 – (1 - 1/n)n (n: number of training instances)
➢ ~0.632 when n is large
13
Bagging Algorithm
14
Bagging Example
Consider 1-dimensional data set:
Original Data:
x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
y 1 1 1 -1 -1 -1 -1 1 1 1
Classifier is a decision stump (decision tree of size 1)
– Decision rule: x k versus x > k
– Split point k is chosen based on entropy
xk
True False
yleft yright
15
Bagging Example
Bagging Round 1:
x 0.1 0.2 0.2 0.3 0.4 0.4 0.5 0.6 0.9 0.9 x <= 0.35 y = 1
y 1 1 1 1 -1 -1 -1 -1 1 1 x > 0.35 y = -1
Bagging Round 2:
x 0.1 0.2 0.3 0.4 0.5 0.5 0.9 1 1 1
y 1 1 1 -1 -1 -1 1 1 1 1
Bagging Round 3:
x 0.1 0.2 0.3 0.4 0.4 0.5 0.7 0.7 0.8 0.9
y 1 1 1 -1 -1 -1 -1 -1 1 1
Bagging Round 4:
x 0.1 0.1 0.2 0.4 0.4 0.5 0.5 0.7 0.8 0.9
y 1 1 1 -1 -1 -1 -1 -1 1 1
Bagging Round 5:
x 0.1 0.1 0.2 0.5 0.6 0.6 0.6 1 1 1
y 1 1 1 -1 -1 -1 -1 1 1 1
16
Bagging Example
Bagging Round 1:
x 0.1 0.2 0.2 0.3 0.4 0.4 0.5 0.6 0.9 0.9 x <= 0.35 y = 1
y 1 1 1 1 -1 -1 -1 -1 1 1 x > 0.35 y = -1
Bagging Round 2:
x 0.1 0.2 0.3 0.4 0.5 0.5 0.9 1 1 1 x <= 0.7 y = 1
y 1 1 1 -1 -1 -1 1 1 1 1 x > 0.7 y = 1
Bagging Round 3:
x 0.1 0.2 0.3 0.4 0.4 0.5 0.7 0.7 0.8 0.9 x <= 0.35 y = 1
y 1 1 1 -1 -1 -1 -1 -1 1 1 x > 0.35 y = -1
Bagging Round 4:
x 0.1 0.1 0.2 0.4 0.4 0.5 0.5 0.7 0.8 0.9 x <= 0.3 y = 1
y 1 1 1 -1 -1 -1 -1 -1 1 1 x > 0.3 y = -1
Bagging Round 5:
x 0.1 0.1 0.2 0.5 0.6 0.6 0.6 1 1 1 x <= 0.35 y = 1
x > 0.35 y = -1
y 1 1 1 -1 -1 -1 -1 1 1 1
17
Bagging Example
Bagging Round 6:
x 0.2 0.4 0.5 0.6 0.7 0.7 0.7 0.8 0.9 1 x <= 0.75 y = -1
y 1 -1 -1 -1 -1 -1 -1 1 1 1 x > 0.75 y = 1
Bagging Round 7:
x 0.1 0.4 0.4 0.6 0.7 0.8 0.9 0.9 0.9 1 x <= 0.75 y = -1
y 1 -1 -1 -1 -1 1 1 1 1 1 x > 0.75 y = 1
Bagging Round 8:
x 0.1 0.2 0.5 0.5 0.5 0.7 0.7 0.8 0.9 1 x <= 0.75 y = -1
y 1 1 -1 -1 -1 -1 -1 1 1 1 x > 0.75 y = 1
Bagging Round 9:
x 0.1 0.3 0.4 0.4 0.6 0.7 0.7 0.8 1 1 x <= 0.75 y = -1
y 1 1 -1 -1 -1 -1 -1 1 1 1 x > 0.75 y = 1
Bagging Round 10:
x 0.1 0.1 0.1 0.1 0.3 0.3 0.8 0.8 0.9 0.9 x <= 0.05 y = 1
x > 0.05 y = 1
y 1 1 1 1 1 1 1 1 1 1
18
Bagging Example
Summary of Trained Decision Stumps:
Round Split Point Left Class Right Class
1 0.35 1 -1
2 0.7 1 1
3 0.35 1 -1
4 0.3 1 -1
5 0.35 1 -1
6 0.75 -1 1
7 0.75 -1 1
8 0.75 -1 1
9 0.75 -1 1
10 0.05 1 1
19
Bagging Example
Use majority vote (sign of sum of predictions) to
determine class of ensemble classifier
Round x=0.1 x=0.2 x=0.3 x=0.4 x=0.5 x=0.6 x=0.7 x=0.8 x=0.9 x=1.0
1 1 1 1 -1 -1 -1 -1 -1 -1 -1
2 1 1 1 1 1 1 1 1 1 1
3 1 1 1 -1 -1 -1 -1 -1 -1 -1
4 1 1 1 -1 -1 -1 -1 -1 -1 -1
5 1 1 1 -1 -1 -1 -1 -1 -1 -1
6 -1 -1 -1 -1 -1 -1 -1 1 1 1
7 -1 -1 -1 -1 -1 -1 -1 1 1 1
8 -1 -1 -1 -1 -1 -1 -1 1 1 1
9 -1 -1 -1 -1 -1 -1 -1 1 1 1
10 1 1 1 1 1 1 1 1 1 1
Sum 2 2 2 -6 -6 -6 -6 2 2 2
Predicted Sign 1 1 1 -1 -1 -1 -1 1 1 1
Class
Bagging can also increase the complexity (representation
capacity) of simple classifiers such as decision stumps
20
Boosting
An iterative procedure to adaptively change
distribution of training data by focusing more on
previously misclassified records
– Initially, all N records are assigned equal
weights
– Unlike bagging, weights may change at the
end of a boosting round
21
Boosting
Records that are wrongly classified will have
their weights increased
Records that are classified correctly will have
their weights decreased
Original Data 1 2 3 4 5 6 7 8 9 10
Boosting (Round 1) 7 3 2 8 7 9 4 10 6 3
Boosting (Round 2) 5 4 9 4 2 5 1 7 4 2
Boosting (Round 3) 4 4 8 10 4 5 4 6 3 4
• Example 4 is hard to classify
• Its weight is increased, therefore it is more likely
to be chosen again in subsequent rounds
22
Boosting
Equal weights are assigned to each training
instance (1/N for round 1) at first round
After a classifier Ci is learned, the weights are
adjusted to allow the subsequent classifier
Ci+1 to “pay more attention” to data that were
misclassified by Ci.
Final boosted classifier C* combines the
votes of each individual classifier
– Weight of each classifier’s vote is a function of its
accuracy
Adaboost – popular boosting algorithm
23
Adaboost (Adaptive Boost)
Input:
– Training set D containing N instances
– T rounds
– A classification learning scheme
Output:
– A composite model
24
Adaboost: Training Phase
Training data D contain N labeled data (X1,y1),
(X2,y2 ), (X3,y3),….(XN,yN)
Initially assign equal weight 1/d to each data
To generate T base classifiers, we need T
rounds or iterations
Round i, data from D are sampled with
replacement , to form Di (size N)
Each data’s chance of being selected in the next
rounds depends on its weight
– Each time the new sample is generated directly from
the training data D with different sampling probability
according to the weights; these weights are not zero
25
Adaboost: Training Phase
Base classifier Ci, is derived from training data of
Di
Error of Ci is tested using Di
Weights of training data are adjusted depending
on how they were classified
– Correctly classified: Decrease weight
– Incorrectly classified: Increase weight
Weight of a data indicates how hard it is to
classify it (directly proportional)
26
Adaboost: Testing Phase
The lower a classifier error rate, the more accurate it is,
and therefore, the higher its weight for voting should be
Weight of a classifier Ci’s vote is
1 1 − i
i = ln
2 i
Testing:
– For each class c, sum the weights of each classifier that
assigned class c to X (unseen data)
– The class with the highest sum is the WINNER!
T
C * ( xtest ) = arg max i (Ci ( xtest ) = y )
y i =1
27
Example: Error and Classifier Weight in
AdaBoost
Base classifiers: C1, C2, …, CT
Error rate: (i = index of
classifier, j=index of instance)
w (C ( x ) y )
N
1
i = j i j j
N j =1
Importance of a classifier:
1 1 − i
i = ln
2 i
28
Example: Data Instance Weight
in AdaBoost
Assume: N training data in D, T rounds, (xj,yj) are
the training data, Ci, ai are the classifier and weight
of the ith round, respectively.
Weight update on all training data in D:
− i
( i +1) w (i )
j
exp if Ci ( x j ) = y j
wj =
Z i exp i if Ci ( x j ) y j
where Z i is the normalization factor
T
C * ( xtest ) = arg max i (Ci ( xtest ) = y )
y i =1
29
AdaBoost Algorithm
30
AdaBoost Example
Consider 1-dimensional data set:
Original Data:
x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
y 1 1 1 -1 -1 -1 -1 1 1 1
Classifier is a decision stump
– Decision rule: x k versus x > k
– Split point k is chosen based on entropy
xk
True False
yleft yright
31
AdaBoost Example
Training sets for the first 3 boosting rounds:
Boosting Round 1:
x 0.1 0.4 0.5 0.6 0.6 0.7 0.7 0.7 0.8 1
y 1 -1 -1 -1 -1 -1 -1 -1 1 1
Boosting Round 2:
x 0.1 0.1 0.2 0.2 0.2 0.2 0.3 0.3 0.3 0.3
y 1 1 1 1 1 1 1 1 1 1
Boosting Round 3:
x 0.2 0.2 0.4 0.4 0.4 0.4 0.5 0.6 0.6 0.7
y 1 1 -1 -1 -1 -1 -1 -1 -1 -1
Summary:
Round Split Point Left Class Right Class alpha
1 0.75 -1 1 1.738
2 0.05 1 1 2.7784
3 0.3 1 -1 4.1195
32
AdaBoost Example
Round x=0.1 x=0.2 x=0.3 x=0.4 x=0.5 x=0.6 x=0.7 x=0.8 x=0.9 x=1.0
1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1
2 0.311 0.311 0.311 0.01 0.01 0.01 0.01 0.01 0.01 0.01
3 0.029 0.029 0.029 0.228 0.228 0.228 0.228 0.009 0.009 0.009
Round x=0.1 x=0.2 x=0.3 x=0.4 x=0.5 x=0.6 x=0.7 x=0.8 x=0.9 x=1.0
1 -1 -1 -1 -1 -1 -1 -1 1 1 1
2 1 1 1 1 1 1 1 1 1 1
3 1 1 1 -1 -1 -1 -1 -1 -1 -1
Sum 5.16 5.16 5.16 -3.08 -3.08 -3.08 -3.08 0.397 0.397 0.397
Predicted Sign 1 1 1 -1 -1 -1 -1 1 1 1
Class
33
Illustrating AdaBoost
Initial weights for each data point Data points
for training
0.1 0.1 0.1
Original
Data +++ - - - - - ++
B1
0.0094 0.0094 0.4623
Boosting
Round 1 +++ - - - - - - - = 1.9459
34
Illustrating AdaBoost
B1
0.0094 0.0094 0.4623
Boosting
Round 1 +++ - - - - - - - = 1.9459
B2
0.3037 0.0009 0.0422
Boosting
Round 2 - - - - - - - - ++ = 2.9323
B3
0.0276 0.1819 0.0038
Boosting
Round 3 +++ ++ ++ + ++ = 3.8744
Overall +++ - - - - - ++
35
Random Forests
• Random Forest is a modified form of bagging that
creates ensembles of independent decision trees.
• Random Forests essentially perform bagging over the
predictor space and build a collection of less correlated
trees.
• Choose a random subset of predictors when defining
trees.
• It increases the stability of the algorithm and tackles
correlation problems that arise by a greedy search of the
best split at each node.
• It reduces variance of total estimator at the cost of an
equal or higher bias.
36
Random Forest
• Each classifier in the ensemble is a decision tree
classifier and is generated using a random selection
of attributes at each node to determine the split
• During classification, each tree votes and the most
popular class is returned.
37
Algorithm
– Choose T—number of trees to grow
– Choose m<M (M is the number of total features) —number
of features used to calculate the best split at each node
– For each tree
◆ Choose a training set by choosing N times (N is the
number of training examples) with replacement from the
training set
◆ For each node, randomly choose m features and
calculate the best split
◆ Fully grown and not pruned
– Use majority voting among all the trees
38
Tuning Random Forests
Multiple hyper-parameters to tune:
1. the number of predictors to randomly select at each split
2. the total number of trees in the ensemble
3. the minimum leaf node size
In theory, each tree in the random forest is full, but in practice
this can be computationally expensive,
– Imposing a minimum node size is not unusual.
39
Methods to construct Random Forest
Two main methods to construct Random Forest:
– Forest-RI (random input selection): Randomly select,
at each node, F attributes as candidates for the split at
the node. The CART methodology is used to grow the
trees to maximum size
– Forest-RC (random linear combinations): Creates
new attributes (or features) that are a linear combination
of the existing attributes (reduces the correlation
between individual classifiers)
40
Random Forest
Comparable in accuracy to Adaboost, but more robust
to errors and outliers
Faster than bagging or boosting since it is insensitive
to the number of attributes selected for consideration at
each split
41