8 Ensembles
8 Ensembles
Ensemble Methods
(bagging and boosting)
1
Ensemble Methods
• An ensemble is a set of classifiers/regressors (either different
algorithms or different settings of the same algorithm, or the
same algorithm on different samples of the dataset) that learn
a target function, and their individual predictions are
combined to classify new examples.
• Ensembles generally improve the generalization performance
of a set of individual models on a domain.
• Based on the following idea: A large number of relatively
uncorrelated models operating as a committee will
outperform any of the individual constituent models.
Example:
Weather Forecast
100% CORRECT!
GROUND
TRUTH
PREDICTOR1 X X
PREDICTOR2 X X
PREDICTOR3 X X X
PREDICTOR4 X X
PREDICTOR5
X X
Combine
Why to use
Ensemble Methods?
• Statistical reasons:
o A set of classifiers with similar training performances may
have different generalization performances
o Combining outputs of several classifiers reduces the risk
of selecting a poorly performing (weak) classifier.
• Too large volumes of data:
o If the amount of data to be analyzed is too large, a single
classifier may not be able to handle it; train different
classifiers on different partitions of data.
• Too little data:
o Ensemble systems can also be used when there is too
little data by the use of resampling techniques.
4
Why to use
Ensemble Methods?
● The the error of a learning algorithm (will see later evaluation) has three components: the
noise, the bias, and the variance:
Error(x) = Bias(x)2 + Variance(x) + Noise(X)
➢ The noise is the irreducible error (random errors in the data that can’t be eliminated, for
example due to corrupted input, data entering errors..)
➢ The bias is the systematic error that the learning algorithm is expected to make due to,
e.g., architectural choices (e.g. if we use a Perceptron for data that are not lineraly
separable) or to insufficient/unrepresentative training data
➢ The variance measures the sensitivity of the algorithm to the specific training set
and/or hyper-parameters used (algorithms can be more or less robust to such
variations).
Ensembles may help reducing both bias and
variance (except noise, which is irreducible error)
5
The relation between error, bias and
variance
• X, Y, 𝑌 are random variables describing the distribution of values for instances x, and their ground
truth and predicted values f(x) and h(x)
• ℎ 𝑋 is and estimator (hypothesis) of the true (unknown) function 𝑓 𝑋 , generated by some
model M
𝑌 = 𝑓 𝑋 + 𝜀 y ∈ 𝑌 values are generated by the (unknown) 𝑓 𝑋 plus «some» random error 𝜀
ℎ(𝑋)
𝜀𝑖
f(𝑋)
7
Bias and variance, the intuition
8
Why to use
Ensemble Methods?
• When combining multiple independent and diverse decisions each of which is
at least more accurate than random guessing, the variance is reduced and so
does the error rate.
σ𝑚 𝑉𝑎𝑟(ℎ𝑖 𝑥, 𝐷𝑖 )
𝑉𝑎𝑟 𝐸𝑛𝑠𝑎𝑚𝑏𝑙𝑒 ℎ𝑖 𝑥, 𝐷 =
𝑚
• Where ℎ𝑖 (x,D) is the hypothesis generated by i-th learner of an ensamle.
• In practice, the idea is the following:
If we use “simple” models trained on smaller samples, the individual variance
of each hypothesis can be higher than for a more complex learner, but still in
most cases the ensamble variance is lower than if we use one single complex
learner (will discuss later about bias)
9
Example
• Suppose there are 25 “simple” classifiers
➢Each classifier has an average error rate, 𝜀 = 0.35 (which is a mid-high rate)
➢Assume classifiers-generated hypotheses hi(x) are statistically independent,
and final class is predicted with majority voting
➢The probability that the ensemble classifier makes a wrong prediction (it is
wrong if at least 13 out of 25 make the wrong prediction):
All possible ways, given a set of 25
σ𝑖=13..25 25 𝜀 𝑖 1 − 𝜀 25-i=0,6 classifiers, of having i (with i>12)
classifiers that make an error, and
𝑖 25-i producing a correct prediction
Training Data
12
Homogeneous Ensembles
• Use a single, arbitrary learning algorithm but manipulate
training data to make it learn multiple models h1(x) h2(x).. hm(x)
• Data1 ≠ Data2 ≠ … ≠ Data m
• Learner1 = Learner2 = … = Learner m
Random
samples
of dataset
14
The basic idea of bagging with bootstrap
15
Bagging: Bootstrapping
Bootstrapping is a method to help decrease the variance of the classifier
and reduce overfitting, by resampling data from the training set. The
ensamble model created should be less overfitted than a single individual
model.
➢ How: Each individual classifiers randomly extracts a sample of m
instances over the training set of n instances with replacement
(instances are put back in the urn, therefore they can be sampled more
than one time). If the sample has the same cardinality as the original
set m=n, is called «the 0.632 bootstrap».
➢ When: Effective method in limited data contexts (high variance, risk of
overfitting).
16
Bootstrapping
18
Example (n=m)
Each instance has a probability p=1/n of being extracted out of n
instances at each extraction. Since extraction is “with replacement” (the
instance is put back in the urn after having been extracted) the
probability is always the same at each extraction.
19
2
20
3
21
4
22
5
23
6
24
Training set
7
Test set
25
The 0.632 bootstrap
• Each example in the original dataset has a selection probability of
1/n on n samples
• If n=m, on average, 36.8% of the data-points are left unselected
and can be used for testing
Why?
26
The 0.632 bootstrap
This method is also called the 0.632 bootstrap
➢If we have n instances, each instance has a probability 1/n
of being picked and (1-1/n) of not being picked at each
extraction
➢If m=n, its probability of ending up in the test data (= not
being selected in any of n extractions) is:
27
Example of Bagging
We aim to learn a classifier C(x) in ℜ1 (x is a continuous variable).
Assume that the “real” (unknown) classification function f(x) is:
+1 -1 +1
0.8 x
0.3
• In this example data is not linearly separable, a classifier for this data
must learn a decision region, e.g.:
IF T1<= x <=T2 then C
else not(C)
➢ In our example, the “true” (unknown) values for decision
boundaries are T1=0.3 and T2=0.8.
Goal: find a collection of 10 simple thresholding (=linear) classifiers
that can collectively classify correctly the data.
i.e. , each classifier Ci learn a single threshold Ti such that:
IF x<=Ti then C
else not(C)
28
Training set
This is the training set: we have 10 pairs (x, C(x))
Remember: Each sample is one training set (bagging round) of size 10.
This means that for 10 times (bagging rounds) we sample 10 instances
from the original dataset with replacement. The extracted instances in a
round(i) are used to train the i-th learner hi, and the non-extracted
instances are used for testing.
29
Note: In each round the
same training example
can be extracted more
than one time, and some
examples are not
extracted.
Note that each classifier is
does not correctly
classifies some of the
training data!
E.g. classifier 1 is wrong on
the last two items of the
“sampled” training set:
c(0.9) = -1 (and is instead 1)
1. Majority voting
2. Combining (averaging) classification functions
Bagging: Aggregation
Majority voting
Majority voting (Hard voting): we just need a majority of
classifiers to determine what the result could be.
33
Bagging with weighted
Aggregation
Weighted voting (Soft voting): Every classifier i output a class
based on the learned model hi(x) for each instance x, and these
predictions are multiplied by each classifier’s weight
(relevance), and finally averaged by the number of classifiers.
The final class label is then derived from the class label with the
highest average probability.
Tip: In reality, weights are hard to find using intuition. To counter this
subjective process, a linear optimization equation or a neural network
could be used to learn the optimal weighting for each of the models to
optimize the accuracy of the ensemble.
Bagging with aggregation by averaging
different learned functions
35
2
36
3
37
4
38
5
39
6
• Robust to overfitting
• Main goal is to reduce variance of combined models
• Improves ability to ignore irrelevant features
• Works well if all instances have an equal probability p of being
classified correctly or wrongly (means: Pr(f(x)≠h(x))=p for all x in
X) (this means that, more or less, all instances have the same
complexity)
41
Bagging:
Issues
42
Example 1:
Handwriting Recognition
43
Example 2:
Face Recognition
49
Boosting Example
• Suppose example 4 is hard to classify (round 1 classifier is wrong on
example 4)
• Its weight is increased, therefore it is more likely to be extracted again in
subsequent weighted sampling rounds (2)
• Therefore, in the subsequent round, the learner is forced to pay more
attention to the misclassified example (in the attempt of reducing the
error on the learning set)
• If round 2 classifier is again wrong on example 4, it probability of being
extracted increases again
Boosting flow diagram
51
Adaboost
Adaptive Boost
• Input:
➢Training set D containing n instances
➢A classification algorithm (e.g., NN or a Tree-based
algorithm)
➢T iterations (i.e. rounds) (i=1,…,T). A classifier Ci is learned
at each round.
• Output:
➢A composite classifier C*
53
Adaboost:
Training Phase
• Training data D contains n labeled data
➢ (x1,y1), (x2,y2 ), (x3,y3),….(xn,yn)
➢ where yi are the correct classifications
• Initially assign equal weight 1/n to each example
• To generate T “base” classifiers, we need T rounds
• Round i:
➢ instances from D are sampled with replacement, to
generate the dataset Di (|Di| = n)
• Each instance’s chance of being selected in the next rounds
depends on its weight
➢the new sample is generated directly from the training
data D with different sampling probability according to
the weights;
54
AdaBoost:
Testing Phase
● Testing occurs on individual classifiers Ci at the end of
each round.
● The performance of each classifier on training set is
used to assess the “importance” or authority of Ci
● Final testing is performed on unseen data (i.e. a
validation set). To combine individual classifications by
each Ci, the decision of each classifier is taken into
consideration proportionally to its importance
55
AdaBoost:
Testing Phase
Training phase of Ci depends on previous testing phase
on Ci-1
• Base classifier Ci, is learned from training data of set
Di
• Error of Ci is tested always using Di (test set = training
set)
• Weights of training data are adjusted depending on
how they were classified
56
AdaBoost:
Testing Phases for individual classifier
• “Base” learned classifiers: C1, C2, …, CT
• Loss Function (Error rate of Ci on sample Di):
58
AdaBoost:
Final Testing Phase
• The lower a classifier’s Ci error estimate is, the more accurate it is, and
therefore, the higher its importance 𝜶𝒊 when final voting is performed
• Final Testing (on unseen data, validation set):
• For each class value yj, sum the weights of each classifier that assigned class yj to
the instance xtest.
• The class with the highest sum is the WINNER, the one the model will predict
• ẟ(x) = 1 if C(x)=y . For any possible classification value y of xtest, the weighted sum of
classifiers with output y is computed. The y value that maximised this sum is taken as
the predicted vale.
59
Example
Example:
• 3 class values: A, B and C
• 6 classifiers C1..C6
• C1,C2 and C6 classify xtest =A ; C3 and C4 classify xtest =B; C5 classifies xtest =C
• weight of classifiers: C1, C2, C3 =0,1; C4=0,35; C5= 0,15 C6=0,2 (total is 1)
• 𝛿(∗) =1 iff * is true, else is 0
61
Illustrating AdaBoost
Initial weights for each of 10 sampled data points
B1
0.0094 0.0094 0.4623
Boosting α
Round 1 +++ - - - - - - - = 1.9459
α=1.9459
•62
AdaBoost
α
63
AdaBoost (another example):
Example Round 1
64
AdaBoost:
Example Round 2 2
65
AdaBoost:
Example Round 3 3
66
AdaBoost:
Example 4
67
AdaBoost:
The Final Hypothesis 5
68
Random Forest
69
Random Forests
• Ensemble method designed for decision/regression tree
classifiers:
• Combines predictions made by many unpruned d-trees.
• Each tree is generated based on a bootstrap sample of training
data and a vector of randomly chosen attributes (i.e. random
vector)
• The random vectors are generated from a fixed probability
distribution.
• Final classification is chosen with a voting method like Majority
Voting (Forest chooses the classification result having the majority of
votes over all the trees in the forest)
70
Random Forests
Introduce two sources of randomness:
➢ Bagging method: each tree is grown using a
bootstrap sample of training data (as in Bagging and
Boosting)
➢ Random vector method: At each decision node,
the best feature to test is chosen from a random
sample of m attributes out of the total numer of
features d rather than from all features
➢ So, if we have a dataset of dimension |D|xd=nxd, we
randomly choose a subset Dk of D for training, and,
at each split, a subset m of the d features
Random Forests:
Random Vector Method
The random vector method add the following rule to the D-tree
(same for both regression and categorical trees) training
algorithm.
For each node of the tree:
a. Choose m features randomly (out of d) on which to
base the decision at that node.
b. Calculate the best split based on these m variables in
the training set (e.g., test on best attribute in m based
on Infogain).
Why random vectors? if one or a few features are very strong predictors for the output class (or value
for regressors), these features will be selected in many nodes of the trees, causing them to become
correlated. Instead, the idea is to generate as much as possible independent models! Accordingly, the
set of features from which to find the best split are selected randomly at each iteration.
72
Random Forests:
Algorithm
74
Why use Random Vectors?
▪ Because in an ensemble, we want independent base learner
▪ Averaging over trees with different training samples reduces
the dependence of the predictions on a particular training
sample (variance).
▪ Using random vectors of features further reduce this
dependency. Typically, for a classification problem with d
features, m=√d (rounded down) features are used in each split.
▪ Increasing the number of trees does not increase the risk of
overfitting the data .
▪ In fact the main advantage of RF is that they are very robust to
overfitting
75
Advantage of Random Forest
• Since each tree only handles a subset of features, this
can be considered a good choice when instances are
described by very many features
• It is also considered a good “dimensionality reduction”
method (since it concentrates on subsets of features)
76
Other Popular (and
more recent)
Ensambles
GRADIENT BOOSTING MACHINES
A hybrid between
Adaboost and
Random forest
• To fit a model using this loss, we must optimize h(x) by changing its
parameters in a way that h(x) “moves” in the opposite direction of the Loss
gradient:
𝜕𝐿
ℎ 𝑥𝑖 = ℎ 𝑥𝑖 −
𝜕ℎ(𝑥𝑖 )
𝜕𝐿 𝜕(σ(𝑦𝑖 −h(𝑥𝑖 ))2 /2)
But = = 𝑦𝑖 -ℎ 𝑥𝑖 = 𝑦𝑖 -𝑦ො𝑖 = −𝑟𝑖 =𝑔(𝑥𝑖 )
𝜕ℎ(𝑥𝑖 ) 𝜕ℎ(𝑥𝑖 )
• We train with current residuals (those of previous model) rather than with
the «ground truth» output!
• The current model learns to output a quantity 𝑟Ƹ which «compensates» the
residual of the previous model
• Let’s say that model «0» predicts h0(x)=1 for some x in D, but ground truth
is 1.5; we would like model «1» to output a quantity that, if summed with
the prediction of model «0», produces the exact value 1.5 (or at least
«closer» to the ground truth value 1.5)!!
• What does it means «fit a regression tree with negative gradients»?
• Remember: error= variance+bias2+noise (slide 5)
• A basic assumption of regression is that sum of its residuals should
ideally be 0, i.e. the residuals should be spread randomly (which means
that the residual error is only “noise”, the random error 𝜀 that cannot be
eliminated)
• If, instead, we are able to see some “pattern of residuals” around 0, we
can leverage those pattern to fit a model. (pattern of residuals mean
that errors are somehow correlated, they are not random)
• GBM therefore reduce both variance and bias!
The intuition
behing GBM
Each tree is the «sum» of previous trees with the current one
Gradient Boosting with Regression
Algorithm (link):
σ 𝑦𝑗
Start with an initial weak learner, say: ℎ0 𝑥 = 𝑛
(𝑜𝑢𝑡𝑝𝑢𝑡 𝑖𝑠 𝑡ℎ𝑒 𝑎𝑣𝑒𝑟𝑎𝑔𝑒 𝑣𝑎𝑙𝑢𝑒 𝑜𝑓 𝑜𝑏𝑠𝑒𝑟𝑣𝑒𝑑 𝑣𝑎𝑙𝑢𝑒𝑠 𝑖𝑛 𝑡ℎ𝑒 𝑙𝑒𝑎𝑟𝑛𝑖𝑛𝑔 𝑠𝑒𝑡)
Iterate until convergence:
1. Calculate negative gradients: - 𝑔𝑗𝑖 (residuals) of current model hi(x)
2. Fit a regression tree ℎ𝑟 𝑖 (𝑥) to negative gradients - 𝑔𝑗𝑖 of ℎ𝑖 𝑥 (i.e., train on pairs (𝑥𝑗 , 𝑟𝑗𝑖 ) )
3. ℎ𝑖+1 𝑥 = ℎ𝑖 𝑥 + ℎ𝑟 𝑖 (𝑥)
Note that the role of ℎ𝑟 𝑖 (𝑥) is to COMPENSATE the shortcomings of ℎ𝑖 𝑥 .
Final prediction is: 𝑦ො = σ ℎ𝑖 𝑥 = 𝑦 0 + 𝑟𝑥1 + 𝑟𝑥2 + ⋯ 𝑟𝑥𝑘 since all subsequent predictors ℎ𝑟 𝑖 (𝑥) after h0 are trained
to predict residuals of their previous model.
Residuals can be weighted in some implementation of the «basic» algorithm ℎ𝑖+1 𝑥 = ℎ𝑖 𝑥 + 𝛾ℎ𝑟 𝑖 (𝑥) .
Flowchart of GB
y
Example
(1)
Example:
first
iteration
https://blog.mlreview.com/gradient-boosting-from-scratch-
1e317ae4587d
Two more
iterations
After several
iterations
This visually
shows what
happens
during
gradient
discent
https://morioh.com/p/e108a4521555
Popular variants
of GBM: LGBM
• Remember: «base» model in GBM is a
regression tree (see lesson of regression
trees and splitting method)
• LightGBM: rather than sorting all possible
splits as in classic regression trees, LGBM
uses Histogram based algorithm which
buckets continuous features into discrete
bins to construct feature histograms
during training
• First, continuous data are discretized into
k bins (e.g., 256), next, statistics are
accumulated for each bin (frequency
histogram of bins)
• Best split is built exploiting k bins statistics
• Obvious reduction of memory
consumption, LGBM is advantageous with
large datasets
Other variants of GBM
• XGB link
• CATBoost (for categorical features) link
Other Suggested Lectures
• General Overview: LINK and LINK
• Boosting and Bagging: LINK, LINK
• Random Trees and Extra Trees: LINK
95