0% found this document useful (0 votes)
18 views94 pages

8 Ensembles

Ensemble methods in machine learning combine multiple classifiers or regressors to improve generalization performance by leveraging the strengths of diverse models. Techniques like bagging and boosting help reduce bias and variance, enhancing prediction accuracy, especially in scenarios with limited or excessive data. Bagging utilizes bootstrapping to create multiple training sets, while boosting adjusts weights to focus on misclassified instances, both aiming to create a more robust ensemble model.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views94 pages

8 Ensembles

Ensemble methods in machine learning combine multiple classifiers or regressors to improve generalization performance by leveraging the strengths of diverse models. Techniques like bagging and boosting help reduce bias and variance, enhancing prediction accuracy, especially in scenarios with limited or excessive data. Bagging utilizes bootstrapping to create multiple training sets, while boosting adjusts weights to focus on misclassified instances, both aiming to create a more robust ensemble model.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 94

Machine Learning:

Ensemble Methods
(bagging and boosting)

1
Ensemble Methods
• An ensemble is a set of classifiers/regressors (either different
algorithms or different settings of the same algorithm, or the
same algorithm on different samples of the dataset) that learn
a target function, and their individual predictions are
combined to classify new examples.
• Ensembles generally improve the generalization performance
of a set of individual models on a domain.
• Based on the following idea: A large number of relatively
uncorrelated models operating as a committee will
outperform any of the individual constituent models.
Example:
Weather Forecast
100% CORRECT!
GROUND
TRUTH

PREDICTOR1 X X
PREDICTOR2 X X
PREDICTOR3 X X X
PREDICTOR4 X X
PREDICTOR5
X X
Combine
Why to use
Ensemble Methods?
• Statistical reasons:
o A set of classifiers with similar training performances may
have different generalization performances
o Combining outputs of several classifiers reduces the risk
of selecting a poorly performing (weak) classifier.
• Too large volumes of data:
o If the amount of data to be analyzed is too large, a single
classifier may not be able to handle it; train different
classifiers on different partitions of data.
• Too little data:
o Ensemble systems can also be used when there is too
little data by the use of resampling techniques.

4
Why to use
Ensemble Methods?
● The the error of a learning algorithm (will see later evaluation) has three components: the
noise, the bias, and the variance:
Error(x) = Bias(x)2 + Variance(x) + Noise(X)
➢ The noise is the irreducible error (random errors in the data that can’t be eliminated, for
example due to corrupted input, data entering errors..)
➢ The bias is the systematic error that the learning algorithm is expected to make due to,
e.g., architectural choices (e.g. if we use a Perceptron for data that are not lineraly
separable) or to insufficient/unrepresentative training data
➢ The variance measures the sensitivity of the algorithm to the specific training set
and/or hyper-parameters used (algorithms can be more or less robust to such
variations).
Ensembles may help reducing both bias and
variance (except noise, which is irreducible error)
5
The relation between error, bias and
variance
• X, Y, 𝑌෠ are random variables describing the distribution of values for instances x, and their ground
truth and predicted values f(x) and h(x)
• ℎ 𝑋 is and estimator (hypothesis) of the true (unknown) function 𝑓 𝑋 , generated by some
model M
𝑌 = 𝑓 𝑋 + 𝜀 y ∈ 𝑌 values are generated by the (unknown) 𝑓 𝑋 plus «some» random error 𝜀

𝑀𝑆𝐸 = 𝐸 ℎ 𝑋 − 𝑓 𝑋 2 =𝐸 𝑌෠ − 𝑓 𝑋 2 𝐦𝐞𝐚𝐧 𝐬𝐪𝐮𝐚𝐫𝐞 𝐞𝐫𝐫𝐨𝐫 is is the expected (mean)


value of the square error over the entire distribution
Bias2(h,f)= 𝐸 ℎ 𝑋 −𝑓 𝑋 = 𝐸 2 ℎ(𝑋) + 𝑓 𝑋 2 − 2𝐸 ℎ 𝑋 𝑓 𝑋
2

Bias is the difference between the


mean predicted values of our model and the correct values which we are trying to predict
2
Var(h)=𝐸 ℎ 𝑋 − 𝐸 ℎ(𝑋) variance is the expected variability of the model around its mean

𝑀𝑆𝐸 = 𝐸 ℎ 𝑋 − 𝑓 𝑋 2 = 𝐸 (ℎ 𝑋 + 𝐸 ℎ(𝑥) − 𝐸 ℎ 𝑋 − 𝑓(𝑋) )2 =….= 𝐸 ℎ 𝑋 −𝑓 𝑋 2 +𝐸 ℎ 𝑋 −𝑓 𝑋 2 =


2
𝐸 𝑌෠ − 𝑌 2 + E 𝑌෠ − 𝑌 + 𝜀 =𝐵𝑖𝑎𝑠 2 + Variance + irreducible error
6
True function, model hypothesis, random
error
● 𝑌 =𝑓 𝑋 +𝜀

ℎ(𝑋)
𝜀𝑖

f(𝑋)

7
Bias and variance, the intuition

Bias: How much the «mean» model


differs from the true function

Variance: How much the model varies e.g.


with different runs (hyperparameters)
|𝑓 𝑥𝑖 − 𝐸[ℎ 𝑥𝑖 ]|
or data samples

෍(ℎ𝑖 𝑥𝑘 − 𝐸[ℎ 𝑥𝑘 ])2


𝑖
𝐸 ℎ(𝑋)

8
Why to use
Ensemble Methods?
• When combining multiple independent and diverse decisions each of which is
at least more accurate than random guessing, the variance is reduced and so
does the error rate.
σ𝑚 𝑉𝑎𝑟(ℎ𝑖 𝑥, 𝐷𝑖 )
𝑉𝑎𝑟 𝐸𝑛𝑠𝑎𝑚𝑏𝑙𝑒 ℎ𝑖 𝑥, 𝐷 =
𝑚
• Where ℎ𝑖 (x,D) is the hypothesis generated by i-th learner of an ensamle.
• In practice, the idea is the following:
If we use “simple” models trained on smaller samples, the individual variance
of each hypothesis can be higher than for a more complex learner, but still in
most cases the ensamble variance is lower than if we use one single complex
learner (will discuss later about bias)

9
Example
• Suppose there are 25 “simple” classifiers
➢Each classifier has an average error rate, 𝜀 = 0.35 (which is a mid-high rate)
➢Assume classifiers-generated hypotheses hi(x) are statistically independent,
and final class is predicted with majority voting
➢The probability that the ensemble classifier makes a wrong prediction (it is
wrong if at least 13 out of 25 make the wrong prediction):
All possible ways, given a set of 25
σ𝑖=13..25 25 𝜀 𝑖 1 − 𝜀 25-i=0,6 classifiers, of having i (with i>12)
classifiers that make an error, and
𝑖 25-i producing a correct prediction

If classifiers are independent, the probability


that the ensemble makes an error is very low!
Learning Ensembles
• Learn multiple alternative models using different training data or
different learning algorithms.
• Combine decisions of multiple definitions, e.g. using weighted voting.

Training Data

Data1 Data2 ⋅⋅⋅⋅⋅⋅⋅⋅ Data m

Learner1 Learner2 ⋅⋅⋅⋅⋅⋅⋅⋅ Learner m

Model1 Model2 ⋅⋅⋅⋅⋅⋅⋅⋅ Model m

Model Combiner Final Model


11
Methods for
Constructing Ensembles
• By manipulating the training set: Create multiple training sets by
resampling the original data according to some sampling distribution.
• By manipulating the input features: Choose a subset of input features
to form each training set.
• By manipulating the class labels: Transform the training data into a
binary class problem by randomly partitioning the class labels into two
disjoint subsets (e.g. for 4 labels: (AB) (CD)).
• By manipulating the learning algorithm: Manipulate the learning
algorithm to generate different models (e.g. different
hyperparameters)

12
Homogeneous Ensembles
• Use a single, arbitrary learning algorithm but manipulate
training data to make it learn multiple models h1(x) h2(x).. hm(x)
• Data1 ≠ Data2 ≠ … ≠ Data m
• Learner1 = Learner2 = … = Learner m

• Different methods for changing training data:


1. Bagging: Resample data (Useful to reduce the variance)
2. Boosting: Re-weight data (Useful to reduce the bias)
Bagging
A high variance for a model is not good, suggesting its performance is
sensitive to the training data provided (low generalization power). So,
even if more training data is provided, the model may still perform poorly.
And, this may not even reduce the variance of our model.

➢ Solution: BAGGING, a shorthand for the combination of


Bootstrapping and aggregating

Random
samples
of dataset

14
The basic idea of bagging with bootstrap

15
Bagging: Bootstrapping
Bootstrapping is a method to help decrease the variance of the classifier
and reduce overfitting, by resampling data from the training set. The
ensamble model created should be less overfitted than a single individual
model.
➢ How: Each individual classifiers randomly extracts a sample of m
instances over the training set of n instances with replacement
(instances are put back in the urn, therefore they can be sampled more
than one time). If the sample has the same cardinality as the original
set m=n, is called «the 0.632 bootstrap».
➢ When: Effective method in limited data contexts (high variance, risk of
overfitting).

16
Bootstrapping

Example with n=12 and m=5 (note the effect of replacement)


17
Bagging: Bootstrapping
➢ Instances that are not extracted during bagging, are used for
testing
➢ Each instance x out of n has probability of (1/n)m of being
selected in a training sample of m instances, and (1 – 1/n)m of
being left as test data, in each bagging round.

Data ID Training Data

Original training dataset has n=10 instances. n=m in this example

18
Example (n=m)
Each instance has a probability p=1/n of being extracted out of n
instances at each extraction. Since extraction is “with replacement” (the
instance is put back in the urn after having been extracted) the
probability is always the same at each extraction.

19
2

20
3

21
4

22
5

23
6

24
Training set
7
Test set

25
The 0.632 bootstrap
• Each example in the original dataset has a selection probability of
1/n on n samples
• If n=m, on average, 36.8% of the data-points are left unselected
and can be used for testing

Why?

26
The 0.632 bootstrap
This method is also called the 0.632 bootstrap
➢If we have n instances, each instance has a probability 1/n
of being picked and (1-1/n) of not being picked at each
extraction
➢If m=n, its probability of ending up in the test data (= not
being selected in any of n extractions) is:

➢This means the training data will contain approximately


63.2% of the instances

27
Example of Bagging
We aim to learn a classifier C(x) in ℜ1 (x is a continuous variable).
Assume that the “real” (unknown) classification function f(x) is:
+1 -1 +1
0.8 x
0.3
• In this example data is not linearly separable, a classifier for this data
must learn a decision region, e.g.:
IF T1<= x <=T2 then C
else not(C)
➢ In our example, the “true” (unknown) values for decision
boundaries are T1=0.3 and T2=0.8.
Goal: find a collection of 10 simple thresholding (=linear) classifiers
that can collectively classify correctly the data.
i.e. , each classifier Ci learn a single threshold Ti such that:
IF x<=Ti then C
else not(C)
28
Training set
This is the training set: we have 10 pairs (x, C(x))

x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1


Y=C(x) 1 1 1 0 0 0 0 1 1 1

We now create 10 samples of this data with bootstrapping, and


on any sample, we train a “simple” threshold classifier.

Remember: Each sample is one training set (bagging round) of size 10.
This means that for 10 times (bagging rounds) we sample 10 instances
from the original dataset with replacement. The extracted instances in a
round(i) are used to train the i-th learner hi, and the non-extracted
instances are used for testing.

29
Note: In each round the
same training example
can be extracted more
than one time, and some
examples are not
extracted.
Note that each classifier is
does not correctly
classifies some of the
training data!
E.g. classifier 1 is wrong on
the last two items of the
“sampled” training set:
c(0.9) = -1 (and is instead 1)

For each bagging,


we show the
threshold learned
by each classifier
30
Bagging: Aggregation
• In the previous example, given an initial training set of 10 examples, we
bag the data 10 times and we learn 10 threshold classifiers Ci (i=1,..., 10),
(each of our hi(x)) has an expected error rate E[ hi(x) - f(x)] = 𝜀 i
• We then need to combine these results (ensemble method)
• There exists several methods to determine the final score, e.g.:

1. Majority voting
2. Combining (averaging) classification functions
Bagging: Aggregation
Majority voting
Majority voting (Hard voting): we just need a majority of
classifiers to determine what the result could be.

A simple version of Majority voting for binary classifiers (with


values +1, -1):
IF sign(ΣiCi(xj)) = +, then C(xj) = 1
➢ This means: if majority says “1” (if σ(+1)> σ(-1) ) then,
predicted class is 1 (the sign of the sum of prediction is
positive)
Bagging with Majority Voting
IF applied to training data
Accuracy of ensemble classifier: 100% ☺

33
Bagging with weighted
Aggregation
Weighted voting (Soft voting): Every classifier i output a class
based on the learned model hi(x) for each instance x, and these
predictions are multiplied by each classifier’s weight
(relevance), and finally averaged by the number of classifiers.
The final class label is then derived from the class label with the
highest average probability.

Tip: In reality, weights are hard to find using intuition. To counter this
subjective process, a linear optimization equation or a neural network
could be used to learn the optimal weighting for each of the models to
optimize the accuracy of the ensemble.
Bagging with aggregation by averaging
different learned functions

Example: creating a non-linear classifier out of many


linear classifiers (e.g perceptrons)

35
2

36
3

37
4

38
5

39
6

By averaging all the lines


we may obtain a perfect
classifier
40
Bagging:
Notable Benefits

• Robust to overfitting
• Main goal is to reduce variance of combined models
• Improves ability to ignore irrelevant features
• Works well if all instances have an equal probability p of being
classified correctly or wrongly (means: Pr(f(x)≠h(x))=p for all x in
X) (this means that, more or less, all instances have the same
complexity)

41
Bagging:
Issues

• Does not focus on any particular instance of the


training data - assumption is that all instances
have the same probability of misclassification
(often wrong assumption, e.g. image recognition)

• What if we want to focus on a particular instance of


training data?
➢ E.g. some instance can be more difficult to
classify than others (and on these difficult
instances most “simple” classifiers may be all
wrong, so majority voting won’t work)

42
Example 1:
Handwriting Recognition

Some signs are more difficult than others,


even for humans

43
Example 2:
Face Recognition

Not all faces are easily recognized here..


44
Boosting

Idea: The main idea of boosting is to add models to the overall


ensemble sequentially. At each iteration, a new model is
created and the new base-learner model is forced to «pay
attention» to the errors of the previous learners.

As for bagging, the algorithm creates multiple weak models


whose output is finally combined to get an overall prediction.
➢ The main goal of boosting is to reduce the bias (errors due
to the model itself, not to its sensitivity to data variations)
45
46
Boosting
How: An iterative procedure to adaptively change the
distribution of training data by focusing (in each iteration) on
previously misclassified examples.
1. Get a dataset
2. Take a bootstrap (as for bagging), and train a model on it
3. See on which examples the model is wrong
4. Upweight those ‘hard ‘ examples, downweight the ‘easy’
ones,
5. Go back to step 2, but with a weighted bootstrap, so that
in next iteration harder examples have a higher probability
of being extracted
Each new member of the ensemble «focuses» on
the instances that the previous ones got wrong!
47
Boosting
• Instances xi in the training set D are sampled with a probability
𝑗
that depends on their weight 𝑤𝑖 (P(X=xi)=wi in iteration j).
Initially, instances are equally weighted (P(X=xi)=1/n) .
• In iteration (round) j, instances that are wrongly classified by
current learner will have their weights increased (so that their
probability of being sampled in next boosting round will grow)
• Those that are classified correctly will have their weights
decreased
The workflow of boosting algorithms

49
Boosting Example
• Suppose example 4 is hard to classify (round 1 classifier is wrong on
example 4)
• Its weight is increased, therefore it is more likely to be extracted again in
subsequent weighted sampling rounds (2)
• Therefore, in the subsequent round, the learner is forced to pay more
attention to the misclassified example (in the attempt of reducing the
error on the learning set)
• If round 2 classifier is again wrong on example 4, it probability of being
extracted increases again
Boosting flow diagram

51
Adaboost
Adaptive Boost
• Input:
➢Training set D containing n instances
➢A classification algorithm (e.g., NN or a Tree-based
algorithm)
➢T iterations (i.e. rounds) (i=1,…,T). A classifier Ci is learned
at each round.
• Output:
➢A composite classifier C*

53
Adaboost:
Training Phase
• Training data D contains n labeled data
➢ (x1,y1), (x2,y2 ), (x3,y3),….(xn,yn)
➢ where yi are the correct classifications
• Initially assign equal weight 1/n to each example
• To generate T “base” classifiers, we need T rounds
• Round i:
➢ instances from D are sampled with replacement, to
generate the dataset Di (|Di| = n)
• Each instance’s chance of being selected in the next rounds
depends on its weight
➢the new sample is generated directly from the training
data D with different sampling probability according to
the weights;

54
AdaBoost:
Testing Phase
● Testing occurs on individual classifiers Ci at the end of
each round.
● The performance of each classifier on training set is
used to assess the “importance” or authority of Ci
● Final testing is performed on unseen data (i.e. a
validation set). To combine individual classifications by
each Ci, the decision of each classifier is taken into
consideration proportionally to its importance

55
AdaBoost:
Testing Phase
Training phase of Ci depends on previous testing phase
on Ci-1
• Base classifier Ci, is learned from training data of set
Di
• Error of Ci is tested always using Di (test set = training
set)
• Weights of training data are adjusted depending on
how they were classified

56
AdaBoost:
Testing Phases for individual classifier
• “Base” learned classifiers: C1, C2, …, CT
• Loss Function (Error rate of Ci on sample Di):

• i = index of boosting round


• j=index of instance in training data
• 𝛿(𝑐) 𝑖𝑠 1 𝑖𝑓 𝑡ℎ𝑒 𝑐𝑜𝑛𝑑𝑖𝑡𝑖𝑜𝑛 𝑐 𝑖𝑠 𝑡𝑟𝑢𝑒, else is zero
• wj = current weight of xj (1/n in round 1)
• Importance of a classifier: the weight of a classifier C i in
final vote is:
AdaBoost:
Weight updating rule (before (i+1)th round)
➢ Weight updating rule on all training data:

αi is the “importance” of classifier Ci, as previously computed


Zi is used to obtain that weights wj are sampling probabilities for xj

If classification of xj is correct, decrease weight (divide by expαi )


else increase (multiply by expαi )

58
AdaBoost:
Final Testing Phase
• The lower a classifier’s Ci error estimate is, the more accurate it is, and
therefore, the higher its importance 𝜶𝒊 when final voting is performed
• Final Testing (on unseen data, validation set):
• For each class value yj, sum the weights of each classifier that assigned class yj to
the instance xtest.
• The class with the highest sum is the WINNER, the one the model will predict

• ẟ(x) = 1 if C(x)=y . For any possible classification value y of xtest, the weighted sum of
classifiers with output y is computed. The y value that maximised this sum is taken as
the predicted vale.

59
Example

Example:
• 3 class values: A, B and C
• 6 classifiers C1..C6
• C1,C2 and C6 classify xtest =A ; C3 and C4 classify xtest =B; C5 classifies xtest =C
• weight of classifiers: C1, C2, C3 =0,1; C4=0,35; C5= 0,15 C6=0,2 (total is 1)
• 𝛿(∗) =1 iff * is true, else is 0

C(xtest)=A➔ 0,1 +0,1+0,2=0,4


C(xtest)=B➔ 0,1 +0,35=0,45
C(xtest)=C➔ 0,15
Argmax(A,B,C) =A
60
AdaBoost:
Pseudo-code
Given D:<xi,yi> |D|=n
1. Set weights wj=1/n
2. For i=1..T
a. Bootstrap Di from D using P(X=xj)=wj, and train Ci
b. Test Ci and compute error rate on Di, εi
3. IF εi>1/2 then T=t-1 abort loop
a. Compute αi
b. Update wj
4. Test: for any unseen xtest

61
Illustrating AdaBoost
Initial weights for each of 10 sampled data points

0.1 0.1 0.1


Original
Data +++ - - - - - ++
Train C1 on these data points

B1
0.0094 0.0094 0.4623
Boosting α
Round 1 +++ - - - - - - -  = 1.9459
α=1.9459

C1 is wrong on these 2 examples, hence


their weight is increased

•62
AdaBoost
α

63
AdaBoost (another example):
Example Round 1

64
AdaBoost:
Example Round 2 2

65
AdaBoost:
Example Round 3 3

66
AdaBoost:
Example 4

● The 3 learned classifiers

67
AdaBoost:
The Final Hypothesis 5

68
Random Forest

69
Random Forests
• Ensemble method designed for decision/regression tree
classifiers:
• Combines predictions made by many unpruned d-trees.
• Each tree is generated based on a bootstrap sample of training
data and a vector of randomly chosen attributes (i.e. random
vector)
• The random vectors are generated from a fixed probability
distribution.
• Final classification is chosen with a voting method like Majority
Voting (Forest chooses the classification result having the majority of
votes over all the trees in the forest)

70
Random Forests
Introduce two sources of randomness:
➢ Bagging method: each tree is grown using a
bootstrap sample of training data (as in Bagging and
Boosting)
➢ Random vector method: At each decision node,
the best feature to test is chosen from a random
sample of m attributes out of the total numer of
features d rather than from all features
➢ So, if we have a dataset of dimension |D|xd=nxd, we
randomly choose a subset Dk of D for training, and,
at each split, a subset m of the d features
Random Forests:
Random Vector Method
The random vector method add the following rule to the D-tree
(same for both regression and categorical trees) training
algorithm.
For each node of the tree:
a. Choose m features randomly (out of d) on which to
base the decision at that node.
b. Calculate the best split based on these m variables in
the training set (e.g., test on best attribute in m based
on Infogain).
Why random vectors? if one or a few features are very strong predictors for the output class (or value
for regressors), these features will be selected in many nodes of the trees, causing them to become
correlated. Instead, the idea is to generate as much as possible independent models! Accordingly, the
set of features from which to find the best split are selected randomly at each iteration.
72
Random Forests:
Algorithm

1. Divide training examples into multiple training sets Dk using Bootstrap


(i.e. choosing p times with replacement from all n available training
cases)
2. For each training set Dk :
a. Train a decision tree with the random vector method
b. Estimate the error of the decision tree using the rest of the examples.
3. Aggregate the predictions of each tree to make classification decisions
using a voting methods (usually Majority Voting)
Random Forests:
Visual Explanation

74
Why use Random Vectors?
▪ Because in an ensemble, we want independent base learner
▪ Averaging over trees with different training samples reduces
the dependence of the predictions on a particular training
sample (variance).
▪ Using random vectors of features further reduce this
dependency. Typically, for a classification problem with d
features, m=√d (rounded down) features are used in each split.
▪ Increasing the number of trees does not increase the risk of
overfitting the data .
▪ In fact the main advantage of RF is that they are very robust to
overfitting

75
Advantage of Random Forest
• Since each tree only handles a subset of features, this
can be considered a good choice when instances are
described by very many features
• It is also considered a good “dimensionality reduction”
method (since it concentrates on subsets of features)

76
Other Popular (and
more recent)
Ensambles
GRADIENT BOOSTING MACHINES
A hybrid between
Adaboost and
Random forest

• Every Decision Tree DTi


(either categorical or
regression tree) tries to
improve over the previous
model by “paying more
attention” on instances
where the previous model
fails
• Every DTi (either categorical or regression
tree) tries to improve over the previous
model by paying more attention on
instances where the previous model fails
How do we • Adaboost does this by increasing the weight
«pay more of misclassified instances
attention» to • GBM applies to regression trees, and does so
using gradient descent
errors? • How?
• Let’s say we start with a very simple model h0(x)
• We have set of training instances D: (x1,y1)..(xn,yn) and
for any xi we define the residual as follows:
• 𝑟𝑗0 = yj – h0(xj) (=the error of h0 on instance xj)
• We want to improve h0(x) (reduce its residuals) by
adding a new model h1(x) such that:
Gradient h0(x)+h1(x)=y for any (x,y) in D
boosting (1) • Or equivalently:
h1(x)=y-h0(x)
Two questions:
1. How does this relate with gradient descent????
2. What if h1(x) does not achieve its task perfectly?
How do residuals relate with gradient?

• Let’s consider regressors. As we have seen, a common loss function is the


Square Loss function (note: MSE is the mean of square losses), defined as:

L=σ 𝐿(𝑥𝑖 , ℎ 𝑥𝑖 )= σ(𝑦𝑖 − h(𝑥𝑖 ))2 /2 = σ(𝑦𝑖 − 𝑦ෝ𝑖 )2/2

(also called squared sum of errors)

• To fit a model using this loss, we must optimize h(x) by changing its
parameters in a way that h(x) “moves” in the opposite direction of the Loss
gradient:
𝜕𝐿
ℎ 𝑥𝑖 = ℎ 𝑥𝑖 −
𝜕ℎ(𝑥𝑖 )
𝜕𝐿 𝜕(σ(𝑦𝑖 −h(𝑥𝑖 ))2 /2)
But = = 𝑦𝑖 -ℎ 𝑥𝑖 = 𝑦𝑖 -𝑦ො𝑖 = −𝑟𝑖 =𝑔(𝑥𝑖 )
𝜕ℎ(𝑥𝑖 ) 𝜕ℎ(𝑥𝑖 )

The gradient is the opposite of the residual of h(x) on xi !


Residuals can be interpreted as gradients!
• Back to GBM, let h0 (x) be an initial weak model (a regression tree) in the
pipeline
• The residuals 𝑟𝑗0 of h0(x) on the dataset D are: (y1 – h0(x1))..(yn – h0(xn))
• Our aim is to «augment» h0(x) with h1(x) such that h1(x) is a new model
«fitted» on the residuals of the previous model.
• The role of h1(x) is to «compensate» the shortcomings of the previous
model (to generate an output that, added to previous function h0(x), cancel
or at least reduces the difference between the ground truth output and the
generated output
• If the new model (h0(x) + h1(x)) still performs poorly (=has high residuals),
we can add another model h2(x) fitted on the residuals of h1(x) and iterate
How do we «fit a new model» hi+1(x) on the
residuals of the previous model hi(x)?
• Just grow a model hi+1(x) (e.g., a tree) with the following training set:
• D: 𝑥𝑗 , 𝑦𝑗 − ℎ𝑖 (𝑥𝑗 ) = 𝑥𝑗 , 𝑟𝑗𝑖

• We train with current residuals (those of previous model) rather than with
the «ground truth» output!
• The current model learns to output a quantity 𝑟Ƹ which «compensates» the
residual of the previous model
• Let’s say that model «0» predicts h0(x)=1 for some x in D, but ground truth
is 1.5; we would like model «1» to output a quantity that, if summed with
the prediction of model «0», produces the exact value 1.5 (or at least
«closer» to the ground truth value 1.5)!!
• What does it means «fit a regression tree with negative gradients»?
• Remember: error= variance+bias2+noise (slide 5)
• A basic assumption of regression is that sum of its residuals should
ideally be 0, i.e. the residuals should be spread randomly (which means
that the residual error is only “noise”, the random error 𝜀 that cannot be
eliminated)
• If, instead, we are able to see some “pattern of residuals” around 0, we
can leverage those pattern to fit a model. (pattern of residuals mean
that errors are somehow correlated, they are not random)
• GBM therefore reduce both variance and bias!
The intuition
behing GBM

Each tree is the «sum» of previous trees with the current one
Gradient Boosting with Regression

Algorithm (link):
σ 𝑦𝑗
Start with an initial weak learner, say: ℎ0 𝑥 = 𝑛
(𝑜𝑢𝑡𝑝𝑢𝑡 𝑖𝑠 𝑡ℎ𝑒 𝑎𝑣𝑒𝑟𝑎𝑔𝑒 𝑣𝑎𝑙𝑢𝑒 𝑜𝑓 𝑜𝑏𝑠𝑒𝑟𝑣𝑒𝑑 𝑣𝑎𝑙𝑢𝑒𝑠 𝑖𝑛 𝑡ℎ𝑒 𝑙𝑒𝑎𝑟𝑛𝑖𝑛𝑔 𝑠𝑒𝑡)
Iterate until convergence:
1. Calculate negative gradients: - 𝑔𝑗𝑖 (residuals) of current model hi(x)
2. Fit a regression tree ℎ𝑟 𝑖 (𝑥) to negative gradients - 𝑔𝑗𝑖 of ℎ𝑖 𝑥 (i.e., train on pairs (𝑥𝑗 , 𝑟𝑗𝑖 ) )
3. ℎ𝑖+1 𝑥 = ℎ𝑖 𝑥 + ℎ𝑟 𝑖 (𝑥)
Note that the role of ℎ𝑟 𝑖 (𝑥) is to COMPENSATE the shortcomings of ℎ𝑖 𝑥 .
Final prediction is: 𝑦ො = σ ℎ𝑖 𝑥 = 𝑦 0 + 𝑟𝑥1 + 𝑟𝑥2 + ⋯ 𝑟𝑥𝑘 since all subsequent predictors ℎ𝑟 𝑖 (𝑥) after h0 are trained
to predict residuals of their previous model.
Residuals can be weighted in some implementation of the «basic» algorithm ℎ𝑖+1 𝑥 = ℎ𝑖 𝑥 + 𝛾ℎ𝑟 𝑖 (𝑥) .
Flowchart of GB
y

Example
(1)
Example:
first
iteration
https://blog.mlreview.com/gradient-boosting-from-scratch-
1e317ae4587d
Two more
iterations
After several
iterations
This visually
shows what
happens
during
gradient
discent

https://morioh.com/p/e108a4521555
Popular variants
of GBM: LGBM
• Remember: «base» model in GBM is a
regression tree (see lesson of regression
trees and splitting method)
• LightGBM: rather than sorting all possible
splits as in classic regression trees, LGBM
uses Histogram based algorithm which
buckets continuous features into discrete
bins to construct feature histograms
during training
• First, continuous data are discretized into
k bins (e.g., 256), next, statistics are
accumulated for each bin (frequency
histogram of bins)
• Best split is built exploiting k bins statistics
• Obvious reduction of memory
consumption, LGBM is advantageous with
large datasets
Other variants of GBM
• XGB link
• CATBoost (for categorical features) link
Other Suggested Lectures
• General Overview: LINK and LINK
• Boosting and Bagging: LINK, LINK
• Random Trees and Extra Trees: LINK

• And the heterogeneous ensemble?


• Stacking and Bleeding: LINK

95

You might also like