Ensemble Learning
Bagging and Boosting
Manu Madhavan
Outline
● Techniques to improve classification accuracy
● We focus on ensemble methods
● Bagging
● Boosting
○ Adaptive Boosting (AdaBoost)
● Random Forest
Ensemble Learning
● No solution is perfect!!
● Different algorithms may produce different solution because they impose
different structure on the data;
● No single algorithm is optimal
● Combine multiple partitions of given data into a single partition of
better quality
What is an ensemble method?
● Ensemble is a Machine Learning concept in which the idea is to train
multiple models using the same learning algorithm
● An ensemble for classification is a composite model, made up of a
combination of classifiers
● The individual classifiers vote, and a class label prediction is returned by
the ensemble based on the collection of votes
● Ensembles tend to be more accurate than their component classifiers
Why ?
Traditional learning models assume that the data
classes are well distributed.
In many real-world data domains, however, the
data are class-imbalanced, where the main class of
interest is represented by only a few tuples.
Ensemble learning is a solution for improving
classification in imbalanced data
Ensemble models decrease the variance of a single
estimate as they combine several estimates from
different models. So the result may be a model with
higher stability.
Ensemble Methods
● An ensemble combines a series of k learned models (or base classifiers),
M1 , M2 , . . . , Mk with the aim of creating an improved composite
classification model, M∗
● A given data set, D, is used to create k training sets, D1 , D2 , . . . , Dk ,
where Di (1 ≤ i ≤ k − 1) is used to generate classifier Mi
● Given a new data tuple to classify, the base classifiers each vote by
returning a class prediction. The ensemble returns a class prediction
based on the votes of the base classifiers.
Bagging and Boosting
Bagging: It is a homogeneous weak learners’ model that learns from each
other independently in parallel and combines them for determining the model
average.
Boosting: It is also a homogeneous weak learners’ model but works
differently from Bagging. In this model, learners learn sequentially and
adaptively to improve model predictions of a learning algorithm.
Bagging- intuition
Suppose that you are a patient and would like to
have a diagnosis made based on your symptoms.
Instead of asking one doctor, you may choose to
ask several. If a certain diagnosis occurs more than
any other, you may choose this as the final or best
diagnosis.
That is, the final diagnosis is made based on a
majority vote, where each doctor gets an equal
vote.
Bagging
● Bootstrap aggregating
● Bootstrap is a sampling technique- random sampling with
replacement
● Given a set, D, of d tuples, bagging works as follows. For iteration i (i = 1, 2,
. . . , k), a training set, Di , of d tuples is sampled with replacement from
the original set of tuples, D.
● In bagging, each training set is a bootstrap sample
Bagging
● Because sampling with replacement is used, some of the original tuples of
D may not be included in Di , whereas others may occur more than once.
● A classifier model, Mi , is learned for each training set, Di .
● To classify an unknown tuple, X, each classifier, Mi , returns its class
prediction, which counts as one vote.
● The bagged classifier, M∗, counts the votes and assigns the class with the
most votes to X.
Bagging
Bagging
Random Forest
● Another ensemble method
● The "forest" it builds, is an ensemble of decision trees, usually trained
with the “bagging” method.
● The general idea of the bagging method is that a combination of learning
models increases the overall result.
● Random forest builds multiple decision trees and merges them
together to get a more accurate and stable prediction.
Decision Trees are the building blocks for Random Forest algorithm
Decision Tree: Recap
● How to classify 1s and 0s
○ Color
○ Underline
● What feature will allow me to split
the observations at hand in a way
that the resulting groups are as
different from each other as
possible (and the members of each
resulting subgroup are as similar to
each other as possible)?
Random Forest
● Consists of a large number of
individual decision trees that operate
as an ensemble.
● Each individual tree in the random
forest spits out a class prediction and
the class with the most votes
becomes our model’s prediction
https://towardsdatascience.com/understanding-random-forest-58381e0602d2
Random Forest
https://www.section.io/engineering-education/introduction-to-random-forest-in-machine-learning/
Random Forest
https://www.section.io/engineering-education/introduction-to-random-forest-in-machine-learning/
Random Forest
● Imagine that each of the classifiers in the ensemble is a decision tree
classifier so that the collection of classifiers is a “forest”.
● The individual decision trees are generated using a random selection of
attributes at each node to determine the split
● Each tree depends on the values of a random vector sampled
independently and with the same distribution for all trees in the forest.
● During classification, each tree votes and the most popular class is
returned.
Random Forest -working
● There are two stages in Random Forest algorithm
○ Random forest creation
○ Make a prediction from the random forest classifier
● Decisions trees are very sensitive to the data they are trained on —
small changes to the training set can result in significantly different tree
structures.
● Random forest takes advantage of this by allowing each individual tree to
randomly sample from the dataset with replacement
○ Bagging
Random Forest -working
Random Forest creation
1. Randomly select “K” features from total “m” features where k << m
2. Among the “K” features, calculate the node “d” using the best split
point
3. Split the node into daughter nodes using the best split
4. Repeat the 1-3 steps until “l” number of nodes has been reached
5. Build forest by repeating steps 1-4 for “n” number times to create “n”
number of trees
Random Forest -working
Random Forest prediction
1. Takes the test features and use the rules of each randomly created
decision tree to predict the outcome and stores the predicted outcome
(target)
2. Calculate the votes for each predicted target
3. Consider the high voted predicted target as the final prediction from the
random forest algorithm
Random Forest- Advantages
1. For applications in classification problems, Random Forest algorithm will
avoid the overfitting problem
2. For both classification and regression task, the same random forest
algorithm can be used
3. The Random Forest algorithm can be used for identifying the most
important features from the training dataset, in other words, feature
engineering.
Random Forest
Python example
Boosting
● Ensemble learning technique
Assume this situation
● If a data point is incorrectly predicted by the first model, and then the
next (probably all models), will combining the predictions provide better
results?
● Such situations are taken care of by boosting.
● Combine weak learners into a strong learner.
Boosting- intuition
Suppose that as a patient, you have
certain symptoms. Instead of
consulting one doctor, you choose to
consult several. Suppose you assign
weights to the value or worth of each
doctor’s diagnosis, based on the
accuracies of previous diagnoses they
have made. The final diagnosis is then
a combination of the weighted
diagnoses.
Boosting
● Boosting is a sequential process, where each subsequent model
attempts to correct the errors of the previous model.
●
https://medium.com/swlh/boosting-and-bagging-explained-with-examples-5353a36eb78d
Boosting
● Train model A on the whole set
● Train the model B with exaggerated data on the regions in which A
performs poorly (i.e “pay more attention” to the training tuples that
were misclassified)
● The final boosted classifier, M∗, combines the votes of each individual
classifier
Boosting
● Train model A on the whole set
● Train the model B with exaggerated data on the regions in which A
performs poorly (i.e “pay more attention” to the training tuples that
were misclassified) Weak classifier
● The final boosted classifier, M∗, combines the votes of each individual
classifier
Strong classifier
Adaboost
Adaptive Boosting
1. We are given D, a data set of d class-labeled tuples, (X1 , y1 ), (X2 , y2 ), . . .
, (Xd , yd )
2. Initialise the dataset and assign equal weight (of 1/d) to each of the data
point
3. Provide this as input to the model and identify the wrongly classified data
points
4. Increase the weight of the wrongly classified data points.
5. Repeat steps 3-4 until got required results
Adaboost
B1 - We have 10 points 5 + and 5 -
Each one has been assigned equal weight
initially.
The first model tries to classify the data
points and generates a vertical separator
line but it wrongly classifies 3 plus(+) as
minus(-).
Adaboost
B2 consists of the 10 data points from the
previous model in which the 3 wrongly
classified plus(+) are weighted more so that
the current model tries more to classify
these pluses(+) correctly.
This model generates a vertical separator
line which correctly classifies the previously
wrongly classified pluses(+) but in this
attempt, it wrongly classifies two
minuses(-).
Adaboost
B3 consists of the 10 data points from the
previous model in which the 3 wrongly
classified minus(-) are weighted more so
that the current model tries more to
classify these minuses(-) correctly.
This model generates a horizontal
separator line which correctly classifies the
previously wrongly classified minuses(-).
Adaboost
B4 combines together B1, B2 and B3 in
order to build a strong prediction model
which is much better than any individual
model used.
(Example from Geeksforgeeks)
Adaboost - how to find weak learners?
To compute the error rate of model Mi
Adaboost
If a tuple in round i was correctly classified, its weight is multiplied by
error(Mi )/(1 − error(Mi )).
Once the weights of all the correctly classified tuples are updated, the
weights for all tuples (including the misclassified ones) are normalized so
that their sum remains the same as it was before.
To normalize a weight, we multiply it by the sum of the old weights, divided
by the sum of the new weights. As a result, the weights of misclassified
tuples are increased and the weights of correctly classified tuples are
decreased
Adaboost- ensemble prediction
Adaboost assigns a weight to each classifier’s vote, based on how well the
classifier performed.
The lower a classifier’s error rate, the more accurate it is, and therefore, the
higher its weight for voting should be.
For each class, c, we sum the weights of each classifier that assigned class c to
X. The class with the highest sum is the “winner” and is returned as the class
prediction for tuple X.