Bagging and Boosting Roy Ian MSc(UOM), BSc. Eng (Hons.
)(UOM), AMIESL, CCNP
Traditional modeling
• A single model is selected and trained with the dataset
• Full dataset is used to trained the model
• All the features or selected features form feature extraction process will
be used
• But,
• There can be overfitting issues - variance
• There can be underfitting issues – biased
• Final performance might not be as expected
Traditional modeling
• A single model is selected and trained with the dataset
• Full dataset is used to trained the model
• All the features or selected features form feature extraction process will
be used
• But,
• There can be overfitting issues - variance
So why don’t we combine multiple models
• There can be underfitting issues – biased
• Final performance might not be as expected
General idea
Training Data
S
Multiple S1 S2 Sn
Data Sets
Multiple C1 C2 Cn
Classifiers
Combined
Classifier
H
Building ensemble classifiers
• Basic idea:
Build different “experts”, and let them vote
• Advantages:
• Improve predictive performance
• Other types of classifiers can be directly included
• Easy to implement
• No too much parameter tuning
• Disadvantages:
• The combined classifier is not so transparent (black box)
• Not a compact representation
Combining multiple models
1. Simple (unweighted) votes
• Standard choice
2. Weighted votes
• e.g., weight by tuning-set accuracy
3. Train a combining function
1. Prone to overfitting?
2. “Stacked generalization” (Wolpert)
Why do they work?
• Suppose there are 25 base classifiers
• Each classifier has error rate, = 0.35
• Assume independence among classifiers
• Probability that the ensemble classifier makes a wrong prediction:
25
25 i
i (1 − ) = 0.06
i =13
25−i
Common ensemble techniques
1. Bagging
2. Boosting
3. Stacking
Bootstrapping
Bagging : Bootstrap aggregating
• Training
o Given a dataset S, at each iteration i, a training set Si is sampled with
replacement from S (i.e. bootstraping)
o A classifier Ci is learned for each Si
• Classification: given an unseen sample X,
o Each classifier Ci returns its class prediction
o Each model in the ensemble votes with equal weight
o The bagged classifier H counts the votes and assigns the class with the most
votes to X
• Regression: can be applied to the prediction of continuous values by taking
the average value of each prediction.
Bagging
Bagging
• Bagging works because it reduces variance by voting/averaging
o In some pathological hypothetical situations the overall error might
increase
o Usually, the more classifiers the better
• Problem: we only have one dataset.
• Solution: generate new ones of size n by bootstrapping, i.e. sampling it
with replacement
• Can help a lot if data is noisy
When to use Bagging
• Learning algorithm is unstable: if small changes to the training set cause
large changes in the learned classifier.
• If the learning algorithm is unstable, then Bagging almost always
improves performance
• Some candidates:
1. Decision tree
2. Decision stump
3. Regression tree
4. Linear regression
5. SVMs
Random Forest
• Decision trees are individual learners that are combined. They are one of
the most popular learning methods commonly used for data exploration.
• One type of decision tree is called CART… classification and regression
tree.
• CART … greedy, top-down binary, recursive partitioning, that divides
feature space into sets of disjoint rectangular regions.
• Regions should be pure wrt response variable
• Simple model is fit in each region – majority vote for classification, constant value
for regression.
Random Forest
• Random forest (or random forests) is an ensemble classifier that consists
of many decision trees and outputs the class that is the mode of the
class's output by individual trees.
• The term came from random decision forests that was first proposed by
Tin Kam Ho of Bell Labs in 1995.
• The method combines Breiman's "bagging" idea and the random
selection of features.
Random Forest (Algorithm)
• Each tree is constructed using the following algorithm:
1. Let the number of training cases be N, and the number of variables in the classifier be M.
2. We are told the number m of input variables to be used to determine the decision at a node of
the tree; m should be much less than M.
3. Choose a training set for this tree by choosing n times with replacement from all N available
training cases (i.e. take a bootstrap sample). Use the rest of the cases to estimate the error of the
tree, by predicting their classes.
4. For each node of the tree, randomly choose m variables on which to base the decision at that
node. Calculate the best split based on these m variables in the training set.
5. Each tree is fully grown and not pruned (as may be done in constructing a normal tree classifier).
• For prediction a new sample is pushed down the tree.
• It is assigned the label of the training sample in the terminal node it ends up in.
• This procedure is iterated over all trees in the ensemble, and the average vote of all
trees is reported as random forest prediction.
Random Forest (Algorithm)
Features and Advantages
The advantages of random forest are:
• It is one of the most accurate learning algorithms available. For many data
sets, it produces a highly accurate classifier.
• It runs efficiently on large databases.
• It can handle thousands of input variables without variable deletion.
• It gives estimates of what variables are important in the classification.
• It generates an internal unbiased estimate of the generalization error as
the forest building progresses.
• It has an effective method for estimating missing data and maintains
accuracy when a large proportion of the data are missing.
18/14
Features and Advantages
• It has methods for balancing error in class population unbalanced data
sets.
• Generated forests can be saved for future use on other data.
• Prototypes are computed that give information about the relation between
the variables and the classification.
• It computes proximities between pairs of cases that can be used in
clustering, locating outliers, or (by scaling) give interesting views of the
data.
• The capabilities of the above can be extended to unlabeled data, leading to
unsupervised clustering, data views and outlier detection.
• It offers an experimental method for detecting variable interactions.
19/14
Disadvantages
• Random forests have been observed to overfit for some datasets with
noisy classification/regression tasks.
• For data including categorical variables with different number of
levels, random forests are biased in favor of those attributes with
more levels. Therefore, the variable importance scores from random
forest are not reliable for this type of data.
20/14
Tuning Random Forests
• Random forest models have multiple hyperparameters to tune:
1. The sampling scheme: number of predictors to randomly select at each split : max_features
{“auto”, “sqrt”, “log2”}, int or float, default=”auto”.
2. The total number of trees in the forest: n_estimators, int, default=100
3. The complexity of each tree: stop when a leaf has <= min_samples_leaf samples or when we
reach a certain max_depth.
4. In theory, each tree in the random forest is full, but in practice this can be computationally
expensive (and added redundancies in the model), thus, imposing a minimum node size is not
unusual.
Tuning Random Forests
• When the number of predictors is large, but the number of relevant
predictors is small, you need to set max_features to a larger number.
Why?
If chosen features is small, in each split, the chances of selected a relevant predictor
will be low and hence most trees in the ensemble will be weak models.
Boosting
• Boosting is considered to be one of the most significant developments
in machine learning
• Finding many weak rules of thumb is easier than finding a single, highly
prediction rule
• Key in combining the weak rules
• Incremental
• Build new models that try to do better on previous model's mis-
classifications
• Can get better accuracy
• Tends to overfit
Boosting
classifier
Training Sample C(0)(x)
re-weight
Weighted classifier
Sample C(1)(x)
re-weight
Weighted classifier
Sample C(2)(x)
NClassifier
re-weight
classifier
y(x) = w iC(i) (x)
Weighted i
Sample C(3)(x)
re-weight
Weighted classifier
Sample C(m)(x)
AdaBoost(Algorithm)
classifier ❑ AdaBoost re-weights events misclassified by
Training Sample C(0)(x) previous classifier by:
re-weight
Weighted classifier 1 − ferr
with :
Sample C(1)(x) ferr
re-weight
misclassified events
Weighted classifier ferr =
Sample C(2)(x) all events
re-weight
Weighted classifier ❑ AdaBoost weights the classifiers also using the
Sample C(3)(x) error rate of the individual classifier according to:
re-weight
NClassifier
1 − ferr
(i)
(i)
y(x) = log (i) C (x)
classifier
i ferr
Weighted
Sample C(m)(x)
AdaBoost(Algorithm)
ROUND 1
AdaBoost(Algorithm)
ROUND 2
AdaBoost(Algorithm)
ROUND 3
AdaBoost(Algorithm)
Popular boosting algorithms
• XGBoost
• LightGBM
• CatBoosting
Comparison
5-fold cross validation