Ensemble Learning
Prof. Navneet Goyal
Ensemble Learning
• Introduction & Motivation
• Construction of Ensemble Classifiers
   – Boosting (Ada Boost)
   – Bagging
   – Random Forests
• Empirical Comparison
Introduction & Motivation
• Suppose that you are a patient with a set of symptoms
• Instead of taking opinion of just one doctor (classifier),
  you decide to take opinion of a few doctors!
• Is this a good idea? Indeed it is.
• Consult many doctors and then based on their diagnosis;
  you can get a fairly accurate idea of the diagnosis.
• Majority voting - ‘bagging’
• More weightage to the opinion of some ‘good’
  (accurate) doctors - ‘boosting’
• In bagging, you give equal weightage to all classifiers,
  whereas in boosting you give weightage according to
  the accuracy of the classifier.
Introduction & Motivation
• Suppose you are preparing for an exam
• As part of the preparation, you have to solve 50
  problems
• You, as an individual, can solve only 15 problems
• But, one of your wingies can solve 20 problems, some of
  which are same as your 15
• Another wingie can solve a set of 12 problems
• When you put them all together, you find that you guys
  have solution to 32 problems
• 32 is much better than what you could solve
  individually!!
• Collaborative Learning
Introduction & Motivation
• Why do you think I ask you to form groups for
  assignments?
   – I am against individual student doing an assignment
• May be coz I want to reduce my evaluation work!!
   – NOT TRUE!!!
• Group members complement each other and do a
  better job
   – Provided it is not just one student who actually does the
     assignment and rest piggyback!!
• If all students in a group contribute properly, it’s a
  WIN-WIN situation for both students and faculty!!
Ensemble Methods
• Construct a set of classifiers from the training data
• Predict class label of previously unseen records by
  aggregating predictions made by multiple classifiers
General Idea
                                                    Original
                                             D    Training data
         Step 1:
      Create Multiple    D1          D2    ....     Dt-1           Dt
        Data Sets
         Step 2:
      Build Multiple     C1          C2             Ct -1          Ct
       Classifiers
        Step 3:
       Combine                               C*
       Classifiers
Figure taken from Tan et. al. book “Introduction to Data Mining”
Ensemble Classifiers (EC)
• An ensemble classifier constructs a set of ‘base
  classifiers’ from the training data
• Methods for constructing an EC
   •   Manipulating training set
   •   Manipulating input features
   •   Manipulating class labels
   •   Manipulating learning algorithms
 Ensemble Classifiers (EC)
• Manipulating training set
   • Multiple training sets are created by resampling the data
     according to some sampling distribution
   • Sampling distribution determines how likely it is that an
     example will be selected for training – may vary from one trial
     to another
   • Classifier is built from each training set using a paritcular
     learning algorithm
   • Examples: Bagging & Boosting
 Ensemble Classifiers (EC)
• Manipulating input features
   • Subset of input features chosen to form each training set
   • Subset can be chosen randomly or based on inputs given by
     Domain Experts
   • Good for data that has redundant features
   • Random Forest is an example which uses DT as its base
     classifierss
 Ensemble Classifiers (EC)
• Manipulating class labels
   • When no. of classes is sufficiently large
   • Training data is transformed into a binary class problem by
     randomly partitioning the class labels into 2 disjoint subsets,
     A0 & A1
   • Re-labelled examples are used to train a base classifier
   • By repeating the class labeling and model building steps
     several times, and ensemble of base classifiers is obtained
   • How a new tuple is classified?
   • Example – error correcting output codings (pp 307)
 Ensemble Classifiers (EC)
• Manipulating learning algorithm
  • Learning algorithms can be manipulated in such a way that
    applying the algorithm several times on the same training
    data may result in different models
  • Example – ANN can produce different models by changing
    network topology or the initial weights of links between
    neurons
  • Example – ensemble of DTs can be constructed by
    introducing randomness into the tree growing procedure –
    instrad of choosing the best split attribute at each node, we
    randomly choose one of the top k attributes
 Ensemble Classifiers (EC)
• First 3 approaches are generic – can be applied to any
  classifier
• Fourth approach depends on the type of classifier used
• Base classifiers can be generated sequentially or in
  parallel
 Ensemble Classifiers
• Ensemble methods work better with ‘unstable
  classifiers’
• Classifiers that are sensitive to minor perturbations in
  the training set
• Examples:
   – Decision trees
   – Rule-based
   – Artificial neural networks
Why does it work?
• Suppose there are 25 base classifiers
    – Each classifier has error rate, e = 0.35
    – Assume classifiers are independent
    – Probability that the ensemble classifier makes a wrong
      prediction:
                   25
                        æ 25 ö i
                  å     ç
                        ç i ÷
                  i =13 è
                             ÷e  (1 - e ) 25 -i
                                                = 0.06
                             ø
    – CHK out yourself if it is correct!!
Example taken from Tan et. al. book “Introduction to Data Mining”
   Examples of Ensemble Methods
• How to generate an ensemble of classifiers?
   – Bagging
   – Boosting
   – Random Forests
    Bagging
• Also known as bootstrap aggregation
     Original Data       1    2    3     4    5    6     7    8    9    10
     Bagging (Round 1)   7    8    10    8    2    5    10   10    5     9
     Bagging (Round 2)   1    4    9     1    2    3     2    7    3     2
     Bagging (Round 3)   1    8    5    10    5    5     9    6    3     7
• Sampling uniformly with replacement
• Build classifier on each bootstrap sample
• 0.632 bootstrap
• Each bootstrap sample Di contains approx. 63.2% of
  the original training data
• Remaining (36.8%) are used as test set
    Example taken from Tan et. al. book “Introduction to Data Mining”
Bagging
• Accuracy of bagging:
                    k
     Acc( M ) = å (0.632 * Acc( M i ) test _ set + 0.368 * Acc( M i ) train _ set )
                   i =1
• Works well for small data sets
• Example:
         X   0.1    0.2   0.3   0.4   0.5   0.6   0.7   0.8   0.9   1.0
         y   1      1     1     -1    -1    -1    -1    1     1     1
Actual
 Class
labels
 Example taken from Tan et. al. book “Introduction to Data Mining”
   Bagging
• Decision Stump
• Single level decision binary
tree
• Entropy – x<=0.35 or
x<=0.75
• Accuracy at most 70%
                 Actual
                  Class
                 labels
   Example taken from Tan et. al. book “Introduction to Data Mining”
Bagging
                    Accuracy of ensemble classifier: 100% J
Example taken from Tan et. al. book “Introduction to Data Mining”
 Bagging- Final Points
• Works well if the base classifiers are unstable
• Increased accuracy because it reduces the variance
  of the individual classifier
• Does not focus on any particular instance of the
  training data
• Therefore, less susceptible to model over-fitting
  when applied to noisy data
• What if we want to focus on a particular instances of
  training data?
Boosting
• An iterative procedure to adaptively change
  distribution of training data by focusing more on
  previously misclassified records
   – Initially, all N records are assigned equal weights
   – Unlike bagging, weights may change at the end of a
     boosting round
   Boosting
• Records that are wrongly classified will have their
  weights increased
• Records that are classified correctly will have their
  weights decreased
   Original Data        1     2     3       4      5      6      7      8      9        10
   Boosting (Round 1)   7     3     2       8      7      9      4     10      6         3
   Boosting (Round 2)   5     4     9       4      2      5      1      7      4         2
   Boosting (Round 3)   4     4     8      10      4      5      4      6      3         4
                               • Example 4 is hard to classify
                               • Its weight is increased, therefore it is more likely
                               to be chosen again in subsequent rounds
    Example taken from Tan et. al. book “Introduction to Data Mining”
Boosting
• Equal weights are assigned to each training tuple
  (1/d for round 1)
• After a classifier Mi is learned, the weights are
  adjusted to allow the subsequent classifier Mi+1 to
  “pay more attention” to tuples that were
  misclassified by Mi.
• Final boosted classifier M* combines the votes of
  each individual classifier
• Weight of each classifier’s vote is a function of its
  accuracy
• Adaboost – popular boosting algorithm
Adaboost
• Input:
   – Training set D containing d tuples
   – k rounds
   – A classification learning scheme
• Output:
   – A composite model
Adaboost
• Data set D containing d class-labeled tuples (X1,y1),
  (X2,y2), (X3,y3),….(Xd,yd)
• Initially assign equal weight 1/d to each tuple
• To generate k base classifiers, we need k rounds or
  iterations
• Round i, tuples from D are sampled with
  replacement , to form Di (size d)
• Each tuple’s chance of being selected depends on its
  weight
Adaboost
• Base classifier Mi, is derived from training tuples of
  Di
• Error of Mi is tested using Di
• Weights of training tuples are adjusted depending
  on how they were classified
   – Correctly classified: Decrease weight
   – Incorrectly classified: Increase weight
• Weight of a tuple indicates how hard it is to classify
  it (directly proportional)
Adaboost
• Some classifiers may be better at classifying some
  “hard” tuples than others
• We finally have a series of classifiers that
  complement each other!
• Error rate of model Mi:d
               error ( M i )= å w j * err ( X j )
                               j
  where err(Xj) is the misclassification error for Xj(=1)
• If classifier error exceeds 0.5, we abandon it
• Try again with a new Di and a new Mi derived from it
Adaboost
• error (Mi) affects how the weights of training tuples are
  updated
• If a tuple is correctly classified in round i, its weight is
  multiplied by       error ( M i )
                    1 - error ( M i )
• Adjust weights of all correctly classified tuples
• Now weights of all tuples (including the misclassified tuples)
  are normalized            sum _ of _ old _ weights
• Normalization factor = sum _ of _ new _ weights
                                              error ( M i )
• Weight of a classifier Mi’s weight is log 1 - error ( M )
                                                           i
Adaboost
• The lower a classifier error rate, the more accurate it is, and
  therefore, the higher its weight for voting should be
• Weight of a classifier Mi’s vote is
                             error ( M i )
                     log
                           1 - error ( M i )
• For each class c, sum the weights of each classifier that
  assigned class c to X (unseen tuple)
• The class with the highest sum is the WINNER!
      Example: AdaBoost
  • Base classifiers: C1, C2, …, CT
  • Error rate:
         å w d (C ( x ) ¹ y )
         N
     1
ei =            j   i   j     j
     N   j =1
  • Importance of a classifier:
              1 æ 1 - ei ö
          ai = lnçç      ÷÷
              2 è ei ø
   Example: AdaBoost
• Weight update:
                                      -a j
                  ( j +1)
                           ( j)
                            w ï  ìexp      if C j ( xi ) = yi
               wi         =i
                                 í    a
                            Z j ïî exp j if C j ( xi ) ¹ yi
                where Z j is the normalization factor
                           C * ( x ) = arg max å a jd (C j ( x ) = y )
                                                     T
                                              y      j =1
• If any intermediate rounds produce error rate higher
  than 50%, the weights are reverted back to 1/n and
  the re-sampling procedure is repeated
• Classification:
  Illustrating    AdaBoost
             Initial weights for each data point         Data points
                                                         for training
                 0.1              0.1              0.1
Original
 Data         +++            - - - - -         ++
                        B1
               0.0094           0.0094         0.4623
Boosting
Round 1       +++            - - - - - - -                  a = 1.9459
Illustrating AdaBoost
              B1
             0.0094     0.0094     0.4623
  Boosting
  Round 1    +++      - - - - - - -          a = 1.9459
                                  B2
             0.3037     0.0009     0.0422
  Boosting
  Round 2    - - -    - - - - -    ++        a = 2.9323
                                        B3
             0.0276     0.1819     0.0038
  Boosting
  Round 3    +++      ++ ++ + ++             a = 3.8744
  Overall    +++      - - - - -    ++