BOOSTING (ADABOOST ALGORITHM)
Eric Emer
Consider Horse-Racing Gambler
Rules of Thumb for determining Win/Loss: Most favored odds Fastest recorded lap time Most wins recently, say, in the past 1 month Hard to determine how he combines analysis of feature
set into a single bet.
Consider MIT Admissions
2-class system (Admit/Deny) Both Quantitative Data and Qualitative Data
We consider (Y/N) answers to be Quantitative (-1,+1) Region, for instance, is qualitative.
Rules of Thumb, Weak Classifiers
Easy to come up with rules of thumb that correctly classify the training data at
better than chance.
E.g. IF GoodAtMath==Y THEN predict Admit. Difficult to find a single, highly accurate prediction rule. This is where our Weak
Learning Algorithm, AdaBoost, helps us.
What is a Weak Learner?
For any distribution, with high probability, given
polynomially many examples and polynomial time we can find a classifier with generalization error better than random guessing.
< 1 2 , also denoted > 0 for generalization error ( 1 2 )
Weak Learning Assumption
We assume that our Weak Learning Algorithm (Weak
Learner) can consistently find weak classifiers (rules of thumb which classify the data correctly at better than 50%) Given this assumption, we can use boosting to generate a single weighted classifier which correctly classifies our training data at 99%-100%.
AdaBoost Specifics
How does AdaBoost weight training examples optimally? Focus on difficult data points. The data points that have been misclassified most by the previous weak classifier. How does AdaBoost combine these weak classifiers into a
comprehensive prediction?
Use an optimally weighted majority vote of weak classifier.
AdaBoost Technical Description
Missing details: How to generate distribution? How to get single classifier?
Constructing Dt
D 1 ( i) = and given Dt and ht : Dt+1 c ( x) = D t ( i) = c ( x) Zt : y i = ht ( x i ) : yi 6= ht (xi )
t yi h t (x i )
1 m
e t e t
Dt+1 = where Zt = normalization constant
D t ( i) e Zt
1 1 t >0 t = ln 2 t
Getting a Single Classifier
X
t
Hf inal (x) = sign(
t ht (x))
Mini-Problem
Training Error Analysis
Claim: then,
Proof
Step 1: unwrapping the recurrence
training error(Hf inal ) p Step 3: Show Zt = 2 t (1 t )
Step 2: Show
Zt
How might test error react to AdaBoost?
We expect to encounter:
Occams Razor Overfitting
Empirical results of test error
Test error does not increase even after 1000 rounds. Test error continues to drop after training error reaches zero.
Difference from Expectation: The Margins Explanation
Our training error only measures correctness of
classifications, neglects confidence of classifications. How can we measure confidence of classifications?
Hf inal (x) = sign(f (x)) P t ht t f ( x) = P 2 [ 1, 1] t t margin(x, y ) = yf (x)
Margin(x,y) close to +1 is high confidence, correct. Margin(x,y) close to -1 is high confidence, incorrect. Margin(x,y) close to 0 is low confidence.
Empirical Evidence Supporting Margins Explanation
Hf inal (x) = sign(f (x)) P t ht t f ( x) = P 2 [ 1, 1] t t margin(x, y ) = yf (x)
Cumulative distribution of margins on training examples
Pros/Cons of AdaBoost
Pros
Fast Simple and easy to program No parameters to tune
(except T) No prior knowledge needed about weak learner Provably effective given Weak Learning Assumption versatile
Cons Weak classifiers too complex leads to overfitting. Weak classifiers too weak can lead to low margins, and can also lead to overfitting. From empirical evidence, AdaBoost is particularly vulnerable to uniform noise.
Predicting College Football Results
Training Data: 2009 NCAAF Season Test Data: 2010 NCAAF Season