Boosting and Mirror Descent
George Lan
A. Russell Chandler III Chair Professor
H. Milton Stewart School of Industrial & Systems
Engineering
Rationale: Combination of methods
There is no algorithm that is always the most accurate
We can select simple “weak” classification or regression
methods and combine them into a single “strong” method
Different learners use different
Algorithms, Hyperparameters, Representations, Training sets,
Subproblems
Previously: Ensemble:
Problem
Problem
… ...
Learner … ...
Learner Learner Learner
Example of Weak Learners: Decision stump
Let 𝑦 ∈ ±1
Decision stump:
ℎ 𝑥; 𝜃! = 𝑠𝑖𝑔𝑛(𝑤! 𝑥! + 𝑏! )
Each decision stump pays
attention to only a single 𝑏!
dimension of the
input vector
Boosting
Boosting: general methods of converting rough rules of thumb
(or weak classifier) into highly accurate prediction rule
A family of methods which produce a sequence of classifiers
Each classifier is dependent on the previous one and focuses on
the previous one’s errors
Examples that are incorrectly predicted in the previous classifiers
are chosen more often or weighted more heavily when
estimating a new classifier.
Questions:
How to choose “hardest” examples?
How to combine these classifiers?
4
Boosting Setup
Given a set of base classifiers 𝐻 = ℎ# ,… , ℎ% , where
ℎ& : 𝑋 → 1, −1 .
Training data (examples):(𝑥& , 𝑦& ), i = 1, … , 𝑚. Here 𝑥& ∈ 𝑋
and 𝑦& ∈ 1, −1 .
To construct
a sequence of distributions {𝑤! }
a sequence {𝐻! } of nonnegative combinations of base classifiers
𝐻! performs significantly better than any base classifier in 𝐻
5
Matrix notation
For notation convenience, let us define the feature matrix
𝐴&! : = 𝑦& ℎ! (𝑥& ).
Intuitively, the (𝑖, 𝑗)-th entry of the matrix 𝐴 represents the
effectiveness of the base classifier ℎ" , 𝑗 = 1, … , 𝑛, applied to the
example (𝑥# , 𝑦# ).
6
Formulation
Our goal is to choose the nonnegative weights 𝜆 to form an
improved classifier that maximizes the worst-case margin.
Margin for example i:
𝐴𝜆 # = 3(𝜆" 𝐴#" ) = 3[𝜆" 𝑦# ℎ" (𝑥# )] = 𝑦# 3[𝜆" ℎ" (𝑥# )]
" " "
Worst case margin:
𝑝 𝜆 ≔ 𝑚𝑖𝑛&'#,…,) (𝐴𝜆)& = min{𝑤 * 𝐴𝜆: 𝑤 ∈ Δ) }.
Δ) = {𝑤| ∑) &'# 𝑤& = 1, 𝑤& ≥ 0}
The optimization problem is defined as
𝑚𝑎𝑥+,- 𝑝 𝜆
7
Optimization problem
𝑚a𝑥+,- 𝑝 𝜆
𝑝 𝜆 ≔ 𝑚𝑖𝑛&'#,…,) (𝐴𝜆)& =min{𝑤 * 𝐴𝜆: 𝑤 ∈ Δ) }
Note that 𝑝 𝜆 is positively homogenous, i.e.,
𝑝 𝑎𝜆 = 𝑎𝑝 𝜆 ,a ≥ 0
It makes sense to normalize 𝜆 so that
𝜆 ∈ Δ$ = {𝜆| ∑'
#%& 𝜆# = 1, 𝜆# ≥ 0}
This is called a bilinear matrix game problem in optimization.
𝑚𝑎𝑥(∈*" {𝑝 𝜆 : = 𝑚𝑖𝑛+∈*# 𝑤 , 𝐴𝜆}
8
Saddle point formulation and duality
Original Problem
𝑚𝑎𝑥(∈*" {𝑝 𝜆 : = 𝑚𝑖𝑛+∈*# 𝑤 , 𝐴𝜆}
By duality
𝑚𝑖𝑛+∈*# {𝑓 𝑤 : = 𝑚𝑎𝑥(∈*" 𝑤 , 𝐴𝜆}
How to solve the problem?
Subgradient descent
Mirror descent => Boosting
9
Subgradient computation --- Weak Learner
𝑚𝑖𝑛.∈0! {𝑓 𝑤 : = 𝑚𝑎𝑥+∈0" 𝑤 * 𝐴𝜆}
Subgradient of 𝑓 : g 𝑤 = 𝐴.! ∗
𝑗 ∗ = 𝑎𝑟𝑔𝑚𝑎𝑥!∈{#,…,%} ∑)
&'# 𝑤& 𝐴&!
)
= 𝑎𝑟𝑔𝑚𝑎𝑥!∈{#,…,%} N 𝑤& 𝑦& ℎ! (𝑥& )
&'#
𝜆 𝑤 = 𝑒! ∗ , i.e., the 𝑗 ∗ -th entry is 1, and others are zeros
Weak learner W:
For any distribution 𝑤 on the examples,
Return an index 𝑗 ∗ of the base classifier ℎ" ∗ that does best on the
weighted example determined by 𝑤
𝑗 ∗ = 𝑎𝑟𝑔𝑚𝑎𝑥"∈{&,…,$} ∑' #%& 𝑤# 𝑦# ℎ" (𝑥# ),
10
Subgradient descent
Subgradient descent algorithm
K = 1, ….
'
𝑤!"# = 𝑎𝑟𝑔𝑚𝑖𝑛$∈&! 𝑤! − 𝛾! 𝑔 𝑤! − 𝑤
∑%
"#$ )" * $"
𝜆!"# =
∑%
"#$ )"
Subgradient g 𝑤 = (𝑦& ℎ! ∗ 𝑥& ), If example is misclassified,
the corresponding entry in g 𝑤 is negative, resulting a weight
increase in the next iteration.
Note: the subproblem of computing 𝑤56# is relatively easy but
we do not have an explicit solution.
11
Mirror descent
# 7
𝑤56# = 𝑎𝑟𝑔𝑚𝑖𝑛.∈0! 7
𝑤5 − 𝛾5 𝑔 𝑤5 − 𝑤
#
= 𝑎𝑟𝑔𝑚𝑖𝑛.∈0! 𝛾5 𝑔 𝑤5 * 𝑤 + 7 𝑤 − 𝑤5 7
# 7
We can replace the term 7
𝑤 − 𝑤5 by a more general
“distance-like” function.
In particular, since Δ) is a simplex, we replace it by
V 𝑤5 , 𝑤 ≔ 𝑣 𝑤 − 𝑣 𝑤5 − 𝑣 8 𝑤5 * (𝑤 − 𝑤5 )
v 𝑤 = ∑) &'# 𝑤& 𝑙𝑛 𝑤&
12
Mirror Descent èBoosting
The Mirror Descent (or Boosting) Algorithm
K = 1, ….
+
𝑤!"# = 𝑎𝑟𝑔𝑚𝑖𝑛$∈&! 𝛾! 𝑔 𝑤! 𝑤+ V 𝑤! , 𝑤
∑%
"#$ )" * $"
𝜆!"# = ∑%
"#$ )"
How should we solve the subproblem?
𝑎𝑟𝑔𝑚𝑖𝑛+∈*# [𝛾! 𝑔 𝑤! − 𝑣′(𝑤! )], 𝑤+ v 𝑤
&
𝑣 2 𝑤! = [1, … , 1], +
+%
13
A previous homework problem
14
Base classifiers with continuous parameters
We had assumed that the base classifiers ℎ! are given and
fixed. How about they are parameterized by some continuous
parameter 𝜃! , i.e., ℎ! 𝑥& , 𝜃! .
Difficult to list all possible values of 𝜃"
Replace the step for computation of subgradient by
)
(𝑗 ∗ , 𝜃! ∗ ) = 𝑎𝑟𝑔𝑚𝑎𝑥!∈ #,…,% ,9$ N 𝑤& 𝑦& ℎ! (𝑥& , 𝜃! )
&'#
g 𝑤 = (𝑦& ℎ! ∗ 𝑥& , 𝜃! ∗ )
h 𝑤 = ℎ! ∗ (. , 𝜃! ∗ )
If example is misclassified, the corresponding entry in g 𝑤 is
negative, resulting a weight increase in the next iteration.
The process of finding h 𝑤 is sometimes called training the
weak learner. 15
Adaptive Boosting (AdaBoost)
K = 1, ….
Train a weak learner at 𝑤! to obtain g 𝑤! and h 𝑤!
,
𝑤!3& = 𝑎𝑟𝑔𝑚𝑖𝑛+∈*# − 𝛾! 𝑔 𝑤! 𝑤+ 𝐷 𝑤! , 𝑤
∑%
&'( 5& 6 +&
𝐻!3& = ∑%
&'( 5&
16
Toy Example
Weak classifier: vertical or horizontal half-planes (or decision
stump) , parameterized by one threshold value
Let 𝑦 ∈ ±1
Uniform weights on all examples
𝑤#
17
Boosting round 1
Choose a decision stump (weak classifier)
Some data points obtain higher weights because they are
classified incorrectly
ℎ(𝑤# ) 𝑤'
𝛾) = 0.42
18
Boosting round 2
Choose a new decision stump
Reweight again. For incorrectly classified examples, weight
increased
ℎ(𝑤' ) 𝑤,
𝛾* = 0.65
19
Boosting round 3
Repeat the same process
Now we have 3 classifiers
ℎ(𝑤, )
𝛾* = 0.95
20
Boosting aggregate classifier
Final classifier is weighted combination of weak classifiers
0.42+0.65+0.92
21