0% found this document useful (0 votes)
9 views21 pages

Lecture18 Boosting

Uploaded by

yitongwu766
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views21 pages

Lecture18 Boosting

Uploaded by

yitongwu766
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Boosting and Mirror Descent

George Lan

A. Russell Chandler III Chair Professor


H. Milton Stewart School of Industrial & Systems
Engineering
Rationale: Combination of methods
There is no algorithm that is always the most accurate

We can select simple “weak” classification or regression


methods and combine them into a single “strong” method

Different learners use different


Algorithms, Hyperparameters, Representations, Training sets,
Subproblems

Previously: Ensemble:

Problem
Problem

… ...
Learner … ...
Learner Learner Learner
Example of Weak Learners: Decision stump
Let 𝑦 ∈ ±1

Decision stump:

ℎ 𝑥; 𝜃! = 𝑠𝑖𝑔𝑛(𝑤! 𝑥! + 𝑏! )

Each decision stump pays


attention to only a single 𝑏!
dimension of the
input vector
Boosting
Boosting: general methods of converting rough rules of thumb
(or weak classifier) into highly accurate prediction rule

A family of methods which produce a sequence of classifiers


Each classifier is dependent on the previous one and focuses on
the previous one’s errors
Examples that are incorrectly predicted in the previous classifiers
are chosen more often or weighted more heavily when
estimating a new classifier.

Questions:
How to choose “hardest” examples?
How to combine these classifiers?

4
Boosting Setup
Given a set of base classifiers 𝐻 = ℎ# ,… , ℎ% , where
ℎ& : 𝑋 → 1, −1 .
Training data (examples):(𝑥& , 𝑦& ), i = 1, … , 𝑚. Here 𝑥& ∈ 𝑋
and 𝑦& ∈ 1, −1 .

To construct
a sequence of distributions {𝑤! }
a sequence {𝐻! } of nonnegative combinations of base classifiers
𝐻! performs significantly better than any base classifier in 𝐻

5
Matrix notation
For notation convenience, let us define the feature matrix
𝐴&! : = 𝑦& ℎ! (𝑥& ).
Intuitively, the (𝑖, 𝑗)-th entry of the matrix 𝐴 represents the
effectiveness of the base classifier ℎ" , 𝑗 = 1, … , 𝑛, applied to the
example (𝑥# , 𝑦# ).

6
Formulation
Our goal is to choose the nonnegative weights 𝜆 to form an
improved classifier that maximizes the worst-case margin.
Margin for example i:
𝐴𝜆 # = 3(𝜆" 𝐴#" ) = 3[𝜆" 𝑦# ℎ" (𝑥# )] = 𝑦# 3[𝜆" ℎ" (𝑥# )]
" " "
Worst case margin:
𝑝 𝜆 ≔ 𝑚𝑖𝑛&'#,…,) (𝐴𝜆)& = min{𝑤 * 𝐴𝜆: 𝑤 ∈ Δ) }.
Δ) = {𝑤| ∑) &'# 𝑤& = 1, 𝑤& ≥ 0}
The optimization problem is defined as
𝑚𝑎𝑥+,- 𝑝 𝜆

7
Optimization problem
𝑚a𝑥+,- 𝑝 𝜆
𝑝 𝜆 ≔ 𝑚𝑖𝑛&'#,…,) (𝐴𝜆)& =min{𝑤 * 𝐴𝜆: 𝑤 ∈ Δ) }

Note that 𝑝 𝜆 is positively homogenous, i.e.,


𝑝 𝑎𝜆 = 𝑎𝑝 𝜆 ,a ≥ 0

It makes sense to normalize 𝜆 so that


𝜆 ∈ Δ$ = {𝜆| ∑'
#%& 𝜆# = 1, 𝜆# ≥ 0}

This is called a bilinear matrix game problem in optimization.


𝑚𝑎𝑥(∈*" {𝑝 𝜆 : = 𝑚𝑖𝑛+∈*# 𝑤 , 𝐴𝜆}

8
Saddle point formulation and duality
Original Problem
𝑚𝑎𝑥(∈*" {𝑝 𝜆 : = 𝑚𝑖𝑛+∈*# 𝑤 , 𝐴𝜆}

By duality
𝑚𝑖𝑛+∈*# {𝑓 𝑤 : = 𝑚𝑎𝑥(∈*" 𝑤 , 𝐴𝜆}

How to solve the problem?


Subgradient descent
Mirror descent => Boosting

9
Subgradient computation --- Weak Learner
𝑚𝑖𝑛.∈0! {𝑓 𝑤 : = 𝑚𝑎𝑥+∈0" 𝑤 * 𝐴𝜆}
Subgradient of 𝑓 : g 𝑤 = 𝐴.! ∗
𝑗 ∗ = 𝑎𝑟𝑔𝑚𝑎𝑥!∈{#,…,%} ∑)
&'# 𝑤& 𝐴&!
)

= 𝑎𝑟𝑔𝑚𝑎𝑥!∈{#,…,%} N 𝑤& 𝑦& ℎ! (𝑥& )


&'#
𝜆 𝑤 = 𝑒! ∗ , i.e., the 𝑗 ∗ -th entry is 1, and others are zeros
Weak learner W:
For any distribution 𝑤 on the examples,
Return an index 𝑗 ∗ of the base classifier ℎ" ∗ that does best on the
weighted example determined by 𝑤
𝑗 ∗ = 𝑎𝑟𝑔𝑚𝑎𝑥"∈{&,…,$} ∑' #%& 𝑤# 𝑦# ℎ" (𝑥# ),

10
Subgradient descent
Subgradient descent algorithm
K = 1, ….
'
𝑤!"# = 𝑎𝑟𝑔𝑚𝑖𝑛$∈&! 𝑤! − 𝛾! 𝑔 𝑤! − 𝑤
∑%
"#$ )" * $"
𝜆!"# =
∑%
"#$ )"

Subgradient g 𝑤 = (𝑦& ℎ! ∗ 𝑥& ), If example is misclassified,


the corresponding entry in g 𝑤 is negative, resulting a weight
increase in the next iteration.

Note: the subproblem of computing 𝑤56# is relatively easy but


we do not have an explicit solution.

11
Mirror descent
# 7
𝑤56# = 𝑎𝑟𝑔𝑚𝑖𝑛.∈0! 7
𝑤5 − 𝛾5 𝑔 𝑤5 − 𝑤
#
= 𝑎𝑟𝑔𝑚𝑖𝑛.∈0! 𝛾5 𝑔 𝑤5 * 𝑤 + 7 𝑤 − 𝑤5 7

# 7
We can replace the term 7
𝑤 − 𝑤5 by a more general
“distance-like” function.

In particular, since Δ) is a simplex, we replace it by


V 𝑤5 , 𝑤 ≔ 𝑣 𝑤 − 𝑣 𝑤5 − 𝑣 8 𝑤5 * (𝑤 − 𝑤5 )
v 𝑤 = ∑) &'# 𝑤& 𝑙𝑛 𝑤&

12
Mirror Descent èBoosting
The Mirror Descent (or Boosting) Algorithm
K = 1, ….
+
𝑤!"# = 𝑎𝑟𝑔𝑚𝑖𝑛$∈&! 𝛾! 𝑔 𝑤! 𝑤+ V 𝑤! , 𝑤
∑%
"#$ )" * $"
𝜆!"# = ∑%
"#$ )"

How should we solve the subproblem?


𝑎𝑟𝑔𝑚𝑖𝑛+∈*# [𝛾! 𝑔 𝑤! − 𝑣′(𝑤! )], 𝑤+ v 𝑤
&
𝑣 2 𝑤! = [1, … , 1], +
+%

13
A previous homework problem

14
Base classifiers with continuous parameters
We had assumed that the base classifiers ℎ! are given and
fixed. How about they are parameterized by some continuous
parameter 𝜃! , i.e., ℎ! 𝑥& , 𝜃! .
Difficult to list all possible values of 𝜃"
Replace the step for computation of subgradient by
)
(𝑗 ∗ , 𝜃! ∗ ) = 𝑎𝑟𝑔𝑚𝑎𝑥!∈ #,…,% ,9$ N 𝑤& 𝑦& ℎ! (𝑥& , 𝜃! )
&'#
g 𝑤 = (𝑦& ℎ! ∗ 𝑥& , 𝜃! ∗ )
h 𝑤 = ℎ! ∗ (. , 𝜃! ∗ )
If example is misclassified, the corresponding entry in g 𝑤 is
negative, resulting a weight increase in the next iteration.
The process of finding h 𝑤 is sometimes called training the
weak learner. 15
Adaptive Boosting (AdaBoost)
K = 1, ….
Train a weak learner at 𝑤! to obtain g 𝑤! and h 𝑤!
,
𝑤!3& = 𝑎𝑟𝑔𝑚𝑖𝑛+∈*# − 𝛾! 𝑔 𝑤! 𝑤+ 𝐷 𝑤! , 𝑤
∑%
&'( 5& 6 +&
𝐻!3& = ∑%
&'( 5&

16
Toy Example
Weak classifier: vertical or horizontal half-planes (or decision
stump) , parameterized by one threshold value
Let 𝑦 ∈ ±1
Uniform weights on all examples
𝑤#

17
Boosting round 1
Choose a decision stump (weak classifier)
Some data points obtain higher weights because they are
classified incorrectly
ℎ(𝑤# ) 𝑤'

𝛾) = 0.42

18
Boosting round 2
Choose a new decision stump
Reweight again. For incorrectly classified examples, weight
increased
ℎ(𝑤' ) 𝑤,

𝛾* = 0.65

19
Boosting round 3
Repeat the same process
Now we have 3 classifiers

ℎ(𝑤, )

𝛾* = 0.95

20
Boosting aggregate classifier
Final classifier is weighted combination of weak classifiers

0.42+0.65+0.92

21

You might also like