0% found this document useful (0 votes)
8 views34 pages

Lec 06 SVM

Support Vector Machine (SVM) is a supervised learning classification model that aims to find the global optimum for separating classes in feature space. It identifies the optimal hyperplane by maximizing the margin between classes, with training samples on the margin lines referred to as support vectors. The optimization problem involves minimizing a function subject to constraints that ensure correct classification of data points.

Uploaded by

Ý Nguyễn Như
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views34 pages

Lec 06 SVM

Support Vector Machine (SVM) is a supervised learning classification model that aims to find the global optimum for separating classes in feature space. It identifies the optimal hyperplane by maximizing the margin between classes, with training samples on the margin lines referred to as support vectors. The optimization problem involves minimizing a function subject to constraints that ensure correct classification of data points.

Uploaded by

Ý Nguyễn Như
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

Introduction to

Support Vector Machine


(SVM)

1
Introduction
Support Vector Machine (SVM):
• is a classification model.
• belongs to supervised learning.
• finds the global optimum, not a local optimum.
SVM is one of the most effective solutions for classification
problem.
Introduction
Binary classification can be viewed as the task of
separating classes in feature space.
𝑤𝑥 + 𝑏 > 0
𝑥2 𝑤𝑥 + 𝑏 = 0

𝑤 = 𝑤1, 𝑤2
𝑥1 𝑤𝑥 + 𝑏 < 0
𝑥=
𝑥2

𝑥1
𝑓 𝑥 = 𝑠𝑖𝑔𝑛(𝑤𝑥 + 𝑏)
Introduction
There are infinite number of linear separators, which one
is optimal?
𝑤𝑥 + 𝑏 > 0
𝑥2 𝑤𝑥 + 𝑏 = 0

𝑤 = 𝑤1, 𝑤2
𝑥1 𝑤𝑥 + 𝑏 < 0
𝑥=
𝑥2

𝑥1
𝑓 𝑥 = 𝑠𝑖𝑔𝑛(𝑤𝑥 + 𝑏)
Introduction
There are infinite number of linear separators, which one
is optimal?
𝑤𝑥 + 𝑏 > 0
𝑥2 𝑤𝑥 + 𝑏 = 0

𝑤 = 𝑤1, 𝑤2
𝑥1 𝑤𝑥 + 𝑏 < 0
𝑥=
𝑥2

𝑥1
𝑓 𝑥 = 𝑠𝑖𝑔𝑛(𝑤𝑥 + 𝑏)
Introduction
There are infinite number of linear separators, which one
is optimal?
𝑤𝑥 + 𝑏 = 1
𝑤𝑥 + 𝑏 > 1
𝑥2 𝑤𝑥 + 𝑏 = 0
𝑤𝑥 + 𝑏 = −1

𝑤 = 𝑤1, 𝑤2
𝑥 = (𝑥1, 𝑥2) 𝑤𝑥 + 𝑏 < −1

𝑥1
𝑓 𝑥 = 𝑠𝑖𝑔𝑛(𝑤𝑥 + 𝑏)
Training samples on the margin lines are called support vectors to shape the optimal hyperplane.
Introduction
𝑓 𝑥 = +1
Region +1
𝑓(𝑥) = 0
𝑥+ 𝑤
𝑝2
𝑓(𝑥) = −1
𝑝1

𝑥−
Region -1

Vector w is perpendicular to the decision boundary:


Pick 𝑝1 and 𝑝2 on the decision boundary s.t. 𝑓(𝑝1 ) = 0 and 𝑓(𝑝2 ) = 0
𝑤𝑝 + 𝑏 = 0
We have ቊ 1 , hence 𝑤 𝑝1 − 𝑝2 = 0
𝑤𝑝2 + 𝑏 = 0
Therefore, 𝑤 ⊥ 𝑝1 𝑝2
Introduction
𝑓 𝑥 = +1
Region +1
𝑓(𝑥) = 0
𝑥+ 𝑤
𝑓(𝑥) = −1

𝑥−

𝑀 Region -1

Width of separation margin 𝐌


Pick 𝑥 − on a margin s.t. 𝑓(𝑥 −) = −1; let 𝑥 + be the closest point to 𝑥 −
st 𝑓(𝑥 +) = 1.
+
We have ቊ𝑤𝑥 − + 𝑏 = +1 , hence 𝑤 𝑥 + − 𝑥 − = +2 → M =
2
𝑤𝑥 + 𝑏 = −1 𝑤
Introduction
𝑓 𝑥 = +1
Region +1
𝑓(𝑥) = 0
𝑥+ 𝑤
𝑓(𝑥) = −1

𝑥−

𝑀 Region -1

2 𝑤 1 2
Optimization: argmax M = → argmin → argmin 𝑤 such
𝑤 2 2
that all data points are on the correct side of the margins (positive and
negative side).
Introduction
𝑓 𝑥 = +1
Region +1
𝑓(𝑥) = 0 Problem:
𝑥+ 𝑤 1
𝑓(𝑥) = −1 argmin 𝑤 2
𝑤 2
𝑦 (𝑖) = +1 → 𝑤𝑥 (𝑖) + 𝑏 ≥ +1
𝑥− Constraint ൝
𝑦 (𝑖) = −1 → 𝑤𝑥 𝑖 + 𝑏 ≤ −1
𝑀 Region -1
This is a binary classification in which x is
data feature and y is data label (+1 or -1).
Introduction
𝑓 𝑥 = +1
Region +1
𝑓(𝑥) = 0 Problem:
𝑥+ 𝑤 1
𝑓(𝑥) = −1 argmin 𝑤 2
𝑤 2
𝑦 = +1 → 𝑤𝑥𝑖 + 𝑏 ≥ +1
Constraint ቊ 𝑖
𝑥− 𝑦𝑖 = −1 → 𝑤𝑥𝑖 + 𝑏 ≤ −1
Region -1

Problem:
1
How to solve? argmin
2
𝑤 2
𝑤
Constraint 𝑦𝑖 (𝑤𝑥𝑖 + 𝑏) ≥ +1
Introduction
𝑓 𝑥 = +1
Region +1
𝑓(𝑥) = 0 Problem:
𝑥+ 𝑤 1
𝑓(𝑥) = −1 argmin 𝑤 2
𝑤 2
𝑦 = +1 → 𝑤𝑥𝑖 + 𝑏 ≥ +1
Constraint ቊ 𝑖
𝑥− 𝑦𝑖 = −1 → 𝑤𝑥𝑖 + 𝑏 ≤ −1
Region -1

Problem:
1
How to solve? argmin
2
𝑤 2
𝑤
• Lagrange multiplier. Constraint 𝑦𝑖 (𝑤𝑥𝑖 + 𝑏) ≥ +1
Exercise: Optimization problem
𝐹𝑖𝑛𝑑 argmin 𝑓 𝑥, 𝑦 = 𝑥 2 + 𝑦 2
𝑥,𝑦
𝑆𝑢𝑏𝑗𝑒𝑐𝑡 𝑡𝑜 𝑔 𝑥, 𝑦 = 𝑥 + 𝑦 = 1

High school solution?


Exercise: Optimization problem
𝐹𝑖𝑛𝑑 argmin 𝑓 𝑥, 𝑦 = 𝑥 2 + 𝑦 2
𝑥,𝑦
𝑆𝑢𝑏𝑗𝑒𝑐𝑡 𝑡𝑜 𝑔 𝑥, 𝑦 = 𝑥 + 𝑦 = 1

High school solution

𝑓 𝑥 = 𝑥 2 + (1 − 𝑥)2 = 2𝑥 2 − 2𝑥 + 1
1 1 1 1 1
𝑓 ′ (𝑥) = 4𝑥 − 2 = 0 → 𝑥 = → 𝑦 = → 𝑓𝑚𝑖𝑛 , =
2 2 2 2 2
Exercise: Optimization problem
𝐹𝑖𝑛𝑑 argmin 𝑓 𝑥, 𝑦 = 𝑥 2 + 𝑦 2
𝑥,𝑦
𝑆𝑢𝑏𝑗𝑒𝑐𝑡 𝑡𝑜 𝑔 𝑥, 𝑦 = 𝑥 + 𝑦 = 1

University solution
Use Lagrange multiplier technique.

Joseph-Louis Lagrange.
(Italian mathematician and astronomer)
(Extra reading) Optimization problem
𝐹𝑖𝑛𝑑 argmin 𝑓 𝑥, 𝑦
𝑥,𝑦
𝑆𝑢𝑏𝑗𝑒𝑐𝑡 𝑡𝑜 𝑔 𝑥, 𝑦 = 𝑘

𝑔 𝑥, 𝑦 = 𝑘
(level curve)

𝑔 𝑥, 𝑦
(Extra reading) Optimization problem
𝐹𝑖𝑛𝑑 argmin 𝑓 𝑥, 𝑦
𝑥,𝑦
𝑆𝑢𝑏𝑗𝑒𝑐𝑡 𝑡𝑜 𝑔 𝑥, 𝑦 = 𝑘
Maximum of the
boundary curve

Boundary curve
(Projection of the constraint Minimum of the
curve g on the function f) boundary curve

Constraint curve
(level curve)
𝑔 𝑥, 𝑦 = 𝑘
(Extra reading) Optimization problem
𝐹𝑖𝑛𝑑 argmin 𝑓 𝑥, 𝑦
𝑥,𝑦
𝑆𝑢𝑏𝑗𝑒𝑐𝑡 𝑡𝑜 𝑔 𝑥, 𝑦 = 𝑘

Solution

∇𝑓 𝑥, 𝑦 = 𝜆∇𝑔 𝑥, 𝑦

𝑔 𝑥, 𝑦 = 𝑘

Gradient vector ∇𝑓 is parallel to ∇𝑔 as


𝜆 is Lagrange multiplier. they are perpendicular to the tangent
line that the point X.
(Extra reading) Optimization problem
𝐹𝑖𝑛𝑑 argmin 𝑓 𝑥, 𝑦 = 𝑥 2 + 𝑦 2
𝑥,𝑦
𝑆𝑢𝑏𝑗𝑒𝑐𝑡 𝑡𝑜 𝑔 𝑥, 𝑦 = 𝑥 + 𝑦 = 1

University solution
Use Lagrange multiplier technique.

Solution

∇𝑓 𝑥, 𝑦 = 𝜆∇𝑔 𝑥, 𝑦

𝑔 𝑥, 𝑦 = 𝑥 + 𝑦 = 1
Exercise: Optimization problem
𝐹𝑖𝑛𝑑 argmin 𝑓 𝑥, 𝑦 = 𝑥 2 + 𝑦 2
𝑥,𝑦
𝑆𝑢𝑏𝑗𝑒𝑐𝑡 𝑡𝑜 𝑔 𝑥, 𝑦 = 𝑥 + 𝑦 = 1

Solution using Lagrange multiplier

∇𝑓 𝑥, 𝑦 = 𝜆∇𝑔 𝑥, 𝑦

𝑥+𝑦 =1

𝜕𝑓 𝜕𝑓 𝜕𝑔 𝜕𝑔 𝑥 = 𝜆/2 𝑥 = 1/2
, = 𝜆 ,𝜆 2𝑥, 2𝑦 = 𝜆, 𝜆
→ 𝜕𝑥 𝜕𝑦 𝜕𝑥 𝜕𝑦 → ቊ → ൞ 𝑦 = 𝜆/2 → ቐ𝑦 = 1/2
𝑥+𝑦 =1
𝑥+𝑦 =1 𝑥+𝑦 = 1 𝜆=1

1 1 1
→ 𝑓𝑚𝑖𝑛 , =
2 2 2

Note: each constraint 𝑔𝑖 (𝑥, 𝑦) results in a value 𝜆𝑖 .


Question: how many constraints does SVM have?
Optimal solution
Problem:
1 𝑤 ∗ = ෍ 𝜆𝑖𝑦𝑖𝑥𝑖
argmin 𝑤 2 solution
𝑤 2 𝑖
Constraint 𝑦𝑖 𝑤𝑥𝑖 + 𝑏 − 1 ≥ 0 1

𝑏 = − 𝑤 ∗ 𝑥𝑘
𝑦𝑘
prediction
Optimization solution:
𝒚𝒏𝒆 𝒘 = 𝐬𝐢𝐠𝐧 𝒘∗ 𝒙𝒏𝒆 𝒘 + 𝒃∗
• Each point 𝒊 has a 𝝀𝒊 associated with it.
𝜆4 = 0 𝜆5 = 0
• For points having 𝝀𝒊 > 𝟎, the 𝑦𝑖 𝑤𝑥𝑖 + 𝑏 − 1 = 0
where 𝑦𝑖 = ±1, which means 𝑤𝑥𝑖 + 𝑏 = ±1 = 𝑦𝑖 .
These points are on the margins and are called
𝜆8 = 0.3
support vectors.
𝜆3 = 0.8
• For points having 𝝀𝒊 = 𝟎, the constraint 𝑦𝑖 (𝑤𝑥𝑖 +
𝑏) − 1 > 0 where 𝑦𝑖 = ±1 , which means 𝜆2 = 0.6
𝑤𝑥𝑖 + 𝑏 > 1 or 𝑤𝑥𝑖 + 𝑏 < 1. These points are 𝜆 = 0.2
1
not on the margins 𝜆7 = 0

𝜆6 = 0
21
Hard Margin SVM
𝑓 𝑥 = +1
Region +1 Primal form of SVM:
𝑓(𝑥) = 0 1
𝑥+ 𝑤 argmin 𝑓 𝑤 = 𝑤 2
𝑤 2
𝑓(𝑥) = −1 s.t. 𝑔𝑖 (𝑤, 𝑏) = 𝑦𝑖 𝑤 · 𝑥𝑖 + 𝑏 − 1 ≥ 0

𝑥−
Region -1

Hard margin SVM

Hard margin SVM can work only when data is completely linearly separable
without any errors (noise or outliers).
→ Objective is to maximize the margin.
Soft Margin SVM
𝑓 𝑥 = +1
Region +1 Primal form of SVM with soft margin:
𝑓(𝑥) = 0 1
𝑥+ 𝑤 argmin 𝑓 𝑤 = 𝑤 2 + 𝐶 ෍ 𝜉𝑖
𝑤 2
𝑓(𝑥) = −1 𝑖
s.t. 𝑔𝑖(𝑤, 𝑏) = 𝑦𝑖 𝑤 · 𝑥𝑖 + 𝑏 ≥ 1 − 𝜉𝑖
and 𝜉𝑖 ≥ 0
𝑥−
Region -1

Soft margin SVM

Soft margin SVM is when the dataset is noisy (with some overlap in positive and
negative samples), there will be some error in classifying them with the hyper-plane
→ Objective is to maximize the margin along with minimize the per-sample errors in
classification.
Soft Margin SVM
𝑓 𝑥 = +1
Region +1 Primal form of SVM with soft margin:
𝑓(𝑥) = 0 1
𝑥+ 𝑤 argmin 𝑓 𝑤 = 𝑤 2 + 𝐶 ෍ 𝜉𝑖
𝑤 2
𝑓(𝑥) = −1 𝑖
s.t. 𝑔𝑖(𝑤, 𝑏) = 𝑦𝑖 𝑤 · 𝑥𝑖 + 𝑏 ≥ 1 − 𝜉𝑖
𝜉𝑖 > 1 and 𝜉𝑖 ≥ 0
𝑥−
Region -1

𝜉𝑖 = 0 Soft margin
𝜉𝑖 > 0

Soft margin SVM is when the dataset is noisy (with some overlap in positive and
negative samples), there will be some error in classifying them with the hyper-plane
→ Objective is to maximize the margin along with minimize the per-sample errors in
classification (like regularization to penalize misclassified points).
Hyper-parameter C tells how much we want to avoid misclassifying each training
example.
Hard Margin vs. Soft Margin

Hard margin Soft margin


Primal form of SVM with soft margin:
• C tends to +∞, what happens? 1
• C tends to 0, what happens? argmin 𝑓 𝑤 =
2
𝑤 2 + 𝐶 ෍ 𝜉𝑖
𝑤
𝑖
s.t. 𝑔𝑖 𝑤, 𝑏 = 𝑦𝑖 𝑤 · 𝑥𝑖 + 𝑏 ≥ 1 − 𝜉𝑖
𝜉𝑖 ≥ 0
Hard Margin vs. Soft Margin

Hard margin Soft margin

If C is large, we want very few SVM with soft margin:


1
misclassifications. argmin 𝑓 𝑤 = 𝑤 2 + 𝐶 ෍ 𝜉𝑖
𝑤 2
If C is small, we allow more 𝑖
s.t. 𝑔𝑖 𝑤, 𝑏 = 𝑦𝑖 𝑤 · 𝑥𝑖 + 𝑏 ≥ 1 − 𝜉𝑖
misclassifications. and 𝜉𝑖 ≥ 0
Hard Margin vs. Soft Margin

Hard margin Soft margin


SVM with soft margin:
Question: how to find C? 1 2
argmin 𝑓 𝑤 = 𝑤 + 𝐶 ෍ 𝜉𝑖
𝑤 2
𝑖
s.t. 𝑔𝑖(𝑤, 𝑏) = 𝑦𝑖 𝑤 · 𝑥𝑖 + 𝑏 ≥ 1 − 𝜉𝑖
𝜉𝑖 ≥ 0
SVM Kernel

28
SVM Kernel

29
Polynomial Kernel

30
Polynomial Kernel

Project the 2-D feature (𝑥1 , 𝑥2 ) into 3-D feature (𝑥12 , 𝑥22 , 2𝑥1 𝑥2 ).

3-D feature projection

31
Pros and cons of SVM

Pros Cons
• Work well with an optimal decision • Slow training time when the
boundary. dataset is large.

• Effective in high dimensional space. • Does not directly provide the


classification probability estimate.
• Only use a subset of training
samples (i.e., support vectors) to
make prediction decision →
Memory efficient.

32
Summary
• Introduction to SVM.
• SVM as a classifier.
• Optimization problem.
• Soft margin vs. Hard margin.
• SVM Kernel.

33
Q&A

Thank you

34

You might also like