0% found this document useful (0 votes)
7 views21 pages

Ch4 Support Vector Machine

Support Vector Machines (SVMs) are supervised learning algorithms used for classification and regression, focusing on finding the optimal hyperplane that maximizes the margin between classes. The Hard-Margin SVM seeks perfect separation of linearly separable data, while the Soft-Margin SVM allows for misclassifications to handle noise and overlapping classes. SVMs can also utilize kernel functions for nonlinear decision boundaries and weighted samples for imbalanced datasets, demonstrating versatility and robustness in various scenarios.

Uploaded by

111304008
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views21 pages

Ch4 Support Vector Machine

Support Vector Machines (SVMs) are supervised learning algorithms used for classification and regression, focusing on finding the optimal hyperplane that maximizes the margin between classes. The Hard-Margin SVM seeks perfect separation of linearly separable data, while the Soft-Margin SVM allows for misclassifications to handle noise and overlapping classes. SVMs can also utilize kernel functions for nonlinear decision boundaries and weighted samples for imbalanced datasets, demonstrating versatility and robustness in various scenarios.

Uploaded by

111304008
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Received: 00 Month 0000 Revised: 00 Month 0000 Accepted: 00 Month 0000

DOI: xxx/xxxx

CHAPTER

Support Vector Machine

Chang, Chih-Hao

Department of Statistics, National Chengchi


University, Taipei, Taiwan Summary
Support Vector Machines (SVMs) are a powerful class of supervised learning algo-
rithms widely used for classification and regression tasks. SVMs aim to find the
optimal hyperplane that separates classes in the feature space by maximizing the mar-
gin between the classes. The Hard-Margin SVM, a foundational concept in SVM,
seeks to find a hyperplane that perfectly separates the classes, assuming the data is
linearly separable. However, this approach may not be suitable for datasets with noise
or overlapping classes. To address this limitation, the Soft-Margin SVM introduces
a penalty parameter (𝜔) to allow for misclassifications, leading to a more flexible
decision boundary. Additionally, SVMs can be extended to handle nonlinear deci-
sion boundaries through the use of Kernel SVM, which maps the input features into
a higher-dimensional space. Furthermore, Weighted SVMs provide a mechanism to
assign di!erent weights to samples, allowing for customized handling of imbalanced
datasets. By leveraging the dual formulation of the optimization problem, SVMs
e"ciently handle high-dimensional datasets and exhibit robust generalization perfor-
mance. Overall, SVMs o!er a versatile framework for classification tasks, capable
of handling various data distributions and achieving high accuracy in both linear and
nonlinear scenarios.

KEYWORDS:
Binary classification, dual SVM, hard margin SVM, kernel SVM, soft margin SVM, Weighted SVM

REFERENCE

1. Lin, Hsuan-Tien. Machine Learning Techniques, youtube videos numbers 1 – 16.

1 LINEAR SUPPORT VECTOR MACHINE

1. Support vector machine (SVM) is an algorithm for binary classification.

2. SVM focuses on maximizing the margin between classes, making it robust and e!ective for high-dimensional data and
handling outliers.

0
2

3. A linear classifier shall be away from the data points as far as possible to avoid the noise disturbance from the data, where
the more noise tolerance means the more robust to overfitting.

(a) Data set {(𝜀𝜗 , 𝝎𝜗 )}𝜛 : 𝜀 ω {+1(⋛), ε1(⋜)} and 𝝎𝜗 ω ℝ𝜚 .


𝜗=1 𝜗
(b) As shown in the figure, the dataset is linear separable and SVM classification plot
how many linear classifiers you can draw over the figures?
(c) Consider a scenario where you are identified with a label 6
of +1(ϑ) , and your height and weight represent two covari-
6

ates. 4
4
(d) Suppose there exists a linear classifier positioned in close
proximity to you, accurately assigning you to the correct 2 2
category.
(e) In this scenario, the classifier’s decision may be highly

x1
0 0

contingent on your weight, rendering it particularly sensi-


tive to fluctuations in this covariate. −2 −2

(f) Consequently, even minor changes in your weight could


lead to vastly di!erent classification outcomes by this
−4
−4

linear classifier. −6
(g) To mitigate such sensitivity, it would be prudent to draw a −6

decision boundary that lies significantly distant from the


data points represented by the black triangle and circle in
−6 −4 −2 0 2 4 6

the figure. x2

(h) This approach helps ensure that classification decisions


are less influenced by individual data points and more
indicative of the overall trend in the data.

4. Key concepts of SVMs compared with logistic regression and the perceptron learning algorithm are:

(a) Objective:
i. SVM: SVM aims to find the hyperplane that best separates the classes while maximizing the margin between
the closest data points (support vectors) from each class. It seeks to find the decision boundary with the largest
margin, making it robust to outliers and improving generalization.
ii. Logistic Regression: Logistic regression models the probability that a given input belongs to a particular class.
It estimates the parameters of the logistic function to fit the data, optimizing the log-likelihood or minimizing
the cross-entropy error.
iii. Perceptron Learning Algorithm (PLA): The PLA learns a linear classifier by iteratively updating the weights
to minimize the misclassification error. It seeks to find a decision boundary that correctly classifies all training
examples.
(b) Decision Boundary:
i. SVM: The decision boundary in SVM is determined by the support vectors, which are the data points closest
to the decision boundary. The margin between the support vectors from each class is maximized.
ii. Logistic Regression: Logistic regression produces a decision boundary based on the logistic function, which
models the probability of class membership. The decision boundary is typically a linear or nonlinear function
of the input features.
iii. PLA: The decision boundary in PLA is determined by the weights assigned to each input feature. It adjusts the
weights to minimize the misclassification error, leading to a decision boundary that separates the classes.
(c) Loss Function:
i. SVM: SVM uses a hinge loss function, which penalizes misclassifications and encourages maximizing the
margin between classes.
3

ii. Logistic Regression: Logistic regression uses the logistic loss function, also known as cross-entropy loss, to
measure the di!erence between predicted probabilities and true labels.
iii. PLA: The PLA minimizes the misclassification error, updating the weights to correct misclassifications.
(d) Handling Nonlinear Data:
i. SVM: SVM can handle nonlinear data by using kernel functions, which map the input features into a higher-
dimensional space where a linear decision boundary can be found.
ii. Logistic Regression: Logistic regression can be extended to handle nonlinear data by including polynomial or
interaction terms, or by using techniques such as feature engineering.
iii. PLA: The PLA is inherently linear and may struggle with nonlinear data unless feature transformations are
applied.

5. Hence, we try to fit a linear classifier with the maximal margin to separate all the data points successfully. In other
words, for a linear classifier, 𝜍(𝝎) = 𝜺ϖ 𝝎 + 𝜑 = 0:
max margin(𝜍),
subject to 𝜍 classifies every (𝝎𝜗 , 𝜀𝜗 ) correctly, (1)
margin(𝜍) = min distance(𝝎𝜗 , 𝜍).
𝜗=1,…,𝜛

6. We say the classifier is correct if 𝜀𝜗 = sign(𝜺ϖ 𝝎𝜗 + 𝜑); 𝜗 = 1, … , 𝜛.

7. We can then rewrite (1) into:


max margin(𝜍)
subject to 𝜀𝜗 (𝜺ϖ 𝝎𝜗 + 𝜑) > 0, (2)
margin(𝜍) = min distance(𝝎𝜗 , 𝜺).
𝜗=1,…,𝜛

8. We then have the distance from 𝝎ϱ to the hyperplane 𝜺ϖ 𝝎 + 𝜑 = 0:


1
distance(𝝎ϱ , 𝜑, 𝜺) = ⌈𝜺ϖ 𝝎ϱ + 𝜑⌈.
⌋𝜺⌋
9. Note that 𝜀𝜗 ω {ε1, +1}, and the hyperplane ask for correctness, 𝜀𝜗 (𝜺ϖ 𝝎𝜗 + 𝜑) > 0, we can rewrite the distance to
separating hyperplane as:
1
distance(𝝎𝜗 , 𝜑, 𝜺) = 𝜀 (𝜺ϖ 𝝎𝜗 + 𝜑).
⌋𝜺⌋ 𝜗
10. We then parameterize the problem of (2):
max margin(𝜍),
subject to 𝜀𝜗 (𝜺ϖ 𝝎𝜗 + 𝜑) > 0;𝜗 = 1, … , 𝜛, (3)
1
margin(𝜍) = min 𝜀𝜗 (𝜺ϖ 𝝎𝜗 + 𝜑).
𝜗=1,…,𝜛 ⌋𝜺⌋

11. Note that the two linear classifiers: 𝜺ϖ 𝝎 + 𝜑 = 0 and 3𝜺ϖ 𝝎 + 3𝜑 = 0 are the same, where scaling 3 does not matter.

12. Hence, we can assume that the minimal distance special scaling at
min 𝜀𝜗 (𝜺ϖ 𝝎𝜗 + 𝜑) = 1. (4)
𝜗=1,…,𝜛

13. Hence by (4), we can rewrite (3) as:


1
max ,
𝜑,𝜺 ⌋𝜺⌋ (5)
subject to min 𝜀𝜗 (𝜺ϖ 𝝎𝜗 + 𝜑) = 1.
𝜗=1,…,𝜛

14. Why can we scale the minimal distance at an arbitrary value?


4

(a) Suppose min𝜗=1,…,𝜛 𝜀𝜗 (𝜺ϖ 𝝎𝜗 + 𝜑) = 2.


(b) The margin is then 2ς⌋𝜺⌋.
(c) To maximize 2ς⌋𝜺⌋ is the same to maximize 1ς⌋𝜺⌋.
(d) The optimization problem is
1
max ,
𝜑,𝜺 ⌋𝜺⌋
⌉ {
1 ϖ
subject to
𝜑
min 𝜀𝜗 𝜺 𝝎𝜗 + = 1.
𝜗=1,…,𝜛 2 2
and compare the the previous one:
1
max ,
𝜑,𝜺 ⌋𝜺⌋
subject to min 𝜀𝜗 (𝜺ϖ 𝝎𝜗 + 𝜑) = 1.
𝜗=1,…,𝜛

(e) Obviously, 𝜺 = 1
2
𝜺  1ς⌋𝜺⌋ = 2ς⌋𝜺⌋.

15. We now release the constraints min𝜗=1,…,𝜛 𝜀𝜗 (𝜺ϖ 𝝎𝜗 + 𝜑) = 1 to 𝜀𝜗 (𝜺ϖ 𝝎𝜗 + 𝜑) ∱ 1, for all 𝜗 and hence yields another
optimization problem:
1 ϖ
min 𝜺 𝜺,
2 (6)
subject to 𝜀𝜗 (𝜺ϖ 𝝎𝜗 + 𝜑) ∱ 1; 𝜗 = 1, … , 𝜛.
𝜑,𝜺

16. Why can we release the constraints?

(a) If the optimal (𝜑, 𝜺) φ {(𝜑, 𝜺) ∇ min𝜗=1,…,𝜛 𝜀𝜗 (𝜺ϖ 𝝎𝜗 + 𝜑) = 1}, then say min𝜗=1,…,𝜛 𝜀𝜗 (𝜺ϖ 𝝎𝜗 + 𝜑) = 1.126.
(b) As we previously mentioned, there is another equivalent optimal pair: (𝜑, 𝜺) = (𝜑ς1.126, 𝜺ς1.126).
(c) The resulting margin is
1 1.126 1
= > ,
⌋𝜺⌋ ⌋𝜺⌋ ⌋𝜺⌋
which fails the optimality of (𝜑, 𝜺), a contradiction.
(d) Hence the optimal (𝜑, 𝜺) ω {(𝜑, 𝜺) ∇ min𝜗=1,…,𝜛 𝜀𝜗 (𝜺ϖ 𝝎𝜗 + 𝜑) = 1}.

17. Quadratic programming (QP):

(a) QP is a mathematical optimization problem that deals with finding the minimum (or maximum) of a quadratic
function subject to linear constraints.
(b) QP is widely used in various fields such as finance (portfolio optimization), engineering (control theory, structural
optimization), operations research, and many other areas where optimization problems arise.
(c) It’s an essential tool for solving complex optimization problems e"ciently.

18. Use QP to find (𝜑, 𝜺).

(a) The quadratic programming:


(b) Solution from the process of QP:
1 ϖ
min 𝝑 𝝕𝝑 + 𝝔ϖ 𝝑,
2 (7)
𝝇𝛻 𝝑 ∱ 𝜕𝛻 ; 𝛻 = 1, … , ℵ.
𝝑
subject to
5

(c) From (6) and (7), by letting


𝝑 = (𝜑, 𝜺)ϖ ;
𝝕 = diag(0, 𝝋);
𝝔 = 𝛚;
ℵ = 𝜛;
𝜕𝛻 = 1; 𝜗 = 1, … , 𝜛;
𝝇𝜗 = 𝜀𝜗 (1, 𝝎ϖ𝜗 ); 𝜗 = 1, … , 𝜛.

(d) This is a linear hard-margin SVM.

19. Soft-margin and hard-margin SVMs are two variants of SVMs that di!er in how they handle datasets with respect to linear
separability and classification errors.

(a) Hard-Margin SVM:


i. Objective: The objective of a hard-margin SVM is to find the maximum-margin hyperplane that perfectly
separates the instances of di!erent classes in the feature space without allowing any misclassifications.
ii. Linear Separability Requirement: Hard-margin SVM requires that the dataset be linearly separable, meaning
that there exists at least one hyperplane that can completely separate the instances of di!erent classes without
any misclassifications.
iii. No Misclassification Tolerance: Hard-margin SVM does not tolerate any misclassifications in the training data.
It enforces strict constraints that all data points must be correctly classified and lie on the correct side of the
decision boundary.
iv. No Regularization Parameter: Hard-margin SVM does not incorporate a regularization parameter. Instead, it
focuses solely on maximizing the margin while satisfying the linear separability constraints.
v. Sensitivity to Outliers: Hard-margin SVM is highly sensitive to outliers or noise in the data. Even a single
outlier can significantly a!ect the position of the decision boundary and the margin.
(b) Soft-Margin SVM:
i. Objective: The objective of a soft-margin SVM is to find the hyperplane that achieves a balance between
maximizing the margin and minimizing the classification errors by allowing some margin violations or
misclassifications.
ii. Relaxation of Linear Separability Requirement: Soft-margin SVM relaxes the requirement of strict lin-
ear separability. It allows for some misclassifications or margin violations to account for datasets that are not
perfectly separable by a hyperplane.
iii. Introduction of Regularization Parameter: Soft-margin SVM introduces a regularization parameter 𝜔 that
controls the trade-o! between maximizing the margin and minimizing the classification errors. A smaller value
of 𝜔 allows for more misclassifications and results in a wider margin, while a larger value of 𝜔 penalizes
misclassifications more heavily, resulting in a narrower margin.
iv. Robustness to Outliers: Soft-margin SVM is more robust to outliers and noisy data compared to hard-margin
SVM. It can tolerate a certain degree of misclassifications, which reduces the impact of outliers on the decision
boundary.
v. Generalization to Non-Separable Data: Soft-margin SVM can handle datasets that are not linearly separable
by finding a hyperplane that minimizes the classification errors while still maximizing the margin to the extent
possible.
(c) In summary, the main di!erences between soft-margin and hard-margin SVMs lie in their treatment of linear
separability, tolerance for misclassifications, incorporation of a regularization parameter, and robustness to outliers.
(d) Soft-margin SVM provides more flexibility and robustness by allowing for some misclassifications, making it
suitable for real-world datasets that may contain noise or overlapping classes.

20. We will introduce the soft-margin SVM later in this chapter.


6

2 DUAL SUPPORT VECTOR MACHINE

1. For optimization problems, duality refers to a relationship between two closely related optimization problems, known as
the primal problem and the dual problem.

2. The primal problem is the original optimization problem that we want to solve, while the dual problem is derived from
the primal problem and provides an alternative perspective on the optimization task.

3. In the case of SVMs, the dual SVM formulation refers to the dual problem that arises from the standard primal SVM
optimization problem.

(a) The standard formulation of SVM involves finding a hyperplane that separates the data into two classes while
maximizing the margin between the closest data points (support vectors) from each class.
(b) This problem can be formulated as a convex optimization problem in the primal space, where the objective is to
minimize the norm of the weight vector subject to the constraints imposed by the data points.
(c) The dual SVM formulation, on the other hand, reformulates the optimization problem in terms of the Lagrange
multipliers associated with the constraints.
(d) This formulation allows for the solution of the SVM problem in a higher-dimensional space (the dual space), where
the optimization problem may be easier to solve.
(e) The key advantage of the dual SVM formulation is that it often leads to more e"cient algorithms for solving the
SVM problem, particularly when the number of data points is large or when the data is not linearly separable.
(f) Additionally, the dual formulation allows for the use of kernel methods, which enable SVMs to handle nonlinear
data by implicitly mapping the data into a higher-dimensional space.
(g) In summary, the dual SVM formulation provides an alternative approach to solving the SVM optimization problem,
o!ering advantages in terms of e"ciency and flexibility, especially in the presence of large or nonlinear datasets.

4. The duality between the primal and dual problems has several important implications:

(a) Relationship between solutions: The solutions to the primal and dual problems are closely related. In the case of
SVMs, the optimal solution to the primal problem can be recovered from the optimal solution to the dual problem,
and vice versa.
(b) E!cient algorithms: In some cases, solving the dual problem may be computationally more e"cient than solving
the primal problem. This is particularly true when the number of data points is large or when the data is not linearly
separable, as in these cases, the dual problem may have a simpler structure.
(c) Kernel methods: The dual formulation of SVMs enables the use of kernel methods, which allow SVMs to handle
nonlinear data by implicitly mapping the data into a higher-dimensional space. Kernel methods are typically more
naturally expressed in terms of the dual problem.

5. The linear SVM:


1 ϖ
min 𝜺𝜺
2
𝜀𝜗 (𝜺ϖ 𝝎𝜗 + 𝜑) ∱ 1;
𝜑,𝜺

subject to 𝜗 = 1, … , 𝜛,

6. Solution from the method of Lagrange multipliers:

∲(𝜑, 𝜺, 𝜵) = 𝜺ϖ 𝜺 +
1 }𝜛
ℶ𝜗 (1 ε 𝜀𝜗 (𝜺ϖ 𝝎𝜗 + 𝜑)),
2 𝜗=1
where
SVM = min max ∲(𝜑, 𝜺, 𝜵)
⌉ {{
𝜑,𝜺 ℶ𝜗 ∱0
7

7. Why
⌉ ⌉ }𝜛 {{
1
SVM = min max 𝜺ϖ 𝜺 + ℶ𝜗 (1 ε 𝜀𝜗 (𝜺ϖ 𝝎𝜗 + 𝜑)) ?
𝜑,𝜺 ℶ𝜗 ∱0 2
𝜗=1

(a) Can 𝜀𝜗 (𝜺ϖ 𝝎𝜗 + 𝜑) < 1 be happened?


(b) If yes, 1 ε 𝜀𝜗 (𝜺ϖ 𝝎𝜗 + 𝜑) > 0, then ℶ𝜗 = ∂.
(c) This cannot be a solution of SVM.
(d) If 1 ε 𝜀𝜗 (𝜺ϖ 𝝎𝜗 + 𝜑) < 0, ℶ𝜗 = 0.
(e) Hence ℶ𝜗 > 0 can only be happened on 1 ε 𝜀𝜗 (𝜺ϖ 𝝎𝜗 + 𝜑) = 0 (complementary slackness).

8. Complementary slackness:

(a) It says that if a dual variable (i.e., ℶ𝜗 ) is greater than zero (slack) then the corresponding primal constraint (i.e.,
1 ε 𝜀𝜗 (𝜺ϖ 𝝎𝜗 + 𝜑)) must be an equality (tight).
(b) It also says that if the primal constraint is slack then the corresponding dual variable is tight (or zero.)
(c) This property is a fundamental aspect of duality in optimization and plays a key role in understanding the
relationships between primal and dual solutions in optimization problems.

9. In conclusion,
max ∲(𝜑, 𝜺, 𝜵)
⌉ {{
SVM = min
𝜑,𝜺 all ℶ𝜗 ∱0
(a) For any violating (𝜑, 𝜺) such that 1 ε 𝜀𝜗 (𝜺ϖ 𝝎𝜗 + 𝜑) > 0 for some 𝜗, we have
⌉ }𝜛 {
1 ϖ
max 𝜺 𝜺 + ℶ𝜗 (1 ε 𝜀𝜗 (𝜺ϖ 𝝎𝜗 + 𝜑))  ∂,
ℶ𝜗 ∱0 2
𝜗=1
since ℶ𝜗 = ∂ shall be chosen in this case.
(b) For any feasible (𝜑, 𝜺) such that 1 ε 𝜀𝜗 (𝜺ϖ 𝝎𝜗 + 𝜑) ∳ 0 for all 𝜗, we have
⌉ }𝜛 {
1 ϖ 1
max 𝜺 𝜺 + ℶ𝜗 (1 ε 𝜀𝜗 (𝜺ϖ 𝝎𝜗 + 𝜑)) = 𝜺ϖ 𝜺,
ℶ𝜗 ∱0 2 2
𝜗=1
since ℶ𝜗 (1 ε 𝜀𝜗 (𝜺ϖ 𝝎𝜗 + 𝜑)) = 0; for all 𝜗, shall be chosen in this case.
(c) We are now exactly minimizing 𝜺ϖ 𝜺 under constraints 1 ε 𝜀𝜗 (𝜺ϖ 𝝏𝜗 + 𝜑) ∳ 0 for all 𝜗.

10. Lagrange Dual Problem.

(a) For any fixed 𝜵 ϱ with all ℶ𝜗ϱ ∱ 0,

min max ∲(𝜑, 𝜺, 𝜵) ∱ min ∲(𝜑, 𝜺, 𝜵 ϱ ).


⌉ {
𝜑,𝜺 ℶ𝜗 ∱0 𝜑,𝜺

(b) Since 𝜵 ϱ is arbitrary, we have


min max ∲(𝜑, 𝜺, 𝜵) ∱ max min ∲(𝜑, 𝜺, 𝜵) ,
⌉ { ⌉ {
𝜑,𝜺 ℶ𝜗 ∱0 ℶ𝜗 ∱0 𝜑,𝜺

which is called the Lagrange dual problem and is the outer maximization of 𝜵 on lower bound of original problem.

11. For

∲(𝜑, 𝜺, 𝜵) = 𝜺ϖ 𝜺 +
1 }𝜛
ℶ𝜗 (1 ε 𝜀𝜗 (𝜺ϖ 𝝎𝜗 + 𝜑)),
2 𝜗=1
the Lagrange dual problem (weak duality) is:
SVM = min max ∲(𝜑, 𝜺, 𝜵) ∱ max min ∲(𝜑, 𝜺, 𝜵) .
⌉ {{ ⌉ {
𝜑,𝜺 ℶ𝜗 ∱0 ℶ𝜗 ∱0 𝜑,𝜺
8

12. The equality holds for strong duality:

(a) Convex;
(b) Existence of the solution;
(c) Linear constraints.

13. If the strong duality meets, there exists primal-dual optimal solution (𝜑, 𝜺, 𝜵) for both sides.

14. Simplify the dual problem: under the strong duality,


⌉ }𝜛 {
1
SVM = max min 𝜺ϖ 𝜺 + ℶ𝜗 (1 ε 𝜀𝜗 (𝜺ϖ 𝝎𝜗 + 𝜑))
ℶ𝜗 ∱0 𝜑,𝜺 2
𝜗=1
⌉ }𝜛 {
1 ϖ
= max min 𝜺 𝜺 + ℶ𝜗 (1 ε 𝜀𝜗 (𝜺ϖ 𝝎𝜗 )) ,
ℶ𝜗 ∱0,

𝜀𝜗 ℶ𝜗 =0 𝜺 2
𝜗=1
since
ℷ∲(𝜑, 𝜺, 𝜵) }𝜛
= ε ℶ𝜗 𝜀𝜗 = 0.
ℷ𝜑 𝜗=1

15. Further simplify the dual problem: under the strong duality and by
ℷ∲(𝜑, 𝜺, 𝜵) }𝜛
= ℸ⊳ ε ℶ𝜗 𝜀𝜗 ⊲𝜗,⊳ = 0
ℷℸ⊳ 𝜗=1
}
𝜛
𝜺= ℶ𝜗 𝜀𝜗 𝝎𝜗 ,
𝜗=1
and hence
⌉ }𝜛 {
1
SVM = max min 𝜺ϖ 𝜺 + ℶ𝜗 (1 ε 𝜀𝜗 (𝝎ℵ ϖ 𝜺))
ℶ𝜗 ∱0,

𝜀𝜗 ℶ𝜗 =0 𝜺 2
𝜗=1
⌉ }𝜛 {
1 ϖ
= max 𝜺𝜺+ ℶ𝜗 ε 𝜺ϖ 𝜺
ℶ𝜗 ∱0,
⦃ ⦃
𝜀𝜗 ℶ𝜗 =0,𝜺= ℶ𝜗 𝜀𝜗 𝝎𝜗 2 𝜗=1
⌉ {
1⦄ } ⦄2 }
𝜛 𝜛
= max ε ⦄ ℶ𝜗 𝜀𝜗 𝝎𝜗 ⦄ + ℶ𝜗 .
ℶ𝜗 ∱0,
⦃ ⦃
𝜀𝜗 ℶ𝜗 =0,𝜺= ℶ𝜗 𝜀𝜗 𝝎𝜗 2 ⦄ 𝜗=1 ⦄
𝜗=1

16. This suggests that conditions for primal-dual optimal (𝜑, 𝜺, 𝜵) are:

(a) Primal feasible:


𝜀𝜗 (𝜺ϖ 𝝎𝜗 + 𝜑) ∱ 1;

(b) Dual feasible:


ℶ𝜗 ∱ 0;

(c) Dual-inner optimal:


} }
𝜀𝜗 ℶ𝜗 = 0, and 𝜺 = ℶ𝜗 𝜀𝜗 𝝎𝜗 ;

(d) Primal-inner optimal (complementary slackness):


ℶ𝜗 (1 ε 𝜀𝜗 (𝜺ϖ 𝝎𝜗 + 𝜑)) = 0.

(e) The above conditions are called Karush-Kuhn-Tucker (KKT) conditions and are necessary for optimality.
(f) We will use KKT to solve (𝜑, 𝜺) from optimal 𝜵.
9

17. The dual problem:


⌉ {
1⦄ } ⦄2 }
𝜛 𝜛
max ε ⦄ ℶ𝜗 𝜀𝜗 𝝎𝜗 ⦄ + ℶ𝜗
ℶ𝜗 ∱0,
⦃ ⦃
𝜀𝜗 ℶ𝜗 =0,𝜺= ℶ𝜗 𝜀𝜗 𝝎𝜗 2 ⦄ 𝜗=1 ⦄
𝜗=1

18. The equivalent standard dual SVM problem:

1 }} }
𝜛 𝜛 𝜛
ϖ
min ℶℶ 𝜀𝜀 𝝎𝝎 ε ℶ,
𝜵 2 𝜗=1 𝛻=1 𝜗 𝛻 𝜗 𝛻 𝜗 𝛻 𝜗=1 𝜗
}
𝜛 (8)
subject to 𝜀𝜗 ℶ𝜗 = 0;

ℶ𝜗 ∱ 0;
𝜗=1
for 𝜗 = 1, … , 𝜛,
which has 𝜛 variables and 𝜛 + 1 constraints, and is a convex quadratic programming problem.

19. Again we can apply the process of QP given in (7):


1 ϖ
min 𝝑 𝝕𝝑 + 𝝔ϖ 𝝑,
2
𝝇𝛻 𝝑 ∱ 𝜕𝛻 ; 𝛻 = 1, … , ℵ.
𝝑
subject to
to solve (8) by setting:
𝝑 = 𝜵;
𝝕 = {0𝜗,𝛻 }; 0𝜗,𝛻 = 𝜀𝜗 𝜀𝛻 𝝏ϖ𝜗 𝝏𝛻 ;
𝝔 = ε 𝛆𝜛 ;
𝝇0,1 = ℶ; with 𝜕0,1 = 0;
𝝇0,2 = ε ℶ; with 𝜕0,2 = 0;
𝝇𝜗 = (0, … , 0, 1𝜗,𝜗 , 0, … , 0)ϖ ; 1𝜗,𝜗 = 1 with 𝜕𝜗 = 0.

20. By the KKT conditions, we can obtain 𝜺 easily. For the optimal 𝜑, for any ℶ𝜗 > 0, 𝜑 = 𝜀𝜗 ε 𝜺ϖ 𝝏𝜗 .

(a) By KKT conditions, we have 𝜺 = ℶ𝜗 𝜀𝜗 𝝎𝜗 and for complementary slackness condition: ℶ𝜗 (1 ε 𝜀𝜗 (𝜺ϖ 𝝎𝜗 + 𝜑)) = 0;
𝜗 = 1, … , 𝜛,
𝜑 = 𝜀𝜗 ε 𝜺ϖ 𝝎𝜗 , with ℶ𝜗 > 0,

(b) It is also interesting to note that when ℶ𝜗 > 0, 𝜀𝜗 (𝜺ϖ 𝝏𝜗 + 𝜑) = 1 which are points on the margin boundary and are
support vectors.
(c) Hence by letting ⨋ = {2 ω {1, … , 𝜛} ∇ ℶ2 > 0}, we can obtain
1 }
𝜑= 𝜀 ε 𝜺ϖ 𝝎2 , (9)
⌈⨋⌈ 2ω⨋ 2
where ⨋ is the set of support vectors: 𝜀2 (𝜺ϖ 𝝎𝜗 + 𝜑) = 1; 2 ω ⨋.
(d) Or alternatively, by letting ⨋ + = {2 ω ⨋ ∇ 𝜀2 = +1} and ⨋ ε = {2 ω ⨋ ∇ 𝜀2 = ε1},
⟨ ⟩
1
𝜑=ε ϖ ϖ
min 𝜺 𝝎2 + maxε 𝜺 𝝎2 . (10)
2 2ω⨋ + 2ω⨋

21. Note that the aforementioned two methods for obtaining 𝜑 are based on the hard-margin SVM:

(a) Both methods have their advantages and may perform di!erently depending on the dataset.
(b) The method (9) considers all support vectors, providing a more comprehensive estimate of 𝜑, while the method (10)
focuses only on the extreme support vectors, potentially providing a more robust estimate in the presence of outliers
or in high-dimensional spaces.
(c) If computational resources are not a concern, it’s often beneficial to experiment with both methods and choose the
one that yields better performance on your validation data or through cross-validation.
10

(d) Additionally, some SVM libraries or frameworks may have built-in functions to estimate 𝜑 automatically, utilizing
either of these methods or other techniques.

22. Compare the dual hard-margin SVM to the primal hard-margin SVM.

(a) Primal linear SVM (suitable for small 𝜚):


1 ϖ
min 𝜺𝜺
2
𝜀𝜗 (𝜺ϖ 𝝎𝜗 + 𝜑) ∱ 1;
𝜑,𝜺

subject to 𝜗 = 1, … , 𝜛.
i. 𝜚 + 1 variables;
ii. 𝜛 constraints;
(b) Dual SVM (suitable for small 𝜛):
1 ϖ
min 𝜵 𝝕𝜵 ε 𝛆ϖ 𝜵; 𝝕 = {0𝜗,𝛻 }; 0𝜗,𝛻 = 𝜀𝜗 𝜀𝛻 𝝎ϖ𝜗 𝝎𝛻 ,
𝜵 2
subject to ℶ ϖ 𝜵 = 0;
ℶ𝜗 ∱ 0; for 𝜗 = 1, … , 𝜛,
i. 𝜛 variables;
ii. 𝜛 + 1 constraints;
iii. Support vectors with ℶ𝜗 > 0.

3 KERNEL SUPPORT VECTOR MACHINE

1. How about non-linear boundary?

SVM classification plot SVM classification plot

1.5
6 6
6
1.0
4 4
4
0.5
2 2 2

0.0
x1

x1

0 0 0

−0.5
−2 −2 −2

−4 −1.0
−4 −4

−6 −1.5
−6 −6

−6 −4 −2 0 2 4 6 −6 −4 −2 0 2 4 6

x2 x2

2. Recall a nonlinear decision boundary can be obtained via Transformation + Linear Models.

3. Let 𝛝 ∇ ℝ𝜚  ℝ𝜚 and 𝝏𝜗 = 𝛝(𝝎𝜗 ); 𝜗 = 1, … , 𝜛.


3

4. For the dual SVM, we need compute 𝝏ϖ𝜗 𝝏𝛻 = 𝛝(𝝎𝜗 )ϖ 𝛝(𝝎𝛻 ); for 𝜗, 𝛻 = 1, … , 𝜛, which we want to calculate faster than
3
4(𝜚).

5. The goal is to do SVM without dependence on 𝜚3 (kernel trick).

6. Consider a 2nd order polynomial transform:


𝛝2 (𝝎) = (1, ⊲1 , ⊲2 , … , ⊲𝜚 , ⊲21 , ⊲1 ⊲2 , … , ⊲1 ⊲𝜚 , ⊲2 ⊲1 , ⊲22 , … , ⊲2 ⊲𝜚 , … , ⊲𝜚 ⊲1 , ⊲𝜚 ⊲2 , … , ⊲2𝜚 ),
11

and hence we have


}
𝜚
}
𝜚
}
𝜚
ϖ ϱ
𝛝2 (𝝎) 𝛝2 (𝝎 ) = 1 + ⊲⊳ ⊲ϱ⊳ + ⊲⊳ ⊲5 ⊲ϱ⊳ ⊲ϱ5
⊳=1 ⊳=1 5=1

}
𝜚
}
𝜚
}
𝜚
=1+ ⊲⊳ ⊲ϱ⊳ + ⊲⊳ ⊲ϱ⊳ ⊲5 ⊲ϱ5
⊳=1 ⊳=1 5=1
ϖ ϱ ϖ ϱ 2
= 1 + 𝝎 𝝎 + (𝝎 𝝎 ) ,
which suggests that for 𝛝2 transform, the inner product can be carefully done in 4(𝜚) instead of 4(𝜚3 = 𝜚 2 ).

7. We call this transform and inner product of 𝝎 the kernel function (the 2nd order polynomial kernel function):
6𝛝2 (𝝎, 𝝎ϱ ) ⨌ 𝛝2 (𝝎)ϖ 𝛝2 (𝝎ϱ ), (11)
where 6𝛝2 (𝝎, 𝝎ϖ ) = 1 + (𝝎ϖ 𝝎ϱ ) + (𝝎ϖ 𝝎ϱ )2 .

8. For kernel function:


6𝛝 (𝝎𝜗 , 𝝎𝛻 ) = 𝛝(𝝎𝜗 )ϖ 𝛝(𝝎𝛻 ),
the dual problem on the transform function 𝛝:
1 ϖ
min 𝜵 𝝕− 𝜵 ε 𝛆ϖ 𝜵; ,
𝜵 2
subject to ℶ ϖ 𝜵 = 0;
ℶ𝜗 ∱ 0; for 𝜗 = 1, … , 𝜛,
where 𝝕− = {0𝜗,𝛻 } and 0𝜗,𝛻 = 𝜀𝜗 𝜀𝛻 6𝛝 (𝝎𝜗 , 𝝎𝛻 ).

9. Polynomial kernel function:


60 (𝝎, 𝝎ϱ ) = (7 + 8𝝎ϖ 𝝎ϱ )0 ; 8 > 0, 7 ∱ 0,
which is commonly applied for polynomial SVM.

10. Gaussian kernel function:


69 (𝝎, 𝝎ϱ ) = exp(ε8⌋𝝎 ε 𝝎ϱ ⌋2 ), 8 > 0,
which is a kernel of infinite dimensional transform and for 𝜚 = 1 (i.e. 𝝎 = ⊲),
⌉ ⟪ ⟪ {
2 2 22 2
−(⊲) = exp(ε⊲ ) ⋝ 1, ⊲, ⊲ ,… ,
1! 2!
11. Hence for some kernel function 6(⋝, ⋝), the quadratic coe"cients 0𝜗,𝛻 can be rewritten as:
0𝜗,𝛻 = 𝜀𝜗 𝜀𝛻 𝝏ϖ𝜗 𝝏𝛻 = 𝜀𝜗 𝜀𝛻 6(𝝎𝜗 , 𝝎𝛻 ),
and hence the optimal 𝜑 (for some ℶ2 > 0 such that 1 ε 𝜀2 (𝜺ϖ 𝝏2 + 𝜑) = 0).
⌉}
𝜛 {ϖ
𝜑 = 𝜀2 ε 𝜺ϖ 𝝏2 = 𝜀2 ε ℶ𝜗 𝜀𝜗 𝝏𝜗 𝝏2
𝜗=1
}
𝜛
= 𝜀2 ε ℶ𝜗 𝜀𝜗 6(𝝎𝜗 , 𝝎2 ),
𝜗=1
}
𝜛
}
𝜛
𝜺ϖ −(𝝎) = ℶ𝜗 𝜀𝜗 𝝏ϖ𝜗 𝝏 = ℶ𝜗 𝜀𝜗 6(𝝎𝜗 , 𝝎).
𝜗=1 𝜗=1

12. The classifier of kernel svm, 9svm (𝝎) = sign(𝜺ϖ −(𝝎) + 𝜑), we don’t need to know 𝜺 and the hyperplane classifier can be
a mystery and this is called the kernel trick.
12

13. For the Gaussian SVM, the hypothesis is


⌉} {
9svm (𝝎) = sign ℶ𝜗 𝜀𝜗 6(𝝎𝜗 , 𝝎) + 𝜑
⟫SV ❲
} ⌉ {{
= sign
ℶ𝜗 𝜀𝜗 exp ε 8⌋𝝎 ε 𝝎𝜗 ⌋ +𝜑 , 2

SV
which is a linear combination of Gaussian centered at SV’s 𝝎𝜗 , and is also called Radial Basis Function (RBF) kernel.

14. The Gaussian SVM find ℶ𝜗 to combine Gaussians centered at 𝝎𝜗 and achieve large margin in infinite dimensional space.

4 SOFT-MARGIN SVM

1. Outlier?

SVM classification plot SVM classification plot SVM classification plot

2 1.0
6 6 4 6

4 1 4 4 0.5
2

2 2 2
x1

x1

x1
0 0 0.0

0 0 0
−2
−1 −0.5

−2 −2 −2

−4
−2 −1.0
−4 −4 −4

−4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4

x2 x2 x2

SVM classification plot SVM classification plot SVM classification plot

10 12
2
6 6 6
5 10

4 0 4 8 4 1

−5 6
2 2 2
0
x1

x1

x1

4
−10
0 0 0
2
−15 −1

−2 −2 0 −2
−20

−2
−2
−4 −25 −4 −4

−4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4

x2 x2 x2

2. If the dataset is linearly separable in high-dimensional transformed feature space, the dataset is then separable nonlinearly
in the original feature space.

3. If we ignore the problem of overfitting, theoretically, the Gaussian kernel SVM can transform any dataset into a nonlinearly
separable space.

4. The Gaussian kernel allows for a highly flexible decision boundary by implicitly mapping the input features into a
potentially infinite-dimensional space.

5. However, if always insisting on separable, SVM tends to overfitting to noise.

6. Can SVM automatically detect that some data points could be outlier, or some data points are able to be ignored for a
parsimonious decision boundary?

7. We impose the noise tolerance by allowing the data points are incorrectly specified.
13

SVM classification plot SVM classification plot SVM classification plot

2 1.0
6 6 4 6

4 1 4 4 0.5
2

2 2 2
x1

x1

x1
0 0 0.0

0 0 0
−2
−1 −0.5

−2 −2 −2

−4
−2 −1.0
−4 −4 −4

−4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4

x2 x2 x2

SVM classification plot SVM classification plot SVM classification plot

2
6 6 6 1.0
4

4 1 4 4
0.5
2
2 2 2
x1

x1

x1
0
0.0
0
0 0 0

−1 −0.5
−2 −2 −2 −2

−2 −1.0
−4 −4 −4 −4

−4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4

x2 x2 x2

8. Hard-margin SVM:
1 ϖ
min 𝜺 𝜺,
2
𝜀𝜗 (𝜺ϖ 𝝏𝜗 + 𝜑) ∱ 1.
𝜑,𝜺,ℷ

subject to

9. Tolerance on the noise by allowing the data points being incorrectly classified.

10. Record the margin violation by .𝜗 ; 𝜗 = 1, … , 𝜛.

11. Soft-margin SVM:

1 ϖ } 𝜛
min 𝜺𝜺+𝜔 .𝜗
𝜑,𝜺,ℷ 2
𝜀𝜗 (𝜺ϖ 𝝏𝜗 + 𝜑) ∱ 1 ε .𝜗 ,
𝜗=1
subject to
.𝜗 ∱ 0; 𝜗 = 1, … , 𝜛,
where 𝜔 is the trade-o! of large margin and margin violation:

(a) Large 𝜔: less margin violation;


(b) Small 𝜔: large margin.

12. Quadratic programming with 𝜚3 + 1 + 𝜛 variables and 2𝜛 constraints:

1 ϖ } 𝜛
min 𝜺𝜺+𝜔 .𝜗
𝜑,𝜺,ℷ 2
𝜀𝜗 (𝜺ϖ 𝝏𝜗 + 𝜑) ∱ 1 ε .𝜗 ,
𝜗=1
subject to
.𝜗 ∱ 0; 𝜗 = 1, … , 𝜛,
14

max ∲(𝜑, 𝜺, ℷ, 𝜵, ℸ) ,
⌉ {
13. The dual problem: SVM = min
𝜑,𝜺,ℷ ℶ𝜗 ∱0,,𝜗 ∱0

∲(𝜑, 𝜺, ℷ, 𝜵, ℸ) = 𝜺ϖ 𝜺 + 𝜔
1 }𝜛
}𝜛
.𝜗 + ,𝜗 (ε.𝜗 )
2 𝜗=1 𝜗=1
}
𝜛
+ ℶ𝜗 (1 ε .𝜗 ε 𝜀𝜗 (𝜺ϖ 𝝏𝜗 + 𝜑)).
𝜗=1

14. Simplify the soft-margin dual problem: under the strong duality:
⌉ }𝜛
}𝜛
1
SVM = max min 𝜺ϖ 𝜺 + 𝜔 .𝜗 + ,𝜗 (ε.𝜗 )
ℶ𝜗 ∱0,,𝜗 ∱0 𝜑,𝜺,ℷ 2
𝜗=1 𝜗=1
}
𝜛 {
+ ℶ𝜗 (1ε.𝜗 ε 𝜀𝜗 (𝜺ϖ 𝝏𝜗 + 𝜑))
𝜗=1
⌉ }𝜛 {
1
= max min 𝜺ϖ 𝜺 + ℶ𝜗 (1 ε 𝜀𝜗 (𝜺ϖ 𝝏𝜗 + 𝜑)) ,
0∳ℶ𝜗 ∳𝜔,,𝜗 =𝜔εℶ𝜗 𝜑,𝜺 2
𝜗=1
where note that ,𝜗 and .𝜗 are removed, and the last equation follows by

= 𝜔 ε ℶ𝜗 ε ,𝜗 = 0  ,𝜗 = 𝜔 ε ℶ𝜗 ∱ 0.
ℷ∲
ℷ.𝜗
gives that 0 ∳ ℶ𝜗 ∳ 𝜔.

15. Further simplify the soft-margin dual problem: under the strong duality and by

ℷ∲ }
𝜛
=0  ℶ𝜗 𝜀𝜗 = 0,
ℷ𝜑 𝜗=1

ℷ∲ }
𝜛
=0  𝜺= ℶ𝜗 𝜀𝜗 𝝏𝜗 ,
ℷℸ⊳ 𝜗=1

16. The soft-margin dual SVM:


❳ /
1 } }
𝜛 𝜛
SVM = max ε ⌋ ℶ𝜗 𝜀𝜗 𝝏𝜗 ⌋2 + ℶ𝜗 .
0∳ℶ𝜗 ∳𝜔,,𝜗 =𝜔εℶ𝜗 ,
⦃ ⦃
𝜀𝜗 ℶ𝜗 =0,𝜺= ℶ𝜗 𝜀𝜗 𝝏𝜗 2 𝜗=1 𝜗=1

17. QP with 𝜛 variables and 2𝜛 + 1 constraints:


1 }} }
𝜛 𝜛 𝜛
min ℶ𝜗 ℶ𝛻 𝜀𝜗 𝜀𝛻 𝝏ϖ𝜗 𝝏𝛻 ε ℶ𝜗 ,
𝜵 2 𝜗=1 𝛻=1 𝜗=1

𝜀𝜗 ℶ𝜗 = 0, 0 ∳ ℶ𝜗 ∳ 𝜔; 𝜗 = 1, … , 𝜛,
}
𝜛
subject to
𝜗=1
}
𝜛
implicity 𝜺= ℶ𝜗 𝜀𝜗 𝝏𝜗 , ,𝜗 = 𝜔 ε ℶ𝜗 ; 𝜗 = 1, … , 𝜛,
𝜗=1
15

18. A standard soft-margin SVM dual problem:

1 }} }
𝜛 𝜛 𝜛
min ℶ𝜗 ℶ𝛻 𝜀𝜗 𝜀𝛻 𝝏ϖ𝜗 𝝏𝛻 ε ℶ𝜗 ,
𝜵 2 𝜗=1 𝛻=1 𝜗=1
}
𝜛
subject to 𝜀𝜗 ℶ𝜗 = 0,
(12)
0 ∳ ℶ𝜗 ∳ 𝜔; 𝜗 = 1, … , 𝜛,
𝜗=1

}
𝜛
implicity 𝜺= ℶ𝜗 𝜀𝜗 𝝏𝜗 ,
𝜗=1
,𝜗 = 𝜔 ε ℶ𝜗 ; 𝜗 = 1, … , 𝜛,
which is only di!erent from the hard-margin SVM with an upper bound on ℶ𝜗 .

19. We are now able to fit the kernel soft-margin SVM which is similar to the hard-margin kernel SVM: QP with 𝜛 variables
and 2𝜛 + 1 constraints,
1 }} }
𝜛 𝜛 𝜛
min ℶ𝜗 ℶ𝛻 𝜀𝜗 𝜀𝛻 6− (𝝎𝜗 , 𝝎𝛻 ) ε ℶ𝜗 ,
𝜵 2 𝜗=1 𝛻=1 𝜗=1

𝜀𝜗 ℶ𝜗 = 0, 0 ∳ ℶ𝜗 ∳ 𝜔; 𝜗 = 1, … , 𝜛,
}
𝜛
subject to
𝜗=1
}
𝜛
implicity 𝜺= ℶ𝜗 𝜀𝜗 𝝏𝜗 , ,𝜗 = 𝜔 ε ℶ𝜗 ; 𝜗 = 1, … , 𝜛,
𝜗=1

20. The hypothesis hyperplane of soft-margin kernel svm:


⌉}
𝜛 {
9svm (𝝎) = sign ℶ𝜗 𝜀𝜗 6− (𝝎𝜗 , 𝝎) + 𝜑 ,
𝜗=1
where 𝜑 can be solved by the complementary slackness.

21. The hypothesis hyperplane:


⌉}
𝜛 {
9svm (𝝎) = sign ℶ𝜗 𝜀𝜗 6− (𝝎𝜗 , 𝝎) + 𝜑 ,
𝜗=1

(a) To find optimal 𝜑 (by the complementary slackness):


ℶ𝜗 (1 ε .𝜗 ε 𝜀𝜗 (𝜺ϖ 𝝏𝜗 + 𝜑)) = 0,
(𝜔 ε ℶ𝜗 ).𝜗 = 0.
i. For support vectors (with ℶ2 > 0):
𝜑 = 𝜀2 ε 𝜀2 .2 ε 𝜺ϖ 𝝏2 .
ii. For free support vectors (with 0 < ℶ2 < 𝜔):
.2 = 0  𝜑 = 𝜀2 ε 𝜺ϖ 𝝏2 .
(b) Finally, we can solve unique 𝜑 with free support vectors (𝝎2 , 𝜀2 ):
}
𝜑 = 𝜀2 ε ℶ𝜗 𝜀𝜗 6(𝝎𝜗 , 𝝎2 ).
SV indices 𝜗
(c) Further, let ⨋ = {2 ω {1, … , 𝜛} ∇ 0 < ℶ2 < 𝜔} be the set of all free support vectors,
⟨ ⟩
1 } }
𝜑= 𝜀2 ε ℶ𝜗 𝜀𝜗 6(𝝎𝜗 , 𝝎2 ) .
⌈⨋⌈ 2ω⨋
SV indices 𝜗
16

(d) Or alternatively, for ⨋ + = {2 ω ⨋ ∇ 𝜀2 = +1} and ⨋ ε = {2 ω ⨋ ∇ 𝜀2 = ε1},


⟨ ⟩
1 } }
𝜑=ε min+ ℶ𝜗 𝜀𝜗 6(𝝎𝜗 , 𝝎2 ) + maxε ℶ𝜗 𝜀𝜗 6(𝝎𝜗 , 𝝎2 )
2 2ω⨋
SV indices 𝜗 SV indices 𝜗
2ω⨋

22. Complementary slackness:


ℶ𝜗 (1 ε .𝜗 ε𝜀𝜗 (𝜺ϖ 𝝏𝜗 + 𝜑)) = 0,
(𝜔 ε ℶ𝜗 ).𝜗 = 0.

(a) Non-SV (ℶ𝜗 = 0): .𝜗 = 0 — these are not support vectors


lying on or outside the margin and are correctly specified. SVM classification plot

(b) SV (0 < ℶ𝜗 < 𝜔): .𝜗 = 0 — these are support vectors


lying on the margin and are correctly specified.
6
4

(c) Bounded SV (ℶ𝜗 = 𝜔): .𝜗 = 1 ε 𝜀𝜗 (𝜺ϖ 𝝏𝜗 + 𝜑) ∱ 0 4

—these are support vectors and generally lie within the 2

margin or are incorrectly specified. It is seldom on the 2

margin for .𝜗 = 0.

x1
0
(d) If .𝜗 = 0, it implies that the samples (𝜀𝜗 , 𝝏𝜗 ) are correctly
0

specified and lies either on the margin (if 𝜀𝜗 (𝜺ϖ 𝝏𝜗 +𝜑) = 1, −2

and ℶ𝜗 = 0) or away from the margin (if 𝜀𝜗 (𝜺ϖ 𝝏𝜗 + 𝜑) > 1 −2

and ℶ𝜗 = 0).
−4

(e) If .𝜗 > 0, it implies that the samples (𝜀𝜗 , 𝝏𝜗 ) may fall −6 −4

inside the margin or are misclassified, but they are still


considered support vectors.
−4 −2 0 2 4

x2
(f) ℶ𝜗 and .𝜗 can be used for data analysis.

23. The selection of penalty 𝜔:

SVM classification plot SVM classification plot SVM classification plot

6 10 6 6 6

4 4 4 1
4
5
2 2 2 2

0
x1

x1

x1

0 0 0 0
0

−2 −2 −2
−5 −2 −1

−4 −4 −4
−4
−10
−6 −6 −6 −2
−6
−4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4

x2 x2 x2

SVM classification plot SVM classification plot SVM classification plot

0.6 0.06
6 1.0 6 6

0.4 0.04
4 4 4
0.5

2 2 0.2 2 0.02

0.0
x1

x1

x1

0 0 0.0 0 0.00

−2 −0.5 −2 −0.2 −2 −0.02

−4 −4 −0.4 −4 −0.04
−1.0

−6 −6 −0.6 −6 −0.06

−4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4

x2 x2 x2
17

24. R-code of the toy example:

5 WEIGHTED SVMS

1. For each vector 𝝏𝜗 , suppose that the classification error is weighted by 8𝜗 when the sample point (𝜀𝜗 , 𝝏𝜗 ) is incorrectly
classified.

2. This is equivalent to consider an extra amount of sample points, say 08𝜗 , that are overlapping on 𝝏𝜗 , where 0 is the smallest
integer such that 08𝜗 ; 𝜗 = 1, … , 𝜛 are all integers.

3. In other words, we now have the pseudo samples that:


⦃𝜛
(a) Pseudo sample size: 𝜛3 = 0 𝜗=1 8𝜗 .
(b) Pseudo responses: for 𝜗 = 1, … , 𝜛 and 5 = 1, … , 08𝜗 ,
𝜀𝜗,5 = 𝜀𝜗 ω {+1, ε1}.

(c) Pseudo covariates: for 𝜗 = 1, … , 𝜛 and 5 = 1, … , 08𝜗 , 𝝏𝜗,5 = 𝝏𝜗 .


(d) Linear classifier: for 𝜗 = 1, … , 𝜛 and 5 = 1, … , 08𝜗 , 𝜍(𝝏𝜗,5 ) = 𝜺ϖ 𝝏𝜗,5 + 𝜑.

4. Denote the margin violation of cluster samples 𝜀𝜗,5 ; 5 = 1, … , 08𝜗 by .𝜗,5 and .𝜗,5 = .𝜗 for 5 = 1, … , 08𝜗 .

5. Let ℷ = {.𝜗,5 } be the 𝜛3 + 1 margin violation vector.

6. The optimization problem of the soft-SVM is


08𝜗
1 ϖ }} 𝜛
min 𝜺𝜺+𝜔 .𝜗,5
𝜑,𝜺,ℷ 2
𝜀𝜗,5 (𝜺ϖ 𝝏𝜗,5 + 𝜑) ∱ 1 ε .𝜗,5 ,
𝜗=1 5=1

subject to
.𝜗,5 ∱ 0; 𝜗 = 1, … , 𝜛, 5 = 1, … , 08𝜗 ,
where 𝜔 is the trade-o! of large margin and margin violation.

7. Since 𝜀𝜗,5 = 𝜀𝜗 , 𝝏𝜗,5 = 𝝏 and .𝜗,5 = .𝜗 , the optimization problem can then be re-written as:

1 ϖ } 𝜛
min 𝜺𝜺+𝜔 08𝜗 .𝜗
𝜑,𝜺,ℷ 2
𝜀𝜗 (𝜺ϖ 𝝏𝜗 + 𝜑) ∱ 1 ε .𝜗 ,
𝜗=1
subject to
.𝜗 ∱ 0; 𝜗 = 1, … , 𝜛,
18

8. We then have the dual problem:


⟨ ⟩
min max ∲(𝜑, 𝜺, ℷ, 𝜵, ℸ) ,
𝜑,𝜺,ℷ ℶ𝜗 ∱0,,𝜗 ∱0

where

∲(𝜑, 𝜺, ℷ, 𝜵, ℸ) = 𝜺 𝜺 + 𝜔
1 ϖ }𝜛
}𝜛
08𝜗 .𝜗 + ,𝜗 (ε.𝜗 )
2 𝜗=1 𝜗=1
}
𝜛
\ (
+ ℶ𝜗 1 ε .𝜗 ε 𝜀𝜗 (𝜺ϖ 𝝏𝜗 + 𝜑) .
𝜗=1

9. Under the strong duality, the dual problem is equivalent to


⟨ ⟩
max min ∲(𝜑, 𝜺, ℷ, 𝜵, ℸ)
ℶ𝜗 ∱0,,𝜗 ∱0 𝜑,𝜺,ℷ

1 ϖ }𝜛
}𝜛
= max min 𝜺 𝜺 + 08𝜗 .𝜗 𝜔 + ,𝜗 (ε.𝜗 )
ℶ𝜗 ∱0,,𝜗 ∱0 𝜑,𝜺,ℷ 2
𝜗=1 𝜗=1

}
𝜛
\ (
ϖ
+ ℶ𝜗 1ε.𝜗 ε 𝜀𝜗 (𝜺 𝝏𝜗 + 𝜑)
𝜗=1
⟨ ⟩
1 }𝜛
\ (
= max min 𝜺ϖ 𝜺 + ℶ𝜗 1 ε 𝜀𝜗 (𝜺ϖ 𝝏𝜗 + 𝜑
ℶ𝜗 ∱0,,𝜗 ∱0 𝜑,𝜺 2
𝜗=1
where note that ,𝜗 and .𝜗 are removed, and the last equality follows by
ℷ∲
= 08𝜗 𝜔 ε ,𝜗 ε ℶ𝜗 = 0,
ℷ.𝜗
which gives ,𝜗 = 08𝜗 𝜔 ε ℶ𝜗 ∱ 0 and 0 ∳ ℶ𝜗 ∳ 08𝜗 𝜔.

10. Note that


and
ℷ∲ ℷ∲
= 0, =𝛚
ℷ𝜑 ℷ𝜺
implies that
}
𝜛
ℶ𝜗 𝜀𝜗 = 0, (13)
𝜗=1
and
}
𝜛
𝜺= ℶ𝜗 𝜀𝜗 𝝏𝜗 . (14)
𝜗=1

11. Under the strong duality and by (13) and (14), we can further simplify the dual problem as:
) ⦅
1⦄ }𝜛
⦄2 } 𝜛
max ε ⦄ ℶ 𝜀 𝝏 ⦄ + ℶ 𝜗 .
0∳ℶ𝜗 ∳08𝜗 𝜔, 𝜛 2⦄
⦃ ⦃𝜛 𝜗 𝜗 𝜗⦄
𝜗=1 ℶ𝜗 𝜀𝜗 =0, 𝜺= 𝜗=1 ℶ𝜗 𝜀𝜗 𝝏𝜗 ⦄ 𝜗=1 ⦄ 𝜗=1

12. We then have the standard soft-margin SVM dual problem:

1 }} }
𝜛 𝜛 𝜛
min ℶ𝜗 ℶ𝛻 𝜀𝜗 𝜀𝛻 𝝏ϖ𝜗 𝝏𝛻 ε ℶ𝜗 ,
𝜵 2 𝜗=1 𝛻=1 𝜗=1

𝜀𝜗 ℶ𝜗 = 0, 0 ∳ ℶ𝜗 ∳ 08𝜗 𝜔; 𝜗 = 1, … , 𝜛,
}
𝜛
subject to (15)
𝜗=1
}
𝜛
implicity 𝜺= ℶ𝜗 𝜀𝜗 𝝏𝜗 , ,𝜗 = 08𝜗 𝜔 ε ℶ𝜗 ; 𝜗 = 1, … , 𝜛.
𝜗=1
19

13. Note that the e!ect of 0 can be incorporated into the tuning problem of the penalty 𝜔; hence we can set 𝜔 = 0𝜔 in (16),
which turns into:
1 }} }
𝜛 𝜛 𝜛
min ℶ𝜗 ℶ𝛻 𝜀𝜗 𝜀𝛻 𝝏ϖ𝜗 𝝏𝛻 ε ℶ𝜗 ,
𝜵 2 𝜗=1 𝛻=1 𝜗=1

𝜀𝜗 ℶ𝜗 = 0, 0 ∳ ℶ𝜗 ∳ 8𝜗 𝜔; 𝜗 = 1, … , 𝜛,
}
𝜛
subject to (16)
𝜗=1
}
𝜛
implicity 𝜺= ℶ𝜗 𝜀𝜗 𝝏𝜗 , ,𝜗 = 8𝜗 𝜔 ε ℶ𝜗 ; 𝜗 = 1, … , 𝜛.
𝜗=1

14. Let ⨋ = {𝜗 ω {1, … , 𝜛} ∇ 0 < ℶ𝜗 ∳ 8𝜗 𝜔} denote the set of all support vectors.

15. We then have the classifier 𝜍(𝝏𝜗 ) = 𝜺ϖ 𝝏𝜗 + 𝜑 with


}
𝜺= ℶ𝜗 𝜀𝜗 𝝏𝜗 .
𝜗ω⨋
and 𝜑 being obtained by free support vectors:
⟨ ⟩
1
𝜑=ε min+ 𝜺ϖ 𝝏𝜗 + maxε 𝜺ϖ 𝝏𝜗
2 𝜗ω⨋ 𝜗ω⨋

where ⨋ = {𝜗 ω ⨋ ∇ 𝜀𝜗 = +1 and 0 < ℶ𝜗 < 8𝜗 𝜔} and ⨋ = {𝜗 ω ⨋ ∇ 𝜀𝜗 = ε1 and 0 < ℶ𝜗 < 8𝜗 𝜔} are sets of free
+ ε

support vectors that lie on the margin.

16. In summary:

(a) The optimization di!erence between standard soft-margin SVM and weighted soft-margin SVM lies in how the
upper bounds of the dual variables (i.e., ℶ𝜗 ) are determined.
(b) In a standard soft-margin SVM, the upper bound of the dual variables is fixed and determined solely by the penalty
parameter 𝜔.
(c) However, in a weighted soft-margin SVM, the upper bounds of the dual variables are adjusted by multiplying the
penalty parameter 𝜔 with the predefined weights 8𝜗 associated with each data point.
(d) This adjustment allows for di!erential treatment of individual data points based on their importance or significance,
as determined by the weights assigned to them.
(e) In other words, The significance of a vector (i.e., 𝝏𝜗 ) in constructing the classifier (i.e., 𝜺) increases with higher
weights (i.e., 8𝜗 ) assigned to it.
(f) This reflects the importance attributed to data points with higher weights, allowing them to contribute more
substantially to the classifier’s construction.

6 FUNCTIONS OF (WEIGHTED) SUPPORT VECTOR MACHINES IN R/RSTUDIO

1. Code in R:
20

2. Loading Required Libraries: The WeightSVM library provides functions for weighted SVM.

3. Generating Example Data: We create a synthetic dataset with two features (x1 and x2) and a binary response variable
y. There are 50 samples each for class ε1 and class +1.

4. Fitting Standard Soft-Margin SVM:

(a) We fit a standard soft-margin SVM using the wsvm function from the WeightSVM package.
(b) We specify the input features x, the response variable y, and set the weights to be equal (rep(1, 100)).
(c) The kernel is set to linear and the cost parameter (𝜔) is set to 10.
(d) scale = FALSE ensures that the dataset is not standardized.
(e) The coe"cients of the SVM model are obtained using coef(model).

5. Plotting Data and Separating Hyperplane for Standard SVM:

(a) We plot the data points with di!erent colors and shapes based on their class labels.
(b) The separating hyperplane and the margin are plotted using the coe"cients obtained from the model.
(c) Support vectors are identified and plotted with di!erent shapes.

6. Defining Di"erent Weights for Weighted Soft-Margin SVM: We define di!erent weights for the positive and negative
samples. In this case, positive samples are weighted 3 times more than negative samples.

7. Fitting Weighted Soft-Margin SVM: We fit a weighted soft-margin SVM using the same procedure as the standard
SVM, but with the specified weights.
21

8. Plotting Data and Separating Hyperplane for Weighted SVM:

(a) Similar to the standard SVM, we plot the data points, separating hyperplane, margin, and support vectors for the
weighted SVM.
(b) The code demonstrates the e!ect of di!erent weights on the decision boundary and margin of SVMs using the same
dataset.
(c) The decision boundary may change based on the weights assigned to di!erent samples, impacting how the SVM
model classifies new data points.

9. The performance of the two SVMs are visualized in the followings:

You might also like