Ch4 Support Vector Machine
Ch4 Support Vector Machine
DOI: xxx/xxxx
CHAPTER
Chang, Chih-Hao
KEYWORDS:
Binary classification, dual SVM, hard margin SVM, kernel SVM, soft margin SVM, Weighted SVM
REFERENCE
2. SVM focuses on maximizing the margin between classes, making it robust and e!ective for high-dimensional data and
handling outliers.
0
2
3. A linear classifier shall be away from the data points as far as possible to avoid the noise disturbance from the data, where
the more noise tolerance means the more robust to overfitting.
ates. 4
4
(d) Suppose there exists a linear classifier positioned in close
proximity to you, accurately assigning you to the correct 2 2
category.
(e) In this scenario, the classifier’s decision may be highly
x1
0 0
linear classifier. −6
(g) To mitigate such sensitivity, it would be prudent to draw a −6
the figure. x2
4. Key concepts of SVMs compared with logistic regression and the perceptron learning algorithm are:
(a) Objective:
i. SVM: SVM aims to find the hyperplane that best separates the classes while maximizing the margin between
the closest data points (support vectors) from each class. It seeks to find the decision boundary with the largest
margin, making it robust to outliers and improving generalization.
ii. Logistic Regression: Logistic regression models the probability that a given input belongs to a particular class.
It estimates the parameters of the logistic function to fit the data, optimizing the log-likelihood or minimizing
the cross-entropy error.
iii. Perceptron Learning Algorithm (PLA): The PLA learns a linear classifier by iteratively updating the weights
to minimize the misclassification error. It seeks to find a decision boundary that correctly classifies all training
examples.
(b) Decision Boundary:
i. SVM: The decision boundary in SVM is determined by the support vectors, which are the data points closest
to the decision boundary. The margin between the support vectors from each class is maximized.
ii. Logistic Regression: Logistic regression produces a decision boundary based on the logistic function, which
models the probability of class membership. The decision boundary is typically a linear or nonlinear function
of the input features.
iii. PLA: The decision boundary in PLA is determined by the weights assigned to each input feature. It adjusts the
weights to minimize the misclassification error, leading to a decision boundary that separates the classes.
(c) Loss Function:
i. SVM: SVM uses a hinge loss function, which penalizes misclassifications and encourages maximizing the
margin between classes.
3
ii. Logistic Regression: Logistic regression uses the logistic loss function, also known as cross-entropy loss, to
measure the di!erence between predicted probabilities and true labels.
iii. PLA: The PLA minimizes the misclassification error, updating the weights to correct misclassifications.
(d) Handling Nonlinear Data:
i. SVM: SVM can handle nonlinear data by using kernel functions, which map the input features into a higher-
dimensional space where a linear decision boundary can be found.
ii. Logistic Regression: Logistic regression can be extended to handle nonlinear data by including polynomial or
interaction terms, or by using techniques such as feature engineering.
iii. PLA: The PLA is inherently linear and may struggle with nonlinear data unless feature transformations are
applied.
5. Hence, we try to fit a linear classifier with the maximal margin to separate all the data points successfully. In other
words, for a linear classifier, 𝜍(𝝎) = 𝜺ϖ 𝝎 + 𝜑 = 0:
max margin(𝜍),
subject to 𝜍 classifies every (𝝎𝜗 , 𝜀𝜗 ) correctly, (1)
margin(𝜍) = min distance(𝝎𝜗 , 𝜍).
𝜗=1,…,𝜛
11. Note that the two linear classifiers: 𝜺ϖ 𝝎 + 𝜑 = 0 and 3𝜺ϖ 𝝎 + 3𝜑 = 0 are the same, where scaling 3 does not matter.
12. Hence, we can assume that the minimal distance special scaling at
min 𝜀𝜗 (𝜺ϖ 𝝎𝜗 + 𝜑) = 1. (4)
𝜗=1,…,𝜛
(e) Obviously, 𝜺 = 1
2
𝜺 1ς⌋𝜺⌋ = 2ς⌋𝜺⌋.
15. We now release the constraints min𝜗=1,…,𝜛 𝜀𝜗 (𝜺ϖ 𝝎𝜗 + 𝜑) = 1 to 𝜀𝜗 (𝜺ϖ 𝝎𝜗 + 𝜑) ∱ 1, for all 𝜗 and hence yields another
optimization problem:
1 ϖ
min 𝜺 𝜺,
2 (6)
subject to 𝜀𝜗 (𝜺ϖ 𝝎𝜗 + 𝜑) ∱ 1; 𝜗 = 1, … , 𝜛.
𝜑,𝜺
(a) If the optimal (𝜑, 𝜺) φ {(𝜑, 𝜺) ∇ min𝜗=1,…,𝜛 𝜀𝜗 (𝜺ϖ 𝝎𝜗 + 𝜑) = 1}, then say min𝜗=1,…,𝜛 𝜀𝜗 (𝜺ϖ 𝝎𝜗 + 𝜑) = 1.126.
(b) As we previously mentioned, there is another equivalent optimal pair: (𝜑, 𝜺) = (𝜑ς1.126, 𝜺ς1.126).
(c) The resulting margin is
1 1.126 1
= > ,
⌋𝜺⌋ ⌋𝜺⌋ ⌋𝜺⌋
which fails the optimality of (𝜑, 𝜺), a contradiction.
(d) Hence the optimal (𝜑, 𝜺) ω {(𝜑, 𝜺) ∇ min𝜗=1,…,𝜛 𝜀𝜗 (𝜺ϖ 𝝎𝜗 + 𝜑) = 1}.
(a) QP is a mathematical optimization problem that deals with finding the minimum (or maximum) of a quadratic
function subject to linear constraints.
(b) QP is widely used in various fields such as finance (portfolio optimization), engineering (control theory, structural
optimization), operations research, and many other areas where optimization problems arise.
(c) It’s an essential tool for solving complex optimization problems e"ciently.
19. Soft-margin and hard-margin SVMs are two variants of SVMs that di!er in how they handle datasets with respect to linear
separability and classification errors.
1. For optimization problems, duality refers to a relationship between two closely related optimization problems, known as
the primal problem and the dual problem.
2. The primal problem is the original optimization problem that we want to solve, while the dual problem is derived from
the primal problem and provides an alternative perspective on the optimization task.
3. In the case of SVMs, the dual SVM formulation refers to the dual problem that arises from the standard primal SVM
optimization problem.
(a) The standard formulation of SVM involves finding a hyperplane that separates the data into two classes while
maximizing the margin between the closest data points (support vectors) from each class.
(b) This problem can be formulated as a convex optimization problem in the primal space, where the objective is to
minimize the norm of the weight vector subject to the constraints imposed by the data points.
(c) The dual SVM formulation, on the other hand, reformulates the optimization problem in terms of the Lagrange
multipliers associated with the constraints.
(d) This formulation allows for the solution of the SVM problem in a higher-dimensional space (the dual space), where
the optimization problem may be easier to solve.
(e) The key advantage of the dual SVM formulation is that it often leads to more e"cient algorithms for solving the
SVM problem, particularly when the number of data points is large or when the data is not linearly separable.
(f) Additionally, the dual formulation allows for the use of kernel methods, which enable SVMs to handle nonlinear
data by implicitly mapping the data into a higher-dimensional space.
(g) In summary, the dual SVM formulation provides an alternative approach to solving the SVM optimization problem,
o!ering advantages in terms of e"ciency and flexibility, especially in the presence of large or nonlinear datasets.
4. The duality between the primal and dual problems has several important implications:
(a) Relationship between solutions: The solutions to the primal and dual problems are closely related. In the case of
SVMs, the optimal solution to the primal problem can be recovered from the optimal solution to the dual problem,
and vice versa.
(b) E!cient algorithms: In some cases, solving the dual problem may be computationally more e"cient than solving
the primal problem. This is particularly true when the number of data points is large or when the data is not linearly
separable, as in these cases, the dual problem may have a simpler structure.
(c) Kernel methods: The dual formulation of SVMs enables the use of kernel methods, which allow SVMs to handle
nonlinear data by implicitly mapping the data into a higher-dimensional space. Kernel methods are typically more
naturally expressed in terms of the dual problem.
subject to 𝜗 = 1, … , 𝜛,
∲(𝜑, 𝜺, 𝜵) = 𝜺ϖ 𝜺 +
1 }𝜛
ℶ𝜗 (1 ε 𝜀𝜗 (𝜺ϖ 𝝎𝜗 + 𝜑)),
2 𝜗=1
where
SVM = min max ∲(𝜑, 𝜺, 𝜵)
⌉ {{
𝜑,𝜺 ℶ𝜗 ∱0
7
7. Why
⌉ ⌉ }𝜛 {{
1
SVM = min max 𝜺ϖ 𝜺 + ℶ𝜗 (1 ε 𝜀𝜗 (𝜺ϖ 𝝎𝜗 + 𝜑)) ?
𝜑,𝜺 ℶ𝜗 ∱0 2
𝜗=1
8. Complementary slackness:
(a) It says that if a dual variable (i.e., ℶ𝜗 ) is greater than zero (slack) then the corresponding primal constraint (i.e.,
1 ε 𝜀𝜗 (𝜺ϖ 𝝎𝜗 + 𝜑)) must be an equality (tight).
(b) It also says that if the primal constraint is slack then the corresponding dual variable is tight (or zero.)
(c) This property is a fundamental aspect of duality in optimization and plays a key role in understanding the
relationships between primal and dual solutions in optimization problems.
9. In conclusion,
max ∲(𝜑, 𝜺, 𝜵)
⌉ {{
SVM = min
𝜑,𝜺 all ℶ𝜗 ∱0
(a) For any violating (𝜑, 𝜺) such that 1 ε 𝜀𝜗 (𝜺ϖ 𝝎𝜗 + 𝜑) > 0 for some 𝜗, we have
⌉ }𝜛 {
1 ϖ
max 𝜺 𝜺 + ℶ𝜗 (1 ε 𝜀𝜗 (𝜺ϖ 𝝎𝜗 + 𝜑)) ∂,
ℶ𝜗 ∱0 2
𝜗=1
since ℶ𝜗 = ∂ shall be chosen in this case.
(b) For any feasible (𝜑, 𝜺) such that 1 ε 𝜀𝜗 (𝜺ϖ 𝝎𝜗 + 𝜑) ∳ 0 for all 𝜗, we have
⌉ }𝜛 {
1 ϖ 1
max 𝜺 𝜺 + ℶ𝜗 (1 ε 𝜀𝜗 (𝜺ϖ 𝝎𝜗 + 𝜑)) = 𝜺ϖ 𝜺,
ℶ𝜗 ∱0 2 2
𝜗=1
since ℶ𝜗 (1 ε 𝜀𝜗 (𝜺ϖ 𝝎𝜗 + 𝜑)) = 0; for all 𝜗, shall be chosen in this case.
(c) We are now exactly minimizing 𝜺ϖ 𝜺 under constraints 1 ε 𝜀𝜗 (𝜺ϖ 𝝏𝜗 + 𝜑) ∳ 0 for all 𝜗.
which is called the Lagrange dual problem and is the outer maximization of 𝜵 on lower bound of original problem.
11. For
∲(𝜑, 𝜺, 𝜵) = 𝜺ϖ 𝜺 +
1 }𝜛
ℶ𝜗 (1 ε 𝜀𝜗 (𝜺ϖ 𝝎𝜗 + 𝜑)),
2 𝜗=1
the Lagrange dual problem (weak duality) is:
SVM = min max ∲(𝜑, 𝜺, 𝜵) ∱ max min ∲(𝜑, 𝜺, 𝜵) .
⌉ {{ ⌉ {
𝜑,𝜺 ℶ𝜗 ∱0 ℶ𝜗 ∱0 𝜑,𝜺
8
(a) Convex;
(b) Existence of the solution;
(c) Linear constraints.
13. If the strong duality meets, there exists primal-dual optimal solution (𝜑, 𝜺, 𝜵) for both sides.
15. Further simplify the dual problem: under the strong duality and by
ℷ∲(𝜑, 𝜺, 𝜵) }𝜛
= ℸ⊳ ε ℶ𝜗 𝜀𝜗 ⊲𝜗,⊳ = 0
ℷℸ⊳ 𝜗=1
}
𝜛
𝜺= ℶ𝜗 𝜀𝜗 𝝎𝜗 ,
𝜗=1
and hence
⌉ }𝜛 {
1
SVM = max min 𝜺ϖ 𝜺 + ℶ𝜗 (1 ε 𝜀𝜗 (𝝎ℵ ϖ 𝜺))
ℶ𝜗 ∱0,
⦃
𝜀𝜗 ℶ𝜗 =0 𝜺 2
𝜗=1
⌉ }𝜛 {
1 ϖ
= max 𝜺𝜺+ ℶ𝜗 ε 𝜺ϖ 𝜺
ℶ𝜗 ∱0,
⦃ ⦃
𝜀𝜗 ℶ𝜗 =0,𝜺= ℶ𝜗 𝜀𝜗 𝝎𝜗 2 𝜗=1
⌉ {
1⦄ } ⦄2 }
𝜛 𝜛
= max ε ⦄ ℶ𝜗 𝜀𝜗 𝝎𝜗 ⦄ + ℶ𝜗 .
ℶ𝜗 ∱0,
⦃ ⦃
𝜀𝜗 ℶ𝜗 =0,𝜺= ℶ𝜗 𝜀𝜗 𝝎𝜗 2 ⦄ 𝜗=1 ⦄
𝜗=1
16. This suggests that conditions for primal-dual optimal (𝜑, 𝜺, 𝜵) are:
(e) The above conditions are called Karush-Kuhn-Tucker (KKT) conditions and are necessary for optimality.
(f) We will use KKT to solve (𝜑, 𝜺) from optimal 𝜵.
9
1 }} }
𝜛 𝜛 𝜛
ϖ
min ℶℶ 𝜀𝜀 𝝎𝝎 ε ℶ,
𝜵 2 𝜗=1 𝛻=1 𝜗 𝛻 𝜗 𝛻 𝜗 𝛻 𝜗=1 𝜗
}
𝜛 (8)
subject to 𝜀𝜗 ℶ𝜗 = 0;
ℶ𝜗 ∱ 0;
𝜗=1
for 𝜗 = 1, … , 𝜛,
which has 𝜛 variables and 𝜛 + 1 constraints, and is a convex quadratic programming problem.
20. By the KKT conditions, we can obtain 𝜺 easily. For the optimal 𝜑, for any ℶ𝜗 > 0, 𝜑 = 𝜀𝜗 ε 𝜺ϖ 𝝏𝜗 .
⦃
(a) By KKT conditions, we have 𝜺 = ℶ𝜗 𝜀𝜗 𝝎𝜗 and for complementary slackness condition: ℶ𝜗 (1 ε 𝜀𝜗 (𝜺ϖ 𝝎𝜗 + 𝜑)) = 0;
𝜗 = 1, … , 𝜛,
𝜑 = 𝜀𝜗 ε 𝜺ϖ 𝝎𝜗 , with ℶ𝜗 > 0,
(b) It is also interesting to note that when ℶ𝜗 > 0, 𝜀𝜗 (𝜺ϖ 𝝏𝜗 + 𝜑) = 1 which are points on the margin boundary and are
support vectors.
(c) Hence by letting ⨋ = {2 ω {1, … , 𝜛} ∇ ℶ2 > 0}, we can obtain
1 }
𝜑= 𝜀 ε 𝜺ϖ 𝝎2 , (9)
⌈⨋⌈ 2ω⨋ 2
where ⨋ is the set of support vectors: 𝜀2 (𝜺ϖ 𝝎𝜗 + 𝜑) = 1; 2 ω ⨋.
(d) Or alternatively, by letting ⨋ + = {2 ω ⨋ ∇ 𝜀2 = +1} and ⨋ ε = {2 ω ⨋ ∇ 𝜀2 = ε1},
⟨ ⟩
1
𝜑=ε ϖ ϖ
min 𝜺 𝝎2 + maxε 𝜺 𝝎2 . (10)
2 2ω⨋ + 2ω⨋
21. Note that the aforementioned two methods for obtaining 𝜑 are based on the hard-margin SVM:
(a) Both methods have their advantages and may perform di!erently depending on the dataset.
(b) The method (9) considers all support vectors, providing a more comprehensive estimate of 𝜑, while the method (10)
focuses only on the extreme support vectors, potentially providing a more robust estimate in the presence of outliers
or in high-dimensional spaces.
(c) If computational resources are not a concern, it’s often beneficial to experiment with both methods and choose the
one that yields better performance on your validation data or through cross-validation.
10
(d) Additionally, some SVM libraries or frameworks may have built-in functions to estimate 𝜑 automatically, utilizing
either of these methods or other techniques.
22. Compare the dual hard-margin SVM to the primal hard-margin SVM.
subject to 𝜗 = 1, … , 𝜛.
i. 𝜚 + 1 variables;
ii. 𝜛 constraints;
(b) Dual SVM (suitable for small 𝜛):
1 ϖ
min 𝜵 𝝕𝜵 ε 𝛆ϖ 𝜵; 𝝕 = {0𝜗,𝛻 }; 0𝜗,𝛻 = 𝜀𝜗 𝜀𝛻 𝝎ϖ𝜗 𝝎𝛻 ,
𝜵 2
subject to ℶ ϖ 𝜵 = 0;
ℶ𝜗 ∱ 0; for 𝜗 = 1, … , 𝜛,
i. 𝜛 variables;
ii. 𝜛 + 1 constraints;
iii. Support vectors with ℶ𝜗 > 0.
1.5
6 6
6
1.0
4 4
4
0.5
2 2 2
0.0
x1
x1
0 0 0
−0.5
−2 −2 −2
−4 −1.0
−4 −4
−6 −1.5
−6 −6
−6 −4 −2 0 2 4 6 −6 −4 −2 0 2 4 6
x2 x2
2. Recall a nonlinear decision boundary can be obtained via Transformation + Linear Models.
4. For the dual SVM, we need compute 𝝏ϖ𝜗 𝝏𝛻 = 𝛝(𝝎𝜗 )ϖ 𝛝(𝝎𝛻 ); for 𝜗, 𝛻 = 1, … , 𝜛, which we want to calculate faster than
3
4(𝜚).
}
𝜚
}
𝜚
}
𝜚
=1+ ⊲⊳ ⊲ϱ⊳ + ⊲⊳ ⊲ϱ⊳ ⊲5 ⊲ϱ5
⊳=1 ⊳=1 5=1
ϖ ϱ ϖ ϱ 2
= 1 + 𝝎 𝝎 + (𝝎 𝝎 ) ,
which suggests that for 𝛝2 transform, the inner product can be carefully done in 4(𝜚) instead of 4(𝜚3 = 𝜚 2 ).
7. We call this transform and inner product of 𝝎 the kernel function (the 2nd order polynomial kernel function):
6𝛝2 (𝝎, 𝝎ϱ ) ⨌ 𝛝2 (𝝎)ϖ 𝛝2 (𝝎ϱ ), (11)
where 6𝛝2 (𝝎, 𝝎ϖ ) = 1 + (𝝎ϖ 𝝎ϱ ) + (𝝎ϖ 𝝎ϱ )2 .
12. The classifier of kernel svm, 9svm (𝝎) = sign(𝜺ϖ −(𝝎) + 𝜑), we don’t need to know 𝜺 and the hyperplane classifier can be
a mystery and this is called the kernel trick.
12
SV
which is a linear combination of Gaussian centered at SV’s 𝝎𝜗 , and is also called Radial Basis Function (RBF) kernel.
14. The Gaussian SVM find ℶ𝜗 to combine Gaussians centered at 𝝎𝜗 and achieve large margin in infinite dimensional space.
4 SOFT-MARGIN SVM
1. Outlier?
2 1.0
6 6 4 6
4 1 4 4 0.5
2
2 2 2
x1
x1
x1
0 0 0.0
0 0 0
−2
−1 −0.5
−2 −2 −2
−4
−2 −1.0
−4 −4 −4
−4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4
x2 x2 x2
10 12
2
6 6 6
5 10
4 0 4 8 4 1
−5 6
2 2 2
0
x1
x1
x1
4
−10
0 0 0
2
−15 −1
−2 −2 0 −2
−20
−2
−2
−4 −25 −4 −4
−4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4
x2 x2 x2
2. If the dataset is linearly separable in high-dimensional transformed feature space, the dataset is then separable nonlinearly
in the original feature space.
3. If we ignore the problem of overfitting, theoretically, the Gaussian kernel SVM can transform any dataset into a nonlinearly
separable space.
4. The Gaussian kernel allows for a highly flexible decision boundary by implicitly mapping the input features into a
potentially infinite-dimensional space.
6. Can SVM automatically detect that some data points could be outlier, or some data points are able to be ignored for a
parsimonious decision boundary?
7. We impose the noise tolerance by allowing the data points are incorrectly specified.
13
2 1.0
6 6 4 6
4 1 4 4 0.5
2
2 2 2
x1
x1
x1
0 0 0.0
0 0 0
−2
−1 −0.5
−2 −2 −2
−4
−2 −1.0
−4 −4 −4
−4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4
x2 x2 x2
2
6 6 6 1.0
4
4 1 4 4
0.5
2
2 2 2
x1
x1
x1
0
0.0
0
0 0 0
−1 −0.5
−2 −2 −2 −2
−2 −1.0
−4 −4 −4 −4
−4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4
x2 x2 x2
8. Hard-margin SVM:
1 ϖ
min 𝜺 𝜺,
2
𝜀𝜗 (𝜺ϖ 𝝏𝜗 + 𝜑) ∱ 1.
𝜑,𝜺,ℷ
subject to
9. Tolerance on the noise by allowing the data points being incorrectly classified.
1 ϖ } 𝜛
min 𝜺𝜺+𝜔 .𝜗
𝜑,𝜺,ℷ 2
𝜀𝜗 (𝜺ϖ 𝝏𝜗 + 𝜑) ∱ 1 ε .𝜗 ,
𝜗=1
subject to
.𝜗 ∱ 0; 𝜗 = 1, … , 𝜛,
where 𝜔 is the trade-o! of large margin and margin violation:
1 ϖ } 𝜛
min 𝜺𝜺+𝜔 .𝜗
𝜑,𝜺,ℷ 2
𝜀𝜗 (𝜺ϖ 𝝏𝜗 + 𝜑) ∱ 1 ε .𝜗 ,
𝜗=1
subject to
.𝜗 ∱ 0; 𝜗 = 1, … , 𝜛,
14
max ∲(𝜑, 𝜺, ℷ, 𝜵, ℸ) ,
⌉ {
13. The dual problem: SVM = min
𝜑,𝜺,ℷ ℶ𝜗 ∱0,,𝜗 ∱0
∲(𝜑, 𝜺, ℷ, 𝜵, ℸ) = 𝜺ϖ 𝜺 + 𝜔
1 }𝜛
}𝜛
.𝜗 + ,𝜗 (ε.𝜗 )
2 𝜗=1 𝜗=1
}
𝜛
+ ℶ𝜗 (1 ε .𝜗 ε 𝜀𝜗 (𝜺ϖ 𝝏𝜗 + 𝜑)).
𝜗=1
14. Simplify the soft-margin dual problem: under the strong duality:
⌉ }𝜛
}𝜛
1
SVM = max min 𝜺ϖ 𝜺 + 𝜔 .𝜗 + ,𝜗 (ε.𝜗 )
ℶ𝜗 ∱0,,𝜗 ∱0 𝜑,𝜺,ℷ 2
𝜗=1 𝜗=1
}
𝜛 {
+ ℶ𝜗 (1ε.𝜗 ε 𝜀𝜗 (𝜺ϖ 𝝏𝜗 + 𝜑))
𝜗=1
⌉ }𝜛 {
1
= max min 𝜺ϖ 𝜺 + ℶ𝜗 (1 ε 𝜀𝜗 (𝜺ϖ 𝝏𝜗 + 𝜑)) ,
0∳ℶ𝜗 ∳𝜔,,𝜗 =𝜔εℶ𝜗 𝜑,𝜺 2
𝜗=1
where note that ,𝜗 and .𝜗 are removed, and the last equation follows by
= 𝜔 ε ℶ𝜗 ε ,𝜗 = 0 ,𝜗 = 𝜔 ε ℶ𝜗 ∱ 0.
ℷ∲
ℷ.𝜗
gives that 0 ∳ ℶ𝜗 ∳ 𝜔.
15. Further simplify the soft-margin dual problem: under the strong duality and by
ℷ∲ }
𝜛
=0 ℶ𝜗 𝜀𝜗 = 0,
ℷ𝜑 𝜗=1
ℷ∲ }
𝜛
=0 𝜺= ℶ𝜗 𝜀𝜗 𝝏𝜗 ,
ℷℸ⊳ 𝜗=1
𝜀𝜗 ℶ𝜗 = 0, 0 ∳ ℶ𝜗 ∳ 𝜔; 𝜗 = 1, … , 𝜛,
}
𝜛
subject to
𝜗=1
}
𝜛
implicity 𝜺= ℶ𝜗 𝜀𝜗 𝝏𝜗 , ,𝜗 = 𝜔 ε ℶ𝜗 ; 𝜗 = 1, … , 𝜛,
𝜗=1
15
1 }} }
𝜛 𝜛 𝜛
min ℶ𝜗 ℶ𝛻 𝜀𝜗 𝜀𝛻 𝝏ϖ𝜗 𝝏𝛻 ε ℶ𝜗 ,
𝜵 2 𝜗=1 𝛻=1 𝜗=1
}
𝜛
subject to 𝜀𝜗 ℶ𝜗 = 0,
(12)
0 ∳ ℶ𝜗 ∳ 𝜔; 𝜗 = 1, … , 𝜛,
𝜗=1
}
𝜛
implicity 𝜺= ℶ𝜗 𝜀𝜗 𝝏𝜗 ,
𝜗=1
,𝜗 = 𝜔 ε ℶ𝜗 ; 𝜗 = 1, … , 𝜛,
which is only di!erent from the hard-margin SVM with an upper bound on ℶ𝜗 .
19. We are now able to fit the kernel soft-margin SVM which is similar to the hard-margin kernel SVM: QP with 𝜛 variables
and 2𝜛 + 1 constraints,
1 }} }
𝜛 𝜛 𝜛
min ℶ𝜗 ℶ𝛻 𝜀𝜗 𝜀𝛻 6− (𝝎𝜗 , 𝝎𝛻 ) ε ℶ𝜗 ,
𝜵 2 𝜗=1 𝛻=1 𝜗=1
𝜀𝜗 ℶ𝜗 = 0, 0 ∳ ℶ𝜗 ∳ 𝜔; 𝜗 = 1, … , 𝜛,
}
𝜛
subject to
𝜗=1
}
𝜛
implicity 𝜺= ℶ𝜗 𝜀𝜗 𝝏𝜗 , ,𝜗 = 𝜔 ε ℶ𝜗 ; 𝜗 = 1, … , 𝜛,
𝜗=1
margin for .𝜗 = 0.
x1
0
(d) If .𝜗 = 0, it implies that the samples (𝜀𝜗 , 𝝏𝜗 ) are correctly
0
and ℶ𝜗 = 0).
−4
x2
(f) ℶ𝜗 and .𝜗 can be used for data analysis.
6 10 6 6 6
4 4 4 1
4
5
2 2 2 2
0
x1
x1
x1
0 0 0 0
0
−2 −2 −2
−5 −2 −1
−4 −4 −4
−4
−10
−6 −6 −6 −2
−6
−4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4
x2 x2 x2
0.6 0.06
6 1.0 6 6
0.4 0.04
4 4 4
0.5
2 2 0.2 2 0.02
0.0
x1
x1
x1
0 0 0.0 0 0.00
−4 −4 −0.4 −4 −0.04
−1.0
−6 −6 −0.6 −6 −0.06
−4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4
x2 x2 x2
17
5 WEIGHTED SVMS
1. For each vector 𝝏𝜗 , suppose that the classification error is weighted by 8𝜗 when the sample point (𝜀𝜗 , 𝝏𝜗 ) is incorrectly
classified.
2. This is equivalent to consider an extra amount of sample points, say 08𝜗 , that are overlapping on 𝝏𝜗 , where 0 is the smallest
integer such that 08𝜗 ; 𝜗 = 1, … , 𝜛 are all integers.
4. Denote the margin violation of cluster samples 𝜀𝜗,5 ; 5 = 1, … , 08𝜗 by .𝜗,5 and .𝜗,5 = .𝜗 for 5 = 1, … , 08𝜗 .
subject to
.𝜗,5 ∱ 0; 𝜗 = 1, … , 𝜛, 5 = 1, … , 08𝜗 ,
where 𝜔 is the trade-o! of large margin and margin violation.
7. Since 𝜀𝜗,5 = 𝜀𝜗 , 𝝏𝜗,5 = 𝝏 and .𝜗,5 = .𝜗 , the optimization problem can then be re-written as:
1 ϖ } 𝜛
min 𝜺𝜺+𝜔 08𝜗 .𝜗
𝜑,𝜺,ℷ 2
𝜀𝜗 (𝜺ϖ 𝝏𝜗 + 𝜑) ∱ 1 ε .𝜗 ,
𝜗=1
subject to
.𝜗 ∱ 0; 𝜗 = 1, … , 𝜛,
18
where
∲(𝜑, 𝜺, ℷ, 𝜵, ℸ) = 𝜺 𝜺 + 𝜔
1 ϖ }𝜛
}𝜛
08𝜗 .𝜗 + ,𝜗 (ε.𝜗 )
2 𝜗=1 𝜗=1
}
𝜛
\ (
+ ℶ𝜗 1 ε .𝜗 ε 𝜀𝜗 (𝜺ϖ 𝝏𝜗 + 𝜑) .
𝜗=1
11. Under the strong duality and by (13) and (14), we can further simplify the dual problem as:
) ⦅
1⦄ }𝜛
⦄2 } 𝜛
max ε ⦄ ℶ 𝜀 𝝏 ⦄ + ℶ 𝜗 .
0∳ℶ𝜗 ∳08𝜗 𝜔, 𝜛 2⦄
⦃ ⦃𝜛 𝜗 𝜗 𝜗⦄
𝜗=1 ℶ𝜗 𝜀𝜗 =0, 𝜺= 𝜗=1 ℶ𝜗 𝜀𝜗 𝝏𝜗 ⦄ 𝜗=1 ⦄ 𝜗=1
1 }} }
𝜛 𝜛 𝜛
min ℶ𝜗 ℶ𝛻 𝜀𝜗 𝜀𝛻 𝝏ϖ𝜗 𝝏𝛻 ε ℶ𝜗 ,
𝜵 2 𝜗=1 𝛻=1 𝜗=1
𝜀𝜗 ℶ𝜗 = 0, 0 ∳ ℶ𝜗 ∳ 08𝜗 𝜔; 𝜗 = 1, … , 𝜛,
}
𝜛
subject to (15)
𝜗=1
}
𝜛
implicity 𝜺= ℶ𝜗 𝜀𝜗 𝝏𝜗 , ,𝜗 = 08𝜗 𝜔 ε ℶ𝜗 ; 𝜗 = 1, … , 𝜛.
𝜗=1
19
13. Note that the e!ect of 0 can be incorporated into the tuning problem of the penalty 𝜔; hence we can set 𝜔 = 0𝜔 in (16),
which turns into:
1 }} }
𝜛 𝜛 𝜛
min ℶ𝜗 ℶ𝛻 𝜀𝜗 𝜀𝛻 𝝏ϖ𝜗 𝝏𝛻 ε ℶ𝜗 ,
𝜵 2 𝜗=1 𝛻=1 𝜗=1
𝜀𝜗 ℶ𝜗 = 0, 0 ∳ ℶ𝜗 ∳ 8𝜗 𝜔; 𝜗 = 1, … , 𝜛,
}
𝜛
subject to (16)
𝜗=1
}
𝜛
implicity 𝜺= ℶ𝜗 𝜀𝜗 𝝏𝜗 , ,𝜗 = 8𝜗 𝜔 ε ℶ𝜗 ; 𝜗 = 1, … , 𝜛.
𝜗=1
14. Let ⨋ = {𝜗 ω {1, … , 𝜛} ∇ 0 < ℶ𝜗 ∳ 8𝜗 𝜔} denote the set of all support vectors.
where ⨋ = {𝜗 ω ⨋ ∇ 𝜀𝜗 = +1 and 0 < ℶ𝜗 < 8𝜗 𝜔} and ⨋ = {𝜗 ω ⨋ ∇ 𝜀𝜗 = ε1 and 0 < ℶ𝜗 < 8𝜗 𝜔} are sets of free
+ ε
16. In summary:
(a) The optimization di!erence between standard soft-margin SVM and weighted soft-margin SVM lies in how the
upper bounds of the dual variables (i.e., ℶ𝜗 ) are determined.
(b) In a standard soft-margin SVM, the upper bound of the dual variables is fixed and determined solely by the penalty
parameter 𝜔.
(c) However, in a weighted soft-margin SVM, the upper bounds of the dual variables are adjusted by multiplying the
penalty parameter 𝜔 with the predefined weights 8𝜗 associated with each data point.
(d) This adjustment allows for di!erential treatment of individual data points based on their importance or significance,
as determined by the weights assigned to them.
(e) In other words, The significance of a vector (i.e., 𝝏𝜗 ) in constructing the classifier (i.e., 𝜺) increases with higher
weights (i.e., 8𝜗 ) assigned to it.
(f) This reflects the importance attributed to data points with higher weights, allowing them to contribute more
substantially to the classifier’s construction.
1. Code in R:
20
2. Loading Required Libraries: The WeightSVM library provides functions for weighted SVM.
3. Generating Example Data: We create a synthetic dataset with two features (x1 and x2) and a binary response variable
y. There are 50 samples each for class ε1 and class +1.
(a) We fit a standard soft-margin SVM using the wsvm function from the WeightSVM package.
(b) We specify the input features x, the response variable y, and set the weights to be equal (rep(1, 100)).
(c) The kernel is set to linear and the cost parameter (𝜔) is set to 10.
(d) scale = FALSE ensures that the dataset is not standardized.
(e) The coe"cients of the SVM model are obtained using coef(model).
(a) We plot the data points with di!erent colors and shapes based on their class labels.
(b) The separating hyperplane and the margin are plotted using the coe"cients obtained from the model.
(c) Support vectors are identified and plotted with di!erent shapes.
6. Defining Di"erent Weights for Weighted Soft-Margin SVM: We define di!erent weights for the positive and negative
samples. In this case, positive samples are weighted 3 times more than negative samples.
7. Fitting Weighted Soft-Margin SVM: We fit a weighted soft-margin SVM using the same procedure as the standard
SVM, but with the specified weights.
21
(a) Similar to the standard SVM, we plot the data points, separating hyperplane, margin, and support vectors for the
weighted SVM.
(b) The code demonstrates the e!ect of di!erent weights on the decision boundary and margin of SVMs using the same
dataset.
(c) The decision boundary may change based on the weights assigned to di!erent samples, impacting how the SVM
model classifies new data points.