0% found this document useful (0 votes)
15 views49 pages

Support Vector Machine

The document provides an overview of Support Vector Machines (SVM), a significant concept in machine learning introduced by Vladimir Vapnik. It discusses the methodology of selecting the optimal separating hyperplane, maximizing the margin, and the role of support vectors in classification tasks. Additionally, it covers the formulation of the optimization problem, including the dual problem and the introduction of slack variables for non-linearly separable data.

Uploaded by

Naif N.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views49 pages

Support Vector Machine

The document provides an overview of Support Vector Machines (SVM), a significant concept in machine learning introduced by Vladimir Vapnik. It discusses the methodology of selecting the optimal separating hyperplane, maximizing the margin, and the role of support vectors in classification tasks. Additionally, it covers the formulation of the optimization problem, including the dual problem and the introduction of slack variables for non-linearly separable data.

Uploaded by

Naif N.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

Support Vector Machine: an

Overview

Shihabudheen KV
Dept. of Electrical Engineering,
NIT Calicut
Support Vector Machines
• Support Vector Machines is arguably the most
important & interesting recent discovery in Machine
Learning.

• Support vector machines were introduced by Vapnik

Vladimir Naumovich Vapnik


Support Vector Machines
Which Separating Hyperplane to Use?
Var1

3
Maximizing the Margin
Var1
Select the
separating
hyperplane
that
maximizes the
margin!

Margin
Width

Margin
Width
Var2
4
Support Vectors
Var1

Support
Vectors

Margin
Width
Var2
5
Example
(0,2)
x1 = [0 0]t x1 = [1 0]t .......... ...
* +1
x1 = [2 0] t
x1 = [0 2] .......... ..
t
-1

(0,0)* *
(1,0) (2,0)
H2

(0,2)

H1

(0,0) (1,0) (2,0)


Equation forms of a straight line on a plane
x1 x2 x1 x2
+ =1 + =1
p q 1.5 1.5
ax1 + bx2 + c = 0 − 2 x1 − 2 x2 + 3 = 0  x1 
[−2 − 2]   + 3 = 0 wt x + b = 0
 x2 
x1 x2 x1 x2
+ =1 + =1
1 1 2 2
Equations of the three hyperplanes Distance from the line to a point P( x0 , y0 )

− 2 x1 − 2 x2 + 2 = 0 ax0 + by0 + c
− 2 x1 − 2 x2 + 3 = 0 a 2 + b2
If P is the origin, then distance is
− 2 x1 − 2 x2 + 4 = 0
c
a2 + b2
2 2
The distance between origin and H1 =
2 +2
2 2
8
4 4
The distance between origin and H2 =
22 + 22 8

The distance between origin and H 3


=
3
2 +2
2 2
8
2 2
The distance between H1 and H2 is
8 w

− 2 x1 − 2 x2 + 2 = 0 − 2 − 2x + 3 = 1
− 2 x1 − 2 x2 + 3 = 0 − 2 − 2x + 3 = 0
− 2 x1 − 2 x2 + 4 = 0 − 2 − 2x + 3 = −1
2 x1 + 2 x2 − 3 = −1
H2 2 x1 + 2 x2 − 3 = 0
2 x1 + 2 x2 − 3 = +1
(0,2)

H1

2 x1 + 2 x2 − 3  +1

(0,0) (1,0) (2,0)


2 x1 + 2 x2 − 3  −1
This is a constrained optimization problem. Solving it requires
some new tools
Lagrange function

Construct the Lagrange function (or Lagrangian):


N
1 T
J ( w , b, α ) = w w −   i [ yi ( w T x i + b) − 1]
2 i =1
where i are Lagrange multipliers.

According to KT condition, for non zero multipliers the


constraints are precisely met

Formally, we have
i [ yi (wT x i + b) −1] = 0
The patterns xi for which αi > 0 are called Support Vectors.
J ( w, b,  )
=0
w
 1 T N
 { w w −   i [ yi ( w T x i + b) − 1]} = 0
w 2 i =1
N
 w −   i yi x i = 0
i =1
N
 w =   i yi x i
i =1

J ( w , b,  )
=0
b
 1 T N
 { w w −   i [ yi ( w T x i + b) − 1]} = 0
b 2 i =1
N
   i yi = 0
i =1
Two condition of optimality:

J ( w , b,  ) N

condition 1: = 0  w =   i yi x i
w i =1

J ( w , b,  ) N
condition 2:
b
=0   y
i =1
i i =0
Dual problem
Given a constrained optimization problem with a
convex cost function and linear constraints; a dual
problem with the Lagrange multipliers providing the
solution can be formulated.
Duality Theorem (Bertsekas 1995)
(a) If the primal problem has an optimal solution, then
the dual problem has an optimal solution with the
same optimal values.
(b) In order for wo to be an optimal solution and o to
be an optimal dual solution, it is necessary and
sufficient that wo is feasible for the primal problem
and
(wo) = J(wo,bo, o) = Minw J(w,bo, o)
Since
N
w =   i yi x i
i =1
N
 w w =   i yi w T x i
T

i =1

We can get
J ( w , b,  )
N
1 T
= w w −   i [ yi ( w T x i + b) − 1]
2 i =1
N N N
1 T
= w w −   i y i w x i −   i yi b +   i
T

2 i =1 i =1 i =1
N N
1 T
= w w − w w − b   i yi +   i
T

2 i =1 i =1
N N
1 T
= − w w − b   i yi +   i
2 i =1 i =1
N

 y
i =1
i i =0 0
N N
1 T
Hence J ( w , b,  ) = − w w − b  i yi +   i
2 i =1 i =1
N
1 T
= − w w + i
2 i =1
Furthermore, we have
N N N N
w w =   i yi w x i =   i j yi y j x x j ( w =   i yi x i )
T T T
i
i =1 i =1 j =1 i =1

Accordingly, the objective function can be reformulated as


N
1 N N
J ( w , b,  ) =   i −   i j yi y j x Ti x j
i =1 2 i =1 j =1
Let Q ( ) = J ( w, b,  ) , then
N
1 N N
Q( ) =   i −   i j yi y j x Ti x j where  i  0
i =1 2 i =1 j =1
Dual Problem

The dual problem:


Given the training sample {(xi,yi)} ,find the Lagrange
multipliers {( i)} that maximize the objective function
N
1 N N
Q( ) =   i −   i j yi y j xTi x j <xi,xj>
i =1 2 i =1 j =1

subject to the constraints N

 y
i =1
i i =0
i  0
for i=1,2,…,N

The dual problem is cast entirely in terms of the training data.


Q(α) to be maximized depends only on the input
patterns in the form of dot product.
Classifying new data points
• Once the parameters (*, b*) are found by solving the
required quadratic optimisation on the training set of
points, the SVM is ready to be used for classifying
new points.
• Given new point x, its class membership is
sign[f(x, *, b*)], where

f (x, , b ) = w x + b = i =1 i* yi xi x + b* = iSV  i* yi xi x + b*


* * * * N

Data enters only


in the form of
dot products!
1 N
maximise : Q( ) = − i , j =1 i j yi y j x i x j + i =1 i
t N
2
subject to : 
N
y
i =1 i i
= 0,  i  0, i = 1,... N
 i = 1 + 2 + 3 +  4
K ( Xi, Xj ) = X i X j
T

 0 0 0 0  1  
 
0 1 2 0   2  
− i , j =1  i  j y i y j x i x j = −  1 −  4 
1 N 1
2 − 3
2 2 0 2 4 0  −  3  
   
 0 0 0 4 −  4  

 2
= x 2 x3 = 1 0  = 2
T
K 23
0 
1
Q() = (1 +  2 +  3 +  4 ) − ( 2 − 4 2  3 + 4 3 + 4 4 )
2 2 2

2
1 +  2 −  3 −  4 = 0
 i  0, i = 1,...4
Non-Linearly Separable Data

Var1 Introduce slack


i variables i
Allow some
instances to fall
within the margin,
i but penalize them
 w x + b = 1
w wxi + b  1 −  i for y i = +1
wxi + b  −1 +  i for y i = −1
w  x + b = −1 1 Var2 i  0
1
 
w x + b = 0 yi ( wxi + b)  1 − i
Formulating the Optimization
Problem
Constraint becomes :

Var1 yi (w  xi + b)  1 − i , xi
i
i  0
Objective function penalizes for
misclassified instances and
those within the margin.
i Introducing extra cost term

 w x + b = 1
w 1
min w + C  i
2

2 i
w  x + b = −1 1 Var2
1
 
w x + b = 0 C trades-off margin
width and
misclassifications
Linear, Soft-Margin SVMs
1
min w + C  i yi (w  xi + b)  1 − i , xi
2

2 i i  0

• As C→, we get closer to the hard-


margin solution
Soft-Margin always have a solution
Smoother surfaces (in the non-linear case)

 i do not appear in the dual problem


Robustness of Soft vs Hard
Margin SVMs

i

i
 
w x + b = 0
 
w x + b = 0

Soft Margin Hard Margin


SVN SVN
The Soft-Margin Classifier cont.
Lagrangian for soft margin problem is
l l l
1
L p (w , b,  , α, r ) = w, w + C  i −   i [ yi ( xi  w + b) − 1 + i ] −  rii ( )
(1)
2
with  i 0 ri  0 Differentiating w.r.t. w, and b
i i i
and

 l
L p = w −   i yi xi = 0 ()
(2)
w i =1

L p = C −  i − ri = 0 ()
(3)
i

 l
L p =   i yi = 0 ()
(4)
b i

Substituting (2),(3),(4) into primal Lagrangian (1)


we can form the dual formulation for soft margin problem
Soft Margin-Dual Lagrangian

Dual problem
1
LD =   i −   i j yi y j xi x j ()
i 2 i, j
0  i  C ()
 y
i
i i =0 ()

The nonseparable case differs from the separable case in that the constraint  i  0
is replaced with the more stringent constraint 0   i  C
Mapping Data to a High-Dimensional Space
• Find function (x) to map to a different space,
then SVM formulation becomes:
1
w + C  i
2
min
2 i
s.t. yi ( w   ( x ) + b)  1 −  i , xi
i  0

• Data appear as (x), weights w are now weights


in the new space
• Explicit mapping expensive if (x) is very high
dimensional
• Solving the problem without explicitly mapping
the data is desirable
The Dual of the SVM Formulation

Original SVM formulation 1 2


min w + C i
– n inequality constraints w ,b 2 i
– n positivity constraints
s.t. yi ( w   ( x ) + b)  1 −  i , xi
– n number of  variables
i  0
The dual of this problem
– n positivity constraints 1
max   i j yi y j ( ( xi )   ( x j )) −   i
– one equality constraint ai 2 i, j i
– n number of  variables
(Lagrange multipliers) s.t. C   i  0, xi
– Objective function more  y
i
i i =0
complicated

Data only appear as (xi)  (xj)


The Kernel Trick
• (xi)  (xj): means, map data into new space, then take
the inner product of the new vectors
• We can find a function such that: K(xi  xj) = (xi)  (xj),
i.e., the image of the inner product of the data is the inner
product of the images of the data
• Then, we do not need to explicitly map the data into the
high-dimensional space to solve the optimization problem
(for training)

Example 1:x, y points in 2D input, 3D feature space

 x12 
 x1   y1   
x =  ; y =  ; if  (x) =  2 x1 x2 , this implies K (x, y ) =  (x) (y ) = (xy ) 2
x  y   2 
 2  2  x2 
 
◼ The linear classifier relies on dot product between vectors
K(xi,xj)=xiTxj
◼ If every data point is mapped into high-dimensional space via some
transformation Φ: x → φ(x), the dot product becomes:
K(xi,xj)= φ(xi) Tφ(xj)
◼ A kernel function is some function that corresponds to an inner
product in some expanded feature space.
◼ Example 2:

2-dimensional vectors x=[x1 x2]; let K(xi,xj)=(1 + xiTxj)2,


Need to show that K(xi,xj)= φ(xi) Tφ(xj):
K(xi,xj)=(1 + xiTxj)2,
= 1+ xi12xj12 + 2 xi1xj1 xi2xj2+ xi22xj22 + 2xi1xj1 + 2xi2xj2
= [1 xi12 √2 xi1xi2 xi22 √2xi1 √2xi2]T [1 xj12 √2 xj1xj2 xj22 √2xj1 √2xj2]
= φ(xi) Tφ(xj), where φ(x) = [1 x12 √2 x1x2 x22 √2x1 √2x2]
Examples of Kernel Functions
◼ Linear: K(xi,xj)= xi Txj

◼ Polynomial of power p: K(xi,xj)= (1+ xi Txj)p

◼ Gaussian (radial-basis function network):


2
xi − x j
K (x i , x j ) = exp( − )
2 2

◼ Sigmoid: K(xi,xj)= tanh(β0xi Txj + β1)


SVM-Architecture
XOR Problem
K(xi,xj)=(1 + xiTxj)2,
Input Class
vector label
X= (x1,x2) Y={1, -1}  x j1  2
K ( x i , x j ) = ( 1 +  x i1 xi 2   ) = ( 1 + xi1 x j1 + xi 2 x j 2 )2
(1, 1) -1  x j2 
= 1 + xi1 x j1 + 2 xi1 x j1 xi 2 x j 2 + xi 2 x j 2 + 2 xi1 x j1 + 2 xi 2 x
2 2 2 2
(1, -1) 1
(-1, 1) 1
(-1, -1) -1
9 1 1 1 K11=(1+[-1 -1]*[-1 -1]')^2
1 9 1 1 K11 = 9
K =
1 1 9 1 K12=(1+[-1 -1]*[-1 1]')^2
  K12 = 1
1 1 1 9
Notice that in the dual problem the image of input
vectors only involved as an inner product meaning
that the optimisation can be performed in the (lower
dimensional) input space and that the inner product
can be replaced by an inner-product kernel
Q(a) = S ai – ½ S S ai aj di dj (xi) T (xj)
= S ai – ½ S S ai aj di dj K(xi, xj)
=a1 +a2 +a3 +a4 – ½(9 a1 a1 - 2a1 a2 -2 a1 a3 +2a1 a4
+9a2 a2 + 2a2 a3 -2a2 a4 +9a3 a3 -2a3 a4 +9 a4 a4 )
1 = 9 a1 - a2 - a3 + a4
1 = -a1 + 9 a2 + a3 - a4
1 = -a1 + a2 + 9 a3 - a4
1 = a1 - a2 - a3 + 9 a4 a1 = a2 = a3 = a4 = 1/8
MultiClass SVMs
Total C classes
One-versus-all
(OVA)

Total C binary SVMs

One-versus-one
All-vs-all (AVA)

Total C(C-1)/2 binary SVMs


Support Vector Regression
Support Vector Regression
Linear ε- Insensitive Loss Regression

l
min 1 2
w + C  ( i +  i* )
2 i =1

 yi − w, xi − b   +  i
subject to 
 w, xi + b − yi   +  i
*


 i ,  i  0
*

ε → decide Insensitive Zone


C → a trade-off between error and ||w||
➢ εand C must be tuned simultaneously
Regression is more difficult than Classification?
Dual Formulation
• Lagrangian function will help us to formulate the dual problem
l l
1
L = || w || +C  (i + i ) −  i ( i + i − yi + w, xi + b)
2 *

2 i =1 i =1
l l
−  ( yi + i +  − w, xi − b) −  (  ii +  i*i* )
*
i i
*

i =1 i =1

• ε: insensitive loss βi* : Lagrange Multiplier


ξi : difference value for points above εband
ξi*: difference value for points below εband
• Optimality Conditions
L l
= w −  ( i −  i* ) x i = 0
w i =1

L l
=  ( i* −  i ) = 0
b i =1

L
= C −i − i = 0
 i
L
= C −  i* −  i* = 0
 i*
Dual Formulation(Cont’)
• Dual Problem
1 l l l l
max −   ( i −  )( j −  ) xi , x j −   ( i +  ) +  yi ( i −  i* )
*
i
*
j
*
i
2 i =1 j =1 i =1 i =1

subject to  ( i − 
i )=0
 i , i*  [0, C ]

Solving w =  ( i −  *i ) x i ,
*
and
i =1
l
f ( x) =  ( i −  i* ) x i x + b
i =1
Extended versions of SVM
• Least-squares support-vector machine (LS-
SVM)
Suykens, J. A., & Vandewalle, J. (1999). Least squares support vector machine
classifiers. Neural processing letters, 9(3), 293-300.

• Proximal support vector machine (PSVM)


Fung, G. M., & Mangasarian, O. L. (2005). Multicategory proximal support vector machine
classifiers. Machine learning, 59(1-2), 77-97.
Comparison with Neural
Networks
Neural Networks SVMs
• Hidden Layers map to lower • Kernel maps to a very-high
dimensional spaces dimensional space
• Search space has multiple local • Search space has a unique
minima minimum
• Training is expensive • Training is extremely efficient
• Classification extremely • Classification extremely
efficient efficient
• Requires number of hidden • Kernel and cost the two
units and layers parameters to select
• Very good accuracy in typical • Very good accuracy in typical
domains domains
• Extremely robust
46
SVM:Nonlinear case
Model selection procedure

-We have to decide which Kernel function and “C” value to use.
-”In practice a Gaussian radial basis or a low degree polynomial kernel
is a good start.” [Andrew.Moore]
- We start checking which set of parameters (such as C
or  if we choose Gaussian radial basis) are the most appropriate by
Cross-Validation (K- fold) :
1) divide randomly all the available training examples into K equal-
sized subsets.
2) use all but one subset to train the SVM with the chosen para’.
3) use the held out subset to measure classification error.
4) repeat Steps 2 and 3 for each subset.
5) average the results to get an estimate of the generalization error
of the SVM classifier.
-The SVM is tested using this procedure for various parameter
settings. In the end, the model with the smallest generalization error
is adopted. Then we train our SVM classifier using these parameters
over the whole training set.
- For Gaussian RBF trying exponentially growing sequences of C and
is a practical method to identify good parameters :
- A good choice * is the following grid:
C = 2−5 , 2−4 ,......, 215
 = 2−15 , 2−14 ,...., 23
* This grid is suggested by LibSVM (An integrated and easy-to-use
tool for SVM classifier )
Software: Popular implementations
SVMlight: http://svmlight.joachims.org/
By Joachims, is one of the most widely used SVM classification and
regression package. Distributed as C++ source and binaries for
Linux, Windows, Cygwin, and Solaris. Kernels: polynomial, radial
basis function, and neural (tanh).
LibSVM : http://www.csie.ntu.edu.tw/~cjlin/libsvm/ LIBSVM
(Library for Support Vector Machines), is developed by Chang and
Lin; also widely used. Developed in C++ and Java, it supports also
multi-class classification, weighted SVM for unbalanced data,
cross-validation and automatic model selection. It has interfaces
for Python, R, Splus, MATLAB, Perl, Ruby, and LabVIEW. Kernels:
linear, polynomial, radial basis function, and neural (tanh).

For applications:
http://www.clopinet.com/isabelle/Projects/SVM/app
list.html

You might also like