6.
867
Nonlinear models,Kernels
Fall 2016
1X X
maximize i j yi yj hxi , xj i + i
2 i,j i
X
subject to i yi = 0 and i 2 [0, C]
i
X
w= yi i x i
i i = 0 =) yi [hw, xi i + b] 1
i [yi [hw, xi i + b] + i 1] = 0 0 < i < C =) yi [hw, xi i + b] = 1
i i = 0 i = C =) yi [hw, xi i + b] 1
courtesy: A. Smola (CMU, 2013)
The Kernel Trick
Linear soft margin problem
1 2
X
minimize kwk + C i
w,b 2 i
subject to yi [hw, xi i + b] 1 i and i 0
Dual problem
1X X
maximize
2 i,j
i j yi yj hxi , xj i + i
i
X
subject to i yi = 0 and i 2 [0, C]
i
Support vector expansion
X
f (x) = i yi hxi , xi + b
i
courtesy: A. Smola (CMU, 2013)
The Kernel Trick
Linear soft margin problem
1 2
X
minimize kwk + C i
w,b 2 i
subject to yi [hw, (xi )i + b] 1 i and i 0
Dual problem
1X X
maximize
2 i,j
i j yi yj k(xi , xj ) + i
i
X
subject to i yi = 0 and i 2 [0, C]
i
Support vector expansion
X
f (x) = i yi k(xi , x) + b
i
courtesy: A. Smola (CMU, 2013)
C=1
courtesy: A. Smola (CMU, 2013)
C=2
courtesy: A. Smola (CMU, 2013)
C=5
courtesy: A. Smola (CMU, 2013)
C=10
courtesy: A. Smola (CMU, 2013)
C=20
courtesy: A. Smola (CMU, 2013)
C=50
courtesy: A. Smola (CMU, 2013)
C=100
courtesy: A. Smola (CMU, 2013)
C=1
courtesy: A. Smola (CMU, 2013)
C=2
courtesy: A. Smola (CMU, 2013)
C=5
courtesy: A. Smola (CMU, 2013)
C=10
courtesy: A. Smola (CMU, 2013)
C=20
courtesy: A. Smola (CMU, 2013)
C=50
courtesy: A. Smola (CMU, 2013)
C=100
courtesy: A. Smola (CMU, 2013)
C=1
courtesy: A. Smola (CMU, 2013)
C=2
courtesy: A. Smola (CMU, 2013)
C=5
courtesy: A. Smola (CMU, 2013)
C=10
courtesy: A. Smola (CMU, 2013)
C=20
courtesy: A. Smola (CMU, 2013)
C=50
courtesy: A. Smola (CMU, 2013)
C=100
courtesy: A. Smola (CMU, 2013)
now with a narrower kernel
courtesy: A. Smola (CMU, 2013)
courtesy: A. Smola (CMU, 2013)
courtesy: A. Smola (CMU, 2013)
courtesy: A. Smola (CMU, 2013)
courtesy: A. Smola (CMU, 2013)
now with a very wide kernel
courtesy: A. Smola (CMU, 2013)
courtesy: A. Smola (CMU, 2013)
Nonlinear separation
Pattern Recognition
Figure 7.10 2D toy example of a binary classification problem solved using a soft margin
Increasing C allows for more nonlinearities
SVC. In all cases, a Gaussian kernel (7.27) is used. From left to right, we decrease the
kernel width. Note that for a large width, the decision boundary is almost linear, and
Decreases number of errors
the data set cannot be separated without error (see text). Solid lines represent decision
boundaries; dotted lines depict the edge of the margin (where (7.34) becomes an equality
SV boundary need not be contiguous
with i = 0).
Kernel width adjusts function class
was used, but the kernel width c was varied. For large values of c, the classifier is
courtesy: A. Smola (CMU, 2013)
almost linear, and it cannot separate the data set without errors. For a small width