Support Vector Machines & Kernels
Lecture 5
David Sontag
New York University
Slides adapted from Luke Zettlemoyer and Carlos Guestrin
Multi-class SVM
As for the SVM, we introduce slack variables and maximize margin:
To predict, we use:
Now can we learn it?
How to deal with imbalanced data?
• In many practical applications we may have
imbalanced data sets
• We may want errors to be equally distributed
between the positive and negative classes
• A slight modification to the SVM objective
does the trick!
Class-specific weighting of the slack variables
What’s Next!
• Learn one of the most interesting and
exciting recent advancements in machine
learning
– The “kernel trick”
– High dimensional feature spaces at no extra
cost!
• But first, a detour
– Constrained optimization!
Constrained optimization
No Constraint x ≥ -1 x≥1
x*=0 x*=0 x*=1
How do we solve with constraints?
Lagrange Multipliers!!!
Lagrange multipliers – Dual variables
Add Lagrange multiplier
Rewrite
Constraint
Introduce Lagrangian (objective):
We will solve:
Why is this equivalent?
• min is fighting max!
x<b (x-b)<0 maxα-α(x-b) = ∞
• min won’t let this happen!
Add new
x>b, α≥0 (x-b)>0 maxα-α(x-b) = 0, α*=0 constraint
• min is cool with 0, and L(x, α)=x2 (original objective)
x=b α can be anything, and L(x, α)=x2 (original objective)
The min on the outside forces max to behave, so constraints will be satisfied.
Dual SVM derivation (1) – the linearly
separable case
Original optimization problem:
Rewrite One Lagrange multiplier
constraints per example
Lagrangian:
Our goal now is to solve:
Dual SVM derivation (2) – the linearly
separable case
(Primal)
Swap min and max
(Dual)
Slater’s condition from convex optimization guarantees that
these two optimization problems are equivalent!
x(1)
...
Dual SVM derivation
x (n)
(3) – the linearly
x(1) x(2)
φ(x) = separable case
(1) (3)
x x
...
(Dual)
ex(1)
Can solve for optimal w,. .b. as function of α:
∂L �
∂w
=w− αj yj xj
j
Substituting these values back in (and simplifying), we obtain:
(Dual)
Sums over all training examples scalars dot product
x(1)
...
Dual SVM derivation
x (n)
(3) – the linearly
x(1) x(2)
φ(x) = separable case
(1) (3)
x x
...
(Dual)
ex(1)
Can solve for optimal w,. .b. as function of α:
∂L �
∂w
=w− αj yj xj
j
Substituting these values back in (and simplifying), we obtain:
(Dual)
So, in dual formulation we will solve for α directly!
• w and b are computed from α (if needed)
Dual SVM derivation (3) – the linearly
separable case
Lagrangian:
αj > 0 for some j implies constraint
is tight. We use this to obtain b:
(1)
(2)
(3)
Dual for the non-separable case – same basic
story (we will skip details)
Primal: Solve for w,b,α:
Dual:
What changed?
• Added upper bound of C on αi!
• Intuitive explanation:
• Without slack, αi ∞ when constraints are violated (points
misclassified)
• Upper bound of C limits the αi, so misclassifications are allowed
Wait a minute: why did we learn about the dual
SVM?
• There are some quadratic programming
algorithms that can solve the dual faster than the
primal
– At least for small datasets
• But, more importantly, the “kernel trick”!!!
Reminder: What if the data is not
linearly separable?
Use features of features
of features of features….
x(1)
...
x(n)
x(1) x(2)
φ(x) = (1) (3)
x x
...
ex(1)
...
Feature space can get really large really quickly!
Higher order polynomials
number of monomial terms
d=4
m – input features
d – degree of polynomial
d=3
grows fast!
d = 6, m = 100
d=2
about 1.6 billion terms
number of input dimensions
Dual formulation only depends on
dot-products, not on w!
Remember the
First, we introduce features: examples x only
appear in one dot
product
Next, replace the dot product with a Kernel:
Why is this useful???
∂w x . . .x
(1) j j j
φ(x) = ex x j�
(1)
(1)e∂Lx(3)
� x �. . .�= w − � αj yj xj
u1 .∂w .. v1 j
= �. .. . = u1 v1 + u2 v2 = u.v
Efficient dot-product
φ(u).φ(v)
∂L
∂L = w
=
− u2� (1)of
�
α �y
w − euα1j yj xj
x j
v2�polynomials
j xj
�
∂w
Polynomials of degree exactly d
v1
∂w 2 =
φ(u).φ(v) jju 2
. = u1 v1 + u2 v2 = u.v
u1 2 v1 v2
d=1 �� �� � � ��. . .
uu u 1u 2
v1 v 1 v 2
2 2
φ(u).φ(v)
φ(u).φ(v) =
= 11
φ(u).φ(v) = u u u. v1�.
u 2 1
. == uu v v2+ =
+u2uu
v221vv= + u.v
u.v
1= 2u1 v1 u2 v2 + u
∂Lu22 21 u1 uv222 v
1
2 v1
1 11
1 2
v v
φ(u).φ(v)
=
= u
w
2 − α
.
2
vj2y j1x 2
j = u21 v12 + 2u1 v1 u2 v2 +
d=2 u∂w
1
2 2
uv21u12 v2 v1
u1 u2 v1uv22 j 2
v2 v )2
φ(u).φ(v) = .
�u2 u1 �� = (u v 2
+ 2
u 2 2
v2 v1 �1 11 1 2 21 1 2 2
= u v + 2u v u v + u 2 v2
u
u2 1 v 2v1 = (u v + u v ) 2
φ(u).φ(v) = 2 . 2 = 1u11 v1 + 2 2u v = u.v
2 2
u2 v2 = (u.v) 2
φ(u).φ(v) = (u.v)d
For any d (we will skip proof): φ(u).φ(v) = (u.v) d
= (u.v)=d (u.v)d
φ(u).φ(v)φ(u).φ(v)
• Cool! Taking a dot product and exponentiating gives same
results as mapping into high dimensional space and then taking
dot produce
Finally: the “kernel trick”!
• Never compute features explicitly!!!
– Compute dot products in closed form
• Constant-time high-dimensional dot-
products for many classes of features
• But, O(n2) time in size of dataset to
compute objective
– Naïve implements slow
– much work on speeding up
Common kernels
• Polynomials of degree exactly d
• Polynomials of degree up to d
• Gaussian kernels
• Sigmoid
• And many others: very active area of research!