CSE517A Machine Learning Fall 2022
Lecture 2: Optimization for Machine Learning
Instructor: Marion Neumann
Reading: fcml Comment 4.1; lfd 3.3.2 (Gradient Descent); pml1 Chapter 8 (Optimization)
Learning Objective
Solving the structural risk minimization problem entails solving an optimization problem. We want to
understand the basic and most popular optimization procedures used in machine learning and be able to
weigh up their advantages and disadvantages.
Our Application
Let’s assume our spam filter is a linear classifier using the log-loss and l1 regu-
larization. This means we have to solve the following objective function for w:
1 n −y (wT xi +b)
d
min ∑ log(1 + e i ) + λ ∑ ∣wi ∣
w n i=1 i=1
How is this classifier called?
Training this classifier essentially means to find a solution of this optimization
problem. Now, we need an algorithm to do that!
1 Recap: SRM Objective
In general, the objective function of structural risk minimization as defined in the last lecture is given by:
1 n
min L(w) = min ∑ l(hw (xi ), yi ) + λr(w) (1)
w w n i=1
Another concrete example of this objective function stated in Eq. (1) is regularized least squares regression,
which is also known as ridge regression:
1 n ⊺ 2
d
2
min ∑(w xi − yi ) + λ ∑ wi (2)
w n i=1 i=1
Or in matrix notation using X ∈ Rd×n Eq. (2) can be represented as:
1
min (X ⊺ w − y)⊺ (X ⊺ w − y) +λw⊺ w (3)
w n ´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹¶
=w⊺ XX ⊺ w−2w⊺ Xy+y⊺ y
We will be using this notation quite a bit, so try to familiarize yourself with it as best as you can. Later we
will have to take derivatives, which can also be elegantly done using matrix notation.
1 Probabilistic Machine Learning by Kevin Murphy (https://probml.github.io/pml-book/book1.html)
1
2
How to solve the minimization problem?
We have a couple of different options here and not every one will necessarily work out for all objectives we
will encounter:
(1) Closed form solution, cf. Exercise 1.1
(2) Derive constraint optimization problem (primal/dual) and solve those with a QP-solver, cf. Exercise 1.2
(3) Gradient descent (gd) and it’s variants, e.g. for logistic regression (and actually most optimization
problems we will encounter)
Note that we treat λ as a hyperparameter, that is not optimized with the model parameters (i.e., w). Typi-
cally λ is learned via cross-validation using telescope-search, random search, or Bayesian optimization.
Aside: telescope-search for hyperparamter optimization:
(1) find coarse range for λ
(2) refine range around best choice for λ (“zoom in”)
Exercise 1.1. Derive the solution for ols and ridge regression in matrix notation (cf. Section 4 in Lec-
ture 1: srm to check your solution).
3
Exercise 1.2. The support vector machine (svm) solution to the linear classification problem
h(x) = sign(w⊺ x + b)
on D = {(xi , yi )}i=1...n is the separating hyperplane that maximizes the margin, where x, xi , w ∈ Rd and
b, yi ∈ R. (Consider the hard margin svm formulation without slack variables in this exercise.)
(a) Derive the formula for the margin γ(w, b) of a separating hyperplane H as the distance between the
hyperplane and the nearest training point.
(b) Now, we want to maximize this margin and ensure that H is indeed separating the training data. Write
down the optimization problem that yields the maximum margin classifier and derive the (primal)
svm optimization problem. Show your derivations and briefly explain every step.
(c) How many parameters (aka variables) does the primal optimization problem have? How many con-
straints?
(d) Derive the unconstrained svm formulation (srm objective)
n
min C ∑ max[1 − yi (wT xi + b), 0] + ∣∣w∣∣22
w ´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶
i=1 ²
=fw (xi ) l2 −regularizer
´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶
hinge-loss
from the primal. What is λ in this representation?
(e) Carefully derive the dual svm optimization problem. Include all steps of your derivation with brief
verbose explanations.
(f) How many parameters (aka variables) does the dual optimization problem have? How many con-
straints?
(g) When is the primal optimization as solution to the svm classifier disadvantageous? Discuss two situa-
tions.
(h) State the dual form of the svm classifier used for predictions and explain why only support vectors
contribute to the prediction.
2 Gradient Descent
The goal is to minimize a learning objective `(w) such as the loss function, srm objective L(w), or
other objectives like the negative log likelihood or negative log posterior , that we will encounter later in this
course. Assuming that `(w) is convex, continuous, and differentiable (once or twice), there exists a global
minimum.
To find this minimum we us the following strategy:
Start at a random location in the parameter space and then follow the direction of the descending gradient
of ` as illustrated in Figure 1.
4
Figure 1: Gradient descent illustration plotting w versus `(w) for a one-dimensional w and indicates the
gradient of ` at two locations w1 and w2 in the parameter space.
The basic gradient descent algorithm works as follows:
Algorithm 1 Gradient Descent
Initialize w0
repeat
wt+1 ←Ð wt + s
until ∣∣wt+1 − wt ∣∣ ≤
How to choose gradient step s in Algorithm 1?
We want:
`(wt+1 ) ≤ `(wt ) ⇐⇒ `(wt + s) ≤ `(wt ) (4)
If ∣∣s∣∣ is small, we can use the Taylor approximation:
1 ⊺ 2
`(w + s) = `(w) + ∇`(w)⊺ s + s ∇ l(w) s + . . . (5)
2
where ∇`(w)⊺ is the first order approximation and ∇2 `(w) is the second order approximation.
Gradient descent uses the first order approximation.
Let g(w) = ∇`(w) be the vector of partial first order derivatives, then Equation 4 can be written as:
`(wt + s) ≈ `(wt ) + g(wt )⊺ s ≤ `(wt )
(6)
⇐⇒ g(wt )⊺ s ≤ 0
We achieve this for example when choosing: s = −g(wt ).
In general,
s = −α g(wt ) (7)
where α > 0 is the learning rate. Setting α is a “dark art”; only if it is sufficiently small, gradient descent
converges (see Figure 2 [Left]). If it is too large the algorithm can easily diverge out of control (see Figure 2
[Right]).
5
Figure 2: Left: Gradient descent converges when α is small. Right: Gradient descent may diverge when α
is too large.
Some choices of the learning rate α
The choice of α is essential to a good gd implementation, cf. Figure 3.
Figure 3: Gradient descent with different step-sizes. The left and middle plots show the losses as a function
of the number of iterations. The right plot shows the path of the weight vector wt over time. The middle
image shows the loss zoomed in around the left bottom corner. In this case the steps-size of α = 0.05 (blue
line left and yellow right) converges the fastest.
Here are some ways of choosing the learning rate α:
(1) Safe choice:
1
αt =
t
gd will converge but very slowly.
(2) Heuristic/ad-hoc choice:
⎧
⎪1.01 αt
⎪ if l(wt+1 ) ≤ l(wt )
αt+1 = ⎨
⎪
⎪0.5 αt if l(wt+1 ) > l(wt )
⎩
It is not guaranteed that gd will converge.
(3) Line search: Choose the α that minimizes εtr .
n
∗
αt+1 = arg min ∑ lwt+1 (xi , yi ),
α i=1
6
where lwt+1 (xi , yi ) is the training error at which is not necessarily the value of the objective function we are
optimizing in gd. This requires solving an additional optimization problem in every iteration of gradient
descent.
Exercise 2.1. gd for ols
(a) Why would you want to perform gd to train an ols regression model instead of computing the closed
form solution derived in Exercise 1.1?
(b) Write down the gradient descent update for the ols problem in terms of the matrix X ∶=
[x1 x2 x3 ... xn ], the vector y ∶= [y1 y2 ... yn ]T and the weight vector w ∈ Rd .
(c) True or false? Gradient descent on a convex function (such as the ols objective) is guaranteed to
converge. Justify your answer.
(d) Add an elastic net regularizer (λ1 ∥w∥1 + λ2 ∥w∥22 - simple version) to your ols regression model. State
the gradient descent update rule for this new objective function in matrix notation.
3 Newton’s Method
Let’s use the second order approximation in Equation 5. This is Newton’s method and it assumes that the
loss `(w) is twice differentiable.
Note that we can represent ∇2 `(w) by the Hessian matrix
2
which contains all second order partial derivatives Hij = ∂w∂i ∂w
`
j
. Note that H a symmetric square matrix
that is always positive semi-definite (cf. problem on hw1).
Reminder: A symmetric matrix M is positive semi-definite if it has only non-negative eigenvalues or,
equivalently, for any vector x we must have x⊺ M x ≥ 0.
So, to choose the gradient step s we want `(w + s) to be be small. Hence,
1 ⊺ 2
s = arg min `(w) + ∇`(w)⊺ s + s ∇ `(w) s
s 2
(8)
1 ⊺
= arg min `(w) + g(w)⊺ s + s H(w) s
s 2
where g(w) = ∇`(w) and H(w) = ∇2 `(w).
Note that here we are optimizing over a convex parabola that represents the gradient direction and the
curavture of our objective function.
To find the minimum of Equation 8, we take its first derivative and equate it with zero and solve for s:
0 = g(w) + H(w) s
(9)
⇒ s = −H(w)−1 g(w)
This choice of s converges extremely fast if the approximation is sufficiently accurate. Otherwise it can
diverge. This is typically the case if the function is flat or almost flat with respect to some dimension. In
7
that case the second derivatives are close to zero, and their inverse becomes very large – resulting in gigantic
steps.
3.1 Avoid Divergence of Newton’s Method
To avoid divergence of Newton’s method, a good approach is to combine gradient descent and newton’s
method. Essentially, we try to get to the right place slowly but goal directed by starting off with gradient
descent (or even stochastic gradient descent) and then finish the optimization quickly with Newton’s method.
Typically, the second order approximation, used by Newton’s Method, is more likely to be appropriate near
the optimum.
Gradient descent always converges after over 100 iterations from all initial starting points. If it converges
(Figure 4), Newton’s method is much faster (convergence after 8 iterations) but it can diverge (Figure 5).
Figure 6 shows the hybrid approach of taking 6 gradient descent steps and then switching to Newton’s
Method. It still converges in only 10 updates.
Figure 4: A starting point where Newton’s Method converges in 8 iterations.
Figure 5: A starting point where Newton’s Method diverges.
8
Figure 6: Same starting point as in Figure 5, however Newton’s method is only used after 6 gradient steps
and converges in a few steps.
3.2 Quasi-Newton Methods
Also note that computing H(w)−1 is expensive (O(d3 )) and oftentimes we use approximations for the
Hessian matrix or it’s inverse directly. Such methods are commonly known as quasi-Newton methods and
one of its most popular representatives is (limited-memory) BFGS (after Broyden-Fletcher-Goldfarb-
Shanno). The easiest, but also crudest approximation is to use diag(H(w)) instead of H(w), which is easily
invertible.
3.3 [optional] Conjugate Gradient
Another way to avoid the computation of the Hessian, but still incorporating 2nd order information is to use
conjugate search directions. Essentially we run gd with line search and compute the gradient step s (i.e.,
the gradient direction) based on the previously used directions. The new direction will be conjugate to the
previously used ones. This way we end up using way less search directions, and hence gd converges faster.
Algorithm 2 is the basic conjugate gradient descent cgd algorithm.
Algorithm 2 Conjugate Gradient Descent
Initialize w0 , d0 = −g(w0 ), g1 = −g(w0 )
repeat
α ← arg minα l(wt + αdt ) // line search
wt+1 ←Ð wt + α dt
gt = gt+1 , gt+1 = −g(wt+1 ) // remember old gradient, compute new gradient
dt+1 = gt+1 + β dt // new conjugate direction
until ∣∣wt+1 − wt ∣∣ ≤
g⊺ (g
t+1 −g )
t
Note that gt is a vector and there are various possible ways to set β, e.g. β = dt+1 (after Hestenes-
t+1 (gt+1 −gt )
⊺
Stiefel) with dot products in numerator and denominator. Unfortunately, cgd does not work with noisy
gradients (which are produced by stochastic gradient descent discussed below).
Exercise 3.1. True or false? Justify your answer.
(a) Ordinary least squares regression can be solved with Newton’s method.
(b) Newton’s Method can be used to optimize the (standard) SVM hinge loss.
Exercise 3.2. If you were to optimize the ols objective with Newton’s method, how many steps would
you need until convergence (you can assume the Hessian is invertible)? You can either derive the answer
formally or state it verbally with clear justifications.
9
4 Best Practice
If your data is not too big (approx. n ≤ 100, 000), then quasi-Newton or conjugate gradient descent are
the methods of choice. If you have a huge number of training points, then we typically use approximated
gradients in combination with an adaptive learning rate. In the following we discuss these two strategies.
4.1 Momentum Method
When dealing with non-convex functions, e.g. in neural network training, we want to avoid local minima.
One simple way to do so is to have an adaptive learning rate that uses part of the previous gradient:
s = −α g̃(wt ) (10)
where g̃(wt ) ← g(wt ) + µ g̃(wt−1 ). This is known as the momentum method and its idea is to use some
portion of the previous gradient (“momentum”) to push you out of small local minima.
There are various other ways of computing adaptive learning rates, such as adam or rmsprop. See this blog
post for more information: http://ruder.io/optimizing-gradient-descent/index.html#whichoptimizertochoose.
4.2 Stochastic Gradient Descent
The idea of stochatstic gradient descent (sgd) is to make gd more efficient. We consider l(w) on a subset
of training examples. In practice, sgd converges faster than gd; and noisy updates can help escape local
minima. Two versions are typically considered:
• use one training point at a time
i ,yi )
g(w) = ∂l(x
∂w
(very high fluctuations in the objective function)
• mini-batches (randomly partition D in sets with m training examples)
∂l(xi ,yi )
g(w) = ∑mi=1 ∂w
for m << n
sgc is extremely popular as it is easy to implement and fast for large n. As setting the learning rate
is challenging we typically combine sgc with the momentum method (or other methods using adaptive
learning rate, such as adam). Note that there are parallel implementations of sgc, however, there is still a
lot of ongoing research in this area.
4.3 Note: Random Restarts for Non-convex Functions
Oftentimes our objective functions are non-convex. We will still use the methods discussed in this lecture
to optimize those objectives. The solution we get is now typically not the global minima, but a local one.
so, when dealing with non-convex functions we typically use several randomly selected starting points and
choose the best solution among the (s)gd results. This is then called gd with random restarts. Note that
there is a trival way of parallelizing this algorithm.
5 Summary
We discussed how to optimize convex, continuous, and differentiable functions using gradient descent and
Newton’s method. Further, we introduced several variations of those methods and a couple of best practice
tips on which methods to use. You should be familiar with the following methods:
• gradient descent (and it’s gradient step derivation)
• Newton’s method
• momentum method
• stochastic gradient descent
10
Exercise 5.1. Practice Retrieving!
For this summary exercise, it is intended that your answers are based on your own (current) understanding
of the concepts (and not on the definitions you read and copy from these notes or from elsewhere). Don’t
hesitate to say it out loud to your seat neighbor, your pet or stuffed animal, or to yourself before writing
it down. Research studies show that this practice of retrieval and phrasing out loud will help you retain
the knowledge!
(a) Using your own words, summarize each of these methods in 2-3 sentences by retrieving the knowledge
from the top of your head.
(b) What is the difference between gd and Newton? Name one advantage of gd over Newton. Name one
advantage of Newton over gd.
(c) Recap the assumptions for gd.
(d) Discuss the implications of those assumptions being violated. Consider theoretical and practical impli-
cations. What can you do (if anything) in the case where each of the assumptions is violated? Consider
one at a time and come up with concrete things that can be done (if anything can be done) to deal
with the violation.
(e) How does the momentum method improve upon vanilla gd?
(f) How does stochastic gd improve upon the vanilla version?
And always remember: It’s not bad to get it wrong. Getting it wrong is part of learning! Use your notes or
other resources to get the correct answer or come to our office hours to get help!
Our Application
With respect to our spam filter application, we are now able to solve the
various possible objective functions. That means we are able to train various
learning models and try to find the best one for a given application/dataset.
This is exactly what you will do in written homework 1 and implementation
project 1: you will compute the derivatives of some desired srm objective func-
tions, implement the gradient descent algorithm, and evaluate various models on
a dataset of emails labeled as spam and ham (not spam).