0% found this document useful (0 votes)

12 views10 pages

02 Lecturenote GD

The document discusses optimization techniques in machine learning, focusing on structural risk minimization and various methods such as gradient descent and Newton's method. It outlines the objective functions for linear classifiers and regularized regression, detailing how to solve minimization problems using different algorithms. Additionally, it covers hyperparameter optimization and the advantages and disadvantages of each optimization method.

Uploaded by

mizhou0309

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views10 pages

02 Lecturenote GD

Uploaded by

mizhou0309

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

CSE517A Machine Learning Fall 2022

Lecture 2: Optimization for Machine Learning

Instructor: Marion Neumann
Reading: fcml Comment 4.1; lfd 3.3.2 (Gradient Descent); pml1 Chapter 8 (Optimization)

Learning Objective
Solving the structural risk minimization problem entails solving an optimization problem. We want to
understand the basic and most popular optimization procedures used in machine learning and be able to
weigh up their advantages and disadvantages.

Our Application
Let’s assume our spam filter is a linear classifier using the log-loss and l1 regu-
larization. This means we have to solve the following objective function for w:

1 n −y (wT xi +b)
d
min ∑ log(1 + e i ) + λ ∑ ∣wi ∣
w n i=1 i=1

How is this classifier called?

Training this classifier essentially means to find a solution of this optimization
problem. Now, we need an algorithm to do that!

1 Recap: SRM Objective

In general, the objective function of structural risk minimization as defined in the last lecture is given by:

1 n
min L(w) = min ∑ l(hw (xi ), yi ) + λr(w) (1)
w w n i=1

Another concrete example of this objective function stated in Eq. (1) is regularized least squares regression,
which is also known as ridge regression:

1 n ⊺ 2
d
2
min ∑(w xi − yi ) + λ ∑ wi (2)
w n i=1 i=1

Or in matrix notation using X ∈ Rd×n Eq. (2) can be represented as:

1
min (X ⊺ w − y)⊺ (X ⊺ w − y) +λw⊺ w (3)
w n ´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹¶
=w⊺ XX ⊺ w−2w⊺ Xy+y⊺ y

We will be using this notation quite a bit, so try to familiarize yourself with it as best as you can. Later we
will have to take derivatives, which can also be elegantly done using matrix notation.

1 Probabilistic Machine Learning by Kevin Murphy (https://probml.github.io/pml-book/book1.html)

1
2

How to solve the minimization problem?

We have a couple of different options here and not every one will necessarily work out for all objectives we
will encounter:
(1) Closed form solution, cf. Exercise 1.1

(2) Derive constraint optimization problem (primal/dual) and solve those with a QP-solver, cf. Exercise 1.2
(3) Gradient descent (gd) and it’s variants, e.g. for logistic regression (and actually most optimization
problems we will encounter)
Note that we treat λ as a hyperparameter, that is not optimized with the model parameters (i.e., w). Typi-
cally λ is learned via cross-validation using telescope-search, random search, or Bayesian optimization.

Aside: telescope-search for hyperparamter optimization:

(1) find coarse range for λ
(2) refine range around best choice for λ (“zoom in”)

Exercise 1.1. Derive the solution for ols and ridge regression in matrix notation (cf. Section 4 in Lec-
ture 1: srm to check your solution).
3

Exercise 1.2. The support vector machine (svm) solution to the linear classification problem

h(x) = sign(w⊺ x + b)

on D = {(xi , yi )}i=1...n is the separating hyperplane that maximizes the margin, where x, xi , w ∈ Rd and
b, yi ∈ R. (Consider the hard margin svm formulation without slack variables in this exercise.)

(a) Derive the formula for the margin γ(w, b) of a separating hyperplane H as the distance between the
hyperplane and the nearest training point.
(b) Now, we want to maximize this margin and ensure that H is indeed separating the training data. Write
down the optimization problem that yields the maximum margin classifier and derive the (primal)
svm optimization problem. Show your derivations and briefly explain every step.
(c) How many parameters (aka variables) does the primal optimization problem have? How many con-
straints?
(d) Derive the unconstrained svm formulation (srm objective)
n
min C ∑ max[1 − yi (wT xi + b), 0] + ∣∣w∣∣22
w ´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶
i=1 ²
=fw (xi ) l2 −regularizer
´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶
hinge-loss

from the primal. What is λ in this representation?

(e) Carefully derive the dual svm optimization problem. Include all steps of your derivation with brief
verbose explanations.
(f) How many parameters (aka variables) does the dual optimization problem have? How many con-
straints?

(g) When is the primal optimization as solution to the svm classifier disadvantageous? Discuss two situa-
tions.
(h) State the dual form of the svm classifier used for predictions and explain why only support vectors
contribute to the prediction.

2 Gradient Descent
The goal is to minimize a learning objective `(w) such as the loss function, srm objective L(w), or
other objectives like the negative log likelihood or negative log posterior , that we will encounter later in this
course. Assuming that `(w) is convex, continuous, and differentiable (once or twice), there exists a global
minimum.

To find this minimum we us the following strategy:

Start at a random location in the parameter space and then follow the direction of the descending gradient
of ` as illustrated in Figure 1.
4

Figure 1: Gradient descent illustration plotting w versus `(w) for a one-dimensional w and indicates the
gradient of ` at two locations w1 and w2 in the parameter space.

The basic gradient descent algorithm works as follows:

Algorithm 1 Gradient Descent

Initialize w0
repeat
wt+1 ←Ð wt + s
until ∣∣wt+1 − wt ∣∣ ≤

How to choose gradient step s in Algorithm 1?

We want:
`(wt+1 ) ≤ `(wt ) ⇐⇒ `(wt + s) ≤ `(wt ) (4)
If ∣∣s∣∣ is small, we can use the Taylor approximation:
1 ⊺ 2
`(w + s) = `(w) + ∇`(w)⊺ s + s ∇ l(w) s + . . . (5)
2
where ∇`(w)⊺ is the first order approximation and ∇2 `(w) is the second order approximation.
Gradient descent uses the first order approximation.
Let g(w) = ∇`(w) be the vector of partial first order derivatives, then Equation 4 can be written as:

`(wt + s) ≈ `(wt ) + g(wt )⊺ s ≤ `(wt )

(6)
⇐⇒ g(wt )⊺ s ≤ 0

We achieve this for example when choosing: s = −g(wt ).

In general,
s = −α g(wt ) (7)
where α > 0 is the learning rate. Setting α is a “dark art”; only if it is sufficiently small, gradient descent
converges (see Figure 2 [Left]). If it is too large the algorithm can easily diverge out of control (see Figure 2
[Right]).
5

Figure 2: Left: Gradient descent converges when α is small. Right: Gradient descent may diverge when α
is too large.

Some choices of the learning rate α

The choice of α is essential to a good gd implementation, cf. Figure 3.

Figure 3: Gradient descent with different step-sizes. The left and middle plots show the losses as a function
of the number of iterations. The right plot shows the path of the weight vector wt over time. The middle
image shows the loss zoomed in around the left bottom corner. In this case the steps-size of α = 0.05 (blue
line left and yellow right) converges the fastest.

Here are some ways of choosing the learning rate α:

(1) Safe choice:
1
αt =
t
gd will converge but very slowly.

(2) Heuristic/ad-hoc choice:

⎧
⎪1.01 αt
⎪ if l(wt+1 ) ≤ l(wt )
αt+1 = ⎨
⎪
⎪0.5 αt if l(wt+1 ) > l(wt )
⎩
It is not guaranteed that gd will converge.

(3) Line search: Choose the α that minimizes εtr .

n
∗
αt+1 = arg min ∑ lwt+1 (xi , yi ),
α i=1
6

where lwt+1 (xi , yi ) is the training error at which is not necessarily the value of the objective function we are
optimizing in gd. This requires solving an additional optimization problem in every iteration of gradient
descent.
Exercise 2.1. gd for ols
(a) Why would you want to perform gd to train an ols regression model instead of computing the closed
form solution derived in Exercise 1.1?
(b) Write down the gradient descent update for the ols problem in terms of the matrix X ∶=
[x1 x2 x3 ... xn ], the vector y ∶= [y1 y2 ... yn ]T and the weight vector w ∈ Rd .

(c) True or false? Gradient descent on a convex function (such as the ols objective) is guaranteed to
converge. Justify your answer.
(d) Add an elastic net regularizer (λ1 ∥w∥1 + λ2 ∥w∥22 - simple version) to your ols regression model. State
the gradient descent update rule for this new objective function in matrix notation.

3 Newton’s Method
Let’s use the second order approximation in Equation 5. This is Newton’s method and it assumes that the
loss `(w) is twice differentiable.
Note that we can represent ∇2 `(w) by the Hessian matrix

2
which contains all second order partial derivatives Hij = ∂w∂i ∂w
`
j
. Note that H a symmetric square matrix
that is always positive semi-definite (cf. problem on hw1).

Reminder: A symmetric matrix M is positive semi-definite if it has only non-negative eigenvalues or,
equivalently, for any vector x we must have x⊺ M x ≥ 0.
So, to choose the gradient step s we want `(w + s) to be be small. Hence,
1 ⊺ 2
s = arg min `(w) + ∇`(w)⊺ s + s ∇ `(w) s
s 2
(8)
1 ⊺
= arg min `(w) + g(w)⊺ s + s H(w) s
s 2
where g(w) = ∇`(w) and H(w) = ∇2 `(w).
Note that here we are optimizing over a convex parabola that represents the gradient direction and the
curavture of our objective function.
To find the minimum of Equation 8, we take its first derivative and equate it with zero and solve for s:

0 = g(w) + H(w) s
(9)
⇒ s = −H(w)−1 g(w)
This choice of s converges extremely fast if the approximation is sufficiently accurate. Otherwise it can
diverge. This is typically the case if the function is flat or almost flat with respect to some dimension. In
7

that case the second derivatives are close to zero, and their inverse becomes very large – resulting in gigantic
steps.

3.1 Avoid Divergence of Newton’s Method

To avoid divergence of Newton’s method, a good approach is to combine gradient descent and newton’s
method. Essentially, we try to get to the right place slowly but goal directed by starting off with gradient
descent (or even stochastic gradient descent) and then finish the optimization quickly with Newton’s method.
Typically, the second order approximation, used by Newton’s Method, is more likely to be appropriate near
the optimum.

Gradient descent always converges after over 100 iterations from all initial starting points. If it converges
(Figure 4), Newton’s method is much faster (convergence after 8 iterations) but it can diverge (Figure 5).
Figure 6 shows the hybrid approach of taking 6 gradient descent steps and then switching to Newton’s
Method. It still converges in only 10 updates.

Figure 4: A starting point where Newton’s Method converges in 8 iterations.

Figure 5: A starting point where Newton’s Method diverges.

Figure 6: Same starting point as in Figure 5, however Newton’s method is only used after 6 gradient steps
and converges in a few steps.

3.2 Quasi-Newton Methods

Also note that computing H(w)−1 is expensive (O(d3 )) and oftentimes we use approximations for the
Hessian matrix or it’s inverse directly. Such methods are commonly known as quasi-Newton methods and
one of its most popular representatives is (limited-memory) BFGS (after Broyden-Fletcher-Goldfarb-
Shanno). The easiest, but also crudest approximation is to use diag(H(w)) instead of H(w), which is easily
invertible.

3.3 [optional] Conjugate Gradient

Another way to avoid the computation of the Hessian, but still incorporating 2nd order information is to use
conjugate search directions. Essentially we run gd with line search and compute the gradient step s (i.e.,
the gradient direction) based on the previously used directions. The new direction will be conjugate to the
previously used ones. This way we end up using way less search directions, and hence gd converges faster.
Algorithm 2 is the basic conjugate gradient descent cgd algorithm.

Algorithm 2 Conjugate Gradient Descent

Initialize w0 , d0 = −g(w0 ), g1 = −g(w0 )
repeat
α ← arg minα l(wt + αdt ) // line search
wt+1 ←Ð wt + α dt
gt = gt+1 , gt+1 = −g(wt+1 ) // remember old gradient, compute new gradient
dt+1 = gt+1 + β dt // new conjugate direction
until ∣∣wt+1 − wt ∣∣ ≤

g⊺ (g
t+1 −g )
t
Note that gt is a vector and there are various possible ways to set β, e.g. β = dt+1 (after Hestenes-
t+1 (gt+1 −gt )
⊺

Stiefel) with dot products in numerator and denominator. Unfortunately, cgd does not work with noisy
gradients (which are produced by stochastic gradient descent discussed below).
Exercise 3.1. True or false? Justify your answer.
(a) Ordinary least squares regression can be solved with Newton’s method.

(b) Newton’s Method can be used to optimize the (standard) SVM hinge loss.

Exercise 3.2. If you were to optimize the ols objective with Newton’s method, how many steps would
you need until convergence (you can assume the Hessian is invertible)? You can either derive the answer
formally or state it verbally with clear justifications.
9

4 Best Practice
If your data is not too big (approx. n ≤ 100, 000), then quasi-Newton or conjugate gradient descent are
the methods of choice. If you have a huge number of training points, then we typically use approximated
gradients in combination with an adaptive learning rate. In the following we discuss these two strategies.

4.1 Momentum Method

When dealing with non-convex functions, e.g. in neural network training, we want to avoid local minima.
One simple way to do so is to have an adaptive learning rate that uses part of the previous gradient:

s = −α g̃(wt ) (10)

where g̃(wt ) ← g(wt ) + µ g̃(wt−1 ). This is known as the momentum method and its idea is to use some
portion of the previous gradient (“momentum”) to push you out of small local minima.
There are various other ways of computing adaptive learning rates, such as adam or rmsprop. See this blog
post for more information: http://ruder.io/optimizing-gradient-descent/index.html#whichoptimizertochoose.

4.2 Stochastic Gradient Descent

The idea of stochatstic gradient descent (sgd) is to make gd more efficient. We consider l(w) on a subset
of training examples. In practice, sgd converges faster than gd; and noisy updates can help escape local
minima. Two versions are typically considered:

• use one training point at a time

i ,yi )
g(w) = ∂l(x
∂w
(very high fluctuations in the objective function)

• mini-batches (randomly partition D in sets with m training examples)

∂l(xi ,yi )
g(w) = ∑mi=1 ∂w
for m << n

sgc is extremely popular as it is easy to implement and fast for large n. As setting the learning rate
is challenging we typically combine sgc with the momentum method (or other methods using adaptive
learning rate, such as adam). Note that there are parallel implementations of sgc, however, there is still a
lot of ongoing research in this area.

4.3 Note: Random Restarts for Non-convex Functions

Oftentimes our objective functions are non-convex. We will still use the methods discussed in this lecture
to optimize those objectives. The solution we get is now typically not the global minima, but a local one.
so, when dealing with non-convex functions we typically use several randomly selected starting points and
choose the best solution among the (s)gd results. This is then called gd with random restarts. Note that
there is a trival way of parallelizing this algorithm.

5 Summary
We discussed how to optimize convex, continuous, and differentiable functions using gradient descent and
Newton’s method. Further, we introduced several variations of those methods and a couple of best practice
tips on which methods to use. You should be familiar with the following methods:
• gradient descent (and it’s gradient step derivation)
• Newton’s method
• momentum method
• stochastic gradient descent
10

Exercise 5.1. Practice Retrieving!

For this summary exercise, it is intended that your answers are based on your own (current) understanding
of the concepts (and not on the definitions you read and copy from these notes or from elsewhere). Don’t
hesitate to say it out loud to your seat neighbor, your pet or stuffed animal, or to yourself before writing
it down. Research studies show that this practice of retrieval and phrasing out loud will help you retain
the knowledge!

(a) Using your own words, summarize each of these methods in 2-3 sentences by retrieving the knowledge
from the top of your head.
(b) What is the difference between gd and Newton? Name one advantage of gd over Newton. Name one
advantage of Newton over gd.
(c) Recap the assumptions for gd.

(d) Discuss the implications of those assumptions being violated. Consider theoretical and practical impli-
cations. What can you do (if anything) in the case where each of the assumptions is violated? Consider
one at a time and come up with concrete things that can be done (if anything can be done) to deal
with the violation.
(e) How does the momentum method improve upon vanilla gd?

(f) How does stochastic gd improve upon the vanilla version?

And always remember: It’s not bad to get it wrong. Getting it wrong is part of learning! Use your notes or
other resources to get the correct answer or come to our office hours to get help!

Our Application
With respect to our spam filter application, we are now able to solve the
various possible objective functions. That means we are able to train various
learning models and try to find the best one for a given application/dataset.

This is exactly what you will do in written homework 1 and implementation

project 1: you will compute the derivatives of some desired srm objective func-
tions, implement the gradient descent algorithm, and evaluate various models on
a dataset of emails labeled as spam and ham (not spam).

Linear Regression For Machine Learning Course
No ratings yet
Linear Regression For Machine Learning Course
41 pages
Linear - Regression - SGD
No ratings yet
Linear - Regression - SGD
71 pages
CS480 6 Linear Models
No ratings yet
CS480 6 Linear Models
68 pages
Module3 Ch1
No ratings yet
Module3 Ch1
83 pages
MLSS Complete PDF
No ratings yet
MLSS Complete PDF
106 pages
11 Gradient Descent
No ratings yet
11 Gradient Descent
58 pages
A Tutorial of Machine Learning
No ratings yet
A Tutorial of Machine Learning
16 pages
Support Vector Machines (SVMS)
No ratings yet
Support Vector Machines (SVMS)
31 pages
Foundations of Machine Learning: Part A: Logistic Regression
No ratings yet
Foundations of Machine Learning: Part A: Logistic Regression
63 pages
(MLP) Lecture Notes
No ratings yet
(MLP) Lecture Notes
22 pages
ML - 5 Sovan LR SVM 1
No ratings yet
ML - 5 Sovan LR SVM 1
59 pages
Concise - Lecture - Notes - On - Optimization - Methods - 1722728042 2024-08-03 23 - 34 - 09
No ratings yet
Concise - Lecture - Notes - On - Optimization - Methods - 1722728042 2024-08-03 23 - 34 - 09
258 pages
5.1loss Function, Optimization, GD
No ratings yet
5.1loss Function, Optimization, GD
39 pages
Berkeley-Tutorial Optimization For Machine Learningpart2
No ratings yet
Berkeley-Tutorial Optimization For Machine Learningpart2
35 pages
Linear Regression
No ratings yet
Linear Regression
6 pages
Logistic Regression
No ratings yet
Logistic Regression
9 pages
GradientDescent-Regression Slides
No ratings yet
GradientDescent-Regression Slides
26 pages
Lec 07-08 - Final
No ratings yet
Lec 07-08 - Final
32 pages
L5 SVM
No ratings yet
L5 SVM
61 pages
Week 6
No ratings yet
Week 6
72 pages
Linear Models & Optimization Techniques
No ratings yet
Linear Models & Optimization Techniques
24 pages
Wk05 Machine Learning
No ratings yet
Wk05 Machine Learning
6 pages
Regression Analysis
No ratings yet
Regression Analysis
54 pages
Lecture02a Optimization Annotated PDF
No ratings yet
Lecture02a Optimization Annotated PDF
23 pages
02 - Linear Models - A
No ratings yet
02 - Linear Models - A
23 pages
1 One Dimension: Gradient Descent
No ratings yet
1 One Dimension: Gradient Descent
5 pages
Lecture 8: Gradient Descent and Logistic Regression
No ratings yet
Lecture 8: Gradient Descent and Logistic Regression
39 pages
EE353 - 769 08 Linear Classification
No ratings yet
EE353 - 769 08 Linear Classification
22 pages
Lecture 7 (With Notes)
No ratings yet
Lecture 7 (With Notes)
39 pages
Stochastic Gradient Descent Algorithm
No ratings yet
Stochastic Gradient Descent Algorithm
6 pages
Week 2 Introduction To Linear Models - Revised - v1
No ratings yet
Week 2 Introduction To Linear Models - Revised - v1
54 pages
Linear Regression
No ratings yet
Linear Regression
29 pages
Introml 02 Regression Annotated PDF
No ratings yet
Introml 02 Regression Annotated PDF
26 pages
Binary Classification and Logistic Regression
No ratings yet
Binary Classification and Logistic Regression
7 pages
Gradient Descent Algorithm in Machine Learning
No ratings yet
Gradient Descent Algorithm in Machine Learning
21 pages
Gradient Descent Based Learners
No ratings yet
Gradient Descent Based Learners
11 pages
Berkeley-Tutorial Optimization For Machine Learning-Part1
No ratings yet
Berkeley-Tutorial Optimization For Machine Learning-Part1
37 pages
Unit 02 - Nonlinear Classification, Linear Regression, Collaborative Filtering - MD
No ratings yet
Unit 02 - Nonlinear Classification, Linear Regression, Collaborative Filtering - MD
14 pages
AIMLB PGP 2025 Session 5
No ratings yet
AIMLB PGP 2025 Session 5
67 pages
SVM, RF, Decision Tree
No ratings yet
SVM, RF, Decision Tree
17 pages
Les Hoches 2022 Convex Optimization
No ratings yet
Les Hoches 2022 Convex Optimization
34 pages
Cs419 Closed Form Derv
No ratings yet
Cs419 Closed Form Derv
5 pages
cs188 Fa22 Note21
No ratings yet
cs188 Fa22 Note21
4 pages
Lec9 - Linear Models
No ratings yet
Lec9 - Linear Models
44 pages
CS229 Supplemental Lecture Notes: 1 Binary Classification
No ratings yet
CS229 Supplemental Lecture Notes: 1 Binary Classification
7 pages
Lec 6 Tutorial
No ratings yet
Lec 6 Tutorial
27 pages
03 Linear Models
No ratings yet
03 Linear Models
46 pages
Lecture 2 Linear Regression, Machine Learning Course Andrew NG
No ratings yet
Lecture 2 Linear Regression, Machine Learning Course Andrew NG
14 pages
Gradient Descent New
No ratings yet
Gradient Descent New
42 pages
IML Summary
No ratings yet
IML Summary
12 pages
New Support Vector Algorithms: Letter
No ratings yet
New Support Vector Algorithms: Letter
39 pages
Online Gradient Descent
No ratings yet
Online Gradient Descent
7 pages
HW 4
No ratings yet
HW 4
7 pages
Lect4 Log Reg
No ratings yet
Lect4 Log Reg
20 pages
Machine Learning Guide 2017
No ratings yet
Machine Learning Guide 2017
15 pages
Gradient Descent Deep Learning: by T.K. Damodharan Vice President, RBS Reg - No: PC2013003013008
No ratings yet
Gradient Descent Deep Learning: by T.K. Damodharan Vice President, RBS Reg - No: PC2013003013008
37 pages
A Layman's Guide To The Project
No ratings yet
A Layman's Guide To The Project
34 pages
CS2011 2
No ratings yet
CS2011 2
14 pages
05 Lecturenote NB
No ratings yet
05 Lecturenote NB
10 pages
04 Lecturenote MLE MAP Discriminative
No ratings yet
04 Lecturenote MLE MAP Discriminative
6 pages
03 Lecturenote MLE MAP
No ratings yet
03 Lecturenote MLE MAP
7 pages
01 Lecturenote SRM
No ratings yet
01 Lecturenote SRM
9 pages
Massey Ferguson
100% (1)
Massey Ferguson
697 pages
Pepperl+Fuchs - Transmissor de Pressão
No ratings yet
Pepperl+Fuchs - Transmissor de Pressão
38 pages
Cross-Border Payment Solutions
No ratings yet
Cross-Border Payment Solutions
28 pages
Axial Runout Data - Vasava Engineering
No ratings yet
Axial Runout Data - Vasava Engineering
3 pages
The Regulatory Vacuum
No ratings yet
The Regulatory Vacuum
4 pages
Allwin Resume 20-08-2024
No ratings yet
Allwin Resume 20-08-2024
3 pages
Nova
No ratings yet
Nova
21 pages
Creative Tech Grade 7 QRTR 1 Exam
No ratings yet
Creative Tech Grade 7 QRTR 1 Exam
5 pages
SSS and Feedback System Details
No ratings yet
SSS and Feedback System Details
4 pages
Design Patterns for Developers
No ratings yet
Design Patterns for Developers
13 pages
Week6 Module Educ8
No ratings yet
Week6 Module Educ8
4 pages
x86 Stderr
No ratings yet
x86 Stderr
3 pages
GCP Cloud Migration Strategy Guide
No ratings yet
GCP Cloud Migration Strategy Guide
5 pages
Aramco Scafffold Materials - 19 April'2022
No ratings yet
Aramco Scafffold Materials - 19 April'2022
26 pages
EEE321 Mid Autumn 2022
No ratings yet
EEE321 Mid Autumn 2022
2 pages
Meta-Learning Adaptive Filters for Audio
No ratings yet
Meta-Learning Adaptive Filters for Audio
15 pages
Forklift Truck Checklist
100% (1)
Forklift Truck Checklist
1 page
ISB-CTO - Brochure Q3 FY25
No ratings yet
ISB-CTO - Brochure Q3 FY25
28 pages
Unemployment
No ratings yet
Unemployment
18 pages
My City Bhopal
No ratings yet
My City Bhopal
8 pages
Pascal's Law and Its Applications
No ratings yet
Pascal's Law and Its Applications
15 pages
Teachers' Perceptions of The Impact of Technology On Children and Young People's Emotions and Behaviours
No ratings yet
Teachers' Perceptions of The Impact of Technology On Children and Young People's Emotions and Behaviours
11 pages
Graph Databases: Adrian Silvescu, Doina Caragea, Anna Atramentov
No ratings yet
Graph Databases: Adrian Silvescu, Doina Caragea, Anna Atramentov
14 pages
Position Control of 3-DOF Articulated Robot Arm Using PID Controller
No ratings yet
Position Control of 3-DOF Articulated Robot Arm Using PID Controller
7 pages
100 Essential Interview Questions and Answers For 5G.
No ratings yet
100 Essential Interview Questions and Answers For 5G.
16 pages
Online Driver License
No ratings yet
Online Driver License
8 pages
014 CAT-6060 AC 60Hz E-Drive CAMP + SIL + BCS4 FS Legend H-Schematic Canada No New Available
No ratings yet
014 CAT-6060 AC 60Hz E-Drive CAMP + SIL + BCS4 FS Legend H-Schematic Canada No New Available
13 pages
Network Management
No ratings yet
Network Management
27 pages
Performance of Fire-Resisting
No ratings yet
Performance of Fire-Resisting
9 pages
Ban Unit-3
No ratings yet
Ban Unit-3
21 pages

02 Lecturenote GD

Uploaded by

02 Lecturenote GD

Uploaded by

CSE517A Machine Learning Fall 2022

Lecture 2: Optimization for Machine Learning

How is this classifier called?

1 Recap: SRM Objective

Or in matrix notation using X ∈ Rd×n Eq. (2) can be represented as:

1 Probabilistic Machine Learning by Kevin Murphy (https://probml.github.io/pml-book/book1.html)

How to solve the minimization problem?

Aside: telescope-search for hyperparamter optimization:

from the primal. What is λ in this representation?

To find this minimum we us the following strategy:

The basic gradient descent algorithm works as follows:

Algorithm 1 Gradient Descent

How to choose gradient step s in Algorithm 1?

`(wt + s) ≈ `(wt ) + g(wt )⊺ s ≤ `(wt )

We achieve this for example when choosing: s = −g(wt ).

Some choices of the learning rate α

Here are some ways of choosing the learning rate α:

(2) Heuristic/ad-hoc choice:

(3) Line search: Choose the α that minimizes εtr .

3.1 Avoid Divergence of Newton’s Method

Figure 4: A starting point where Newton’s Method converges in 8 iterations.

Figure 5: A starting point where Newton’s Method diverges.

3.2 Quasi-Newton Methods

3.3 [optional] Conjugate Gradient

Algorithm 2 Conjugate Gradient Descent

4.1 Momentum Method

4.2 Stochastic Gradient Descent

• use one training point at a time

• mini-batches (randomly partition D in sets with m training examples)

4.3 Note: Random Restarts for Non-convex Functions

Exercise 5.1. Practice Retrieving!

(f) How does stochastic gd improve upon the vanilla version?

This is exactly what you will do in written homework 1 and implementation

You might also like