0% found this document useful (0 votes)

54 views4 pages

Stochastic Gradient Descent Guide

The document summarizes stochastic gradient descent (SGD), an optimization algorithm for minimizing objective functions that are written as a sum of differentiable functions. SGD approximates the gradient of the objective using only a subset of the data, making it computationally cheaper than batch gradient descent but less accurate. The summary improves SGD through mini-batching, adaptive learning rates, momentum, and Nesterov acceleration.

Uploaded by

Momin Abbas

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

54 views4 pages

Stochastic Gradient Descent Guide

Uploaded by

Momin Abbas

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

44715: Optimization II 2019S

Lecture 7: Stochastic Gradient Descent

Lecturer: Yimin Zhong Scribes: None

Note: In all notes, bold face letters denote vectors.

The problem of interest is the following:

m
1 X
f (x) = fi (x) (7.1)
m i=1

where fi is the loss associated with certain measurement of observation set. The above objective function f
denotes the statistical average loss function. The most common example is the so-called least squares, where
fi are in the squared form. Another common example is the so-called maximum likelihood.

7.1 (Batch) Gradient Descent

The general theory says, if f is differentiable, then we must have minimum at a stationary point, the gradient
descent guarantees the convergence to a stationary point.
The cost will be mostly the computation of gradient, each time n entries in the gradient have to be calculated
and update x. But taking derivative in f will involve m functions, if the data set is huge, this is difficult to
compute.

m
1 X
xk+1 = xk − η∇f (xk ) = xk − η ∇fi (xk ) (7.2)
m i=1

The idea of stochastic gradient descent is to use a small portion of data to compute gradient, which usually
is much faster, though accuracy is not good. As we know from last quarter, the linear method is usually
slow, the accuracy in evaluating gradient might not provide too much improvement.

7.2 Stochastic Gradient Descent

Although selecting a small portion of data set from the whole data set is not enough to represent the true
gradient, but it is very rare the “sampled” gradient is far away from the true one in general. The stochastic
gradient descent, SGD is the following

xk+1 = xk − η∇fik (xk ) (7.3)

for one single ik , where ik is sampled from the index set {1, 2, . . . , m}, so on average, we should expect the
mean gradient is the correct one. The sampling strategy is something not understood well mathematically,
but experimentally well understood (see Project 2).
In practice, we do not require a specific rule on the sampling on the ordering of ik , which means we could
have some sequence like {1, 2, 1, 1, 2, 2, 3, . . . }, locally the index will be repeated, it is clear this may not

7-1
7-2 Lecture 7: Stochastic Gradient Descent

provide good convergence since the objective function does not depend on only one data set. It is not trivial
to show this strategy is for sure worse than a “clever” way: instead of sampling on the fly, we shuffle the
index {1, 2, . . . , m}, which gives a permutation of the index set, then follow a one-by-one manner to perform
the stochastic gradient descent.
The algorithm is simple

1. Choose initial values of x0 and the step size (learning rate) η.

2. Shuffle the index set {1, 2, . . . , m} as T , perform the iteration (7.3) sequentially in T . Goto step 2 until
converges.

7.3 Mini-batch Gradient Descent

The mini-batch is somewhat between batch GD and SGD. The idea is instead of one data set to compute the
gradient, it uses a “mini“ batch of data to compute the gradient. This will be much better in convergence
experimentally, since the more sampling means more accurate approximation. The algorithm is very simple
too.

1. Choose initial values of x0 and the step size (learning rate) η.

2. Shuffle the index set {1, 2, . . . , m} as T , group T into a few mini-batches, perform the iteration

1 X
xk+1 = xk − η ∇fi (xk ) (7.4)
L
i∈Tj

with each mini-batch Tj sequentially in T . Goto step 2 until converges.

7.4 Convergence and learning rate

The learning rate actually plays a role for the convergence, unlike what we have learned in last quarter, the
convergence rate here is sublinear in general.
Under some assumption of convexity, if η = kc for some number c, this is called diminishing step sizes, which
means η decays in a slow rate, then the convergence (in expectation sense) will be at order O(k −1/2 ). If you
want the expected error to be small, say 10−2 , roughly the iteration must be around 104 level. The proof
is not difficult. If you know expectation’s meaning, you can read the references for this lecture. The idea is
the same as the proof for gradient descent with fixed step length in last quarter.
Because of the slow convergence, choosing xk as our result might not be a good idea since it has too much
variance, another popular choice is the average xk , which is the mean value from previous all outputs.
k
1X
x̄k = xi
k i=1

This is usually more robust, but the convergence is about O(k −1/2 ) for general case.
If f is (strongly) convex and gradient is differentiable (or Lipschitz), then we can get O(k −1 ) convergence.
Lecture 7: Stochastic Gradient Descent 7-3

7.5 Some challenges

The mini-batch GD does not guarantee good convergence in theory, but in practice it has better performance.
There are some issues for the algorithm:

1. Learning rate, the step size is important. Choosing the step size is difficult.
2. The adjustment of learning rate, which means how to effectively make the learning rate vanishing is
not clear.
3. learning rate should change adaptively, different parameters, different learning rates, this is more like
scaling.
4. The nature of the problem, especially the non-convexity.

7.6 Annealing the learning rate

The step size or learning rate is not fixed in the stochastic gradient and its variants. The best practice is to
anneal it in a smooth way. There are several common approaches here:

1. Step decay, we keep the learning rate unchanged for a few iterations, then reduce the learning rate,
then repeat.
2. Exponential decay, choose a relatively small decay rate and let the learning rate changes in a0 exp(−ck),
k is iteration number.
3. 1/k decay, this is the theoretical result, people uses a0 /(1 + ck), k is iteration number, others are
parameters.

7.7 Acceleration

7.7.1 Momentum

We have learned the so-called momentum method in last quarter for heavyball method, the trend in the
preivous step is preserved somehow to help the convergence.

pk = γpk−1 + η∇f (xk )

(7.5)
xk+1 = xk − pk
where pk−1 is the displacement from the previous step. The γ is usually set as 0.9 or similar values.

7.7.2 Nesterov

The Nesterov accelerated gradient (NAG) is another way of momentum method, which gives a little bit
correction in gradient.
pk = γpk−1 + η∇f (xk − γpk−1 )
(7.6)
xk+1 = xk − pk
7-4 Lecture 7: Stochastic Gradient Descent

The good thing about Nesterov is: the heavyball method only let the ball follow the slope to move, but
Nesterov can predict a little ahead.

7.7.3 Adagrad

The adagrad is a way to change the learning rate adaptively. Suppose the gradient gk = ∇f (xk ), then
componentwisely we have
xk+1,i = xk,i − ηgk,i (7.7)
The adagrad chooses different η for each parameter.
η
xk+1,i = xk,i − p gk,i (7.8)
Gk,i + ε

where Gk is a vector, the ith element is the sum of squares of the gradients with respect to xk,i up to time
k. ε is a small term for smoothing (say 10−8 ).
The weakness of adagrad is the accumulation in the denominator, which may make the learning rate shrink
to zero too fast.

7.7.4 Adadelta

This method does only one thing, which is to reduce the sum in the denominator by introducing a decay. It
defines Gk as the running average E[gk2 ], which is a vector.

E[gk2 ] = γE[gk−1
2
] + (1 − γ)gk2 (7.9)

say γ = 0.9 here. Then replace Gk as E[gk2 ].

Optimization Algorithms Deep PDF
No ratings yet
Optimization Algorithms Deep PDF
9 pages
An Overview of Gradient Descent Optimization Algorithms PDF
No ratings yet
An Overview of Gradient Descent Optimization Algorithms PDF
12 pages
4 - Gradient Descent and Stochastic GD
No ratings yet
4 - Gradient Descent and Stochastic GD
37 pages
Stochastic Gradient Descent Tuning
No ratings yet
Stochastic Gradient Descent Tuning
8 pages
Lecture 21
No ratings yet
Lecture 21
49 pages
Gradient Descent Optimization Guide
No ratings yet
Gradient Descent Optimization Guide
9 pages
Gradient Descent Algorithm in Machine Learning
No ratings yet
Gradient Descent Algorithm in Machine Learning
21 pages
Optimization
No ratings yet
Optimization
6 pages
Gradient Descent Algorithm in Machine Learning: Dr. P. K. Chaurasia
No ratings yet
Gradient Descent Algorithm in Machine Learning: Dr. P. K. Chaurasia
24 pages
SGD 1
No ratings yet
SGD 1
86 pages
Gradient Descent - PR
No ratings yet
Gradient Descent - PR
31 pages
WINSEM2024-25 CSE4006 ETH AP2024254000693 2025-01-08 Reference-Material-I
No ratings yet
WINSEM2024-25 CSE4006 ETH AP2024254000693 2025-01-08 Reference-Material-I
40 pages
Technical Writing
No ratings yet
Technical Writing
9 pages
17 Convexoptim5
No ratings yet
17 Convexoptim5
63 pages
Gradient-Based Optimizers
No ratings yet
Gradient-Based Optimizers
54 pages
Katyusha: The First Direct Acceleration of Stochastic Gradient Methods
No ratings yet
Katyusha: The First Direct Acceleration of Stochastic Gradient Methods
45 pages
Paper 2
No ratings yet
Paper 2
27 pages
14 Efficient Learning
No ratings yet
14 Efficient Learning
7 pages
Lecture3-Steepest and Gradient Descent
No ratings yet
Lecture3-Steepest and Gradient Descent
7 pages
Stochastic Gradient Descent Basics
No ratings yet
Stochastic Gradient Descent Basics
22 pages
HMD-Deep Learning-Lecture 2-2024
No ratings yet
HMD-Deep Learning-Lecture 2-2024
47 pages
Technical Writing
No ratings yet
Technical Writing
8 pages
Gradient Descent Overview
No ratings yet
Gradient Descent Overview
14 pages
Lecture 5
No ratings yet
Lecture 5
4 pages
Discussion 4 CS771
No ratings yet
Discussion 4 CS771
25 pages
Gradient Descent & Stochastic Optimization
No ratings yet
Gradient Descent & Stochastic Optimization
4 pages
Advanced Gradient Descent
No ratings yet
Advanced Gradient Descent
14 pages
Gradient Descent
No ratings yet
Gradient Descent
52 pages
Neural Network Optimization Tactics
No ratings yet
Neural Network Optimization Tactics
20 pages
Advanced Stochastic Gradient Descent
No ratings yet
Advanced Stochastic Gradient Descent
23 pages
Optim
No ratings yet
Optim
33 pages
Implement 03-1
No ratings yet
Implement 03-1
24 pages
Lecture7 Graddesc
No ratings yet
Lecture7 Graddesc
8 pages
Gradient Descent for ML Practitioners
No ratings yet
Gradient Descent for ML Practitioners
2 pages
Stochastic Optimization For Machine Learning
No ratings yet
Stochastic Optimization For Machine Learning
59 pages
Gradient Descent Insights
No ratings yet
Gradient Descent Insights
4 pages
Backpropagation Optimization Tutorial
No ratings yet
Backpropagation Optimization Tutorial
14 pages
Lecture 7 (With Notes)
No ratings yet
Lecture 7 (With Notes)
39 pages
Gradient Descent for ML Experts
No ratings yet
Gradient Descent for ML Experts
5 pages
DS303: Introduction To Machine Learning: Stochastic Gradient Descent
No ratings yet
DS303: Introduction To Machine Learning: Stochastic Gradient Descent
19 pages
Gradient Descent
No ratings yet
Gradient Descent
13 pages
Mlfa Autumn 23 Optimization
No ratings yet
Mlfa Autumn 23 Optimization
37 pages
Gradient Descent in Convex Optimization
No ratings yet
Gradient Descent in Convex Optimization
27 pages
Optimization Techniques in Deep Learning
No ratings yet
Optimization Techniques in Deep Learning
14 pages
Gradient Descent in Machine Learning
No ratings yet
Gradient Descent in Machine Learning
98 pages
DL Unit - 2
No ratings yet
DL Unit - 2
20 pages
Gradient Descent Algorithm Is A First
No ratings yet
Gradient Descent Algorithm Is A First
5 pages
L23 Stochastic Gradient and Mini Batch
No ratings yet
L23 Stochastic Gradient and Mini Batch
9 pages
Module 4 Lab 3
No ratings yet
Module 4 Lab 3
6 pages
Optimization Techniques (SGD Alternatives)
No ratings yet
Optimization Techniques (SGD Alternatives)
34 pages
SGD 2
No ratings yet
SGD 2
18 pages
Lec 5 Scaling and Opt
No ratings yet
Lec 5 Scaling and Opt
68 pages
S09 DNN Gradients Wip
No ratings yet
S09 DNN Gradients Wip
28 pages
Lecture 5
No ratings yet
Lecture 5
34 pages
2.stochastic Gradient Descent (SGD)
No ratings yet
2.stochastic Gradient Descent (SGD)
11 pages
Screenshot 2024-10-19 at 10.37.25 AM
No ratings yet
Screenshot 2024-10-19 at 10.37.25 AM
25 pages
Gradient Descent Types Explained
No ratings yet
Gradient Descent Types Explained
5 pages
Gradient Descent
No ratings yet
Gradient Descent
2 pages
chp2 Gradient Descent Algorithm
No ratings yet
chp2 Gradient Descent Algorithm
5 pages
Evangelista vs. People PDF
No ratings yet
Evangelista vs. People PDF
10 pages
Factsheet: Lightning Safety When Working Outdoors
No ratings yet
Factsheet: Lightning Safety When Working Outdoors
5 pages
Unacompanied Minors
No ratings yet
Unacompanied Minors
2 pages
2023-AD-Optimizing The Performance of The RF Signal Chain
No ratings yet
2023-AD-Optimizing The Performance of The RF Signal Chain
62 pages
BITMINING LIMITED PDF
No ratings yet
BITMINING LIMITED PDF
7 pages
Amplidyne
100% (1)
Amplidyne
1 page
Non Current Asset Held For Sale Final-1
100% (4)
Non Current Asset Held For Sale Final-1
29 pages
Eee 5th V Sem CS 2312 Oop Lab Manual
100% (1)
Eee 5th V Sem CS 2312 Oop Lab Manual
80 pages
SC BIRTHDAY Party Minutes of The Meeting March 3
No ratings yet
SC BIRTHDAY Party Minutes of The Meeting March 3
2 pages
Evidence Based RUA
No ratings yet
Evidence Based RUA
12 pages
Tax Dispute: McDonald's vs. CIR
No ratings yet
Tax Dispute: McDonald's vs. CIR
37 pages
STP Action Planning Sheet
No ratings yet
STP Action Planning Sheet
2 pages
Metals and Finishes: Finish Standards Finish Standards Chart Back To Index
No ratings yet
Metals and Finishes: Finish Standards Finish Standards Chart Back To Index
2 pages
KM Tl117e 02LX110 7
No ratings yet
KM Tl117e 02LX110 7
184 pages
Contesting China in The Maldives India S Foreign Policy Challenge.
No ratings yet
Contesting China in The Maldives India S Foreign Policy Challenge.
18 pages
SolarMan General Company Profile Solar - Update 2019
No ratings yet
SolarMan General Company Profile Solar - Update 2019
14 pages
AD444858538496en 010201
No ratings yet
AD444858538496en 010201
94 pages
2005 LEXUS RX Specifications: Knowledge
No ratings yet
2005 LEXUS RX Specifications: Knowledge
4 pages
Economics Quiz on Price Elasticity
No ratings yet
Economics Quiz on Price Elasticity
6 pages
Vehicle Checklist
No ratings yet
Vehicle Checklist
9 pages
Digital Transformation
No ratings yet
Digital Transformation
13 pages
Esaote P8000 - User Manual
100% (3)
Esaote P8000 - User Manual
58 pages
Process Consultation Revisited - Building The Helping Relationship (PDFDrive)
100% (3)
Process Consultation Revisited - Building The Helping Relationship (PDFDrive)
276 pages
Hadzic's Peripheral Nerve Blocks and Anatomy For Ultrasound-Guided Regional Anesthesia 3rd Edition Admir Hadzic Full Chapters Instanly
No ratings yet
Hadzic's Peripheral Nerve Blocks and Anatomy For Ultrasound-Guided Regional Anesthesia 3rd Edition Admir Hadzic Full Chapters Instanly
141 pages
Operational Acceptance Testing Whitepaper
No ratings yet
Operational Acceptance Testing Whitepaper
11 pages
Diesel Engine PLC Configuration
No ratings yet
Diesel Engine PLC Configuration
6 pages
Jurnal Kinetika Kimia
No ratings yet
Jurnal Kinetika Kimia
7 pages
HS 10.2.1 Food Safety Management System
No ratings yet
HS 10.2.1 Food Safety Management System
1 page
Detection of Fake Images Using Metadata Analysis and Error Level Analysis
No ratings yet
Detection of Fake Images Using Metadata Analysis and Error Level Analysis
28 pages
Drug Therapy Monitoring
100% (1)
Drug Therapy Monitoring
13 pages

Stochastic Gradient Descent Guide

Uploaded by

Stochastic Gradient Descent Guide

Uploaded by

44715: Optimization II 2019S

Lecture 7: Stochastic Gradient Descent

Note: In all notes, bold face letters denote vectors.

The problem of interest is the following:

7.1 (Batch) Gradient Descent

7.2 Stochastic Gradient Descent

xk+1 = xk − η∇fik (xk ) (7.3)

1. Choose initial values of x0 and the step size (learning rate) η.

7.3 Mini-batch Gradient Descent

1. Choose initial values of x0 and the step size (learning rate) η.

with each mini-batch Tj sequentially in T . Goto step 2 until converges.

7.4 Convergence and learning rate

7.5 Some challenges

7.6 Annealing the learning rate

pk = γpk−1 + η∇f (xk )

say γ = 0.9 here. Then replace Gk as E[gk2 ].

You might also like