0% found this document useful (0 votes)
54 views4 pages

Stochastic Gradient Descent Guide

The document summarizes stochastic gradient descent (SGD), an optimization algorithm for minimizing objective functions that are written as a sum of differentiable functions. SGD approximates the gradient of the objective using only a subset of the data, making it computationally cheaper than batch gradient descent but less accurate. The summary improves SGD through mini-batching, adaptive learning rates, momentum, and Nesterov acceleration.

Uploaded by

Momin Abbas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
54 views4 pages

Stochastic Gradient Descent Guide

The document summarizes stochastic gradient descent (SGD), an optimization algorithm for minimizing objective functions that are written as a sum of differentiable functions. SGD approximates the gradient of the objective using only a subset of the data, making it computationally cheaper than batch gradient descent but less accurate. The summary improves SGD through mini-batching, adaptive learning rates, momentum, and Nesterov acceleration.

Uploaded by

Momin Abbas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

44715: Optimization II 2019S

Lecture 7: Stochastic Gradient Descent


Lecturer: Yimin Zhong Scribes: None

Note: In all notes, bold face letters denote vectors.

The problem of interest is the following:


m
1 X
f (x) = fi (x) (7.1)
m i=1

where fi is the loss associated with certain measurement of observation set. The above objective function f
denotes the statistical average loss function. The most common example is the so-called least squares, where
fi are in the squared form. Another common example is the so-called maximum likelihood.

7.1 (Batch) Gradient Descent

The general theory says, if f is differentiable, then we must have minimum at a stationary point, the gradient
descent guarantees the convergence to a stationary point.
The cost will be mostly the computation of gradient, each time n entries in the gradient have to be calculated
and update x. But taking derivative in f will involve m functions, if the data set is huge, this is difficult to
compute.

m
1 X
xk+1 = xk − η∇f (xk ) = xk − η ∇fi (xk ) (7.2)
m i=1

The idea of stochastic gradient descent is to use a small portion of data to compute gradient, which usually
is much faster, though accuracy is not good. As we know from last quarter, the linear method is usually
slow, the accuracy in evaluating gradient might not provide too much improvement.

7.2 Stochastic Gradient Descent

Although selecting a small portion of data set from the whole data set is not enough to represent the true
gradient, but it is very rare the “sampled” gradient is far away from the true one in general. The stochastic
gradient descent, SGD is the following

xk+1 = xk − η∇fik (xk ) (7.3)

for one single ik , where ik is sampled from the index set {1, 2, . . . , m}, so on average, we should expect the
mean gradient is the correct one. The sampling strategy is something not understood well mathematically,
but experimentally well understood (see Project 2).
In practice, we do not require a specific rule on the sampling on the ordering of ik , which means we could
have some sequence like {1, 2, 1, 1, 2, 2, 3, . . . }, locally the index will be repeated, it is clear this may not

7-1
7-2 Lecture 7: Stochastic Gradient Descent

provide good convergence since the objective function does not depend on only one data set. It is not trivial
to show this strategy is for sure worse than a “clever” way: instead of sampling on the fly, we shuffle the
index {1, 2, . . . , m}, which gives a permutation of the index set, then follow a one-by-one manner to perform
the stochastic gradient descent.
The algorithm is simple

1. Choose initial values of x0 and the step size (learning rate) η.

2. Shuffle the index set {1, 2, . . . , m} as T , perform the iteration (7.3) sequentially in T . Goto step 2 until
converges.

7.3 Mini-batch Gradient Descent

The mini-batch is somewhat between batch GD and SGD. The idea is instead of one data set to compute the
gradient, it uses a “mini“ batch of data to compute the gradient. This will be much better in convergence
experimentally, since the more sampling means more accurate approximation. The algorithm is very simple
too.

1. Choose initial values of x0 and the step size (learning rate) η.

2. Shuffle the index set {1, 2, . . . , m} as T , group T into a few mini-batches, perform the iteration

1 X
xk+1 = xk − η ∇fi (xk ) (7.4)
L
i∈Tj

with each mini-batch Tj sequentially in T . Goto step 2 until converges.

7.4 Convergence and learning rate

The learning rate actually plays a role for the convergence, unlike what we have learned in last quarter, the
convergence rate here is sublinear in general.
Under some assumption of convexity, if η = kc for some number c, this is called diminishing step sizes, which
means η decays in a slow rate, then the convergence (in expectation sense) will be at order O(k −1/2 ). If you
want the expected error to be small, say 10−2 , roughly the iteration must be around 104 level. The proof
is not difficult. If you know expectation’s meaning, you can read the references for this lecture. The idea is
the same as the proof for gradient descent with fixed step length in last quarter.
Because of the slow convergence, choosing xk as our result might not be a good idea since it has too much
variance, another popular choice is the average xk , which is the mean value from previous all outputs.
k
1X
x̄k = xi
k i=1

This is usually more robust, but the convergence is about O(k −1/2 ) for general case.
If f is (strongly) convex and gradient is differentiable (or Lipschitz), then we can get O(k −1 ) convergence.
Lecture 7: Stochastic Gradient Descent 7-3

7.5 Some challenges

The mini-batch GD does not guarantee good convergence in theory, but in practice it has better performance.
There are some issues for the algorithm:

1. Learning rate, the step size is important. Choosing the step size is difficult.
2. The adjustment of learning rate, which means how to effectively make the learning rate vanishing is
not clear.
3. learning rate should change adaptively, different parameters, different learning rates, this is more like
scaling.
4. The nature of the problem, especially the non-convexity.

7.6 Annealing the learning rate

The step size or learning rate is not fixed in the stochastic gradient and its variants. The best practice is to
anneal it in a smooth way. There are several common approaches here:

1. Step decay, we keep the learning rate unchanged for a few iterations, then reduce the learning rate,
then repeat.
2. Exponential decay, choose a relatively small decay rate and let the learning rate changes in a0 exp(−ck),
k is iteration number.
3. 1/k decay, this is the theoretical result, people uses a0 /(1 + ck), k is iteration number, others are
parameters.

7.7 Acceleration

7.7.1 Momentum

We have learned the so-called momentum method in last quarter for heavyball method, the trend in the
preivous step is preserved somehow to help the convergence.

pk = γpk−1 + η∇f (xk )


(7.5)
xk+1 = xk − pk
where pk−1 is the displacement from the previous step. The γ is usually set as 0.9 or similar values.

7.7.2 Nesterov

The Nesterov accelerated gradient (NAG) is another way of momentum method, which gives a little bit
correction in gradient.
pk = γpk−1 + η∇f (xk − γpk−1 )
(7.6)
xk+1 = xk − pk
7-4 Lecture 7: Stochastic Gradient Descent

The good thing about Nesterov is: the heavyball method only let the ball follow the slope to move, but
Nesterov can predict a little ahead.

7.7.3 Adagrad

The adagrad is a way to change the learning rate adaptively. Suppose the gradient gk = ∇f (xk ), then
componentwisely we have
xk+1,i = xk,i − ηgk,i (7.7)
The adagrad chooses different η for each parameter.
η
xk+1,i = xk,i − p gk,i (7.8)
Gk,i + ε

where Gk is a vector, the ith element is the sum of squares of the gradients with respect to xk,i up to time
k. ε is a small term for smoothing (say 10−8 ).
The weakness of adagrad is the accumulation in the denominator, which may make the learning rate shrink
to zero too fast.

7.7.4 Adadelta

This method does only one thing, which is to reduce the sum in the denominator by introducing a decay. It
defines Gk as the running average E[gk2 ], which is a vector.

E[gk2 ] = γE[gk−1
2
] + (1 − γ)gk2 (7.9)

say γ = 0.9 here. Then replace Gk as E[gk2 ].

You might also like