0% found this document useful (0 votes)

35 views33 pages

Unit V NNHDL

This document discusses the challenges and strategies in optimizing deep neural networks, focusing on the differences between machine learning and pure optimization, as well as the issues like ill-conditioning, local minima, saddle points, and exploding gradients. It highlights the importance of surrogate loss functions, mini-batch gradient descent, and the role of learning rates in stochastic gradient descent. Additionally, it outlines basic algorithms and techniques such as momentum and gradient clipping to address optimization challenges in neural networks.

Uploaded by

shreerupa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

35 views33 pages

Unit V NNHDL

Uploaded by

shreerupa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 33

UNIT - V

Optimization for Train Deep Models: Challenges in Neural Network

Optimization, Basic Algorithms, Parameter Initialization Strategies,
Algorithms with Adaptive Learning Rates, Approximate Second-Order
Methods, Optimization Strategies and Meta-Algorithms

1. How Learning Differs from Pure Optimization

• In Machine Learning (ML), we care about a certain performance measure (say P, for
e.g. accuracy) defined w.r.t the test set and optimize J(θ) (for e.g. cross-entropy loss)
with the hope that it improves P as well. In pure optimization, optimizing J(θ) is the
final goal.

• The expected generalization error (risk) is taken over the true data-generating
distribution p_data. If we do have that, it becomes an optimization problem.

Notice how the expectation is taken over the true data generating distribution.

When we don’t have p_data but a finite training set, we have a ML problem. The latter can be
converted back to an optimization problem by replacing p_data with the empirical
distribution with p̂_data obtained from the training set, thereby reducing the empirical risk.
This is called empirical risk minimization (ERM):

Notice the change in distribution over which the expectation is taken.

Although this might look relatively similar to optimization, there are two main problems.
Firstly, ERM is prone to overfitting with the possibility of the dataset being learned by high
capacity models (models with the ability to learn extremely complex functions). Secondly,
ERM might not be feasible. Most optimization algorithms now are based on Gradient
Descent (GD) which requires a derivative calculation and hence, may not work with certain
loss functions like the 0–1 loss (as it is not differentiable).
• It is for the reasons mentioned above that a surrogate loss function (SLF) is used
instead, that acts as a proxy. For e.g. the negative log-likelihood of the true class is
used as a surrogate for 0–1 loss. I’ve added a code snippet below that would help you
understand why 0–1 loss won’t work for Gradient Descent but cross-entropy, being a
smooth function, would.

It can be seen that the 0–1 loss is a non-differentiable function and hence, not compatible
with gradient-based algorithms like Gradient Descent. Cross-entropy is a smooth
approximation of the 0–1 loss.

Using a SLF might even turn out to be beneficial as you can keep continuing to obtain a
better test error by pushing the classes even further apart to get a more reliable classifier. By
this, I mean that suppose we were using a 0–1 loss with a threshold of, say, 0.5 to assign each
class. Here, in case the true class is 1, our model would have no motivation to push the
prediction score close to 1 once it’s able to get it above 0.5. However, using cross-entropy,
since we are using the raw prediction scores, the model keeps trying to push the prediction
closer to the true class.

• Another common difference is that training might be halted following some

convergence criterion based on Early Stopping to prevent overfitting, when the
derivative of the surrogate loss function might still be large. This is different from
pure optimization which is halted only when the derivative becomes very small. If
you’re not familiar with Early Stopping, I’d recommend you to look at our previous
post where we talk about Early Stopping and other regularization techniques.

• In ML optimization algorithms, the objective function decomposes as a sum over the

examples and we can perform updates by randomly sampling a batch of examples and
taking the average over the examples in that batch. If we consider n random variables,
each having the true mean given by μ, the Standard Error (S.E.) of the
mean estimated from those n random variables(see Appendix for derivation) is given
as follows:
This indicates that as we include more examples for making an update, the returns of
additional examples in improving the error is less than linear. Thus, if we use 100 and 10000
examples separately to make an update, the latter takes 100 times more compute, but reduces
the error only by a factor of 10. Thus, it’s better to compute rapid approximate updates rather
than a slow exact update.

• There are 3 types of sampling based algorithms — batch gradient descent (BGD),
where the entire training set is used to make a single update, stochastic gradient
descent (SGD), where a single example is used to make a weight update and mini-
batch gradient descent (MBGD), where a batch (not to be confused with BGD) of
examples is randomly sampled from the entire training set and is used to make an
update. Mini-batch GD is nowadays commonly referred to as Stochastic GD.

Taken from Andrew Ng’s Machine Learning course. Here, as mentioned above, mini-batch
gradient descent has been presented as stochastic gradient descent.

It is a common practise to use batch sizes of powers of 2 to offer better run-time with certain
hardware. Small batches tend to have a regularizing effect because of the noise they inject as
each update is made by seeing only a very small portion of the entire training set, a.k.a. a
batch of samples.
• The mini-batches should be selected randomly. It is sufficient to shuffle the dataset
once and iterate over it multiple times. In the first epoch, the network sees each
example for the first time and hence, the estimate of gradient is an unbiased estimate
of the gradient of the true generalization error. However, from the second epoch
onward, the estimate becomes biased as it is re-sampling from data that it has already
seen. Such a sampling algorithm is called Random Reshuffling and although their
analysis even for generalized linear models, which are strongly convex, is an open
problem till date, reasonable efforts have been made to show that this biased estimate
of the gradient is decent enough (at least as good as i.i.d. based Batch SGD).

2. Challenges in Neural Network Optimization

The optimization problem for training neural networks is generally non-convex. Some of the
challenges faced are mentioned below:

• Ill-conditioning of the Hessian Matrix: The Hessian matrix and condition number
have been covered in our summary for Chapter 4. For the sake of completion, the
Hessian matrix H of a function f with a vector-valued input x is given as:

Source: https://en.wikipedia.org/wiki/Hessian_matrix

The ill-conditioning of H (defined below) can cause the SGD to get stuck in a sense that even
very small steps may increase the cost function. A more rigorous explanation of this, added
by Raghav Somani is given next.

Assuming local convexity, another way of looking at ill-conditioned Hessian is by considering

its eigenvalues. Condition number of the Hessian is high if the largest positive eigenvalue of
the local Hessian is too large, or when the lowest eigenvalue of the Hessian is very close to
zero. It can be shown that the optimal step size is less than the inverse of the local smoothness
constant of the function (smoothness, is informally proportional to the largest eigenvalue), so
for the former case, if the largest eigenvalue of the Hessian is very large, the step-size
required needs to be extremely small (leading to slow convergence). On the other hand, if the
least eigenvalue is extremely small, then we again encounter slow convergence due to lack of
enough convexity.

If local convexity is not assumed, it gets difficult to say just via first order optimization
methods if we are converging to a saddle point or to a local minima.

In Chapter 4, it was shown that moving by a factor of ϵg would add the term given below to
the cost function:

Derived using the Taylor Series Expansion

Ill-conditioning is said to happen when the first term exceeds the second term as then the cost
would be increasing. In many cases, the gradient might be large leading to a large gradient
norm (i.e. g’g). However, g’Hg might be even larger than the gradient norm. This would lead
to slower learning as we would need to reduce the learning rate to make the first term lower
than the second. To clarify more on this, since the first term contains the 2nd power of ϵ,
and ϵ being less than 1, ϵ² < ϵ. So, to prevent ill-conditioning, the first term should be lower
than the second, but given that g’Hg > g’g, this can be achieved only by reducing the learning
rate, leading to slower learning. Thus, although ideally the gradient norm should decrease
during training (since our primary aim is to reach a global minima where the gradient is 0),
we can still get successful training even with the gradient norm increasing as shown below:
• Local minima: Nearly any Deep Learning (DL ) model is guaranteed to have an
extremely large number of local minima (LM) arising due to the model
identifiability problem.

There might be many local minima where the model may get stuck due to zero gradient.
Source: http://www.yaldex.com/game-development/1592730043_ch18lev1sec4.html
A model is said to be identifiable if a sufficiently large training set can rule out all but
one setting of the model parameters. In case of neural networks, we can obtain equivalent
models by swapping the position of the neurons. Thus, they are not identifiable.

Swapping the two hidden nodes leads to equivalent models. Thus, even after having a
sufficiently large training set, there is not a unique setting of parameters. This is the model
identifiability problem that neural networks suffer from.

However, all the local minima caused due to this have the same value of the cost function,
thus not being a problem. However, if local minima with high cost are common, it becomes a
serious problem as shown above. Many points other than local minima can lead to low
gradients. Nowadays, it’s common to aim for a low but not minimal cost value.

• Plateaus, Saddle Points and Other Flat Regions: Saddle point (SP) is another type of
point with zero gradient where some points around it have higher value and the others
have lower. Intuitively, this means that a saddle point acts as both a local minima for
some neighbors and a local maxima for the others. Thus, Hessian at SP has both
positive and negative eigenvalues (a very good explanation for this can be
found here. TL;DR — for a function to curve upwards or downwards around a point
as in the case of local minima and local maxima, the eigenvalues should have the
same sign, positive for local minima and negative for local maxima).
It’s called a saddle point as it looks like the saddle of a horse. Source: deeplearning.ai. I know
Andrew Ng is awesome ;)

For many classes of random functions, saddle points become more common at high
dimensions with the ratio of number of SPs to LMs growing exponentially with n for an n-
dimensional space. Many random functions have an amazing property that near points with
low cost, the Hessian tends to take up mostly positive eigenvalues. SGD empirically tends to
rapidly avoid encountering a high-cost saddle point. For SGD to rapidly avoid saddle points,
it needs to have sufficiently high stochastic variance along all directions. Langevin dynamics
stochastic gradient, is a set of SGD based algorithms that play around with SGD with added
white noise to get off local minimas and saddles with high probability under mind conditions.
There also might be wide, flat regions of constant value, thereby having a zero gradient.
These can be problematic if the cost is high in these regions.
It is problematic to get stuck in a plateau where the value of the cost function is high.
Source: https://en.wikipedia.org/wiki/Local_search_(constraint_satisfaction)
• Cliffs and Exploding Gradients: Neural Networks (NNs) might sometimes have
extremely steep regions resembling cliffs due to the repeated multiplication of
weights. Suppose we use a 3-layer (input-hidden-output) neural network with all the
activation functions as linear. We choose the same number of input, hidden and output
neurons, thus, using the same weight W for each layer. The output layer y =
W*h where h = W*x represents the hidden layer, finally giving y = W*W x. So, deep
neural networks involve multiplication of a large number of parameters leading to
sharp non-linearities in the parameter space. These non-linearities give rise to high
gradients in some places. At the edge of such a cliff, an update step might throw the
parameters extremely far.
Image depicting the problem of exploding gradients when approaching a cliff. 1) Usual
training going on with the parameters moving towards the lower cost region. 2) The gradient
at the bottom left-most point pointed downwards (correct direction) but the step-size was too
large, which caused the parameters to land at a point having large cost value. 3) The gradient
at this new point moved the parameters in a completely different position undoing most of the
training done until that point.

It can be taken care of by using Gradient Clipping (GC). The gradient indicates only the
direction in which to make the update. If the GD update proposes making a very large step,
GC intervenes to reduce the step size.

• Long-Term Dependencies: This problem is encountered when the NN becomes

sufficiently deep. For example, if the same weight matrix W is used in each layer,
after t steps, we’d get W *W * W … (t times). Using the eigendecomposition of W:

Here, V is an orthonormal matrix, i.e. V V’ = I

Thus, any eigenvalues not near an absolute value of one would either explode or vanish
leading to the Vanishing and Exploding Gradient problem. The use of the same weight matrix
is especially the case in Recurrent NNs (RNNs), where this is a serious problem.
Values near 1 either explode or vanish upon being compounded. You might have seen this
poster in a separate context.
• Inexact Gradients: Most optimization algorithms use a noisy/biased estimate of the
gradient in cases where the estimate is based on sampling, or in cases where the true
gradient is intractable for e.g. in the case of training a Restricted Boltzmann
Machine (RBM), an approximation of the gradient is used. For RBM, the contrastive
divergence algorithm gives a technique for approximating the gradient of its
intractable log-likelihood.

• Neural Networks might not end up at any critical point at all and such critical points
might not even necessarily exist. A lot of the problems might be avoided if there exists
a space connected reasonably directly to the solution by a path that local descent can
follow and if we are able to initialize learning within that well-behaved space. Thus,
choosing good initial points should be studied.

3. Basic Algorithms
• Stochastic Gradient Descent: This has already been described before but there are
certain things that should be kept in mind regarding SGD. The learning rate ϵ is a very
important parameter for SGD. ϵ should be reduced after each epoch in general. This is
due to the fact that the random sampling of batches acts as a source of noise which
might make SGD keep oscillating around the minima without actually reaching it.
This is shown below:
Source: https://goo.gl/tq6Xof

The true gradient of the total cost function (involving the entire dataset) actually becomes
0 when we reach the minimum. Hence, BGD can use a fixed learning rate. The following
conditions guarantee convergence under convexity assumptions in case of SGD:

Setting it too low makes the training proceed slowly which might lead to the algorithm being
stuck at a high cost value. Setting it too high would lead to large oscillations which might
even push the learning outside the optimal region. The best way is to monitor the first several
iterations and set the learning rate to be higher than the best performing one, but not too high
to cause instability.
Source: https://www.jeremyjordan.me/nn-learning-rate/

A big advantage of SGD is that the time taken to compute a weight update doesn’t grow with
the number of training examples as each update is computed after observing a batch of
samples which is independent of the total number of training examples. Theoretically, for a
convex problem, BGD makes the error rate O(1/k) after k iterations whereas SGD makes
it O(1/√k). However, SGD compensates for this with its advantages after a few iterations
along with the ability to make rapid updates in the case of a large training set.

• Momentum: The momentum algorithm accumulates the exponentially decaying

moving average of past gradients (called as velocity) and uses it as the direction in
which to take the next step. Momentum is given by mass times velocity, which
is equal to velocity if we’re using unit mass. The momentum update is given by:

Momentum weight update

The step size (earlier equal to learning rate * gradient) now depends on
how large and aligned the sequence of gradients are. If the gradients at each iteration point in
the same direction (say g), it will lead to a higher value of the step size as they just keep
accumulating. Once it reaches a constant (terminal) velocity, the step size becomes ϵ || g|| / (1
— α). Thus, using α as 0.9 makes the speed 10 times. Common values of α are 0.5, 0.9 and
0.99.
Viewing it as the Newtonian dynamics of a particle sliding down a hill, the momentum
algorithm consists of solving a set of differential equations via numerical simulation. There
are two kinds of forces involved as shown below:

Momentum can be seen as two forces operating together. 1) Proportional to the negative of
the gradient such that whenever it descends a steep part of the surface, it gathers speed and
continues sliding in that direction until it goes uphill again. 2) A viscous drag force (friction)
proportional to -v(t) without the presence of which the particle would keep oscillating back
and forth as the negative of the gradient would keep forcing it to move downhill . Viscous
force is suitable as it is weak enough to allow the gradient to cause motion and strong enough
to resist any motion if the gradient doesn’t justify moving
Read more about momentum in this excellent blog post by distill.ai: Why Momentum Really
Works.

• Nesterov Momentum: This is a slight modification of the usual momentum equation.

Here, the gradient is calculated after applying the current velocity to the parameters,
which can be viewed as adding a correction factor:

Nesterov momentum weight update

The intuition behind Nesterov momentum is that upon being at a point θ in the parameter
space, the momentum update is going to shift the point by αv. So, we are soon going to end
up in the vicinity of (θ + αv). Thus, it might be better to compute the gradient from that point
onward. The figure below describes this visually:

Source: http://cs231n.github.io/neural-networks-3/#sgd

This concludes the 1st part of our summary for Chapter 8. We hope that this gave you a good
intuition and understanding on the fundamental concepts behind Optimization used in Deep
Learning. Part II of the summary would contain more advanced concepts like algorithms with
adaptive learning rates and batch-normalization which are also fundamental in making Deep
Learning practically applicable to real-world problems. To stay updated with our
posts, follow our publication, Inveterate Learner. If you’re looking for the summary of
previous chapters or are more comfortable reading from a Jupyter Notebook, feel free to have
a look over our repository here. Finally, I thank my co-author for the publication, Ameya
Godbole, for reviewing the post and suggesting many important areas for further explanations
along with several nip ticks. A special thanks to Raghav Somani for providing deeper
mathematical insights and clarity in a lot of places that gave me (and I hope everyone else
too) a better understanding of the concepts.

4. Parameter Initialization Strategies

Training algorithms for deep learning models are iterative in nature and require the
specification of an initial point. This is extremely crucial as it often decides whether or not
the algorithm converges and if it does, then does the algorithm converge to a point with high
cost or low cost.

Source: https://towardsdatascience.com/random-initialization-for-neural-networks-a-thing-of-
the-past-bfcdd806bf9e

We have limited understanding of neural network optimization but the one property that we
know with complete certainty is that the initialization should break symmetry. This means
that if two hidden units are connected to the same input units, then these should have
different initialization or else the gradient would update both the units in the same way and
we don’t learn anything new by using an additional unit. The idea of having each unit learn
something different motivates random initialization of weights which is also computationally
cheaper.

Biases are often chosen heuristically (zero mostly) and only the weights are randomly
initialized, almost always from a Gaussian or uniform distribution. The scale of the
distribution is of utmost concern. Large weights might have better symmetry-breaking effect
but might lead to chaos (extreme sensitivity to small perturbations in the input) and exploding
values during forward & back propagation. As an example of how large weights might lead to
chaos, consider that there’s a slight noise adding ϵ to the input. Now, we if did just a simple
linear transformation like W * x, the ϵ noise would add a factor of W * ϵ to the output. In case
the weights are high, this ends up making a significant contribution to the output. SGD and
its variants tend to halt in areas near the initial values, thereby expressing a prior that the path
to the final parameters from the initial values is discoverable by steepest descent algorithms.
A more mathematical explanation for the symmetry breaking can be found in the Appendix.

Various suggestions have been made for appropriate initialization of the parameters. The
most commonly used ones include sampling the weights of each fully-connected layer
having m inputs and n outputs uniformly from the following distributions:

• U(-1 / √m, 1 / √m)

• U(- √6 / (m+n), √6 / (m+n))

U(a, b) represents the uniform distribution where the probability of each value between a and
b, a and b inclusive, is 1/(b-a). The probability of every other value is 0.

These initializations have already been incorporated into the most commonly used Deep
Learning frameworks nowadays so that you can just specify which initializer to use and the
framework takes care of sampling appropriately. For e.g. Keras, which is a very famous deep
learning framework, has a module called initializers, where the second distribution (among
the 2 mentioned above) is implemented as glorot_uniform .

One drawback of using 1 / √m as the standard deviation is that the weights end up being
small when a layer has too many input/output units. Motivated by the idea to have the total
amount of input to each unit independent of the number of input units m, Sparse
initialization sets each unit to have exactly k non-zero weights. However, it takes a long time
for GD to correct incorrect large values and hence, this initialization might cause problems.
If the weights are too small, the range of activations across the mini-batch will shrink as the
activations propagate forward through the network.By repeatedly identifying the ﬁrst layer
with unacceptably small activations and increasing its weights, it is possible to eventually
obtain a network with reasonable initial activations throughout.
The biases are relatively easier to choose. Setting the biases to zero is compatible with most
weight initialization schemes except for a few cases for e.g. when used for an output unit, to
prevent saturation at initialization or when using unit as a gate for making a decision. Refer to
the chapter for details.

5. Algorithms with Adaptive Learning Rates

• AdaGrad: As mentioned in Part I , it is important to incrementally decrease the
learning rate for faster convergence. Instead of manually reducing the learning rate
after each (or several) epochs, a better approach is to adapt the learning rate as the
training progresses. This can be done by scaling the learning rates of each model
parameter individually inversely proportional to the square root of the sum of
historical squared values of the gradient. In the parameter update equation below, r is
initialized with 0 and the multiplication in the update step happens element-wise as
mentioned. Since the gradient value would be different for each parameter, the
learning rate is scaled differently for each parameter too. Thus, those parameters
having a large gradient have a large decrease in the learning rate as the learning rate
might be too high leading to oscillations or it might be approaching the minima but
having large learning rate might cause it to jump over the minima as explained in the
figure below, because of which the learning rate should be decreased for better
convergence, while those with small gradients have a small decrease in the learning
rate as they might have already approached their respective minima and should not be
pushed away from that. Even if they have not, reducing the learning rate too much
would reduce the gradient even further leading to slower learning.

AdaGrad parameter update equation.

This figure illustrates the need to reduce the learning rate if gradient is large in case of a
single parameter. 1) One step of gradient descent representing a large gradient value. 2)
Result of reducing the learning rate — moves towards the minima 3) Scenario if the learning
rate was not reduced — it would have jumped over the minima.
However, accumulation of squared gradients from the very beginning can lead to excessive
and premature decrease in the learning rate. Consider that we had a model with only 2
parameters (for simplicity) and both the initial gradients are 1000. After some iterations, the
gradient of one of the

Figure explaining the problem with AdaGrad. Accumulated gradients can cause the learning
rate to be reduced far too much in the later stages leading to slower learning.

parameters has reduced to 100 but that of the other parameter is still around 750. However,
because of the accumulation at each update, the accumulated gradient would still have
almost the same value. For e.g. let the accumulated gradients at each step for the Parameter
1 be 1000 + 900 + 700 + 400 + 100 = 3100, 1/3100=0.0003 and that for Parameter
2 be: 1000 + 900 + 850 + 800 + 750 = 4300, 1/4300 = 0.0002. This would lead to a similar
decrease in the learning rates for both the parameters, even though the parameter having the
lower gradient might have its learning rate reduced too much leading to slower learning.
• RMSProp: RMSProp addresses the problem caused by accumulated gradients in
AdaGrad. It modifies the gradient accumulation step to an exponentially weighted
moving average in order to discard history from the extreme past. The RMSProp
update is given by:
ρ is the weighing used for exponential averaging. As more updates are made, the contribution
of past gradient values are reduced since ρ < 1 and ρ > ρ² >ρ³ …

This allows the algorithm to converge rapidly after finding a convex bowl, as if it were an
instance of AdaGrad initialized within that bowl. Let me explain why this is so. Consider the
figure below. The region represented by 1 indicates usual RMSProp parameter updates as
given by the update equation, which is nothing but exponentially averaged AdaGrad updates.
Once the optimization process lands on A, it essentially lands at the top of a convex bowl. At
this point, intuitively, all the updates before A can be seen to be forgotten due to the
exponential averaging and it can be seen as if (exponentially averaged) AdaGrad updates start
from point A onwards.

Intuition behind RMSProp. 1) Usual parameter updates 2) Once it reaches the convex bowl,
exponentially weighted averaging would cause the effect of earlier gradients to reduce and to
simplify, we can assume their contribution to be zero. This can be seen as if AdaGrad had
been used with the training initiated inside the convex bowl

• Adam: Adapted from “adaptive moments”, it focuses on combining RMSProp and

Momentum. Firstly, it views Momentum as an estimate of the first-order moment and
RMSProp as that of the second moment. The weight update for Adam is given by:
Secondly, since s and r are initialized as zeros, the authors observed a bias during the initial
steps of training thereby adding a correction term for both the moments to account for their
initialization near the origin. As an example of what the effect of this bias correction is, we’ll
look at the values of s and r for a single parameter (in which case everything is now
represented as a scalar). Let’s first understand what would happen if there was no bias
correction. Since s (notice that this is not in bold as we are looking at the value for a single
parameter and the s here is a scalar) is initialized as zero, after the first iteration, the value
of s would be (1 — ρ1) * g and that of r would be (1 — ρ2) * g². The preferred values for ρ1
and ρ2 are 0.9 and 0.99 respectively. Thus, the initial values of s and r are pretty small and
this gets compounded as the training progress. However, if we now use bias correction, after
the first iteration, the value of s is just g and that of r is just g². This gets rid of the bias that
occurs in the initial phase of training. A major advantage of Adam is that it’s fairly robust to
the choice of these hyperparameters, i.e. ρ1 and ρ2.

The figure below shows the comparison between the various optimization methods discussed
above. It can be clearly seen that algorithms with adaptive learning rates provide faster
convergence:
NAG here refers to Nesterov Accelerated Gradient which is the same as Nesterov
Momentum. Source: http://ruder.io/optimizing-gradient-descent/index.html#adam

6. Approximate Second-Order Methods

The optimization algorithms that we’ve looked at till now involved computing only the first
derivative. But there are many methods which involve higher order derivatives as well. The
main problem with these algorithms are that they are not practically feasible in their vanilla
form and so, certain methods are used to approximate the values of the derivatives. We
explain three such methods, all of which use empirical risk as the objective function:
• Newton’s Method: This is the most common higher-order derivative method used. It
makes use of the curvature of the loss function via its second-order derivative to
arrive at the optimal point. Using the second-order Taylor Series expansion to
approximate J(θ) around a point θo and ignoring derivatives of order greater than 2
(this has already been discussed in previous chapters), we get:
We know that we get a critical point for any function f(x) by solving for f'(x) = 0. We get the
following critical point of the above equation (refer to the Appendix for proof):

For quadratic surfaces (i.e. where cost function is quadratic), this directly gives the optimal
result in one step whereas gradient descent would still need to iterate. However, for surfaces
that are not quadratic, as long as the Hessian remains positive definite, we can obtain the
optimal point through a 2-step iterative process — 1) Get the inverse of the Hessian and 2)
update the parameters.

Saddle points are problematic for Newton’s method. If all the eigenvalues are not positive,
Newton’s method might cause the updates to move in the wrong direction. A way to avoid
this is to add regularization:

However, if there is a strong negative curvature i.e. the eigenvalues are largely
negative, α needs to be sufficiently high to offset the negative eigenvalues in which case the
Hessian becomes dominated by the diagonal matrix. This leads to an update which becomes
the standard gradient divided by α:

Another problem restricting the use of Newton’s method is the computational cost. It
takes O(k³) time to calculate the inverse of the Hessian where k is the number of parameters.
It’s not uncommon for Deep Neural Networks to have about a million parameters and since
the parameters are updated every iteration, this inverse needs to be calculated at every
iteration, which is not computationally feasible.

• Conjugate Gradients: One weakness of the method of steepest descent (i.e. GD) is
that line searches happen along the direction of the gradient. Suppose the previous
search direction is d(t-1). Once the search terminates (which it does when the gradient
along the current gradient direction vanishes) at the minimum, the next search
direction, d(t) is given by the gradient at that point, which is orthogonal to d(t-1)
(because if it’s not orthogonal, it’ll have some component along d(t-1) which cannot
be true as at the minimum, the gradient along d(t-1) has vanished).
Upon getting the minimum along the current search direction, the minimum along the
previous search direction is not preserved, undoing, in a sense, the progress made in
previous search direction.

In the method of conjugate gradients, we seek a search direction that is conjugate to the
previous line search direction:

Now, the previous search direction contributes towards finding the next search direction.

with d(t) and d(t-1) being conjugates if d(t)' H d(t-1) = 0. βt decides how much of d(t-1) is
added back to the current search direction. There are two popular choices for βt — Fletcher-
Reeves and Polak-Ribière. These discussions assumed the cost function to be quadratic where
the conjugate directions ensure that the gradient along the previous direction does not
increase in magnitude. To extend the concept to work for training neural networks, there is
one additional change. Since it’s no longer quadratic, there’s no guarantee anymore than the
conjugate direction would preserve the minimum in the previous search directions. Thus, the
algorithm includes occasional resets where the method of conjugate gradients is restarted
with line search along the unaltered gradient.

• BFGS: This algorithm tries to bring the advantages of Newton’s method without the
additional computational burden by approximating the inverse of H by M(t), which is
iteratively refined using low-rank updates. Finally, line search is conducted along the
direction M(t)g(t). However, BFGS requires storing the matrix M(t) which takes O(n²)
memory making it infeasible. An approach called Limited Memory BFGS (L-BFGS)
has been proposed to tackle this infeasibility by computing the matrix M(t) using the
same method as BFGS but assuming that M(t−1) is the identity matrix.

7. Optimization Strategies and Meta-Algorithms

• Batch Normalization: Batch normalization (BN) is one of the most exciting
innovations in Deep learning that has significantly stabilized the learning process and
allowed faster convergence rates. The intuition behind batch normalization is as
follows: Most of the Deep Learning networks are compositions of many layers (or
functions) and the gradient with respect to one layer is taken considering the other
layers to be constant. However, in practise all the layers are updated simultaneously
and this can lead to unexpected results. For example, let y* = x W¹ W² … W¹⁰. Here,
y* is a linear function of x but not a linear function of the weights. Suppose the
gradient is given by g and we now intend to reduce y* by 0.1. Using first-order Taylor
Series approximation, taking a step of ϵg would reduce y* by ϵg’ g. Thus, ϵ should
be 0.1/(g’ g) just using the first-order information. However, higher order effects also
creep in as the updated y* is given by:

An example of a second-order term would be ϵ² g1 g2 ∏ wi. ∏ wi can be negligibly small or

exponentially high depending on whether the individual weights are less than or greater than
1. Since the updates to one layer is so strongly dependent on the other layers, choosing an
appropriate learning rate is tough. Batch normalization takes care of this problem by using an
efficient reparameterization of almost any deep network. Given a matrix of activations, H, the
normalization is given by: H’ = (H-μ) / σ, where the subtraction and division is broadcasted.

𝛿 is added to ensure that σ is not equal to 0.

Going back to the earlier example of y*, let the activations of layer l be given by h(l-1).
Then h(l-1) = x W1 W2 … W (l-1). Now, if x is drawn from a unit Gaussian, then h(l-1) also
comes from a Gaussian, however, not of zero mean and unit variance, as it is a linear
transformation of x. BN makes it zero mean and unit variance. Therefore, y* = Wl h(l-1) and
thus, the learning now becomes much simpler as the parameters at the lower layers mostly do
not have any effect. This simplicity was definitely achieved by rendering the lower layers
useless. However, in a realistic deep network with non-linearities, the lower layers remain
useful. Finally, the complete reparameterization of BN is given by replacing H with γH’ + β.
This is done to retain its expressive power and the fact that the mean is solely determined
by XW. Also, among the choice of normalizating X or XW + B, the authors recommend the
latter, specifically XW, since B becomes redundant because of β. Practically, this means that
when we are using the Batch Normalization layer, the biases should be turned off. In a deep
learning framework like Keras, this can be done by setting the parameter use_bias=False in
the Convolutional layer.

• Coordinate Descent: Generally, a single weight update is made by taking the gradient
with respect to every parameter. However, in cases where some of the parameters
might be independent (discussed below) of the remaining, it might be more efficient
to take the gradient with respect to those independent sets of parameters separately for
making updates. Let me clarify that with an example. Suppose we have the following
cost function:

This cost function describes the learning problem called sparse coding. Here, H refers to the
sparse representation of X and W is the set of weights used to linearly decode H to retrieve X.
An explanation of why this cost function enforces the learning of a sparse representation
of X follows. The first term of the cost function penalizes values far from 0 (positive or
negative because of the modulus, |H|, operator. This enforces most of the values to be 0,
thereby sparse. The second term is pretty self-explanatory in that it compensates the
difference between X and H being linearly transformed by W, thereby enforcing them to take
the same value. In this way, H is now learned as a sparse “representation” of X. The cost
function generally consists of additionally a regularization term like weight decay, which has
been avoided for simplicity. Here, we can divide the entire list of parameters into two
sets, W and H. Minimizing the cost function with respect to any of these sets of parameters is
a convex problem. Coordinate Descent (CD) refers to minimizing the cost function with
respect to only 1 parameter at a time. It has been shown that repeatedly cycling through all
the parameters, we are guaranteed to arrive at a local minima. If instead of 1 parameter, we
take a set of parameters as we did before with W and H, it is called block coordinate
descent (the interested reader should explore Alternating Minimization). CD makes sense if
either the parameters are clearly separable into independent groups or if optimizing with
respect to certain set of parameters is more efficient than with respect to others.
The points A, B, C and D indicates the locations in the parameter space where coordinate
descent landed after each gradient step.
Source: https://www.researchgate.net/figure/Coordinate-Descent-CD-CD-algorithm-searches-
along-one-coordinate-direction-in-each_fig2_262805949

Coordinate descent may fail terribly when one variable influences the optimal value of
another variable.
• Polyak Averaging: Polyak averaging consists of averaging several points in the
parameter space that the optimization algorithm traverses through. So, if the algorithm
encounters the points θ(1), θ(2), … during optimization, the output of Polyak
averaging is:

The figure below explains the intuition behind Polyak averaging:

The optimization algorithm might oscillate back and forth across a valley without ever
reaching the minima. However, the average of those points should be closer to the bottom of
the valley.

Most optimization problems in deep learning are non-convex where the path taken by the
optimization algorithm is quite complicated and it might happen that a point visited in the
distant past might be quite far from the current point in the parameter space. Thus, including
such a point in the distant past might not be useful, which is why an exponentially decaying
running average is taken. This scheme where the recent iterates are weighted more than the
past ones is called Polyak-Ruppert Averaging:

• Supervised Pre-training: Sometimes it’s hard to directly train to solve for a specific
task. Instead it might be better to train for solving a simpler task and use that as an
initialization point for training to solve the more challenging task. As an intuition for
why this seems logical, consider that you didn’t have any background in integration
and were asked to learn how to compute the following integral:

If you’re anyone close to a normal person, your first reaction would be:
Source: https://imgflip.com/i/gdnbg

However, wouldn’t it be better if you were asked to understand the more basic integrations
first:

I hope you understand what I meant with this example — learning a simpler task would put
you in a better position to understand the more complex task. This particular strategy of
training to solve a simpler task before facing the herculean one is called pretraining. A
particular type of pretraining, called greedy supervised pretraining, firstly breaks a given
supervised learning problem into simpler supervised learning ones and solving for the
optimal version of each component in isolation. To build on the above intuition, the
hypothesis as to why this works is that it gives better guidance to the intermediate layers of
the network and helps in both, generalization and optimization. More often that not, the
greedy pretraining is followed by a fine-tuning stage where all the parts are jointly optimized
to search for the optimal solution to the full problem. As an example, the figure below shows
how each hidden layer is trained one at a time, where the input to the hidden layer being
learned is the output of the previously trained hidden layer.
Greedy supervised pretraining (a) The first hidden layer is being trained only using the
original inputs and outputs. (b) For training the second hidden layer, the hidden-output
connection from the first hidden layer is removed and the output of the first hidden layer is
used as the input.

Also, FitNets shows an alternative way to guide the training process. Deep networks are hard
to train mainly because as deeper the model gets, more non-linearities are introduced. The
authors propose the use of a shallower and wider teacher network that is trained first. Then, a
second network which is thinner and deeper, called the student network is trained to predict
not only the final outputs but also the intermediate layers of the teacher network. For those
who might not be clear with what deep, shallow, wide and thin might mean, refer the
following diagram:
Explanation of the terms “shallow”, “deep”, “thin” and “wide” in the context of neural
networks.

The idea is that predicting the intermediate layers of the teacher network provides some hints
as to how the layers of the student network should be used and aids the optimization
procedure. It was shown that without the hints to the hidden layers, the students networks
performs poorly in both the training and test data.

• Designing Models to Aid Optimization: Most of the work in deep learning has been
towards making the models easier to optimize rather than designing a more powerful
optimization algorithm. This is evident from the fact that Stochastic Gradient
Descent, which is primarily used for training deep models today, has been in use since
the 1960s. Many of the current design choices lend towards using linear
transformations between layers and using activation functions like ReLU [max(0, x)]
which are linear for the most part and enjoy large gradients as compared to sigmoidal
units which saturate easily. Also, linear functions increase in a particular direction.
Thus, if there’s an error, there’s a clear direction towards which the output should
move to minimize the error.
Source: https://arxiv.org/pdf/1512.03385.pdf

Residual connections reduce the length of the shortest path from the output to the lower
layers, thereby allowing a larger gradient to flow through and hence, tackling the vanishing
gradient problem. Similarly, GoogleNet attached multiple copies of the output to intermediate
layers so that a larger gradient flows to those layers. Once training is complete, the auxiliary
heads are removed.

Inception architecture with the outputs at the intermediate layers being marked.

• Continuation Methods and Curriculum Learning: Based on the explanation given in

the book (and also given that it’s fairly short), I’d recommend the interested reader to
refer the book directly. The reason being that I realized that the content there was
exactly to the point with no further scope of summarizing and hence, found it
redundant to explain it again here. To give a very brief idea, Continuation
Methods are a family of strategies where instead of a single cost function, a series of
cost functions are optimized to aid optimization by making sure that the algorithm
spends most of the time in well-behaved regions in the parameter space. The series of
cost functions are designed such that the cost functions become progressively harder
to solve, i.e. the first cost function is easiest to solve and the final cost function (which
is actually the cost function we wanted to minimize) is the toughest, and the solution
to each easier cost function serves as a good initialization point for the next, harder-to-
optimize cost function. Curriculum Learning can be interpreted as a type of
continuation method where the learning process is planned such that the simpler
concepts are learned first and then the harder ones which depend on those simpler
concepts. This concept had been shown to accelerate progress in animal training. One
simple way to use it in the context of machine learning is to initially show the
algorithm more of the easier images and less of the harder images. As the learning
progresses, the proportion of easier images is reduced and that of the harder images is
increased so that the model can ultimately learn the harder task.

DL UNIT II PART II (IMP) Optimization For Training Deep Model
No ratings yet
DL UNIT II PART II (IMP) Optimization For Training Deep Model
81 pages
Deep Learning Module 3
No ratings yet
Deep Learning Module 3
15 pages
8 Adagrad, RMSprop, Adam 04 Sep 2020material I 04 Sep 2020 Module4 Optimization
No ratings yet
8 Adagrad, RMSprop, Adam 04 Sep 2020material I 04 Sep 2020 Module4 Optimization
50 pages
Week 06 - Deep Feedforward Networks - Optimization
No ratings yet
Week 06 - Deep Feedforward Networks - Optimization
83 pages
Pure Optimization
No ratings yet
Pure Optimization
23 pages
DL Test-2
No ratings yet
DL Test-2
28 pages
4 Optimization
No ratings yet
4 Optimization
48 pages
UNIT3
No ratings yet
UNIT3
17 pages
Theory DL
No ratings yet
Theory DL
227 pages
Convolutional Neural Network
100% (1)
Convolutional Neural Network
59 pages
Neural Networks: A Beginner's Guide
No ratings yet
Neural Networks: A Beginner's Guide
23 pages
Deep Learning Optimization Guide
No ratings yet
Deep Learning Optimization Guide
12 pages
Chapter 8-Deep Learning Book
No ratings yet
Chapter 8-Deep Learning Book
27 pages
Lec 2
No ratings yet
Lec 2
5 pages
14 Efficient Learning
No ratings yet
14 Efficient Learning
7 pages
Implement 03-1
No ratings yet
Implement 03-1
24 pages
Intro To Machine Learning Lecture Notes5
No ratings yet
Intro To Machine Learning Lecture Notes5
8 pages
Lecture 2
No ratings yet
Lecture 2
31 pages
Op Tim Ization
No ratings yet
Op Tim Ization
22 pages
MtechDL Unit2
No ratings yet
MtechDL Unit2
25 pages
S09 DNN Gradients Wip
No ratings yet
S09 DNN Gradients Wip
28 pages
Cst414-Deep Learning Module 2
No ratings yet
Cst414-Deep Learning Module 2
13 pages
Deep Learning Unit 2
No ratings yet
Deep Learning Unit 2
25 pages
Optimization in Deep Learning - by Adam Cataldo - AC On AI
No ratings yet
Optimization in Deep Learning - by Adam Cataldo - AC On AI
5 pages
DLUNIT2
No ratings yet
DLUNIT2
25 pages
M3 Session 1-1
No ratings yet
M3 Session 1-1
27 pages
DL Unit-2
No ratings yet
DL Unit-2
24 pages
Lecture 6
No ratings yet
Lecture 6
41 pages
DL Unit1
No ratings yet
DL Unit1
61 pages
Deep Learning Module 3-1
No ratings yet
Deep Learning Module 3-1
31 pages
L5 Training Neural Networks Part 2 en v2
No ratings yet
L5 Training Neural Networks Part 2 en v2
70 pages
Lecture 2
No ratings yet
Lecture 2
6 pages
Gradient Descent and Optimization in Machine Learning
No ratings yet
Gradient Descent and Optimization in Machine Learning
9 pages
Ch2-Training, Optimization and Regularization of DNN-new
No ratings yet
Ch2-Training, Optimization and Regularization of DNN-new
114 pages
WINSEM2024-25 CSE4006 ETH AP2024254000693 2025-01-08 Reference-Material-I
No ratings yet
WINSEM2024-25 CSE4006 ETH AP2024254000693 2025-01-08 Reference-Material-I
40 pages
L5 - UCLxDeepMind DL2020
No ratings yet
L5 - UCLxDeepMind DL2020
52 pages
Deep Learning for Data Scientists
No ratings yet
Deep Learning for Data Scientists
17 pages
465-Lecture 10-11
No ratings yet
465-Lecture 10-11
79 pages
Chap 2 Training Feed Forward Neural Networks
No ratings yet
Chap 2 Training Feed Forward Neural Networks
22 pages
DLbook
No ratings yet
DLbook
165 pages
Lec 7 Optimization Part 2
No ratings yet
Lec 7 Optimization Part 2
139 pages
Deep Neural Network Module 4 Regularization
No ratings yet
Deep Neural Network Module 4 Regularization
53 pages
Ai - W7L13
No ratings yet
Ai - W7L13
46 pages
Unit 4 Short Notes Deep Feedforward Networks Gradient Learning
No ratings yet
Unit 4 Short Notes Deep Feedforward Networks Gradient Learning
27 pages
Cours 5
No ratings yet
Cours 5
23 pages
Module 3dl1
No ratings yet
Module 3dl1
11 pages
Neural Network Training: Optimization
No ratings yet
Neural Network Training: Optimization
62 pages
Unit 2
No ratings yet
Unit 2
37 pages
Deep MLP's
No ratings yet
Deep MLP's
44 pages
Six Lectures On NN - Montanari
No ratings yet
Six Lectures On NN - Montanari
77 pages
5 - Chapter8 - Optimization 2
No ratings yet
5 - Chapter8 - Optimization 2
40 pages
1.explain The Concept of Empirical Risk Minimization. What Is The Goal of Optimization in Deep Learning?
No ratings yet
1.explain The Concept of Empirical Risk Minimization. What Is The Goal of Optimization in Deep Learning?
11 pages
7 Optimization2 Stochastic Gradient
No ratings yet
7 Optimization2 Stochastic Gradient
114 pages
AML - Lecture 5
No ratings yet
AML - Lecture 5
97 pages
CS601 - Machine Learning - Unit 2 New
No ratings yet
CS601 - Machine Learning - Unit 2 New
56 pages
Unit 2
No ratings yet
Unit 2
76 pages
CC Unit 4
No ratings yet
CC Unit 4
29 pages
Cognitive-Computing Unit 1
No ratings yet
Cognitive-Computing Unit 1
13 pages
Cognitive-Computing Unit 2
No ratings yet
Cognitive-Computing Unit 2
4 pages
Unit 3
No ratings yet
Unit 3
16 pages
AM601PC KRR Unit 1
No ratings yet
AM601PC KRR Unit 1
16 pages
HPL Board
No ratings yet
HPL Board
3 pages
Fluid Pressure Measurement
No ratings yet
Fluid Pressure Measurement
7 pages
ClassXPreboard1 Schedule and Blue Print 202223
No ratings yet
ClassXPreboard1 Schedule and Blue Print 202223
8 pages
End-Plate - Beam To Beam Connection Design 3
No ratings yet
End-Plate - Beam To Beam Connection Design 3
8 pages
2024-2025 First Semester Examination Time Table - 103000
No ratings yet
2024-2025 First Semester Examination Time Table - 103000
14 pages
Gate Physics Paper With Answer Key 2022
No ratings yet
Gate Physics Paper With Answer Key 2022
13 pages
Practice Questions Set: Numerical Methods: t x i j ij i j πx δx i
No ratings yet
Practice Questions Set: Numerical Methods: t x i j ij i j πx δx i
3 pages
Ch. 01 - Introduction and Mathematical Concepts
No ratings yet
Ch. 01 - Introduction and Mathematical Concepts
30 pages
Lesson 1 PDF
No ratings yet
Lesson 1 PDF
7 pages
Case Studies in Thermal Engineering: Omar Rafae Alomar, Omar Mohammed Ali
No ratings yet
Case Studies in Thermal Engineering: Omar Rafae Alomar, Omar Mohammed Ali
12 pages
Coa Ipa
No ratings yet
Coa Ipa
1 page
Science Mathematics Practical Question Bank 2024-25
No ratings yet
Science Mathematics Practical Question Bank 2024-25
7 pages
VMS002 Adult Body Bag Framework Agreement Specification Lines Metric Dimensions Final 10022016 00F PDF
No ratings yet
VMS002 Adult Body Bag Framework Agreement Specification Lines Metric Dimensions Final 10022016 00F PDF
1 page
2.5.1 Aerodynamic performance plots: CL≈a0 (α−αL=0) - (2.37)
No ratings yet
2.5.1 Aerodynamic performance plots: CL≈a0 (α−αL=0) - (2.37)
14 pages
Assistant Engineer Exam Key
No ratings yet
Assistant Engineer Exam Key
19 pages
Moment Redistribution - PDF Last PDF
No ratings yet
Moment Redistribution - PDF Last PDF
19 pages
MME 2260 Lab 3
No ratings yet
MME 2260 Lab 3
7 pages
Thermal Expansion
No ratings yet
Thermal Expansion
18 pages
Intro to Psychology for Beginners
No ratings yet
Intro to Psychology for Beginners
7 pages
Kalman Filter for Dynamic Systems
No ratings yet
Kalman Filter for Dynamic Systems
8 pages
Exam Prep App for STEM Students
86% (7)
Exam Prep App for STEM Students
11 pages
Part 1 General Radiation Safety
No ratings yet
Part 1 General Radiation Safety
91 pages
Super Dense Coding
No ratings yet
Super Dense Coding
5 pages
Introduction to Chemistry Basics
No ratings yet
Introduction to Chemistry Basics
16 pages
CE207 Lecture4
No ratings yet
CE207 Lecture4
8 pages
Design & Fabrication of Low Cost Small-Scale Fatigue Testing Machine
No ratings yet
Design & Fabrication of Low Cost Small-Scale Fatigue Testing Machine
12 pages
Forces, Movement, Shape and Momentum 1 QP
No ratings yet
Forces, Movement, Shape and Momentum 1 QP
14 pages
Restricted - Hardcore Mode Info & Hardcore Bosses v4.1 - Radical Red - Kanto Leaders
No ratings yet
Restricted - Hardcore Mode Info & Hardcore Bosses v4.1 - Radical Red - Kanto Leaders
10 pages
Solid State Chemistry
No ratings yet
Solid State Chemistry
10 pages
Class 9 Maths Revision Assingment II Cycle II
0% (1)
Class 9 Maths Revision Assingment II Cycle II
1 page

Unit V NNHDL

Uploaded by

Unit V NNHDL

Uploaded by

UNIT - V

Optimization for Train Deep Models: Challenges in Neural Network

1. How Learning Differs from Pure Optimization

Notice the change in distribution over which the expectation is taken.

• Another common difference is that training might be halted following some

• In ML optimization algorithms, the objective function decomposes as a sum over the

2. Challenges in Neural Network Optimization

Assuming local convexity, another way of looking at ill-conditioned Hessian is by considering

Derived using the Taylor Series Expansion

• Long-Term Dependencies: This problem is encountered when the NN becomes

Here, V is an orthonormal matrix, i.e. V V’ = I

• Momentum: The momentum algorithm accumulates the exponentially decaying

Momentum weight update

• Nesterov Momentum: This is a slight modification of the usual momentum equation.

Nesterov momentum weight update

4. Parameter Initialization Strategies

• U(-1 / √m, 1 / √m)

5. Algorithms with Adaptive Learning Rates

AdaGrad parameter update equation.

• Adam: Adapted from “adaptive moments”, it focuses on combining RMSProp and

6. Approximate Second-Order Methods

7. Optimization Strategies and Meta-Algorithms

An example of a second-order term would be ϵ² g1 g2 ∏ wi. ∏ wi can be negligibly small or

𝛿 is added to ensure that σ is not equal to 0.

The figure below explains the intuition behind Polyak averaging:

• Continuation Methods and Curriculum Learning: Based on the explanation given in

You might also like