Optimization
What are optimizers?
• Optimizers and loss functions are two fundamental components that
help improve a model’s performance.
• A loss function evaluates a model's effectiveness by computing the
difference between predicted and expected outputs. Common loss
functions include log loss, hinge loss and mean square loss.
• An optimizer improves the model by adjusting its parameters
(weights and biases) to minimize the loss function value. Examples
include RMSProp, ADAM and SGD (Stochastic Gradient Descent).
• The optimizer’s role is to find the best combination of weights and
biases that leads to the most accurate predictions.
Why Do We Need Optimizers?
• Imagine you’re blindfolded on a hill and want to reach the lowest
point (valley).
• You take small steps downhill, adjusting direction as you go.
• This is optimization in deep learning.
• In deep learning:
▪ The hill = Loss function (error we want to minimize).
▪ The steps = Updates to weights.
▪ The method of deciding steps = Optimizer.
Important Terms
• Epoch – The number of times the algorithm runs on the whole training
dataset.
• Sample – A single row of a dataset.
• Batch – It denotes the number of samples to be taken to for updating the
model parameters.
• Learning rate – It is a parameter that provides the model a scale of how
much model weights should be updated.
• Cost Function/Loss Function – A cost function is used to calculate the cost,
which is the difference between the predicted value and the actual value.
• Weights/ Bias – The learnable parameters in a model that controls the
signal between two neurons.
What is a Gradient?
• On a curve, slope = how steep it is and in which direction it goes (up or down).
At every point on a curve, we can measure its slope. That slope
tells us whether we are going uphill or downhill.”
The gradient is just the slope of the curve(ie function).
•
Gradient Descent (GD) optimizer
• gradient = slope = direction of steepest increase.
• We want to train our neural network by reducing its error (loss).
• This is like finding the bottom of a valley.
•Since we want to minimize error → we go opposite to gradient.
Advantages and disadvantages of Gradient
descent
• Gradient Descent updates weights after computing gradients on the entire
dataset.
• Suppose if we have millions of records then training becomes slow and
computationally very expensive.
• Pros:
1) Easy to understand
2) Easy to implement
3) Stable and accurate convergence.
• Cons:
1) Because this method calculates the gradient for the entire data set in one
update, the calculation is very slow.
2) It requires large memory and it is computationally expensive for large datasets
Gradient Descent (Batch Gradient Descent)
Stochastic Gradient Descent(SGD)
• It is a variant of Gradient Descent. It update the model parameters
one by one. If the model has 10K records in the dataset ,SGD will
update the model parameters 10k times.
• SGD is slow to converge because it needs forward and backward
propagation for every record.
• And the path to reach global minima becomes very noisy.
Stochastic Gradient Descent(SGD)
• Advantages of Stochastic Gradient Descent
Frequent updates of model parameter
Requires less Memory.
Allows the use of large data sets as it has to update only one example at a time.
Faster updates, helps escape local minima.
• Disadvantages of Stochastic Gradient Descent
The frequent updates can also result in noisy gradients which may cause the error
to increase instead of decreasing it.
High Variance.
Frequent updates are computationally expensive.
Stochastic Gradient Descent(SGD)
Local Minima/Global Minima
• Imagine a landscape with many valleys and hills.
• Some valleys are shallow, some are deeper.
• The deepest valley of all = Global Minimum.
• The smaller valleys = Local Minima.
Definition
• Local Minimum: A point where the function has a smaller value than
nearby points (a small dip/valley), but not necessarily the smallest value
overall.
• Global Minimum: The absolute lowest point of the function over the entire
domain.
Learning Rate in Gradient Descent
•
3 cases of Learning rate
Here, theta means parameters(i.e weights and bias)
J(theta) is Loss function/cost function
3 cases of Learning rate
• Very Small Learning Rate
• Steps are too tiny.
• Convergence is very slow (takes many iterations).
• But it’s usually stable (won’t overshoot).
• Very Large Learning Rate
• Steps are too big.
• May overshoot the minimum.
• Can cause oscillations (zig-zagging without settling).
• Sometimes, it may even diverge (loss keeps increasing instead of decreasing).
• Optimal Learning Rate
• Balanced step size.
• Converges fast and stable.
Mini Batch Gradient Descent Optimizer
• In this variant of gradient descent, instead of using all the training data, we only use a subset of
the dataset to calculate gradient.
• Since we use a batch of data (typically between 10 to 1000) instead of the whole dataset, we
need fewer iterations. That is why the mini-batch gradient descent algorithm is faster than both
stochastic gradient descent and gradient descent algorithms.
• As the algorithm uses batching, you do not need to load all the training data into memory, which
makes the process more efficient to implement.
• Moreover, the cost function in mini-batch gradient descent is noisier than the batch gradient
descent algorithm but smoother than that of the stochastic gradient descent algorithm.
strikes a good balance between
• Because of this, mini-batch gradient descent
convergence speed and stability.
Mini Batch Gradient Descent Deep Learning
Optimizer
Mini Batch Gradient Descent Optimizer
Pros:
• It leads to more stable convergence.
• more efficient gradient calculations.
• Requires less amount of memory.
Cons:
• It requires a hyperparameter called ‘mini-batch-size,’ which you must
tune to achieve the required accuracy in addition to learning rate
SGD with Momentum
• It is based on the idea of exponentially weighted moving average.
• Exponentially weighted moving average:
• Imagine you’re checking the daily temperature for a week.
• To smooth out the noise, you take the average of the last 3 days → that’s a simple moving
average.
• Problem: Every past point in the window has equal weight, and old data is dropped suddenly.
Motivation for EWMA
•What if we want to give more importance to recent data but not forget the past completely?
•That’s what Exponentially Weighted Moving Average (EWMA) does.
•It applies a decaying weight to past observations.
Exponentially weighted moving average
(EWMA)
•If β is small (say 0.1) → we rely heavily on the new value (reacts quickly,
but noisy).
•If βis large (say 0.9) → we rely more on the past (smooth, but slower to
react).
• Think of keeping a running average of your exam marks.
• If you give more weight to your recent exam, the average changes quickly.
• If you give equal or more weight to older exams, the average changes slowly.
• Why This is Useful in Optimization?
• SGD gradients jump a lot (noisy).
• By using EWMA, we smooth the gradients.
• This is exactly the idea behind Momentum.
SGD with Momentum
• Problem with SGD: The updates can zig-zag a lot, especially in narrow
valleys (because gradient direction keeps changing).
Intuition of Momentum
• Imagine pushing a ball downhill.
• In plain SGD: the ball moves only based on the current slope.
• With Momentum: the ball remembers the past direction and keeps
rolling smoothly.
• So, momentum helps us:
Move faster in the correct direction.
Reduce zig-zag oscillations.
SGD with Momentum
• Momentum gradient descent is a variant of gradient descent
that adds a momentum term to the update rule.
• The momentum term accumulates the gradient values over time and
reduces the oscillations in the cost function, leading to faster
convergence.
• This is particularly useful in cases where the cost function has a lot of
noise or curvature, which can cause traditional gradient descent to
get stuck in local minima.
Equation of SGD with Momentum
Advantages of SGD with Momentum
• Faster Convergence
• Momentum accelerates movement in the right direction (like rolling downhill
faster).
• Helps reach minima quicker than plain SGD.
• Reduces Zig-Zag Oscillations
• In narrow valleys, plain SGD keeps zig-zagging.
• Momentum smooths the path, giving a straighter trajectory.
• Escapes Shallow Local Minima
• Because it carries “speed” from past updates, it can roll past small bumps
(local minima) instead of getting stuck.
• Stability
• Less sensitive to noisy gradient updates (since it averages them).
Disadvantages of SGD with Momentum
•
Nesterov Accelerated Gradient (NAG)
•
•β = Momentum coefficient (typically 0.9)
Controls how much of the previous velocity to keep
• vt−1 = Previous velocity from last step
The "memory" of where we were going before
• βvt−1 = Momentum component (Keep 90% of the previous direction)
at the look ahead position
Adagrad
In plain GD, we use one fixed learning rate (η) for all parameters.
But in real problems:
• Some parameters (features) need big updates (rarely active features).
• Others need small updates (frequently active features).
•Idea of Adagrad
• Adjust the learning rate automatically for each parameter.
Adagrad
Suppose, you have two parameters:
• Parameter A has seen gradients: [0.1, 0.1, 0.1] → accumulated = 0.03
• Parameter B has seen gradients: [2.0, 1.5, 1.8] → accumulated = 9.49
When the next update comes:
• Parameter A gets a bigger step (because √0.03 is small)
• Parameter B gets a smaller step (because √9.49 is large)
This means,
• Parameters that get updated often → learning rate shrinks for them.
• Parameters that get updated rarely → learning rate stays larger.
• It’s like giving each parameter its own “personalized” learning speed.
Adagrad
Advantages:
• No need to manually tune learning rate for each parameter.
• Works really well with sparse data (like text / NLP, where some words
appear rarely).
Disadvantages:
• The learning rate keeps shrinking due to the growing cache (gradient
accumulation), which can stop learning too soon.
• That’s why later algorithms like RMSProp and Adam were developed
(they “fix” this).
RMSProp (Root Mean Square Propagation)
• RMSProp is a variant of gradient descent that also adapts the
learning rate for each parameter, but instead of using the historical
gradient values, it uses a moving average of the squared gradient
values.
• it forgets very old gradients and only focuses on the recent trend of
gradients.
• This helps to reduce the learning rate for parameters that have large
squared gradient values, which can cause the algorithm to oscillate or
diverge.
Adagrad's Problem: The Vanishing Learning Rate
• Adagrad accumulates ALL gradients forever:
• cache = cache + gradient²
• Let's see what happens over time:
• Step 1: gradient = 1.0 → cache = 0 + 1 = 1
• Step 2: gradient = 1.0 → cache = 1 + 1 = 2
• Step 3: gradient = 1.0 → cache = 2 + 1 = 3
• Step 4: gradient = 1.0 → cache = 3 + 1 = 4...
• Step 100: gradient = 1.0 → cache = 99 + 1 = 100
Adagrad's Problem: The Vanishing Learning Rate
• Effective learning rate:Step 1: α/√1 = α/1.0 = α
• Step 2: α/√2 ≈ 0.7α
• Step 3: α/√3 ≈ 0.58α
• Step 4: α/√4 = 0.5α...
• Step 100: α/√100 = 0.1α
• Result: Learning rate keeps shrinking toward zero!
• Eventually, updates become so tiny that learning essentially stops.
RMSprop's Solution: Forgetful Memory
• RMSprop uses exponential moving average:
• cache = β * cache + (1-β) * gradient²
• With β = 0.9, same scenario:
• Step 1: cache = 0.9×0 + 0.1×1 = 0.1
• Step 2: cache = 0.9×0.1 + 0.1×1 = 0.19
• Step 3: cache = 0.9×0.19 + 0.1×1 = 0.271
• Step 4: cache = 0.9×0.271 + 0.1×1 = 0.344...
• Step 100: cache ≈ 1.0 (converges!)
RMSprop's Solution: Forgetful Memory
Effective learning rate:
• Step 1: α/√0.1 ≈ 3.16α (higher than Adagrad!)
• Step 2: α/√0.19 ≈ 2.29α
• Step 3: α/√0.271 ≈ 1.92α
• Step 4: α/√0.344 ≈ 1.70α...
• Step 100: α/√1.0 = α (stabilizes!)