0% found this document useful (0 votes)

8 views37 pages

Op Tim Ization

The document provides an overview of optimizers in deep learning, explaining their role in improving model performance through parameter adjustments to minimize loss functions. It discusses various types of optimizers, including Gradient Descent, Stochastic Gradient Descent, and RMSProp, highlighting their advantages and disadvantages. Additionally, it covers key concepts such as learning rates, local and global minima, and the importance of techniques like momentum and adaptive learning rates for efficient training.

Uploaded by

crce.10260.ceb

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views37 pages

Op Tim Ization

Uploaded by

crce.10260.ceb

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 37

Optimization

What are optimizers?

• Optimizers and loss functions are two fundamental components that
help improve a model’s performance.
• A loss function evaluates a model's effectiveness by computing the
difference between predicted and expected outputs. Common loss
functions include log loss, hinge loss and mean square loss.
• An optimizer improves the model by adjusting its parameters
(weights and biases) to minimize the loss function value. Examples
include RMSProp, ADAM and SGD (Stochastic Gradient Descent).
• The optimizer’s role is to find the best combination of weights and
biases that leads to the most accurate predictions.
Why Do We Need Optimizers?
• Imagine you’re blindfolded on a hill and want to reach the lowest
point (valley).
• You take small steps downhill, adjusting direction as you go.
• This is optimization in deep learning.
• In deep learning:
▪ The hill = Loss function (error we want to minimize).
▪ The steps = Updates to weights.
▪ The method of deciding steps = Optimizer.
Important Terms
• Epoch – The number of times the algorithm runs on the whole training
dataset.
• Sample – A single row of a dataset.
• Batch – It denotes the number of samples to be taken to for updating the
model parameters.
• Learning rate – It is a parameter that provides the model a scale of how
much model weights should be updated.
• Cost Function/Loss Function – A cost function is used to calculate the cost,
which is the difference between the predicted value and the actual value.
• Weights/ Bias – The learnable parameters in a model that controls the
signal between two neurons.
What is a Gradient?
• On a curve, slope = how steep it is and in which direction it goes (up or down).

At every point on a curve, we can measure its slope. That slope

tells us whether we are going uphill or downhill.”
The gradient is just the slope of the curve(ie function).
•
Gradient Descent (GD) optimizer
• gradient = slope = direction of steepest increase.
• We want to train our neural network by reducing its error (loss).
• This is like finding the bottom of a valley.

•Since we want to minimize error → we go opposite to gradient.

Advantages and disadvantages of Gradient
descent
• Gradient Descent updates weights after computing gradients on the entire
dataset.
• Suppose if we have millions of records then training becomes slow and
computationally very expensive.
• Pros:
1) Easy to understand
2) Easy to implement
3) Stable and accurate convergence.
• Cons:
1) Because this method calculates the gradient for the entire data set in one
update, the calculation is very slow.
2) It requires large memory and it is computationally expensive for large datasets
Gradient Descent (Batch Gradient Descent)
Stochastic Gradient Descent(SGD)
• It is a variant of Gradient Descent. It update the model parameters
one by one. If the model has 10K records in the dataset ,SGD will
update the model parameters 10k times.
• SGD is slow to converge because it needs forward and backward
propagation for every record.
• And the path to reach global minima becomes very noisy.
Stochastic Gradient Descent(SGD)
• Advantages of Stochastic Gradient Descent
Frequent updates of model parameter
Requires less Memory.
Allows the use of large data sets as it has to update only one example at a time.
Faster updates, helps escape local minima.

• Disadvantages of Stochastic Gradient Descent

The frequent updates can also result in noisy gradients which may cause the error
to increase instead of decreasing it.
High Variance.
Frequent updates are computationally expensive.
Stochastic Gradient Descent(SGD)
Local Minima/Global Minima
• Imagine a landscape with many valleys and hills.
• Some valleys are shallow, some are deeper.
• The deepest valley of all = Global Minimum.
• The smaller valleys = Local Minima.

Definition
• Local Minimum: A point where the function has a smaller value than
nearby points (a small dip/valley), but not necessarily the smallest value
overall.
• Global Minimum: The absolute lowest point of the function over the entire
domain.
Learning Rate in Gradient Descent
•
3 cases of Learning rate

Here, theta means parameters(i.e weights and bias)

J(theta) is Loss function/cost function
3 cases of Learning rate
• Very Small Learning Rate
• Steps are too tiny.
• Convergence is very slow (takes many iterations).
• But it’s usually stable (won’t overshoot).
• Very Large Learning Rate
• Steps are too big.
• May overshoot the minimum.
• Can cause oscillations (zig-zagging without settling).
• Sometimes, it may even diverge (loss keeps increasing instead of decreasing).
• Optimal Learning Rate
• Balanced step size.
• Converges fast and stable.
Mini Batch Gradient Descent Optimizer

• In this variant of gradient descent, instead of using all the training data, we only use a subset of
the dataset to calculate gradient.
• Since we use a batch of data (typically between 10 to 1000) instead of the whole dataset, we
need fewer iterations. That is why the mini-batch gradient descent algorithm is faster than both
stochastic gradient descent and gradient descent algorithms.
• As the algorithm uses batching, you do not need to load all the training data into memory, which
makes the process more efficient to implement.
• Moreover, the cost function in mini-batch gradient descent is noisier than the batch gradient
descent algorithm but smoother than that of the stochastic gradient descent algorithm.
strikes a good balance between
• Because of this, mini-batch gradient descent
convergence speed and stability.
Mini Batch Gradient Descent Deep Learning
Optimizer
Mini Batch Gradient Descent Optimizer
Pros:
• It leads to more stable convergence.
• more efficient gradient calculations.
• Requires less amount of memory.
Cons:
• It requires a hyperparameter called ‘mini-batch-size,’ which you must
tune to achieve the required accuracy in addition to learning rate
SGD with Momentum
• It is based on the idea of exponentially weighted moving average.
• Exponentially weighted moving average:
• Imagine you’re checking the daily temperature for a week.
• To smooth out the noise, you take the average of the last 3 days → that’s a simple moving
average.
• Problem: Every past point in the window has equal weight, and old data is dropped suddenly.

Motivation for EWMA

•What if we want to give more importance to recent data but not forget the past completely?
•That’s what Exponentially Weighted Moving Average (EWMA) does.
•It applies a decaying weight to past observations.
Exponentially weighted moving average
(EWMA)
•If β is small (say 0.1) → we rely heavily on the new value (reacts quickly,
but noisy).
•If βis large (say 0.9) → we rely more on the past (smooth, but slower to
react).

• Think of keeping a running average of your exam marks.

• If you give more weight to your recent exam, the average changes quickly.
• If you give equal or more weight to older exams, the average changes slowly.

• Why This is Useful in Optimization?

• SGD gradients jump a lot (noisy).
• By using EWMA, we smooth the gradients.
• This is exactly the idea behind Momentum.
SGD with Momentum
• Problem with SGD: The updates can zig-zag a lot, especially in narrow
valleys (because gradient direction keeps changing).
Intuition of Momentum
• Imagine pushing a ball downhill.
• In plain SGD: the ball moves only based on the current slope.
• With Momentum: the ball remembers the past direction and keeps
rolling smoothly.
• So, momentum helps us:
Move faster in the correct direction.
Reduce zig-zag oscillations.
SGD with Momentum
• Momentum gradient descent is a variant of gradient descent
that adds a momentum term to the update rule.
• The momentum term accumulates the gradient values over time and
reduces the oscillations in the cost function, leading to faster
convergence.
• This is particularly useful in cases where the cost function has a lot of
noise or curvature, which can cause traditional gradient descent to
get stuck in local minima.
Equation of SGD with Momentum
Advantages of SGD with Momentum
• Faster Convergence
• Momentum accelerates movement in the right direction (like rolling downhill
faster).
• Helps reach minima quicker than plain SGD.
• Reduces Zig-Zag Oscillations
• In narrow valleys, plain SGD keeps zig-zagging.
• Momentum smooths the path, giving a straighter trajectory.
• Escapes Shallow Local Minima
• Because it carries “speed” from past updates, it can roll past small bumps
(local minima) instead of getting stuck.
• Stability
• Less sensitive to noisy gradient updates (since it averages them).
Disadvantages of SGD with Momentum
•
Nesterov Accelerated Gradient (NAG)
•
•β = Momentum coefficient (typically 0.9)
Controls how much of the previous velocity to keep
• vt−1 = Previous velocity from last step
The "memory" of where we were going before
• βvt−1 = Momentum component (Keep 90% of the previous direction)

at the look ahead position

Adagrad
In plain GD, we use one fixed learning rate (η) for all parameters.
But in real problems:
• Some parameters (features) need big updates (rarely active features).
• Others need small updates (frequently active features).

•Idea of Adagrad
• Adjust the learning rate automatically for each parameter.
Adagrad
Suppose, you have two parameters:
• Parameter A has seen gradients: [0.1, 0.1, 0.1] → accumulated = 0.03
• Parameter B has seen gradients: [2.0, 1.5, 1.8] → accumulated = 9.49
When the next update comes:
• Parameter A gets a bigger step (because √0.03 is small)
• Parameter B gets a smaller step (because √9.49 is large)
This means,
• Parameters that get updated often → learning rate shrinks for them.
• Parameters that get updated rarely → learning rate stays larger.
• It’s like giving each parameter its own “personalized” learning speed.
Adagrad
Advantages:
• No need to manually tune learning rate for each parameter.
• Works really well with sparse data (like text / NLP, where some words
appear rarely).
Disadvantages:
• The learning rate keeps shrinking due to the growing cache (gradient
accumulation), which can stop learning too soon.
• That’s why later algorithms like RMSProp and Adam were developed
(they “fix” this).
RMSProp (Root Mean Square Propagation)
• RMSProp is a variant of gradient descent that also adapts the
learning rate for each parameter, but instead of using the historical
gradient values, it uses a moving average of the squared gradient
values.
• it forgets very old gradients and only focuses on the recent trend of
gradients.

• This helps to reduce the learning rate for parameters that have large
squared gradient values, which can cause the algorithm to oscillate or
diverge.
Adagrad's Problem: The Vanishing Learning Rate
• Adagrad accumulates ALL gradients forever:
• cache = cache + gradient²
• Let's see what happens over time:
• Step 1: gradient = 1.0 → cache = 0 + 1 = 1
• Step 2: gradient = 1.0 → cache = 1 + 1 = 2
• Step 3: gradient = 1.0 → cache = 2 + 1 = 3
• Step 4: gradient = 1.0 → cache = 3 + 1 = 4...
• Step 100: gradient = 1.0 → cache = 99 + 1 = 100
Adagrad's Problem: The Vanishing Learning Rate
• Effective learning rate:Step 1: α/√1 = α/1.0 = α
• Step 2: α/√2 ≈ 0.7α
• Step 3: α/√3 ≈ 0.58α
• Step 4: α/√4 = 0.5α...
• Step 100: α/√100 = 0.1α
• Result: Learning rate keeps shrinking toward zero!
• Eventually, updates become so tiny that learning essentially stops.
RMSprop's Solution: Forgetful Memory
• RMSprop uses exponential moving average:
• cache = β * cache + (1-β) * gradient²
• With β = 0.9, same scenario:
• Step 1: cache = 0.9×0 + 0.1×1 = 0.1
• Step 2: cache = 0.9×0.1 + 0.1×1 = 0.19
• Step 3: cache = 0.9×0.19 + 0.1×1 = 0.271
• Step 4: cache = 0.9×0.271 + 0.1×1 = 0.344...
• Step 100: cache ≈ 1.0 (converges!)
RMSprop's Solution: Forgetful Memory
Effective learning rate:
• Step 1: α/√0.1 ≈ 3.16α (higher than Adagrad!)
• Step 2: α/√0.19 ≈ 2.29α
• Step 3: α/√0.271 ≈ 1.92α
• Step 4: α/√0.344 ≈ 1.70α...
• Step 100: α/√1.0 = α (stabilizes!)

Gradient Descent Method
No ratings yet
Gradient Descent Method
12 pages
Gradient-Based Optimizers
No ratings yet
Gradient-Based Optimizers
54 pages
Optimizers and Activation Functions in Deep Learning
No ratings yet
Optimizers and Activation Functions in Deep Learning
15 pages
Optimization Gradient Descent
No ratings yet
Optimization Gradient Descent
13 pages
Optimization Techniques (SGD Alternatives)
No ratings yet
Optimization Techniques (SGD Alternatives)
34 pages
Chapter 4
No ratings yet
Chapter 4
33 pages
Gradient Descent - PR
No ratings yet
Gradient Descent - PR
31 pages
Mlfa Autumn 23 Optimization
No ratings yet
Mlfa Autumn 23 Optimization
37 pages
SGD 1
No ratings yet
SGD 1
86 pages
An Overview of Gradient Descent Optimization Algorithms PDF
No ratings yet
An Overview of Gradient Descent Optimization Algorithms PDF
12 pages
Deep Learning Optimizers Explained
No ratings yet
Deep Learning Optimizers Explained
20 pages
4 - Gradient Descent and Stochastic GD
No ratings yet
4 - Gradient Descent and Stochastic GD
37 pages
Optim
No ratings yet
Optim
33 pages
Lecture 5
No ratings yet
Lecture 5
34 pages
Gradient Decent
No ratings yet
Gradient Decent
15 pages
Gradient Descent for ML Practitioners
No ratings yet
Gradient Descent for ML Practitioners
27 pages
Gradient Descent Overview
No ratings yet
Gradient Descent Overview
14 pages
Deep Learning (MODULE-2)
No ratings yet
Deep Learning (MODULE-2)
86 pages
Optimization Techniques in Deep Learning
No ratings yet
Optimization Techniques in Deep Learning
14 pages
Gradient Descent New
No ratings yet
Gradient Descent New
42 pages
Soft Computing Assignment
No ratings yet
Soft Computing Assignment
9 pages
PCA and Convex Optimization and Bias, Variance-2
No ratings yet
PCA and Convex Optimization and Bias, Variance-2
29 pages
Lecture 21
No ratings yet
Lecture 21
49 pages
Gradient Descent 5 Part 2
No ratings yet
Gradient Descent 5 Part 2
15 pages
WINSEM2024-25 CSE4006 ETH AP2024254000693 2025-01-08 Reference-Material-I
No ratings yet
WINSEM2024-25 CSE4006 ETH AP2024254000693 2025-01-08 Reference-Material-I
40 pages
Gradient Descent (GD) - GD With Momentum - Nesterov Accelerated GD - Stochastic GD - OrIGINAL
No ratings yet
Gradient Descent (GD) - GD With Momentum - Nesterov Accelerated GD - Stochastic GD - OrIGINAL
25 pages
Gradient Descent Optimization Guide
No ratings yet
Gradient Descent Optimization Guide
9 pages
Optimizers
No ratings yet
Optimizers
4 pages
ML Lec 08 Gradient Descent
No ratings yet
ML Lec 08 Gradient Descent
37 pages
Gradient Descent for Deep Learning
No ratings yet
Gradient Descent for Deep Learning
21 pages
Cours 5
No ratings yet
Cours 5
23 pages
SGD
No ratings yet
SGD
3 pages
Technical Writing
No ratings yet
Technical Writing
9 pages
S09 DNN Gradients Wip
No ratings yet
S09 DNN Gradients Wip
28 pages
Module 2
No ratings yet
Module 2
67 pages
DL Module 2 1 (Sami)
No ratings yet
DL Module 2 1 (Sami)
17 pages
Gradient Descent and Optimization in Machine Learning
No ratings yet
Gradient Descent and Optimization in Machine Learning
9 pages
Deep Learning Optimization Guide
No ratings yet
Deep Learning Optimization Guide
30 pages
UNIT3
No ratings yet
UNIT3
37 pages
Deep Learning Optimization Guide
100% (1)
Deep Learning Optimization Guide
105 pages
Neural Network Optimization Tactics
No ratings yet
Neural Network Optimization Tactics
20 pages
Stochastic Gradient Descent Tuning
No ratings yet
Stochastic Gradient Descent Tuning
8 pages
INT255 Unit-4
No ratings yet
INT255 Unit-4
40 pages
Implement 03-1
No ratings yet
Implement 03-1
24 pages
DL Class2
No ratings yet
DL Class2
30 pages
Gradient Descent Presentation
No ratings yet
Gradient Descent Presentation
26 pages
05.stochastic Gradient Descent
No ratings yet
05.stochastic Gradient Descent
2 pages
2.stochastic Gradient Descent (SGD)
No ratings yet
2.stochastic Gradient Descent (SGD)
11 pages
BME 6407 - Class 10 (April 2023)
No ratings yet
BME 6407 - Class 10 (April 2023)
31 pages
Unit 2.a Optimzer
No ratings yet
Unit 2.a Optimzer
10 pages
Gradient Descent in Machine Learning
No ratings yet
Gradient Descent in Machine Learning
98 pages
Gradient Descent DS Rohit Sharma Fench Knjs
No ratings yet
Gradient Descent DS Rohit Sharma Fench Knjs
15 pages
Gradient Descent and Cost Function
No ratings yet
Gradient Descent and Cost Function
14 pages
Gradient Descent
No ratings yet
Gradient Descent
17 pages
Unit 2.4
No ratings yet
Unit 2.4
31 pages
Paper 2
No ratings yet
Paper 2
27 pages
Runnable Vs Thread
No ratings yet
Runnable Vs Thread
7 pages
Laxmi Residency Member Receipt
No ratings yet
Laxmi Residency Member Receipt
1 page
10253.exp 5
No ratings yet
10253.exp 5
12 pages
Secb Ossp
No ratings yet
Secb Ossp
2 pages
Expt 2 - 2-1
No ratings yet
Expt 2 - 2-1
31 pages
Experiment No 3 - Final
No ratings yet
Experiment No 3 - Final
44 pages
Basics III: Ionic Relaxation, Stress & Cell Shapes, Phonons and Molecular Dynamics
No ratings yet
Basics III: Ionic Relaxation, Stress & Cell Shapes, Phonons and Molecular Dynamics
61 pages
Traffic Modeling, Prediction, and Congestion Control For High-Speed Networks: A Fuzzy AR Approach
No ratings yet
Traffic Modeling, Prediction, and Congestion Control For High-Speed Networks: A Fuzzy AR Approach
18 pages
Lecture 07: Adaptive Filtering: Instructor: Dr. Gleb V. Tcheslavski Contact: Gleb@ee - Lamar.edu Office Hours
No ratings yet
Lecture 07: Adaptive Filtering: Instructor: Dr. Gleb V. Tcheslavski Contact: Gleb@ee - Lamar.edu Office Hours
53 pages
Linear Algebra & Optimization Q&A
No ratings yet
Linear Algebra & Optimization Q&A
2 pages
Optimization Techniques Course
No ratings yet
Optimization Techniques Course
3 pages
Unit - 3-NNDL - Notes
No ratings yet
Unit - 3-NNDL - Notes
17 pages
Deep Learning
No ratings yet
Deep Learning
15 pages
Deep Learning - AD3501 - Important Question and 2 Marks With Answers - Unit 1
No ratings yet
Deep Learning - AD3501 - Important Question and 2 Marks With Answers - Unit 1
9 pages
Machine Learning Basics Explained
No ratings yet
Machine Learning Basics Explained
10 pages
Deep Learning Module 3
No ratings yet
Deep Learning Module 3
15 pages
Assignment 4
No ratings yet
Assignment 4
8 pages
Optimum System Synthesis 1962
No ratings yet
Optimum System Synthesis 1962
400 pages
Sparsity Techniques in Machine Learning
No ratings yet
Sparsity Techniques in Machine Learning
20 pages
ML Cheatsheet: Linear Regression
100% (1)
ML Cheatsheet: Linear Regression
219 pages
Heuristic Search
No ratings yet
Heuristic Search
49 pages
Essentials of Linear Regression in Python
No ratings yet
Essentials of Linear Regression in Python
23 pages
Linear Regression Quiz Analysis
100% (1)
Linear Regression Quiz Analysis
6 pages
MLT, Two Marks
No ratings yet
MLT, Two Marks
19 pages
Algebra and More For Analytics
No ratings yet
Algebra and More For Analytics
29 pages
E1 251 Linear and Nonlinear Op2miza2on
No ratings yet
E1 251 Linear and Nonlinear Op2miza2on
24 pages
OpTimIzation Overview
No ratings yet
OpTimIzation Overview
47 pages
Charles Audet, Warren Hare - Derivative-Free and Blackbox Optimization-Springer (2017)
No ratings yet
Charles Audet, Warren Hare - Derivative-Free and Blackbox Optimization-Springer (2017)
304 pages
Introduction To Algorithms For Data Mining and Machine Learning 1st Edition - Ebook PDF PDF Download
100% (4)
Introduction To Algorithms For Data Mining and Machine Learning 1st Edition - Ebook PDF PDF Download
81 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
9 pages
Quantum Neural Networks Optimization
No ratings yet
Quantum Neural Networks Optimization
12 pages
Unit 4 Supervised Learning
100% (1)
Unit 4 Supervised Learning
75 pages
Deep Learning Interview Prep Guide
No ratings yet
Deep Learning Interview Prep Guide
14 pages
Unconstrained Numerical Optimization An Introduction For Econometricians
100% (1)
Unconstrained Numerical Optimization An Introduction For Econometricians
32 pages
Shark Smell Optimization Beamer
No ratings yet
Shark Smell Optimization Beamer
11 pages
Lecture 15 Projected Gradient
No ratings yet
Lecture 15 Projected Gradient
8 pages

Op Tim Ization

Uploaded by

Op Tim Ization

Uploaded by

Optimization

What are optimizers?

At every point on a curve, we can measure its slope. That slope

•Since we want to minimize error → we go opposite to gradient.

• Disadvantages of Stochastic Gradient Descent

Here, theta means parameters(i.e weights and bias)

Motivation for EWMA

• Think of keeping a running average of your exam marks.

• Why This is Useful in Optimization?

at the look ahead position

You might also like