0% found this document useful (0 votes)

11 views52 pages

Gradient Descent

The document provides an overview of machine learning algorithms, focusing on supervised and unsupervised learning, and introduces concepts from multivariate calculus essential for optimization in machine learning. It explains gradient descent and subgradient descent methods for finding local minima of functions, emphasizing the importance of step size and line search techniques. Additionally, it discusses the role of convexity in optimization and the use of subgradients for non-differentiable functions.

Uploaded by

Phi Long Nguyễn

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views52 pages

Gradient Descent

Uploaded by

Phi Long Nguyễn

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 52

Recap: Machine learning algorithms

Supervised Unsupervised
Learning Learning

Discrete Classification Clustering

Dimensionality
Continuous Regression
reduction
MOTIVATION
• It is good for finding global minima/maxima if the function is convex
• It is good for finding local minima/maxima if the function is not
convex
• It is used for optimizing many models in Machine learning:
• It is used in conjunction with:
üLinear Regression
üLogistic Regression
üSupport Vector Machines
FUNCTION EXAMPLE
QUICKEST EVER REVIEW OF MULTIVARIATE
CALCULUS
• Derivative
• Partial Derivative
• Gradient Vector
DERIVATIVE Slope of the tangent line

𝑓 𝑥 = x!

𝑓′ 𝑥 = 2x

𝑓′′ 𝑥 = 2 Easy when a function is univariate

PARTIAL DERIVATIVE – MULTIVARIATE FUNCTIONS
For multivariate functions (e.g two variables) we need partial
derivatives – one per dimension. Examples of multivariate functions:

f x, y = x ! + y ! f x, y = −x ! − y ! f x, y = cos ! (x) + y ! f x, y = cos ! 𝑥 + cos ! 𝑦

Convex Concave
PARTIAL DERIVATIVE – MULTIVARIATE FUNCTIONS
• To visualize the partial derivative for each of the dimensions x and y,
we can imagine a plane that “cuts” our surface along the two
dimensions and once again we get the slope of the tangent line.

Surface:f 𝑥, 𝑦 = 9 − 𝑥 ! − 𝑦 ! Plane: 𝑦 = 1 Cut: f 𝑥, 1 = 8 − 𝑥 !

Slope/deviarate of cut f′ 𝑥 = −2𝑥
PARTIAL DERIVATIVE – MULTIVARIATE FUNCTIONS
• If we partially differentiate a function with respect to x, we pretend y
is constant. f 𝑥, 𝑦 = 9 − 𝑥 ! − 𝑦 !

𝜕f 𝜕f
= −2𝑥 = −2𝑦
𝜕𝑥 𝜕𝑦
PARTIAL DERIVATIVE – MULTIVARIATE FUNCTIONS
• The two tangent lines that pass through a point, define the tangent
plane to that point
GRADIENT VECTOR
• Is the vector that has as coordinates the partial derivatives of the
function:
𝜕f
= −2𝑥
𝜕𝑥
f 𝑥, 𝑦 = 9 − 𝑥 6 − 𝑦 6
𝜕f
= −2𝑦
𝜕𝑦

𝜕𝑓 𝜕𝑓 𝜕𝑓 𝜕𝑓
∇𝑓 = 𝑖+ 𝑗= , = −2𝑥, −2𝑦
𝜕x 𝜕y 𝜕x 𝜕y

Note: Gradient Vector is not parallel to tangent surface

GRADIENT DESCENT

• Method to find local optima of differentiable a function 𝑓

• Intuition: gradient tells us direction of greatest increase, negative gradient
gives us direction of greatest decrease
• Take steps in directions that reduce the function value

• Definition of derivative guarantees that if we take a small enough step in the

direction of the negative gradient, the function will decrease in value
• How small is small enough?

12
GRADIENT DESCENT
Gradient Descent Algorithm:

• Pick an initial point 𝑥!

• Iterate until convergence

𝑥"#$ = 𝑥" − 𝛾" 𝛻𝑓(𝑥" )

where 𝛾" is the 𝑡 "% step size (sometimes called learning rate)

Possible Stopping Criteria: iterate until

When do we stop?
∇𝑓(𝑥! ) ≤ 𝜖 for some 𝜖 > 0
How small should 𝜖 be? 13
GRADIENT
𝑓 𝑥 =𝑥
DESCENT !

Step size: .8
𝑥 (#) = −4

14
GRADIENT
𝑓 𝑥 =𝑥
DESCENT !

Step size: .8
𝑥 (#) = −4
𝑥 (%) = −4 − .8 ⋅ 2 ⋅ (−4)

15
GRADIENT
𝑓 𝑥 =𝑥
DESCENT !

Step size: .8
𝑥 (#) = −4
𝑥 (%) = 2.4

16
GRADIENT
𝑓 𝑥 =𝑥
DESCENT !

Step size: .8
𝑥 (#) = −4
𝑥 (%) = 2.4
𝑥 (!) = 2.4 − .8 ⋅ 2 ⋅ 2.4

𝑥 (%) = 0.4

17
GRADIENT
𝑓 𝑥 =𝑥
DESCENT !

Step size: .8
𝑥 (#) = −4
𝑥 (%) = 2.4
𝑥 (!) = −1.44

18
GRADIENT
𝑓 𝑥 =𝑥
DESCENT !

Step size: .8
𝑥 (#) = −4
𝑥 (%) = 2.4
𝑥 (!) = −1.44
𝑥 (() = .864
𝑥 (') = −0.5184
𝑥 (&) = 0.31104

𝑥 ((#) = −8.84296𝑒 − 07
19
GRADIENT DESCENT

Step size: .9

20
GRADIENT DESCENT

Step size: .2

21
GRADIENT DESCENT

Step size matters!

22
GRADIENT DESCENT

Step size matters!

23
LINE SEARCH

• Instead of picking a fixed step size that may or may not actually result
in a decrease in the function value, we can consider minimizing the
function along the direction specified by the gradient to guarantee
that the next iteration decreases the function value
• In other words choose, 𝑥@AB ∈ arg min 𝑓(𝑥@ − 𝛾∇𝑓 𝑥@ )
CD E

• This is called exact line search

• This optimization problem can be expensive to solve exactly L

• However, if 𝑓 is convex, this is a univariate convex optimization problem

24
BACKTRACKING LINE SEARCH

• Instead of exact line search, could simply use a strategy that finds
some step size that decreases the function value (one must exist)
• Backtracking line search: start with a large step size, 𝛾, and keep
shrinking it until 𝑓 𝑥" − 𝛾∇𝑓 𝑥" < 𝑓(𝑥" )

• This always guarantees a decrease, but it may not decrease as much

as exact line search
• Still, this is typically much faster in practice as it only requires a few function
evaluations

25
BACKTRACKING LINE SEARCH

• To implement backtracking line search, choose two parameters 𝛼 ∈

0, . 5 , 𝛽 ∈ (0,1)

• Set 𝛾 = 1
/
• While 𝑓 𝑥" − 𝛾∇𝑓 𝑥" > 𝑓 𝑥" − 𝛼 ⋅ 𝛾 ⋅ ∇𝑓 𝑥"
Iterations continue until
• 𝛾 = 𝛽𝛾 a step size is found that
decreases the function
• Set 𝑥"#$ = 𝑥" − 𝛾∇𝑓 𝑥" “enough”

26
BACKTRACKING LINE SEARCH

𝛼 = .2, 𝛽 = .99

27
BACKTRACKING LINE SEARCH

𝛼 = .1, 𝛽 = .3

28
GRADIENT DESCENT: CONVEX FUNCTIONS

• For convex functions, local optima are always global optima (this
follows from the definition of convexity)
• If gradient descent converges to a critical point, then the result is a global
minimizer

• Not all convex functions are differentiable, can we still apply gradient
descent?

29
GRADIENTS OF CONVEX FUNCTIONS

• For a differentiable convex function 𝑔(𝑥) its gradients yield linear

underestimators
𝑔(𝑥)

30
GRADIENTS OF CONVEX FUNCTIONS

• For a differentiable convex function 𝑔(𝑥) its gradients yield linear

underestimators
𝑔(𝑥)

31
GRADIENTS OF CONVEX FUNCTIONS

• For a differentiable convex function 𝑔(𝑥) its gradients yield linear

underestimators: zero gradient corresponds to a global optimum
𝑔(𝑥)

32
SUBGRADIENTS

• For a convex function 𝑔(𝑥), a subgradient at a point 𝑥 ! is given by

any line, 𝑙, such that 𝑙 𝑥 ! = 𝑔(𝑥 ! ) and 𝑙 𝑥 ≤ 𝑔(𝑥) for all 𝑥, i.e., it
is a linear underestimator

𝑔(𝑥)

𝑥
𝑥#

33
SUBGRADIENTS

• For a convex function 𝑔(𝑥), a subgradient at a point 𝑥 ! is given by

any line, 𝑙, such that 𝑙 𝑥 ! = 𝑔(𝑥 ! ) and 𝑙 𝑥 ≤ 𝑔(𝑥) for all 𝑥, i.e., it
is a linear underestimator
𝑔(𝑥)

𝑥
𝑥#

34
SUBGRADIENTS

• For a convex function 𝑔(𝑥), a subgradient at a point 𝑥 ! is given by

any line, 𝑙, such that 𝑙 𝑥 ! = 𝑔(𝑥 ! ) and 𝑙 𝑥 ≤ 𝑔(𝑥) for all 𝑥, i.e., it
is a linear underestimator
𝑔(𝑥)

𝑥
𝑥#

35
SUBGRADIENTS

• For a convex function 𝑔(𝑥), a subgradient at a point 𝑥 ! is given by

any line, 𝑙, such that 𝑙 𝑥 ! = 𝑔(𝑥 ! ) and 𝑙 𝑥 ≤ 𝑔(𝑥) for all 𝑥, i.e., it
is a linear underestimator
𝑔(𝑥)

If 0 is a subgradient
at 𝑥 " , then 𝑥 " is a
global minimum

𝑥
𝑥#

36
SUBGRADIENTS

• If a convex function is differentiable at a point 𝑥, then it has a unique

subgradient at the point 𝑥 given by the gradient
• If a convex function is not differentiable at a point 𝑥, it can have many
subgradients
• E.g., the set of subgradients of the convex function |𝑥| at the point 𝑥 = 0 is given by
the set of slopes [−1,1]

• The set of all subgradients of 𝑓 at 𝑥 form a convex set, i.e., 𝑔, ℎ subgradients, then
.5𝑔 + .5ℎ is also a subgradient
• Subgradients only guaranteed to exist for convex functions

37
SUBGRADIENT EXAMPLE

• Subgradient of 𝑔 𝑥 = max(𝑓$ 𝑥 , 𝑓/ 𝑥 ) for 𝑓$ , 𝑓/ convex functions?

38
SUBGRADIENT EXAMPLE

• Subgradient of 𝑔 𝑥 = max(𝑓$ 𝑥 , 𝑓/ 𝑥 ) for 𝑓$ , 𝑓/ convex functions?

• If 𝑓B 𝑥 > 𝑓6(𝑥), ∇𝑓B(𝑥)

• If 𝑓6 𝑥 > 𝑓B(𝑥), ∇𝑓6(𝑥)

• If 𝑓B 𝑥 = 𝑓6 𝑥 , ∇𝑓B(𝑥) and ∇𝑓6(𝑥) are both subgradients (and so are all

convex combinations of these)

39
SUBGRADIENT DESCENT

Subgradient Descent Algorithm:

• Pick an initial point 𝑥!

• Iterate until convergence

𝑥"#$ = 𝑥" − 𝛾" 𝑠0 (𝑥" )

where 𝛾" is the 𝑡 "% step size and 𝑠0 (𝑥" ) is a subgradient of 𝑓 at

𝑥"

40
SUBGRADIENT DESCENT

Subgradient Descent Algorithm:

• Pick an initial point 𝑥!

• Iterate until convergence

𝑥"#$ = 𝑥" − 𝛾" 𝑠0 (𝑥" )

where 𝛾" is the 𝑡 "% step

Cansize
you and 𝑠0search
use line is a subgradient of 𝑓 at
(𝑥" ) here?
𝑥"

41
SUBGRADIENT DESCENT

Step Size: .9

42
DIMINISHING STEP SIZE RULES

• A fixed step size may not result in convergence for non-differentiable

functions
• Instead, can use a diminishing step size:
• Required property: step size must decrease as number of iterations increase
but not too quickly that the algorithm fails to make progress

• Common diminishing step size rules:

I
• 𝛾@ = for some 𝑎 > 0, 𝑏 ≥ 0
JA@
I
• 𝛾@ = @
for some 𝑎 > 0

43
SUBGRADIENT DESCENT

Diminishing Step Size

44
THEORETICAL GUARANTEES
• The hard work in convex optimization is to identify conditions that
guarantee quick convergence to within a small error of the optimum
($)
• Let 𝑓!"#$ = min 𝑓(𝑥$ ! )
$ ! ∈{),…,$}

• For a fixed step size, 𝛾, we are guaranteed that

($)
lim 𝑓!"#$ − inf 𝑓 𝑥 ≤ 𝜖(𝛾)
$→. /

where 𝜖(𝛾) is some positive constant that depends on 𝛾

• If 𝑓 is differentiable, then we have 𝜖 𝛾 = 0 whenever 𝛾 is small enough
(more on rates of convergence later)

45
GRADIENT DESCENT CODE (PYTHON)

f x = x ' − 3x ( + 2

f ) x = 4x ( − 9x !
GRADIENT DESCENT WITH MOMENTUM

v1 = 𝛾𝑣"2$ + 𝜂∇3 𝐽 𝜃 , thông thường 𝛾 được chọn là khoảng 0.9

MINI-BATCH GRADIENT DESCENT
• It is wasteful to compute the loss over the entire set to
perform a single parameter update for large datasets
• E.g., ImageNet has 14M images
• GD (a.k.a. vanilla GD) is replaced with mini-batch GD
• Mini-batch gradient descent
• Approach:
• Compute the loss ℒ 𝜃 on a batch of images, update the parameters 𝜃,
and repeat until all images are used
• At the next epoch, shuffle the training data, and repeat above process
• Mini-batch GD results in much faster training
• Typical batch size: 32 to 256 images
• It works because the examples in the training data are correlated
• I.e., the gradient from a mini-batch is a good approximation of the
gradient of the entire training set
STOCHASTIC GRADIENT DESCENT
• Stochastic gradient descent
• SGD uses mini-batches that consist of a single input example
• E.g., one image mini-batch
• Although this method is very fast, it may cause significant fluctuations in the
loss function
• Therefore, it is less commonly used, and mini-batch GD is preferred
• In most DL libraries, SGD is typically a mini-batch SGD (with an option to add
momentum)
PROBLEMS WITH GRADIENT DESCENT
• Besides the local minima problem, the GD algorithm can be very slow
at plateaus, and it can get stuck at saddle points
cost ℒ 𝜃

Very slow at the plateau

Stuck at a saddle point

Stuck at a local minimum

𝛻ℒ 𝜃 ≈ 0 𝛻ℒ 𝜃
=0 𝛻ℒ 𝜃 = 0
𝜃
Slide credit: Hung-yi Lee – Deep Learning Tutorial
GRADIENT DESCENT WITH MOMENTUM
• Gradient descent with momentum uses the momentum of the
gradient for parameter optimization
cost ℒ 𝜃
Movement = Negative of Gradient + Momentum

Negative of Gradient
Momentum
Real Movement

𝜃
Gradient = 0
Slide credit: Hung-yi Lee – Deep Learning Tutorial
GRADIENT DESCENT WITH MOMENTUM
• Parameters update in GD with momentum : 𝜃 2"3 = 𝜃 456 −
𝑉 2"3
• Where: 𝑉 !"# = 𝛽𝑉 $%& + 𝛼𝛻ℒ 𝜃 $%&

• Compare to vanilla GD: 𝜃 2"3 = 𝜃 456 − 𝛼𝛻ℒ 𝜃 456

• The term 𝑉 2"3 is called momentum
• This term accumulates the gradients from the past several steps
• It is similar to a momentum of a heavy ball rolling down the hill
• The parameter 𝛽 referred to as a coefficient of momentum
• A typical value of the parameter 𝛽 is 0.9
• This method updates the parameters 𝜃 in the direction of
the weighted average of the past gradients

Lecture 3 Gradient Descent
No ratings yet
Lecture 3 Gradient Descent
37 pages
Gradient - Descent Important 23-24
No ratings yet
Gradient - Descent Important 23-24
37 pages
Lecture 3 Gradient Descent
No ratings yet
Lecture 3 Gradient Descent
37 pages
11 Gradient Descent
No ratings yet
11 Gradient Descent
58 pages
Chapter 4
No ratings yet
Chapter 4
65 pages
Mlfa Autumn 23 Optimization
No ratings yet
Mlfa Autumn 23 Optimization
37 pages
4 - Gradient Descent and Stochastic GD
No ratings yet
4 - Gradient Descent and Stochastic GD
37 pages
AIMLB PGP 2025 Session 5
No ratings yet
AIMLB PGP 2025 Session 5
67 pages
WINSEM2024-25 CSE4006 ETH AP2024254000693 2025-01-08 Reference-Material-I
No ratings yet
WINSEM2024-25 CSE4006 ETH AP2024254000693 2025-01-08 Reference-Material-I
40 pages
Lect 5 - Gradient Descent
No ratings yet
Lect 5 - Gradient Descent
31 pages
Unit 2-DLV
No ratings yet
Unit 2-DLV
84 pages
Lec 5 - Gradient-Descent
No ratings yet
Lec 5 - Gradient-Descent
31 pages
LInear
No ratings yet
LInear
14 pages
Mlfa Autumn 22 Lec 04
No ratings yet
Mlfa Autumn 22 Lec 04
24 pages
Module 4 Lab 3
No ratings yet
Module 4 Lab 3
6 pages
ML Lec 08 Gradient Descent
No ratings yet
ML Lec 08 Gradient Descent
37 pages
05 Gradient Descent
No ratings yet
05 Gradient Descent
23 pages
Gradient Descent - PR
No ratings yet
Gradient Descent - PR
31 pages
Gradient Descent Algorithm Is A First
No ratings yet
Gradient Descent Algorithm Is A First
5 pages
Lecture 8
No ratings yet
Lecture 8
16 pages
Notes Unit 1-3 Part-III
No ratings yet
Notes Unit 1-3 Part-III
25 pages
DS303: Introduction To Machine Learning: Stochastic Gradient Descent
No ratings yet
DS303: Introduction To Machine Learning: Stochastic Gradient Descent
19 pages
Discussion 4 CS771
No ratings yet
Discussion 4 CS771
25 pages
Unit 2 Introduction To Deep Learning
67% (3)
Unit 2 Introduction To Deep Learning
79 pages
Gradient Descent Deep Learning: by T.K. Damodharan Vice President, RBS Reg - No: PC2013003013008
No ratings yet
Gradient Descent Deep Learning: by T.K. Damodharan Vice President, RBS Reg - No: PC2013003013008
37 pages
Gradient Descent Algorithm in Machine Learning
No ratings yet
Gradient Descent Algorithm in Machine Learning
21 pages
ML Lecture2
No ratings yet
ML Lecture2
36 pages
Linear Regression
No ratings yet
Linear Regression
63 pages
Gradient Descent in Convex Optimization
No ratings yet
Gradient Descent in Convex Optimization
27 pages
Slides-4 Optimization Extra Gradient Descent
No ratings yet
Slides-4 Optimization Extra Gradient Descent
67 pages
ML Module 5 Full Notes
No ratings yet
ML Module 5 Full Notes
23 pages
Unit3 Rev3
No ratings yet
Unit3 Rev3
201 pages
Unit 2
No ratings yet
Unit 2
76 pages
5.1loss Function, Optimization, GD
No ratings yet
5.1loss Function, Optimization, GD
39 pages
Tut04 - One Algorithm To Optimize Them All
No ratings yet
Tut04 - One Algorithm To Optimize Them All
19 pages
Week3 Large Scale ML
No ratings yet
Week3 Large Scale ML
66 pages
ML Notes
No ratings yet
ML Notes
14 pages
Gradient Descent
No ratings yet
Gradient Descent
17 pages
Sheet 3 Sol 3
No ratings yet
Sheet 3 Sol 3
3 pages
Gradient Descent
No ratings yet
Gradient Descent
13 pages
Assignment No 3
No ratings yet
Assignment No 3
7 pages
Assignment 4
No ratings yet
Assignment 4
8 pages
AML 04 Backpropagation
100% (1)
AML 04 Backpropagation
26 pages
Gradient Descent Algorithm in Machine Learning: Dr. P. K. Chaurasia
No ratings yet
Gradient Descent Algorithm in Machine Learning: Dr. P. K. Chaurasia
24 pages
Screenshot 2024-10-19 at 10.37.25 AM
No ratings yet
Screenshot 2024-10-19 at 10.37.25 AM
25 pages
PCA and Convex Optimization and Bias, Variance-2
No ratings yet
PCA and Convex Optimization and Bias, Variance-2
29 pages
Gradient-Based Optimizers
No ratings yet
Gradient-Based Optimizers
54 pages
3 TrainingNetwork
No ratings yet
3 TrainingNetwork
65 pages
Gradient Descent - Xiaowei Huang
No ratings yet
Gradient Descent - Xiaowei Huang
53 pages
Optim
No ratings yet
Optim
33 pages
CHC 351 Module 5
No ratings yet
CHC 351 Module 5
65 pages
Op Tim Ization
No ratings yet
Op Tim Ization
9 pages
Lecture02a Optimization Annotated PDF
No ratings yet
Lecture02a Optimization Annotated PDF
23 pages
Week 10 Notes MLF
No ratings yet
Week 10 Notes MLF
20 pages
Module2 Optimizations
No ratings yet
Module2 Optimizations
65 pages
Chapter 4
No ratings yet
Chapter 4
33 pages
Gradient Descent
No ratings yet
Gradient Descent
6 pages
Gradient Descent DS Rohit Sharma Fench Knjs
No ratings yet
Gradient Descent DS Rohit Sharma Fench Knjs
15 pages
American Goatman Horror Story
No ratings yet
American Goatman Horror Story
16 pages
Common House Gecko - Wikipedia
No ratings yet
Common House Gecko - Wikipedia
65 pages
Sample Diagnostic Test
No ratings yet
Sample Diagnostic Test
2 pages
Urban Stormwater Management Ethiopia
No ratings yet
Urban Stormwater Management Ethiopia
9 pages
Sauce Lyonnaise Recipe
No ratings yet
Sauce Lyonnaise Recipe
2 pages
UEFA Champions League Injury Study
No ratings yet
UEFA Champions League Injury Study
5 pages
en r2 Viking Skills Checklist LR
No ratings yet
en r2 Viking Skills Checklist LR
1 page
Math Problem Set Solutions
No ratings yet
Math Problem Set Solutions
2 pages
Pages From Salvage Engineer Book
No ratings yet
Pages From Salvage Engineer Book
30 pages
Hippo S19 World Finals
100% (1)
Hippo S19 World Finals
3 pages
HIRA of Kerbstone and External Road Work
100% (1)
HIRA of Kerbstone and External Road Work
1 page
Carnivora Overview: Game Ranging Course
No ratings yet
Carnivora Overview: Game Ranging Course
8 pages
Jks 8 Sepp
No ratings yet
Jks 8 Sepp
4 pages
MBBS 1st Year MCQs Key
No ratings yet
MBBS 1st Year MCQs Key
3 pages
Flower Waste Management
100% (2)
Flower Waste Management
22 pages
Minor Bridge General Arrangement Drawing
No ratings yet
Minor Bridge General Arrangement Drawing
2 pages
BS8666 Bends and Shapes - K B Rebar Limited
No ratings yet
BS8666 Bends and Shapes - K B Rebar Limited
6 pages
Fin RCC-RESO-21-21 (Reserve Market)
No ratings yet
Fin RCC-RESO-21-21 (Reserve Market)
73 pages
Lec 05
No ratings yet
Lec 05
46 pages
HondaJet Elite Brochure
100% (1)
HondaJet Elite Brochure
9 pages
An Experimental Study of Inclined Two-Phase Flow
No ratings yet
An Experimental Study of Inclined Two-Phase Flow
24 pages
Agriculture Form 2 Notes
100% (13)
Agriculture Form 2 Notes
51 pages
GT06N SMS Commands Guide
No ratings yet
GT06N SMS Commands Guide
15 pages
Energy Efficient Landscape Design
No ratings yet
Energy Efficient Landscape Design
17 pages
English FAL Grade 11 P2 June 2025
100% (1)
English FAL Grade 11 P2 June 2025
20 pages
Final Revision 3 Prep 2 Term طلااع سلسلة سوبر انجلش
No ratings yet
Final Revision 3 Prep 2 Term طلااع سلسلة سوبر انجلش
11 pages
Coordinate Geometry and Vector
No ratings yet
Coordinate Geometry and Vector
23 pages
1605000
No ratings yet
1605000
201 pages
IIT-JEE 2010 Analysis
No ratings yet
IIT-JEE 2010 Analysis
8 pages
Back To School e Book
No ratings yet
Back To School e Book
10 pages

Gradient Descent

Uploaded by

Gradient Descent

Uploaded by

Recap: Machine learning algorithms

Discrete Classification Clustering

𝑓′′ 𝑥 = 2 Easy when a function is univariate

f x, y = x ! + y ! f x, y = −x ! − y ! f x, y = cos ! (x) + y ! f x, y = cos ! 𝑥 + cos ! 𝑦

Surface:f 𝑥, 𝑦 = 9 − 𝑥 ! − 𝑦 ! Plane: 𝑦 = 1 Cut: f 𝑥, 1 = 8 − 𝑥 !

Note: Gradient Vector is not parallel to tangent surface

• Method to find local optima of differentiable a function 𝑓

• Definition of derivative guarantees that if we take a small enough step in the

• Pick an initial point 𝑥!

• Iterate until convergence

Possible Stopping Criteria: iterate until

Step size matters!

Step size matters!

• This is called exact line search

• This optimization problem can be expensive to solve exactly L

• This always guarantees a decrease, but it may not decrease as much

• To implement backtracking line search, choose two parameters 𝛼 ∈

• For a differentiable convex function 𝑔(𝑥) its gradients yield linear

• For a differentiable convex function 𝑔(𝑥) its gradients yield linear

• For a differentiable convex function 𝑔(𝑥) its gradients yield linear

• For a convex function 𝑔(𝑥), a subgradient at a point 𝑥 ! is given by

• For a convex function 𝑔(𝑥), a subgradient at a point 𝑥 ! is given by

• For a convex function 𝑔(𝑥), a subgradient at a point 𝑥 ! is given by

• For a convex function 𝑔(𝑥), a subgradient at a point 𝑥 ! is given by

• If a convex function is differentiable at a point 𝑥, then it has a unique

• Subgradient of 𝑔 𝑥 = max(𝑓$ 𝑥 , 𝑓/ 𝑥 ) for 𝑓$ , 𝑓/ convex functions?

• Subgradient of 𝑔 𝑥 = max(𝑓$ 𝑥 , 𝑓/ 𝑥 ) for 𝑓$ , 𝑓/ convex functions?

• If 𝑓B 𝑥 > 𝑓6(𝑥), ∇𝑓B(𝑥)

• If 𝑓6 𝑥 > 𝑓B(𝑥), ∇𝑓6(𝑥)

• If 𝑓B 𝑥 = 𝑓6 𝑥 , ∇𝑓B(𝑥) and ∇𝑓6(𝑥) are both subgradients (and so are all

Subgradient Descent Algorithm:

• Pick an initial point 𝑥!

• Iterate until convergence

where 𝛾" is the 𝑡 "% step size and 𝑠0 (𝑥" ) is a subgradient of 𝑓 at

Subgradient Descent Algorithm:

• Pick an initial point 𝑥!

• Iterate until convergence

where 𝛾" is the 𝑡 "% step

• A fixed step size may not result in convergence for non-differentiable

• Common diminishing step size rules:

Diminishing Step Size

• For a fixed step size, 𝛾, we are guaranteed that

where 𝜖(𝛾) is some positive constant that depends on 𝛾

v1 = 𝛾𝑣"2$ + 𝜂∇3 𝐽 𝜃 , thông thường 𝛾 được chọn là khoảng 0.9

Very slow at the plateau

Stuck at a saddle point

Stuck at a local minimum

• Compare to vanilla GD: 𝜃 2"3 = 𝜃 456 − 𝛼𝛻ℒ 𝜃 456

You might also like