Recap: Machine learning algorithms
Supervised Unsupervised
Learning Learning
Discrete Classification Clustering
Dimensionality
Continuous Regression
reduction
MOTIVATION
• It is good for finding global minima/maxima if the function is convex
• It is good for finding local minima/maxima if the function is not
convex
• It is used for optimizing many models in Machine learning:
• It is used in conjunction with:
üLinear Regression
üLogistic Regression
üSupport Vector Machines
FUNCTION EXAMPLE
QUICKEST EVER REVIEW OF MULTIVARIATE
CALCULUS
• Derivative
• Partial Derivative
• Gradient Vector
DERIVATIVE Slope of the tangent line
𝑓 𝑥 = x!
𝑓′ 𝑥 = 2x
𝑓′′ 𝑥 = 2 Easy when a function is univariate
PARTIAL DERIVATIVE – MULTIVARIATE FUNCTIONS
For multivariate functions (e.g two variables) we need partial
derivatives – one per dimension. Examples of multivariate functions:
f x, y = x ! + y ! f x, y = −x ! − y ! f x, y = cos ! (x) + y ! f x, y = cos ! 𝑥 + cos ! 𝑦
Convex Concave
PARTIAL DERIVATIVE – MULTIVARIATE FUNCTIONS
• To visualize the partial derivative for each of the dimensions x and y,
we can imagine a plane that “cuts” our surface along the two
dimensions and once again we get the slope of the tangent line.
Surface:f 𝑥, 𝑦 = 9 − 𝑥 ! − 𝑦 ! Plane: 𝑦 = 1 Cut: f 𝑥, 1 = 8 − 𝑥 !
Slope/deviarate of cut f′ 𝑥 = −2𝑥
PARTIAL DERIVATIVE – MULTIVARIATE FUNCTIONS
• If we partially differentiate a function with respect to x, we pretend y
is constant. f 𝑥, 𝑦 = 9 − 𝑥 ! − 𝑦 !
𝜕f 𝜕f
= −2𝑥 = −2𝑦
𝜕𝑥 𝜕𝑦
PARTIAL DERIVATIVE – MULTIVARIATE FUNCTIONS
• The two tangent lines that pass through a point, define the tangent
plane to that point
GRADIENT VECTOR
• Is the vector that has as coordinates the partial derivatives of the
function:
𝜕f
= −2𝑥
𝜕𝑥
f 𝑥, 𝑦 = 9 − 𝑥 6 − 𝑦 6
𝜕f
= −2𝑦
𝜕𝑦
𝜕𝑓 𝜕𝑓 𝜕𝑓 𝜕𝑓
∇𝑓 = 𝑖+ 𝑗= , = −2𝑥, −2𝑦
𝜕x 𝜕y 𝜕x 𝜕y
Note: Gradient Vector is not parallel to tangent surface
GRADIENT DESCENT
• Method to find local optima of differentiable a function 𝑓
• Intuition: gradient tells us direction of greatest increase, negative gradient
gives us direction of greatest decrease
• Take steps in directions that reduce the function value
• Definition of derivative guarantees that if we take a small enough step in the
direction of the negative gradient, the function will decrease in value
• How small is small enough?
12
GRADIENT DESCENT
Gradient Descent Algorithm:
• Pick an initial point 𝑥!
• Iterate until convergence
𝑥"#$ = 𝑥" − 𝛾" 𝛻𝑓(𝑥" )
where 𝛾" is the 𝑡 "% step size (sometimes called learning rate)
Possible Stopping Criteria: iterate until
When do we stop?
∇𝑓(𝑥! ) ≤ 𝜖 for some 𝜖 > 0
How small should 𝜖 be? 13
GRADIENT
𝑓 𝑥 =𝑥
DESCENT !
Step size: .8
𝑥 (#) = −4
14
GRADIENT
𝑓 𝑥 =𝑥
DESCENT !
Step size: .8
𝑥 (#) = −4
𝑥 (%) = −4 − .8 ⋅ 2 ⋅ (−4)
15
GRADIENT
𝑓 𝑥 =𝑥
DESCENT !
Step size: .8
𝑥 (#) = −4
𝑥 (%) = 2.4
16
GRADIENT
𝑓 𝑥 =𝑥
DESCENT !
Step size: .8
𝑥 (#) = −4
𝑥 (%) = 2.4
𝑥 (!) = 2.4 − .8 ⋅ 2 ⋅ 2.4
𝑥 (%) = 0.4
17
GRADIENT
𝑓 𝑥 =𝑥
DESCENT !
Step size: .8
𝑥 (#) = −4
𝑥 (%) = 2.4
𝑥 (!) = −1.44
18
GRADIENT
𝑓 𝑥 =𝑥
DESCENT !
Step size: .8
𝑥 (#) = −4
𝑥 (%) = 2.4
𝑥 (!) = −1.44
𝑥 (() = .864
𝑥 (') = −0.5184
𝑥 (&) = 0.31104
𝑥 ((#) = −8.84296𝑒 − 07
19
GRADIENT DESCENT
Step size: .9
20
GRADIENT DESCENT
Step size: .2
21
GRADIENT DESCENT
Step size matters!
22
GRADIENT DESCENT
Step size matters!
23
LINE SEARCH
• Instead of picking a fixed step size that may or may not actually result
in a decrease in the function value, we can consider minimizing the
function along the direction specified by the gradient to guarantee
that the next iteration decreases the function value
• In other words choose, 𝑥@AB ∈ arg min 𝑓(𝑥@ − 𝛾∇𝑓 𝑥@ )
CD E
• This is called exact line search
• This optimization problem can be expensive to solve exactly L
• However, if 𝑓 is convex, this is a univariate convex optimization problem
24
BACKTRACKING LINE SEARCH
• Instead of exact line search, could simply use a strategy that finds
some step size that decreases the function value (one must exist)
• Backtracking line search: start with a large step size, 𝛾, and keep
shrinking it until 𝑓 𝑥" − 𝛾∇𝑓 𝑥" < 𝑓(𝑥" )
• This always guarantees a decrease, but it may not decrease as much
as exact line search
• Still, this is typically much faster in practice as it only requires a few function
evaluations
25
BACKTRACKING LINE SEARCH
• To implement backtracking line search, choose two parameters 𝛼 ∈
0, . 5 , 𝛽 ∈ (0,1)
• Set 𝛾 = 1
/
• While 𝑓 𝑥" − 𝛾∇𝑓 𝑥" > 𝑓 𝑥" − 𝛼 ⋅ 𝛾 ⋅ ∇𝑓 𝑥"
Iterations continue until
• 𝛾 = 𝛽𝛾 a step size is found that
decreases the function
• Set 𝑥"#$ = 𝑥" − 𝛾∇𝑓 𝑥" “enough”
26
BACKTRACKING LINE SEARCH
𝛼 = .2, 𝛽 = .99
27
BACKTRACKING LINE SEARCH
𝛼 = .1, 𝛽 = .3
28
GRADIENT DESCENT: CONVEX FUNCTIONS
• For convex functions, local optima are always global optima (this
follows from the definition of convexity)
• If gradient descent converges to a critical point, then the result is a global
minimizer
• Not all convex functions are differentiable, can we still apply gradient
descent?
29
GRADIENTS OF CONVEX FUNCTIONS
• For a differentiable convex function 𝑔(𝑥) its gradients yield linear
underestimators
𝑔(𝑥)
30
GRADIENTS OF CONVEX FUNCTIONS
• For a differentiable convex function 𝑔(𝑥) its gradients yield linear
underestimators
𝑔(𝑥)
31
GRADIENTS OF CONVEX FUNCTIONS
• For a differentiable convex function 𝑔(𝑥) its gradients yield linear
underestimators: zero gradient corresponds to a global optimum
𝑔(𝑥)
32
SUBGRADIENTS
• For a convex function 𝑔(𝑥), a subgradient at a point 𝑥 ! is given by
any line, 𝑙, such that 𝑙 𝑥 ! = 𝑔(𝑥 ! ) and 𝑙 𝑥 ≤ 𝑔(𝑥) for all 𝑥, i.e., it
is a linear underestimator
𝑔(𝑥)
𝑥
𝑥#
33
SUBGRADIENTS
• For a convex function 𝑔(𝑥), a subgradient at a point 𝑥 ! is given by
any line, 𝑙, such that 𝑙 𝑥 ! = 𝑔(𝑥 ! ) and 𝑙 𝑥 ≤ 𝑔(𝑥) for all 𝑥, i.e., it
is a linear underestimator
𝑔(𝑥)
𝑥
𝑥#
34
SUBGRADIENTS
• For a convex function 𝑔(𝑥), a subgradient at a point 𝑥 ! is given by
any line, 𝑙, such that 𝑙 𝑥 ! = 𝑔(𝑥 ! ) and 𝑙 𝑥 ≤ 𝑔(𝑥) for all 𝑥, i.e., it
is a linear underestimator
𝑔(𝑥)
𝑥
𝑥#
35
SUBGRADIENTS
• For a convex function 𝑔(𝑥), a subgradient at a point 𝑥 ! is given by
any line, 𝑙, such that 𝑙 𝑥 ! = 𝑔(𝑥 ! ) and 𝑙 𝑥 ≤ 𝑔(𝑥) for all 𝑥, i.e., it
is a linear underestimator
𝑔(𝑥)
If 0 is a subgradient
at 𝑥 " , then 𝑥 " is a
global minimum
𝑥
𝑥#
36
SUBGRADIENTS
• If a convex function is differentiable at a point 𝑥, then it has a unique
subgradient at the point 𝑥 given by the gradient
• If a convex function is not differentiable at a point 𝑥, it can have many
subgradients
• E.g., the set of subgradients of the convex function |𝑥| at the point 𝑥 = 0 is given by
the set of slopes [−1,1]
• The set of all subgradients of 𝑓 at 𝑥 form a convex set, i.e., 𝑔, ℎ subgradients, then
.5𝑔 + .5ℎ is also a subgradient
• Subgradients only guaranteed to exist for convex functions
37
SUBGRADIENT EXAMPLE
• Subgradient of 𝑔 𝑥 = max(𝑓$ 𝑥 , 𝑓/ 𝑥 ) for 𝑓$ , 𝑓/ convex functions?
38
SUBGRADIENT EXAMPLE
• Subgradient of 𝑔 𝑥 = max(𝑓$ 𝑥 , 𝑓/ 𝑥 ) for 𝑓$ , 𝑓/ convex functions?
• If 𝑓B 𝑥 > 𝑓6(𝑥), ∇𝑓B(𝑥)
• If 𝑓6 𝑥 > 𝑓B(𝑥), ∇𝑓6(𝑥)
• If 𝑓B 𝑥 = 𝑓6 𝑥 , ∇𝑓B(𝑥) and ∇𝑓6(𝑥) are both subgradients (and so are all
convex combinations of these)
39
SUBGRADIENT DESCENT
Subgradient Descent Algorithm:
• Pick an initial point 𝑥!
• Iterate until convergence
𝑥"#$ = 𝑥" − 𝛾" 𝑠0 (𝑥" )
where 𝛾" is the 𝑡 "% step size and 𝑠0 (𝑥" ) is a subgradient of 𝑓 at
𝑥"
40
SUBGRADIENT DESCENT
Subgradient Descent Algorithm:
• Pick an initial point 𝑥!
• Iterate until convergence
𝑥"#$ = 𝑥" − 𝛾" 𝑠0 (𝑥" )
where 𝛾" is the 𝑡 "% step
Cansize
you and 𝑠0search
use line is a subgradient of 𝑓 at
(𝑥" ) here?
𝑥"
41
SUBGRADIENT DESCENT
Step Size: .9
42
DIMINISHING STEP SIZE RULES
• A fixed step size may not result in convergence for non-differentiable
functions
• Instead, can use a diminishing step size:
• Required property: step size must decrease as number of iterations increase
but not too quickly that the algorithm fails to make progress
• Common diminishing step size rules:
I
• 𝛾@ = for some 𝑎 > 0, 𝑏 ≥ 0
JA@
I
• 𝛾@ = @
for some 𝑎 > 0
43
SUBGRADIENT DESCENT
Diminishing Step Size
44
THEORETICAL GUARANTEES
• The hard work in convex optimization is to identify conditions that
guarantee quick convergence to within a small error of the optimum
($)
• Let 𝑓!"#$ = min 𝑓(𝑥$ ! )
$ ! ∈{),…,$}
• For a fixed step size, 𝛾, we are guaranteed that
($)
lim 𝑓!"#$ − inf 𝑓 𝑥 ≤ 𝜖(𝛾)
$→. /
where 𝜖(𝛾) is some positive constant that depends on 𝛾
• If 𝑓 is differentiable, then we have 𝜖 𝛾 = 0 whenever 𝛾 is small enough
(more on rates of convergence later)
45
GRADIENT DESCENT CODE (PYTHON)
f x = x ' − 3x ( + 2
f ) x = 4x ( − 9x !
GRADIENT DESCENT WITH MOMENTUM
v1 = 𝛾𝑣"2$ + 𝜂∇3 𝐽 𝜃 , thông thường 𝛾 được chọn là khoảng 0.9
MINI-BATCH GRADIENT DESCENT
• It is wasteful to compute the loss over the entire set to
perform a single parameter update for large datasets
• E.g., ImageNet has 14M images
• GD (a.k.a. vanilla GD) is replaced with mini-batch GD
• Mini-batch gradient descent
• Approach:
• Compute the loss ℒ 𝜃 on a batch of images, update the parameters 𝜃,
and repeat until all images are used
• At the next epoch, shuffle the training data, and repeat above process
• Mini-batch GD results in much faster training
• Typical batch size: 32 to 256 images
• It works because the examples in the training data are correlated
• I.e., the gradient from a mini-batch is a good approximation of the
gradient of the entire training set
STOCHASTIC GRADIENT DESCENT
• Stochastic gradient descent
• SGD uses mini-batches that consist of a single input example
• E.g., one image mini-batch
• Although this method is very fast, it may cause significant fluctuations in the
loss function
• Therefore, it is less commonly used, and mini-batch GD is preferred
• In most DL libraries, SGD is typically a mini-batch SGD (with an option to add
momentum)
PROBLEMS WITH GRADIENT DESCENT
• Besides the local minima problem, the GD algorithm can be very slow
at plateaus, and it can get stuck at saddle points
cost ℒ 𝜃
Very slow at the plateau
Stuck at a saddle point
Stuck at a local minimum
𝛻ℒ 𝜃 ≈ 0 𝛻ℒ 𝜃
=0 𝛻ℒ 𝜃 = 0
𝜃
Slide credit: Hung-yi Lee – Deep Learning Tutorial
GRADIENT DESCENT WITH MOMENTUM
• Gradient descent with momentum uses the momentum of the
gradient for parameter optimization
cost ℒ 𝜃
Movement = Negative of Gradient + Momentum
Negative of Gradient
Momentum
Real Movement
𝜃
Gradient = 0
Slide credit: Hung-yi Lee – Deep Learning Tutorial
GRADIENT DESCENT WITH MOMENTUM
• Parameters update in GD with momentum : 𝜃 2"3 = 𝜃 456 −
𝑉 2"3
• Where: 𝑉 !"# = 𝛽𝑉 $%& + 𝛼𝛻ℒ 𝜃 $%&
• Compare to vanilla GD: 𝜃 2"3 = 𝜃 456 − 𝛼𝛻ℒ 𝜃 456
• The term 𝑉 2"3 is called momentum
• This term accumulates the gradients from the past several steps
• It is similar to a momentum of a heavy ball rolling down the hill
• The parameter 𝛽 referred to as a coefficient of momentum
• A typical value of the parameter 𝛽 is 0.9
• This method updates the parameters 𝜃 in the direction of
the weighted average of the past gradients