0% found this document useful (0 votes)
38 views14 pages

Gradient and Newton's Methods Lecture

A lecture note!
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views14 pages

Gradient and Newton's Methods Lecture

A lecture note!
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

optimization

(SI 416) – lecture 9

Harsha Hutridurga

IIT Bombay

Harsha Hutridurga (IIT Bombay) SI 416 1 / 14


recap

♣ Take a continuously differentiable function f : Rn → R


♣ Gradient descent algorithm reads as follows:
(
Begin with a x(0) ∈ Rn
Build iterates using x(n+1) = x(n) − δ∇f (x(n) ) for n = 0, 1, 2, .

♣ If f is strongly convex, then there is a unique global minimizer x∗


♣ If f is further assumed to be β-smooth, then picking δ ∈ (0, β −1 )
yields a minimizing sequence, i.e. f (x(n+1) ) ≤ f (x(n) )
♣ Furthermore, we have the estimate:
 n
(n) 1 2
x − x∗ ≤ x(0) − x∗ for n = 0, 1, . . .
1 + 2δλ

♣ Hence we deduced that


n = O(ln(ε−1 )) =⇒ x(n) − x∗ = O(ε)

Harsha Hutridurga (IIT Bombay) SI 416 2 / 14


recap (contd.)

♣ Take a twice continuously differentiable function f : Rn → R


♣ Further assume that ∇2 f (x) is invertible for all x ∈ Rn
♣ Newton’s algorithm reads as follows:

 Begin with a x(0) ∈ Rn
 −1
 Build iterates using x(n+1) = x(n) − δ ∇2 f (x(n) ) ∇f (x(n) )

for n = 0, 1, 2, . . .
♣ Suppose f is strongly convex. Then
I there is a unique global minimizer x∗
I the hessian ∇2 f (x) is positive definite
♣ Picking δ  1 small enough yields a minimizing sequence:

f (x(n+1) ) ≤ f (x(n) )

Harsha Hutridurga (IIT Bombay) SI 416 3 / 14


recap (contd.)

♣ Assume further that f is smooth in the following sense:


∇2 f (x)v − ∇2 f (y)v ≤ γ kx − yk kvk for all x, y, v ∈ Rn
for some γ > 0
♣ For δ = 1, we have the estimate:
 γ 2n −1 2n
x(n) − x∗ ≤ x(0) − x∗

♣ Note that
 
λ 2λ n
x (0)
− x∗ ≤ =⇒ x (n)
− x∗ ≤ 2−2
γ γ

♣ Hence we deduced that


n = O(ln(ln(ε−1 ))) =⇒ x(n) − x∗ = O(ε)

Harsha Hutridurga (IIT Bombay) SI 416 4 / 14


rates of convergence

♣ If a sequence {x(n) } ⊂ Rn converging to a point x∗ ∈ Rn , then


lim x(n) − x∗ = 0
n→∞

♣ For a convergent sequence, we can talk about rate of convergence


I The convergence is linear if there exists a θ ∈ (0, 1) such that

x(n+1) − x∗ ≤ θ x(n) − x∗
for all n sufficiently large
I The convergence is superlinear if
x(n+1) − x∗
lim =0
n→∞ x(n) − x∗

I The convergence is quadratic if there exists a C > 0 such that


2
x(n+1) − x∗ ≤ C x(n) − x∗
for all n sufficiently large
Harsha Hutridurga (IIT Bombay) SI 416 5 / 14
rate of convergence (contd.)

♣ Recall: For the gradient descent algorithm to minimize a strongly


convex β-smooth function, we had
 1
(n+1) 1 2
x − x∗ ≤ x(n) − x∗
1 + 2δλ

♣ Hence the convergence here is linear


♣ Recall: For the Newton’s algorithm to minimize a smooth strongly
convex function, we had
γ 2
x(n+1) − x∗ ≤ x(n) − x∗

♣ Hence the convergence here is quadratic


Harsha Hutridurga (IIT Bombay) SI 416 6 / 14
line search algorithms

♣ Start with an initial vector x(0) ∈ Rn and a direction p(0) ∈ Rn


♣ Find the next iterate x(1) along the line x(0) + αp(0) with α > 0 s.t.
f (x(1) ) ≤ f (x(0) )

♣ At the point x(1) , pick a new direction p(1) ∈ Rn


♣ Find the next iterate x(2) along the line x(1) + αp(1) with α > 0 s.t.
f (x(2) ) ≤ f (x(1) )

♣ General principle of line search algorithms:


I At the current iterate x(n) , choose a direction p(n)
I Pick the next iterate x(n+1) along the line x(n) + αp(n) with α > 0
such that
f (x(n+1) ) ≤ f (x(n) )

Harsha Hutridurga (IIT Bombay) SI 416 7 / 14


line search algorithms (contd.)

♣ At each iteration step, we may perform a one-dimensional


minimization problem:
min f (x(n) + αp(n) )
α>0

♣ But, in practice, we are content with finding a candidate that


comes close to solving the above one-dimensional problem
♣ The direction p(n) is referred to as the search direction
♣ Recall the steepest descent algorithm:
x(n+1) = x(n) − δ∇f (x(n) )

♣ So, here the search direction at the nth iteration step is


p(n) = −∇f (x(n) )

Harsha Hutridurga (IIT Bombay) SI 416 8 / 14


line search algorithms (contd.)

♣ At the iterate x(n) and for any search direction p(n) , we have
D E
f (x(n) + αp(n) ) = f (x(n) ) + α ∇f (x(n) ), p(n)
α2 D 2 E
+ ∇ f (x(n) + sp(n) )p(n) , p(n)
2
for some s ∈ (0, α), thanks to Taylor’s theorem.
♣ Define a function g : [0, ∞) → R as follows:
g(α) := f (x(n) + αp(n) ) for α ∈ [0, ∞).

♣ Observe that D E
g 0 (0) = ∇f (x(n) ), p(n)

♣ That is, the rate of change of f at the point x(n) in the direction
p(n) is given by D E
∇f (x(n) ), p(n)
Harsha Hutridurga (IIT Bombay) SI 416 9 / 14
line search algorithms (contd.)

♣ If we are interested in finding a unit direction of maximum


decrease at the point x(n) , we should understand
D E
min ∇f (x(n) ), p
p∈Rn , kpk=1

♣ Recall that, if θn denotes the angle between ∇f (x(n) ) and p, then


D E
∇f (x(n) ), p = kpk ∇f (x(n) ) cos(θn ) = ∇f (x(n) ) cos(θn )

♣ So, the minimum possible value of f (x(n) ), p is obtained when


cos(θn ) = −1

♣ Observe that the unit vector p which realises that is


∇f (x(n) )
p=−
∇f (x(n) )

Harsha Hutridurga (IIT Bombay) SI 416 10 / 14


line search algorithms (contd.)

♣ We have seen that steepest descent is a line search algorithm


where we take the search direction
p(n) = −∇f (x(n) )

♣ Taylor’s theorem says


D E
f (x(n) + αp(n) ) = f (x(n) ) + α ∇f (x(n) ), p(n)
α2 D 2 E
+ ∇ f (x(n) + sp(n) )p(n) , p(n)
2

♣ Hence, if we take 0 < α  1, and if we ensure that


D E
∇f (x(n) ), p(n) < 0

then we find that f (x(n+1) ) < f (x(n) )


♣ Any such direction p(n) is referred to as descent direction
Harsha Hutridurga (IIT Bombay) SI 416 11 / 14
line search algorithms (contd.)

♣ For any search direction p, we have by Taylor’s theorem:


D E 1D E
f (x(n) + p) = f (x(n) ) + ∇f (x(n) ), p + ∇2 f (x(n) + sp)p, p
2
for some s ∈ (0, 1).
♣ Let us assume that ∇2 f (x(n) + sp) ≈ ∇2 f (x(n) )
♣ Hence we obtain
D E 1D E
f (x(n) + p) ≈ f (x(n) ) + ∇f (x(n) ), p + ∇2 f (x(n) )p, p =: F (p)
2

♣ Observe that F is a quadratic function in p


♣ If ∇2 f is positive definite, then F (p) has a unique global minimum
♣ Recall: that global minimizer p∗ is a critical point of F , i.e.
 −1
∇F (p∗ ) = 0 =⇒ p∗ = − ∇2 f (x(n) ) ∇f (x(n) )

♣ This is the search direction in Newton’s algorithm


Harsha Hutridurga (IIT Bombay) SI 416 12 / 14
line search algorithms (contd.)

♣ Newton’s algorithm is also a line search algorithm


♣ The search direction in Newton’s algorithm is
 −1
p(n) = − ∇2 f (x(n) ) ∇f (x(n) )

♣ If ∇2 f is strictly positive definite, then


D E   −1 
(n) (n) (n) 2 (n) (n)
∇f (x ), p = − ∇f (x ), ∇ f (x ) ∇f (x ) < 0

♣ Hence the above p(n) is a descent direction

Harsha Hutridurga (IIT Bombay) SI 416 13 / 14


end of lecture 9
thank you for your attention

Harsha Hutridurga (IIT Bombay) SI 416 14 / 14

You might also like