optimization
(SI 416) – lecture 9
Harsha Hutridurga
IIT Bombay
Harsha Hutridurga (IIT Bombay) SI 416 1 / 14
recap
♣ Take a continuously differentiable function f : Rn → R
♣ Gradient descent algorithm reads as follows:
(
Begin with a x(0) ∈ Rn
Build iterates using x(n+1) = x(n) − δ∇f (x(n) ) for n = 0, 1, 2, .
♣ If f is strongly convex, then there is a unique global minimizer x∗
♣ If f is further assumed to be β-smooth, then picking δ ∈ (0, β −1 )
yields a minimizing sequence, i.e. f (x(n+1) ) ≤ f (x(n) )
♣ Furthermore, we have the estimate:
n
(n) 1 2
x − x∗ ≤ x(0) − x∗ for n = 0, 1, . . .
1 + 2δλ
♣ Hence we deduced that
n = O(ln(ε−1 )) =⇒ x(n) − x∗ = O(ε)
Harsha Hutridurga (IIT Bombay) SI 416 2 / 14
recap (contd.)
♣ Take a twice continuously differentiable function f : Rn → R
♣ Further assume that ∇2 f (x) is invertible for all x ∈ Rn
♣ Newton’s algorithm reads as follows:
Begin with a x(0) ∈ Rn
−1
Build iterates using x(n+1) = x(n) − δ ∇2 f (x(n) ) ∇f (x(n) )
for n = 0, 1, 2, . . .
♣ Suppose f is strongly convex. Then
I there is a unique global minimizer x∗
I the hessian ∇2 f (x) is positive definite
♣ Picking δ 1 small enough yields a minimizing sequence:
f (x(n+1) ) ≤ f (x(n) )
Harsha Hutridurga (IIT Bombay) SI 416 3 / 14
recap (contd.)
♣ Assume further that f is smooth in the following sense:
∇2 f (x)v − ∇2 f (y)v ≤ γ kx − yk kvk for all x, y, v ∈ Rn
for some γ > 0
♣ For δ = 1, we have the estimate:
γ 2n −1 2n
x(n) − x∗ ≤ x(0) − x∗
2λ
♣ Note that
λ 2λ n
x (0)
− x∗ ≤ =⇒ x (n)
− x∗ ≤ 2−2
γ γ
♣ Hence we deduced that
n = O(ln(ln(ε−1 ))) =⇒ x(n) − x∗ = O(ε)
Harsha Hutridurga (IIT Bombay) SI 416 4 / 14
rates of convergence
♣ If a sequence {x(n) } ⊂ Rn converging to a point x∗ ∈ Rn , then
lim x(n) − x∗ = 0
n→∞
♣ For a convergent sequence, we can talk about rate of convergence
I The convergence is linear if there exists a θ ∈ (0, 1) such that
x(n+1) − x∗ ≤ θ x(n) − x∗
for all n sufficiently large
I The convergence is superlinear if
x(n+1) − x∗
lim =0
n→∞ x(n) − x∗
I The convergence is quadratic if there exists a C > 0 such that
2
x(n+1) − x∗ ≤ C x(n) − x∗
for all n sufficiently large
Harsha Hutridurga (IIT Bombay) SI 416 5 / 14
rate of convergence (contd.)
♣ Recall: For the gradient descent algorithm to minimize a strongly
convex β-smooth function, we had
1
(n+1) 1 2
x − x∗ ≤ x(n) − x∗
1 + 2δλ
♣ Hence the convergence here is linear
♣ Recall: For the Newton’s algorithm to minimize a smooth strongly
convex function, we had
γ 2
x(n+1) − x∗ ≤ x(n) − x∗
2λ
♣ Hence the convergence here is quadratic
Harsha Hutridurga (IIT Bombay) SI 416 6 / 14
line search algorithms
♣ Start with an initial vector x(0) ∈ Rn and a direction p(0) ∈ Rn
♣ Find the next iterate x(1) along the line x(0) + αp(0) with α > 0 s.t.
f (x(1) ) ≤ f (x(0) )
♣ At the point x(1) , pick a new direction p(1) ∈ Rn
♣ Find the next iterate x(2) along the line x(1) + αp(1) with α > 0 s.t.
f (x(2) ) ≤ f (x(1) )
♣ General principle of line search algorithms:
I At the current iterate x(n) , choose a direction p(n)
I Pick the next iterate x(n+1) along the line x(n) + αp(n) with α > 0
such that
f (x(n+1) ) ≤ f (x(n) )
Harsha Hutridurga (IIT Bombay) SI 416 7 / 14
line search algorithms (contd.)
♣ At each iteration step, we may perform a one-dimensional
minimization problem:
min f (x(n) + αp(n) )
α>0
♣ But, in practice, we are content with finding a candidate that
comes close to solving the above one-dimensional problem
♣ The direction p(n) is referred to as the search direction
♣ Recall the steepest descent algorithm:
x(n+1) = x(n) − δ∇f (x(n) )
♣ So, here the search direction at the nth iteration step is
p(n) = −∇f (x(n) )
Harsha Hutridurga (IIT Bombay) SI 416 8 / 14
line search algorithms (contd.)
♣ At the iterate x(n) and for any search direction p(n) , we have
D E
f (x(n) + αp(n) ) = f (x(n) ) + α ∇f (x(n) ), p(n)
α2 D 2 E
+ ∇ f (x(n) + sp(n) )p(n) , p(n)
2
for some s ∈ (0, α), thanks to Taylor’s theorem.
♣ Define a function g : [0, ∞) → R as follows:
g(α) := f (x(n) + αp(n) ) for α ∈ [0, ∞).
♣ Observe that D E
g 0 (0) = ∇f (x(n) ), p(n)
♣ That is, the rate of change of f at the point x(n) in the direction
p(n) is given by D E
∇f (x(n) ), p(n)
Harsha Hutridurga (IIT Bombay) SI 416 9 / 14
line search algorithms (contd.)
♣ If we are interested in finding a unit direction of maximum
decrease at the point x(n) , we should understand
D E
min ∇f (x(n) ), p
p∈Rn , kpk=1
♣ Recall that, if θn denotes the angle between ∇f (x(n) ) and p, then
D E
∇f (x(n) ), p = kpk ∇f (x(n) ) cos(θn ) = ∇f (x(n) ) cos(θn )
♣ So, the minimum possible value of f (x(n) ), p is obtained when
cos(θn ) = −1
♣ Observe that the unit vector p which realises that is
∇f (x(n) )
p=−
∇f (x(n) )
Harsha Hutridurga (IIT Bombay) SI 416 10 / 14
line search algorithms (contd.)
♣ We have seen that steepest descent is a line search algorithm
where we take the search direction
p(n) = −∇f (x(n) )
♣ Taylor’s theorem says
D E
f (x(n) + αp(n) ) = f (x(n) ) + α ∇f (x(n) ), p(n)
α2 D 2 E
+ ∇ f (x(n) + sp(n) )p(n) , p(n)
2
♣ Hence, if we take 0 < α 1, and if we ensure that
D E
∇f (x(n) ), p(n) < 0
then we find that f (x(n+1) ) < f (x(n) )
♣ Any such direction p(n) is referred to as descent direction
Harsha Hutridurga (IIT Bombay) SI 416 11 / 14
line search algorithms (contd.)
♣ For any search direction p, we have by Taylor’s theorem:
D E 1D E
f (x(n) + p) = f (x(n) ) + ∇f (x(n) ), p + ∇2 f (x(n) + sp)p, p
2
for some s ∈ (0, 1).
♣ Let us assume that ∇2 f (x(n) + sp) ≈ ∇2 f (x(n) )
♣ Hence we obtain
D E 1D E
f (x(n) + p) ≈ f (x(n) ) + ∇f (x(n) ), p + ∇2 f (x(n) )p, p =: F (p)
2
♣ Observe that F is a quadratic function in p
♣ If ∇2 f is positive definite, then F (p) has a unique global minimum
♣ Recall: that global minimizer p∗ is a critical point of F , i.e.
−1
∇F (p∗ ) = 0 =⇒ p∗ = − ∇2 f (x(n) ) ∇f (x(n) )
♣ This is the search direction in Newton’s algorithm
Harsha Hutridurga (IIT Bombay) SI 416 12 / 14
line search algorithms (contd.)
♣ Newton’s algorithm is also a line search algorithm
♣ The search direction in Newton’s algorithm is
−1
p(n) = − ∇2 f (x(n) ) ∇f (x(n) )
♣ If ∇2 f is strictly positive definite, then
D E −1
(n) (n) (n) 2 (n) (n)
∇f (x ), p = − ∇f (x ), ∇ f (x ) ∇f (x ) < 0
♣ Hence the above p(n) is a descent direction
Harsha Hutridurga (IIT Bombay) SI 416 13 / 14
end of lecture 9
thank you for your attention
Harsha Hutridurga (IIT Bombay) SI 416 14 / 14