Chapter 4
Unconstrained Minimization in Rn :
Newton Methods
1. Motivation
Again the problem here is
min f (x) (1)
x∈Rn
The basic idea is to take a quadratic approximation around the current point xi and minimize
this approximation to get xi+1 . So we have:
1
minn f (xi ) + ∇f (xi )(x − xi ) + (x − xi )∇2 f (xi )(x − xi ).
x∈R 2
By setting the gradient of this approximation to zero, we get:
∇f (xi ) + ∇2 f (xi )(xi+1 − xi ) = 0.
If ∇2 f (xi )−1 exists then
xi+1 = xi − ∇2 f (xi )−1 ∇f (xi )
which is the classical Newton iteration. An equivalent interpretation is obtained by linearizing
∇f (x) around xi and solving the linearized equations for xi+1 .
2. Local Convergence of Newton’s Method and its Superlinear and Quadratic Rates
of Convergence
1 Theorem Let g: Rn → Rn be differentiable on an open neighborhood of a solution x̄ of
g(x) = 0 and let ∇g be continuous and nonsingular at x̄. Then:
(a) x̄ is a point of attraction of the Newton iteration
xk+1 = xk − ∇g(xk )−1 g(xk )
and
kxk+1 − x̄k
lim =0 (Superlinear quotient rate)
k→∞ kxk − x̄k
(b) Moreover, if
k∇g(x) − ∇g(x̄)k <
= αkx − x̄k
for all x in some neighborhood N (x̄) of x̄, that is g ∈ LCα1 (N (x̄)), then there is a positive
constant c s.t.:
2
kxk+1 − x̄k <
= ckxk − x̄k (Quadratic quotient rate)
for all k >
= k0 where k0 depends on x0 .
1
Proof.
(a) We first show that ∇f (x) is nonsingular for all x in a neighborhood of x̄. Set β =
k∇g(x̄)−1 k and let 0 < ε < (2β)−1 . By the continuity
of ∇g at x̄ there is a δ > 0
s.t. k∇g(x) − ∇g(x̄)k = ε whenever x ∈ S: = {x kx − x̄k <
<
= δ}. Hence by the Banach
Perturbation Lemma A.18:
−1 β
k ∇g(x̄) + ∇g(x) − ∇g(x̄) k = k∇g(x)−1 k <
= < 2β
1 − βε
Define the Newton mapping:
h(x) = x − ∇g(x)−1 g(x).
By B.6.2:
∇h(x̄) = I − ∇g(x̄)−1 ∇g(x̄) = 0.
Hence by the Ostrowski Point of Attraction Theorem B.6.1, x̄ is a point of attraction.
Moreover since h is differentiable at x̄, that is:
kh(xk ) − h(x̄) − ∇h(x̄)(xk − x̄)k
lim = 0,
k→∞ kxk − x̄k
whenever lim xk = x̄. Since ∇h(x̄) = 0, this gives:
k→∞
kxk+1 − x̄k
lim = 0.
k→∞ kxk − x̄k
(b) By the Quadratic Bound Lemma B.1.3:
α
kg(x) − g(x̄) − ∇g(x̄)(x − x̄)k <
= kx − x̄k2 ,
2
and
kh(x) − h(x̄)k = kx − ∇g(x)−1 g(x) − x̄k
<
= k − ∇g(x)−1 g(x) − g(x̄) − ∇g(x̄)(x − x̄) k
+ k∇g(x)−1 ∇g(x) − ∇g(x̄) (x − x̄)k
<
= βαkx − x̄k2 + 2βαkx − x̄k2 ,
in a suitable neighborhood of x̄. Hence:
2
kxk+1 − x̄k <
= 3βαkxk − x̄k ,
provided that for a given x0 , k0 , is chosen so that xk lies in the neighborhood of all
k>= k0 , and by part (a) such a k0 is guaranteed to exist.
2
3. Local convergence of Newton’s method, global convergence of damped Newton’s
method and their quadratic rate of convergence
1 Theorem (Local Newton convergence & its quadratic rate)
Consider the Newton algorithm
xi+1 = xi − ∇2 f (xi )−1 ∇f (xi )
for solving ∇f (x) = 0 where f : Rn → R. Assume that:
(i) ∃x̄: ∇f (x̄) = 0
(ii) k∇2 f (y) − ∇2 f (x)k <
= Rky − xk, ∀x, y ∈ B(x̄) for some R > 0
1
(iii) k∇2 f (x)−1 k <
= , ∀x ∈ B(x̄) for some m > 0 and B 2m (x̄) ⊂ B(x̄)
m R
2m
(iv) kx0 − x̄k <
R
Then
R
(a): (i), (ii) and (iii) =⇒ kxi+1 − x̄k <
= kxi − x̄k2
2m
(b): (i), (ii), (iii) and (iv)=⇒ ∀i : xi ∈ B 2m (x̄) & xi → x̄
R
Note: B 2m (x̄) ⊂ B(x̄) can be induced by taking R larger, but then x0 must be closer to x̄.
R
Proof.
(a)
xi+1 − x̄ = xi − x̄ − ∇2 fi−1 ∇fi − ∇f (x̄) By (i)
h it=1
= xi − x̄ − ∇2 fi−1 ∇f x̄ + t(xi − x̄)
t=0
Z 1
= xi − x̄ − ∇2 fi−1 ∇2 f x̄ + t(xi − x̄) − ∇2 fi + ∇2 fi (xi − x̄)dt
0
(By continuity of ∇2 f from (ii))
Hence:
Z 1
kxi+1 − x̄k <
= k∇2 fi−1 k k∇2 f x̄ + t(xi − x̄) − ∇2 fi k · kxi − x̄kdt
0
(By Lemma B.2.2)
1 1
Z
<
= R(1 − t)kxi − x̄kdtkxi − x̄k (By (ii), (iii))
m 0
R
= kxi − x̄k2
2m
3
(b) By induction:
! !2i
R < R
kxi − x̄k = kx0 − x̄k i = 1, 2, . . .
2m 2m
2 i
(Induction proof: ai <
= a0
2 i
= a0 . Then:
The inequality holds for i = 1. Suppose it holds for i, that is ai <
i
2 i+1
2< 2
ai+1 <
= (ai ) = ao = a2i ,
where the first inequality follows from part (a). Thus the induction is extended to i + 1.)
Hence !2i
2m R 2m
kxi − x̄k <
= kx0 − x̄k < ,
R 2m R
and lim kxi − x̄k = 0.
i→∞
2 Theorem (Globally convergent damped Newton & its rate of convergence)
Consider the damped Newton algorithm:
Direction choice pi = −∇2 f (xi )−1 ∇f (xi )
Stepsize: xi+1 = xi + λi pi and λi chosen such that:
f (xi ) − f (xi+1 ) >
= σ2 (−∇f (xi )pi ),
for some forcing function σ2 . (Hence all stepsizes of SCAT1 can be used. 1 )
Assumption
(v) f ∈ C 2 and ∃M >
= m > 0 s.t. ∀x, y ∈ RN :
mkyk2 < 2 <
= y∇ f (x)y = M kyk
2
Then
(a): (v) =⇒ {xi } → x̄ where x̄ is the unique solution of minn f (x)
x∈R
(b): h (ii), (v) and “min. or first stationary point along pi ” stepsizei =⇒
1
M 2R
kxi+1 −x̄k <
= 3 kxi −x̄k2
2m 2
1
O. L. Mangasarian: “Parallel gradient distribution in unconstrained optimization”, SIAM Journal on Optimiza-
tion 33, 1995, 1916-1925, Theorem 2.
4
Proof.
1
(a) −∇fi pi = ∇fi ∇2 fi−1 ∇fi = k∇fi k2 =: σ(k∇fi k)
M 2
= f (xi ) the bounded set S = {xf (x) = f (x0 )} ⊂{x kx−x0 k =
Because f (xi+1 ) < < < k∇f(x0 )k}
m
contains {xi }. But:
1 1
kpi k2 = ∇fi ∇2 fi−2 ∇fi <
= 2
k∇fi k2 <
= max ||∇f (xi )k2 .
M m2 x∈S
So {xi , pi } is bounded and hence must have an accumulation point. But for each such
accumulation point (x̄, p̄), x̄ is the unique solution of minn f (x). So {xi } → x̄.
x∈R
(b) We need the following lemma:
3 Lemma Let f : Rn → R, let M >
= m>0 s.t. ∀x, y ∈ Rn
M kyk2 > 2 >
= y∇ f (x)y = mkyk
2
Then * +
M 1
2
f (y) <
= f (x), ∇f (x̄ = 0 =⇒ ky − x̄k <
= kx − x̄k
m
Proof of Lemma
1
f (y) = f (x̄) + ∇f (x̄)(y − x̄) + (y − x̄)∇2 f (s)(y − x̄)
2
1
f (x) = f (x̄) + ∇f (x̄)(x − x̄) + (x − x̄)∇2 f (t)(x − x̄),
2
where s ∈ (x̄, y) and t ∈ (x̄, x). Since, ∇f (x̄), subtraction gives:
1 1
0>
= f (y) − f (x) = (y − x̄)∇2 f (s)(y − x̄) − (x − x̄)∇2 f (t)(x − x̄)
2 2
1 1
>
= − M kx − x̄k2 + mky − x̄k2
2 2
Going back to the proof of part (b), let x̄i = xi − ∇2 fi−1 ∇fi , that is, x̄i is the “xi+1 ” of the
Newton algorithm. Because we are minimizing along pi , obviously f (xi+1 ) < = f (x̄i ) and hence
by the above lemma:
M 1
2
kxi+1 − x̄k <
= kx̄i − x̄k.
m
By part (a) of the previous theorem, Theorem 1,
R
kxi+1 − x̄k <
= kxi − x̄k2 .
2m
Hence 1
M 2R
kxi+1 − x̄k <
= 3 kxi − x̄k2
2m 2