CS 726: Nonlinear Optimization 1
Lecture 3 : Di↵erentiability
Michael C. Ferris
Computer Sciences Department
University of Wisconsin-Madison
January 29 2021
Michael C. Ferris (UW-Madison) CS726:Lecture 3 Di↵erentiability 1 / 21
Background Material
Since we spent some part of the lecture in Background Quiz, I would
like you to review the whole of [Wright and Recht(2020), Chapter 2].
Definition of local and global, see [Wright and Recht(2020), Section
2.1].
Reading through this additional information and that of Bertsekas
mentioned last time may be helpful.
Michael C. Ferris (UW-Madison) CS726:Lecture 3 Di↵erentiability 2 / 21
Taxonomy 2.1
of Asolutions
Taxonomy of Solutions to Optimization Problems
Before we can begin designing algorithms, we must determine what it means
to solve an optimization problem. Suppose that f is a function mapping some
domain D ⇢ Rn to the real line R. We have the following definitions.
• x⇤ 2 D is a local minimizer of f if there is a neighborhood N of x⇤ such that
f (x) f (x⇤ ) for all x 2 N \ D.
• x⇤ 2 D is a global minimizer of f if f (x) f (x⇤ ) for all x 2 D.
• x⇤ 2 D is a strict local minimizer if it is a local minimizer for some neigh-
borhood N of x⇤ , and in addition f (x) > f (x⇤ ) for all x 2 N with x , x⇤ .
• x⇤ is an isolated local minimizer if there is a neighborhood N of x⇤ such
that f (x) f (x⇤ ) for all x 2 N \ D and in addition, N contains no local
minimizers other than x⇤ .
• x⇤ is the unique minimizer if it is the only global minimizer.
For the constrained optimization problem
min f (x), (2.1)
x2⌦
where ⌦ ⇢ D ⇢ Rn is a closed set, we modify the terminology slightly to use
the word “solution” rather than “minimizer.” That is, we have the following
definitions.
15
Note local solution and global solution
Michael C. Ferris (UW-Madison) CS726:Lecture 3 Di↵erentiability 3 / 21
Taylor’s Theorem and Di↵erentiability
Convention
f : Rn ! R.
I Df (x) is a 1 ⇥ n row vector.
I rf (x) = [Df (x)]T (column vector).
g : R n ! Rm .
I Dg (x) 2 Rm⇥n
I [Dg (x)]T = rg (x) 2 Rn⇥m .
Michael C. Ferris (UW-Madison) CS726:Lecture 3 Di↵erentiability 4 / 21
Theorem (First order Taylor)
([Wright and Recht(2020), Theorem 2.1])
f 2 C 1 . Then
1 f (x + p) = f (x) + rf (x + p)T p for some 2 (0, 1].
2 f (x + p) = f (x) + rf (x)T p + o(kpk), where
o(t)
lim = 0.
t#0 t
R1
3 f (x + p) = f (x) + 0 rf (x + p)T pd .
Michael C. Ferris (UW-Madison) CS726:Lecture 3 Di↵erentiability 5 / 21
Theorem (Second order Taylor)
([Wright and Recht(2020), Theorem 2.1])
f 2 C 2 . Then
1 f (x + p) = f (x) + rf (x)T p + 12 p T r2 f (x + p)p for some 2 (0, 1].
1 T 2 2
2 f (x + p) = f (x) + rf (x)T p + 2 p r f (x)p + o(kpk ).
R1
3 rf (x + p) = rf (x) + 0 r2 f (x + p)pd .
Michael C. Ferris (UW-Madison) CS726:Lecture 3 Di↵erentiability 6 / 21
Theorem
First Order Sufficiency Condition Let f : Rn ! R̄ (f is an extended
real-valued function). Suppose f is convex and x̄ is a local minimum of
f (x): f (x̄) = minx2Rn f (x). Then x̄ is a global minimizer of f (x) over Rn .
Michael C. Ferris (UW-Madison) CS726:Lecture 3 Di↵erentiability 7 / 21
Proof.
Suppose f is convex and x̄ is a local minimizer of the function f (x) and
suppose 9 x̂ which is better (i.e. f (x̂) < f (x̄)). Let us construct an ✏-ball
around the point x̄ such that f (x) f (x̄) 8x 2 B✏ (x̄).
The line connecting x̂ and x̄ with x inside the ✏-ball
Now, define x = x̄ + (x̂ x̄). Note this is equivalent to x̂ + (1 )x̄.
Thus x is on the segment joining x̄ and x̂. If we take > 0 sufficiently
small, we can ensure x 2 B✏ (x̄). Therefore, f (x ) f (x̄).
Michael C. Ferris (UW-Madison) CS726:Lecture 3 Di↵erentiability 8 / 21
Proof.
But we already know f (x ):
f (x̄) f (x ) = f ( x̂ + (1 )x̂)
by our definition of x . f is convex, so
f ( x̂ + (1 )x̂) f (x̂) + (1 )f (x̄).
By assumption, f (x̂) < f (x̄), so
f (x ) f (x̂) + (1 )f (x̄) < f (x̄) + (1 )f (x̄) = f (x̄).
This implies we can find a point in the neighborhood we defined strictly
better than the local minimizer. Contradiction, so x̄ must be the global
minimizer.
It’s worth noting this result holds regardless of di↵erentiability; it only
depends on the convexity of f .
Michael C. Ferris (UW-Madison) CS726:Lecture 3 Di↵erentiability 9 / 21
Uniqueness of Global Minimizers
Theorem
Uniqueness Result Let f : Rn ! R̄ be strictly convex. Then, if f has a
global minimizer, that global minimizer is unique.
Proof.
Let x 1 ,x 2 be distinct global minimizers. For 0 < < 1, because f is
strictly convex,
f ( x 1 + (1 )x 2 ) < f (x 1 ) + (1 )f (x 2 ).
In order for both x 1 and x 2 to be global minimizers, f (x 1 ) = f (x 2 ), so let’s
replace f (x 2 ) with f (x 1 ) on the right-hand side of the above inequality:
f ( x 1 + (1 )x 2 ) < f (x 1 ) + (1 )f (x 1 ) = f (x 1 ).
This says 9 a point x somewhere between x 1 and x 2 such that
f (x) < f (x 1 ). )( f must have a unique global minimizer.
Michael C. Ferris (UW-Madison) CS726:Lecture 3 Di↵erentiability 10 / 21
for all 2 (0, 1).
As we will see throughout this text, a crucial quantity in optimization is the
Lipschitz constant L for the gradient of f , which is defined to satisfy
kr f (x) r f (y)k Lkx yk, for all x, y 2 dom ( f ). (2.7)
We say that a continuously di↵erentiable function f with this property is L-
smooth or has L-Lipschitz gradients. We say that f is L0 -Lipschitz if
| f (x) f (y)| L0 kx yk, for all x, y 2 dom ( f ). (2.8)
From (2.2), we have
Z 1
f (y) f (x) r f (x)T (y x) = [r f (x + (y x)) r f (x)]T (y x) d .
0
By using (2.7), we have
[r f (x+ (y x)) r f (x)]T (y x) kr f (x+ (y x)) r f (x)kky xk L ky xk2 .
By substituting this bound into the previous integral, we obtain the following
result.
Lemma 2.2 Given an L-smooth function f , we have for any x, y 2 dom ( f )
that
L
f (y) f (x) + r f (x)T (y x) + ky xk2 . (2.9)
2
Lemma 2.2 asserts that f can be upper bounded by a quadratic function
whose value at x is equal to f (x).
When f is twice continuously di↵erentiable, we can characterize the con-
stant L in terms of the eigenvalues of the Hessian r2 f (x). Specifically, we
Michael C. Ferris (UW-Madison) CS726:Lecture 3 Di↵erentiability 11 / 21
Strict and Strong Convexity
Definition
From now on, x1 , xn , etc, will be used to refer to components of vectors
and x 1 , x n , etc, will be used to refer to distinct points.
Definition
Strictly Convex: A function f : R ! R̄ is strictly convex if 8x,y such
that x 6= y and 8↵ 2 [0, 1]
f ((1 ↵)x + ↵y ) < (1 ↵)f (x) + ↵f (y ).
Note this definition is identical to that of convex functions, but the
inequality is now strict.
Michael C. Ferris (UW-Madison) CS726:Lecture 3 Di↵erentiability 12 / 21
Strong Convexity
Let f : ⌦ ! R, h : ⌦ ! R where ⌦ is an open convex set.
Definition
f is strongly convex (⇢) on ⌦ if 9⇢ > 0 such that 8x, y 2 ⌦, 2 [0, 1]
⇢
f ((1 )x + y ) (1 )f (x) + f (y ) (1 ) kx y k2
2
Definition
h is strongly monotone (⇢) on ⌦ if 9⇢ > 0 such that 8x, y 2 ⌦
hh(x) h(y ), x yi ⇢ kx y k2
Michael C. Ferris (UW-Madison) CS726:Lecture 3 Di↵erentiability 13 / 21
Theorem (Strong convexity)
If f is continuously di↵erentiable on ⌦ then the following are equivalent:
(a) f is strongly convex (⇢) on ⌦
(b) For all x, y 2 ⌦,
f (y ) f (x) + hrf (x), y xi + (⇢/2) kx y k2
(c) rf is strongly monotone (⇢) on ⌦
If f is twice continuously di↵erentiable on ⌦, then
⌦ ↵
(d) For all x,y ,z 2 ⌦, x y , r2 f (z)(x y ) ⇢ kx y k2
is equivalent to the above.
Michael C. Ferris (UW-Madison) CS726:Lecture 3 Di↵erentiability 14 / 21
Proof.
We show (a) () (b) () (c).
(a) =) (b) The hypothesis gives
f (x + (y x)) f (x) ⇢
f (y ) f (x) (1 ) kx y k2
2
so taking the limit as !0
⇢
hrf (x), y xi f (y ) f (x) kx y k2
2
Michael C. Ferris (UW-Madison) CS726:Lecture 3 Di↵erentiability 15 / 21
Proof.
(b) =) (c) Applying (b) twice gives
⇢ 2
f (y ) f (x) + hrf (x), y xi + kx yk
2
⇢ 2
f (x) f (y ) + hrf (y ), x yi + kx yk
2
Adding these inequalities gives
2
f (y ) + f (x) f (x) + f (y ) + hrf (x) rf (y ), y xi + ⇢ kx yk
from where the result follows.
Michael C. Ferris (UW-Madison) CS726:Lecture 3 Di↵erentiability 16 / 21
Proof.
(c) =) (b) The hypothesis gives
Z 1 Z 1
2
hrf (x + t(y x)) rf (x), y xi dt ⇢t kx y k dt
0 0
which implies the result.
Michael C. Ferris (UW-Madison) CS726:Lecture 3 Di↵erentiability 17 / 21
Proof.
(b) =) (a) Letting y = u and x = (1 )u + v in (b) gives
⇢ 2
f (u) f ((1 )u + v ) + hrf ((1 )u + v ), (u v )i + k (u v )k (1)
2
Also letting y = v and x = (1 )u + v in (b) implies
⇢ 2
f (v ) f ((1 )u + v ) + hrf ((1 )u + v ), (1 )(v u)i + k(1 )(v u)k (2)
2
Adding (1 ) times (1) to times (2) gives the required result.
Michael C. Ferris (UW-Madison) CS726:Lecture 3 Di↵erentiability 18 / 21
Proof.
To complete the proof, we assume that f is twice continuously
di↵erentiable on ⌦.
(d) =) (c) This follows from the hypothesis since
⌧Z 1
2
hrf (x) rf (y ), x yi = r f (y + t(x y ))(x y )dt, x y
0
2
⇢ kx yk
(c) =) (d) Let x, y , z 2 ⌦. Then z + (x y ) 2 ⌦ for sufficiently small , so
D E hx y , rf (z + (x y )) rf (z)i
2
x y , r f (z + (x y ))(x y) = + o(1)
2
⇢ kx y k + o(1)
The result follows in the limit as ! 0.
Michael C. Ferris (UW-Madison) CS726:Lecture 3 Di↵erentiability 19 / 21
Then give definition of strong convexity modulus m:
1
f ((1 ↵)x + ↵y ) + m↵(1 ↵) kx y k22 (1 ↵)f (x) + ↵f (y ).
2
Strong implies strict clearly.
Lemma
If a function is strictly convex
f (y + (y x)) < f (y ) + (f (y ) f (x))
then
f (y ) > f (x) + rf (x)T (y x)
Michael C. Ferris (UW-Madison) CS726:Lecture 3 Di↵erentiability 20 / 21
Proof.
Suppose f (y ) = f (x) + hrf (x), (y x)i for some x 6= y . Let
(t) = f (x + t(y x))f (x)t hrf (x), (y x)i. The above can be written
as (1) = (0) and we note that 0 (0) = 0, so we have (t) = (0) for
all t 2 [0, 1], which contradicts being strictly convex.
Hence f (y ) > f (x) + hrf (x), (y x)i for all x 6= y .
Michael C. Ferris (UW-Madison) CS726:Lecture 3 Di↵erentiability 21 / 21
S. J. Wright and B. Recht.
Optimization for Data Analysis.
in proof, 2020.
Michael C. Ferris (UW-Madison) CS726:Lecture 3 Di↵erentiability 21 / 21