L.
Vandenberghe ECE236C (Spring 2020)
2. Subgradients
• definition
• subgradient calculus
• duality and optimality conditions
• directional derivative
2.1
Basic inequality
recall the basic inequality for differentiable convex functions:
f (y) ≥ f (x) + ∇ f (x)T (y − x) for all y ∈ dom f
(x, f (x))
∇ f (x)
−1
• the first-order approximation of f at x is a global lower bound
• ∇ f (x) defines non-vertical supporting hyperplane to epigraph of f at (x, f (x)):
T
∇ f (x) y x
− ≤ 0 for all (y, t) ∈ epi f
−1 t f (x)
Subgradients 2.2
Subgradient
g is a subgradient of a convex function f at x ∈ dom f if
f (y) ≥ f (x) + gT (y − x) for all y ∈ dom f
f (y)
f (x1) + g1T (y − x1)
f (x1) + g2T (y − x1)
f (x2) + g3T (y − x2)
x1 x2
g1, g2 are subgradients at x1; g3 is a subgradient at x2
Subgradients 2.3
Subdifferential
the subdifferential ∂ f (x) of f at x is the set of all subgradients:
∂ f (x) = {g | gT (y − x) ≤ f (y) − f (x), ∀y ∈ dom f }
Properties
• ∂ f (x) is a closed convex set (possibly empty)
this follows from the definition: ∂ f (x) is an intersection of halfspaces
• if x ∈ int dom f then ∂ f (x) is nonempty and bounded
proof on next two pages
Subgradients 2.4
Proof: we show that ∂ f (x) is nonempty when x ∈ int dom f
• (x, f (x)) is in the boundary of the convex set epi f
• therefore there exists a supporting hyperplane to epi f at (x, f (x)):
T
a y x
∃(a, b) , 0, − ≤0 ∀(y, t) ∈ epi f
b t f (x)
• b > 0 gives a contradiction as t → ∞
• b = 0 gives a contradiction for y = x + a with small > 0
1
• therefore b < 0 and g = a is a subgradient of f at x
|b|
Subgradients 2.5
Proof: ∂ f (x) is bounded when x ∈ int dom f
• for small r > 0, define a set of 2n points
B = {x ± re k | k = 1, . . . , n} ⊂ dom f
and define M = max f (y) < ∞
y∈B
• for every g ∈ ∂ f (x), there is a point y ∈ B with
r kgk∞ = gT (y − x)
(choose an index k with |gk | = kgk∞, and take y = x + r sign(gk )e k )
• since g is a subgradient, this implies that
f (x) + r kgk∞ = f (x) + gT (y − x) ≤ f (y) ≤ M
• we conclude that ∂ f (x) is bounded:
M − f (x)
kgk∞ ≤ for all g ∈ ∂ f (x)
r
Subgradients 2.6
Example
f (x) = max { f1(x), f2(x)} with f1, f2 convex and differentiable
f (y)
f2(y)
f1(y)
• if f1( x̂) = f2( x̂), subdifferential at x̂ is line segment [∇ f1( x̂), ∇ f2( x̂)]
• if f1( x̂) > f2( x̂), subdifferential at x̂ is {∇ f1( x̂)}
• if f1( x̂) < f2( x̂), subdifferential at x̂ is {∇ f2( x̂)}
Subgradients 2.7
Examples
Absolute value f (x) = |x|
f (x) ∂ f (x)
x −1
Euclidean norm f (x) = k xk2
1
∂ f (x) = { x} if x , 0, ∂ f (x) = {g | kgk2 ≤ 1} if x = 0
k xk2
Subgradients 2.8
Monotonicity
the subdifferential of a convex function is a monotone operator:
(u − v)T (x − y) ≥ 0 for all x , y , u ∈ ∂ f (x), v ∈ ∂ f (y)
Proof: by definition
f (y) ≥ f (x) + uT (y − x), f (x) ≥ f (y) + v T (x − y)
combining the two inequalities shows monotonicity
Subgradients 2.9
Examples of non-subdifferentiable functions
the following functions are not subdifferentiable at x = 0
• f : R → R, dom f = R+
f (x) = 1 if x = 0, f (x) = 0 if x > 0
• f : R → R, dom f = R+ √
f (x) = − x
the only supporting hyperplane to epi f at (0, f (0)) is vertical
Subgradients 2.10
Subgradients and sublevel sets
if g is a subgradient of f at x , then
f (y) ≤ f (x) =⇒ gT (y − x) ≤ 0
x
f (y) ≤ f (x)
the nonzero subgradients at x define supporting hyperplanes to the sublevel set
{y | f (y) ≤ f (x)}
Subgradients 2.11
Outline
• definition
• subgradient calculus
• duality and optimality conditions
• directional derivative
Subgradient calculus
Weak subgradient calculus: rules for finding one subgradient
• sufficient for most nondifferentiable convex optimization algorithms
• if you can evaluate f (x), you can usually compute a subgradient
Strong subgradient calculus: rules for finding ∂ f (x) (all subgradients)
• some algorithms, optimality conditions, etc., need entire subdifferential
• can be quite complicated
we will assume that x ∈ int dom f
Subgradients 2.12
Basic rules
Differentiable functions: ∂ f (x) = {∇ f (x)} if f is differentiable at x
Nonnegative linear combination
if f (x) = α1 f1(x) + α2 f2(x) with α1, α2 ≥ 0, then
∂ f (x) = α1 ∂ f1(x) + α2 ∂ f2(x)
(right-hand side is addition of sets)
Affine transformation of variables: if f (x) = h(Ax + b), then
∂ f (x) = AT ∂h(Ax + b)
Subgradients 2.13
Pointwise maximum
f (x) = max { f1(x), . . . , fm (x)}
define I(x) = {i | fi (x) = f (x)}, the ‘active’ functions at x
Weak result
to compute a subgradient at x , choose any k ∈ I(x), any subgradient of fk at x
Strong result
∂ f (x) = conv ∂ fi (x)
[
i∈I(x)
• the convex hull of the union of subdifferentials of ‘active’ functions at x
• if fi ’s are differentiable, ∂ f (x) = conv {∇ fi (x) | i ∈ I(x)}
Subgradients 2.14
Example: piecewise-linear function
f (x) = max (aiT x + bi )
i=1,...,m
f (x)
aiT x + bi
the subdifferential at x is a polyhedron
∂ f (x) = conv {ai | i ∈ I(x)}
with I(x) = {i | aiT x + bi = f (x)}
Subgradients 2.15
Example: `1-norm
f (x) = k xk1 = max sT x
s∈{−1,1}n
the subdifferential is a product of intervals
[−1, 1]
xk = 0
∂ f (x) = J1 × · · · × Jn, Jk = {1} xk > 0
{−1} xk < 0
(1, 1)
(−1, 1) (1, 1) (1, 1)
(−1, −1) (1, −1)
(1, −1)
∂ f (0, 0) = [−1, 1] × [−1, 1] ∂ f (1, 0) = {1} × [−1, 1] ∂ f (1, 1) = {(1, 1)}
Subgradients 2.16
Pointwise supremum
f (x) = sup fα (x), fα (x) convex in x for every α
α∈A
Weak result: to find a subgradient at x̂ ,
• find any β for which f ( x̂) = f β ( x̂) (assuming maximum is attained)
• choose any g ∈ ∂ f β ( x̂)
(Partial) strong result: define I(x) = {α ∈ A | fα (x) = f (x)}
∂ fα (x) ⊆ ∂ f (x)
[
conv
α∈I(x)
equality requires extra conditions (for example, A compact, fα continuous in α)
Subgradients 2.17
Exercise: maximum eigenvalue
Problem: explain how to find a subgradient of
f (x) = λmax(A(x)) = sup yT A(x)y
k yk2 =1
where A(x) = A0 + x1 A1 + · · · + xn An with symmetric coefficients Ai
Solution: to find a subgradient at x̂ ,
• choose any unit eigenvector y with eigenvalue λmax(A( x̂))
• the gradient of yT A(x)y at x̂ is a subgradient of f :
(yT A1 y, . . . , yT An y) ∈ ∂ f ( x̂)
Subgradients 2.18
Minimization
f (x) = inf h(x, y), h jointly convex in (x, y)
y
Weak result: to find a subgradient at x̂ ,
• find ŷ that minimizes h( x̂, y) (assuming minimum is attained)
• find subgradient (g, 0) ∈ ∂h( x̂, ŷ)
Proof: for all x , y ,
h(x, y) ≥ h( x̂, ŷ) + gT (x − x̂) + 0T (y − ŷ)
= f ( x̂) + gT (x − x̂)
therefore
f (x) = inf h(x, y) ≥ f ( x̂) + gT (x − x̂)
y
Subgradients 2.19
Exercise: Euclidean distance to convex set
Problem: explain how to find a subgradient of
f (x) = inf k x − yk2
y∈C
where C is a closed convex set
Solution: to find a subgradient at x̂ ,
• if f ( x̂) = 0 (that is, x̂ ∈ C), take g = 0
• if f ( x̂) > 0, find projection ŷ = P( x̂) on C and take
1 1
g= ( x̂ − ŷ) = ( x̂ − P( x̂))
k ŷ − x̂k2 k x̂ − P( x̂)k2
Subgradients 2.20
Composition
f (x) = h( f1(x), . . . , fk (x)), h convex and nondecreasing, fi convex
Weak result: to find a subgradient at x̂ ,
• find z ∈ ∂h( f1( x̂), . . . , fk ( x̂)) and gi ∈ ∂ fi ( x̂)
• then g = z1 g1 + · · · + z k gk ∈ ∂ f ( x̂)
reduces to standard formula for differentiable h, fi
Proof:
f (x) ≥ h f1( x̂) + g1T (x − x̂), . . . ,
fk ( x̂) + gTk (x− x̂)
≥ h ( f1( x̂), . . . , fk ( x̂)) + z g1 (x − x̂), . . . , gk (x − x̂)
T T T
= h ( f1( x̂), . . . , fk ( x̂)) + (z1 g1 + · · · + z k gk )T (x − x̂)
= f ( x̂) + gT (x − x̂)
Subgradients 2.21
Optimal value function
define f (u, v) as the optimal value of convex problem
minimize f0(x)
subject to fi (x) ≤ ui, i = 1, . . . , m
Ax = b + v
(functions fi are convex; optimization variable is x )
Weak result: suppose f (û, v̂) is finite and strong duality holds with the dual
!
inf f0(x) + λi ( fi (x) − ûi ) + νT (Ax − b − v̂)
X
maximize
x
i
subject to λ0
if λ̂, ν̂ are optimal dual variables (for right-hand sides û, v̂ ) then (−λ̂, −ν̂) ∈ ∂ f (û, v̂)
Subgradients 2.22
Proof: by weak duality for problem with right-hand sides u, v
!
f (u, v) ≥ inf f0(x) + λ̂i ( fi (x) − ui ) + ν̂T (Ax − b − v)
X
x
i
!
= inf f0(x) + λ̂i ( fi (x) − ûi ) + ν̂T (Ax − b − v̂)
X
x
i
− λ̂T (u − û) − ν̂T (v − v̂)
= f (û, v̂) − λ̂T (u − û) − ν̂T (v − v̂)
Subgradients 2.23
Expectation
f (x) = E h(x, u) u random, h convex in x for every u
Weak result: to find a subgradient at x̂ ,
• choose a function u 7→ g(u) with g(u) ∈ ∂x h( x̂, u)
• then, g = Eu g(u) ∈ ∂ f ( x̂)
Proof: by convexity of h and definition of g(u),
f (x) = E h(x, u)
≥ E h( x̂, u) + g(u)T (x − x̂)
= f ( x̂) + gT (x − x̂)
Subgradients 2.24
Outline
• definition
• subgradient calculus
• duality and optimality conditions
• directional derivative
Optimality conditions — unconstrained
x? minimizes f (x) if and only
0 ∈ ∂ f (x?)
x?
this follows directly from the definition of subgradient:
f (y) ≥ f (x?) + 0T (y − x?) for all y ⇐⇒ 0 ∈ ∂ f (x?)
Subgradients 2.25
Example: piecewise-linear minimization
f (x) = max (aiT x + bi )
i=1,...,m
Optimality condition
0 ∈ conv {ai | i ∈ I(x?)} where I(x) = {i | aiT x + bi = f (x)}
• in other words, x? is optimal if and only if there is a λ with
m
λ 0, 1 λ = 1,
T
λi ai = 0, λi = 0 for i < I(x?)
X
i=1
• these are the optimality conditions for the equivalent linear program
minimize t maximize bT λ
subject to Ax + b t1 subject to AT λ = 0
λ 0, 1T λ = 1
Subgradients 2.26
Optimality conditions — constrained
minimize f0(x)
subject to fi (x) ≤ 0, i = 1, . . . , m
assume dom fi = Rn, so functions fi are subdifferentiable everywhere
Karush–Kuhn–Tucker conditions
if strong duality holds, then x?, λ? are primal, dual optimal if and only if
1. x? is primal feasible
2. λ? 0
3. λi? fi (x?) = 0 for i = 1, . . . , m
4. x? is a minimizer of L(x, λ?) = f0(x) + λ? f (x):
Pm
i=1 i i
m
?
0 ∈ ∂ f0(x ) + λi?∂ fi (x?)
X
i=1
Subgradients 2.27
Outline
• definition
• subgradient calculus
• duality and optimality conditions
• directional derivative
Directional derivative
Definition (for general f ): the directional derivative of f at x in the direction y is
0 f (x + αy) − f (x)
f (x; y) = lim
α&0 α
1
= lim t( f (x + y) − t f (x)
t→∞ t
(if the limit exists)
• f 0(x; y) is the right derivative of g(α) = f (x + αy) at α = 0
• f 0(x; y) is homogeneous in y :
f 0(x; λy) = λ f 0(x; y) for λ ≥ 0
Subgradients 2.28
Directional derivative of a convex function
Equivalent definition (for convex f ): replace lim with inf
0 f (x + αy) − f (x)
f (x; y) = inf
α>0 α
1
= inf t f (x + y) − t f (x)
t>0 t
Proof
• the function h(y) = f (x + y) − f (x) is convex in y , with h(0) = 0
• its perspective th(y/t) is nonincreasing in t (ECE236B ex. A2.5); hence
f 0(x; y) = lim th(y/t) = inf th(y/t)
t→∞ t>0
Subgradients 2.29
Properties
consequences of the expressions (for convex f )
0 f (x + αy) − f (x)
f (x; y) = inf
α>0 α
1
= inf t f (x + y) − t f (x)
t>0 t
• f 0(x; y) is convex in y (partial minimization of a convex function in y , t )
• f 0(x; y) defines a lower bound on f in the direction y :
f (x + αy) ≥ f (x) + α f 0(x; y) for all α ≥ 0
Subgradients 2.30
Directional derivative and subgradients
for convex f and x ∈ int dom f
f 0(x; y) = sup gT y
g∈∂ f (x)
fˆ0(x, y) = gT y
ĝ
f 0(x; y) is support function of ∂ f (x)
y ∂ f (x)
• generalizes f 0(x; y) = ∇ f (x)T y for differentiable functions
• implies that f 0(x; y) exists for all x ∈ int dom f , all y (see page 2.4)
Subgradients 2.31
Proof: if g ∈ ∂ f (x) then from page 2.29
0 f (x) + αgT y − f (x)
f (x; y) ≥ inf = gT y
α>0 α
it remains to show that f 0(x; y) = ĝT y for at least one ĝ ∈ ∂ f (x)
• f 0(x; y) is convex in y with domain Rn, hence subdifferentiable at all y
• let ĝ be a subgradient of f 0(x; y) at y : then for all v , λ ≥ 0,
λ f 0(x; v) = f 0(x; λv) ≥ f 0(x; y) + ĝT (λv − y)
• taking λ → ∞ shows that f 0(x; v) ≥ ĝT v ; from the lower bound on page 2.30,
f (x + v) ≥ f (x) + f 0(x; v) ≥ f (x) + ĝT v for all v
hence ĝ ∈ ∂ f (x)
• taking λ = 0 we see that f 0(x; y) ≤ ĝT y
Subgradients 2.32
Descent directions and subgradients
y is a descent direction of f at x if f 0(x; y) < 0
• the negative gradient of a differentiable f is a descent direction (if ∇ f (x) , 0)
• negative subgradient is not always a descent direction
Example: f (x1, x2) = |x1 | + 2|x2 |
x2
g = (1, 2)
x1
(1, 0)
g = (1, 2) ∈ ∂ f (1, 0), but y = (−1, −2) is not a descent direction at (1, 0)
Subgradients 2.33
Steepest descent direction
Definition: (normalized) steepest descent direction at x ∈ int dom f is
∆xnsd = argmin f 0(x; y)
k yk2 ≤1
∆xnsd is the primal solution y of the pair of dual problems (BV §8.1.3)
minimize (over y ) f 0(x; y) maximize (over g ) −kgk2
subject to k yk2 ≤ 1 subject to g ∈ ∂ f (x)
• dual optimal g? is subgradient with least norm
• f 0(x; ∆xnsd) = −kg? k2 ∂ f (x)
g?
• if 0 < ∂ f (x), ∆xnsd = −g?/kg? k2
• ∆xnsd can be expensive to compute
∆xnsd gT ∆xnsd = f 0(x, ∆xnsd)
Subgradients 2.34
Subgradients and distance to sublevel sets
if f is convex, f (y) < f (x), g ∈ ∂ f (x), then for small t > 0,
k x − tg − yk22 = k x − yk22 − 2tgT (x − y) + t 2 kgk22
≤ k x − yk22 − 2t( f (x) − f (y)) + t 2 kgk22
< k x − yk22
• −g is descent direction for k x − yk2, for any y with f (y) < f (x)
• in particular, −g is descent direction for distance to any minimizer of f
Subgradients 2.35
References
• A. Beck, First-Order Methods in Optimization (2017), chapter 3.
• D. P. Bertsekas, A. Nedić, A. E. Ozdaglar, Convex Analysis and Optimization
(2003), chapter 4.
• J.-B. Hiriart-Urruty, C. Lemaréchal, Convex Analysis and Minimization
Algoritms (1993), chapter VI.
• Yu. Nesterov, Lectures on Convex Optimization (2018), section 3.1.
• B. T. Polyak, Introduction to Optimization (1987), section 5.1.
Subgradients 2.36