Indian Institute of Technology Madras
Optimization
Nirav Bhatt
Email: niravbhatt@iitm.ac.in
Office: BT 307 Block II
Biotechnology Department
September 27, 2024
September 27, 2024 1/44
Content
I Unconstrained vs Constrained optimization
I Types of Optimizations problems:
I Linear programming (LP)
I Quadratic programming (QP)
I Nonlinear programming (NLP)
I Dynamic optimization as an NLP
I Overview of Numerical solution approaches
I Book: Nocedal J, Wright SJ, editors. Numerical optimization.
New York, NY: Springer New York; 1999 Aug 27. Reference
Chapters 1, 2, 12
I Reference Book: An Introduction to Optimization, EDWIN K. P.
CHONG and STANISLAW H. ZAK
September 27, 2024 2/44
Learning outcomes
Introduction to optimization and different forms of optimization
I The students are expected to learn
I Different types of optimization problems and their application
I KKT Conditions for finding an optimal solution
I Overview of Line search and trust region for numerical
optimization
September 27, 2024 3/44
Optimization
Elements
I Mathematical Optimization (or Mathematical Programming):
Select a best option or a set of options from the available set
I Consider an optimization problem with decision variables
x = [x1 , x2 , ..., xn ]T
max f (x)
x
s.t. Ax ≤ b
1T x ≤ 1
or
max f (x1 , x2 , ..., xn )
x1 ,x2 ,...,xn
I f (x) is called cost function or objective function or loss function
I Ax ≤ b and ni=1 xi ≤ 1 are constraints
P
September 27, 2024 4/44
Optimization
Some Problems in ML/DL
I Regression Analysis: Given a dataset (y, X), fit a function form
f (X, θ) : Rn × Rp → R
min ky − f (X, θ)k22
θ
I Classification Problems: Given (y, X), fit a function form
f (X, θ) : Rn × Rp → {0, 1}
m
1 X
min − (yi log(pi (x, θ)) + (1 − yi )log(1 − pi (x, θ))
θ m
i=1
where pi (· · · ) is the probability of the class 1.
September 27, 2024 5/44
Optimization
Constraints
I Types of constraints
I Linear Constraints
Ax ≤ b (Inequality )
Aeq x = beq (Equality )
I Inequality Constraints
g(x) ≤ 0 (Inequality )
h(x) = 0 (Equality )
I Bounds: xlb ≤ x ≤ xub
I x ∈ S For example, S = {−1, 0, 1}
September 27, 2024 6/44
Optimization
Types
I Types based constraints: (i) Unconstrained optimization and
(ii) Constrained Optimization; (i) Static optimization and (ii)
Dynamic optimization
I Types of function or variable set smoothness: (i) Continuous
Optimization, and (ii) Integer optimization
I Types of function and constraints
I Linear Programming (LP)
I Quadratic Programming (QP)
I Nonlinear Programming (NLP) or Nonlinear Optimization
I Mixed Integer LP or NLP
September 27, 2024 7/44
Optimization
I Unconstrained Optimization: x Decision variables
max or min f (x) (1)
x
I Constrained Optimization
max f (x)
x
s.t. gi (x) ≤ bi , i = 1, . . . , p
(2)
aTj x = cj , j = 1, . . . , q
LB
x ≤ x ≤ xUB
I x∗ denote an optimal solution
September 27, 2024 8/44
Motivation
Dynamic Optimization
I The current profit is a function of the current production (x(t))
and the rate of change of production (x 0 (t))
I The continuous problem can be defined as:
ZT
max J[x] = f (t, x(t), ẋ(t))dt
(3)
t=1
s.t. x(t) ≥ 0, x(0) = x0
I Objective: Find the function x(t) that maximizes the functional
J[x]
I x ∗ (t): Optimal trajectory
September 27, 2024 9/44
Optimization
I Linear Programming
min cT x
x
s.t. Ax ≤ b,
(4)
Cx = d,
xLB ≤ x ≤ xUB
I Quadratic Programming
min xT Hx + cT x
x
s.t. Ax ≤ b,
(5)
Cx = d,
xLB ≤ x ≤ xUB
H: Symmetric matrix
September 27, 2024 10/44
Optimization
I Nonlinear Programming
min f (x)
x
s.t. gi (x) ≤ 0, i = 1, . . . , M,
(6)
hj (x) = 0, j = 1, . . . , N
xLB ≤ x ≤ xUB
I Integer Quadratic Programming
min xT Hx + cT x
x
s.t. Ax ≤ b,
(7)
Cx = d,
some x are integer
September 27, 2024 11/44
Motivation
Static Optimization
I Static Optimization: An optimal number or finite set of
numbers
I Static Optimization:
max f (x) (8)
x
I Assumptions: f (x) is continuously differentiable
∂f
I First order necessary condition: ∂x (x ∗ ) = 0
2
I Second order necessary condition: ∂ f2 ≤ 0
∂x
I Example: The operating point x ∗ that maximizes the profit
f (x), where x: # of units
September 27, 2024 12/44
Motivation
Static Optimization
I Example
max 1000000 + 4000x − x 2 (9)
x
I ∂f ∗ ⇒ 4000 − 2x = 0
∂x (x ) = 0
∗
x = 2000
I ∂2f
∂x 2
= −2 < 0
September 27, 2024 13/44
Optimization
Static Optimization
I Static Optimization with several variables
max f (x1 , . . . xn ) (10)
x1 ,...,xn
I Example: A plant can produce n items. Find the operating
point x1 , . . . xn that maximizes the profit f (x1 , . . . xn )
September 27, 2024 14/44
Motivation
Static Optimization: Detour to Some Problems in Linear Algebra
I Recall: Ax = b when b is not in the column space spanned by
A.
I Projection of b on the plane spanned by Ax
I Error e = Ax − b or the closest point on the plane spanned by
the columns of A from b
I Optimization Problem
min kAx − bkp
x1 ,...,xn
p = 1, 2, 3 . . . , ∞
September 27, 2024 15/44
Motivation
Static Optimization
I Example
max 1000000 + 300x1 + 500x2 − x12 − x22 (11)
x
I ∂f ∗ − 2x1 = 0⇒ x1∗ = 150
∂x (x1 ) = 0 ⇒ 300
I ∂f ∗ − 2x2 = 0⇒ x2∗ = 250
∂x (x2 ) = 0 ⇒ 500
I ∂2f ∂2f
∂x12
= ∂x 2 = −2 < 0
2
I These conditions are correct??
September 27, 2024 16/44
Static Optimization
Local vs Global SOlutions
I Local Minimizer
A point x∗ is a local minimizer if there is a neighborhood N of
x∗ such that f (x∗ ) ≤ f (x) for all x ∈ N .
A point x∗ is a strict local minimizer (also called a strong local
minimizer) if there exists a neighborhood N of x∗ such that
f (x∗ ) < f (x) for all x ∈ N with x 6= x∗
I Global Minimizer
A point x∗ is a global minimizer if f (x∗ ) ≤ f (x) for all x
I x∗ is a stationary point if ∇f (x∗ ) = 0.
September 27, 2024 17/44
Static Optimization
Unconstrained Optimization
I First-order Necessary Conditions
If x∗ is a local minimizer and f is continuously differentiable in
an open neighborhood of x∗ , then ∇f (x∗ ) = 0.
I Second-order Necessary Conditions
If x∗ is a local minimizer of f then ∇2 f (x) exists and is
continuous in an open neighborhood of x∗ , then ∇f (x∗ ) = 0
and ∇2 f (x∗ ) is positive semidefinite.
I Second-order Sufficient Conditions
Suppose that ∇2 f (x∗ ) is continuous in an open neighborhood
of x∗ and that ∇f (x∗ ) = 0 and ∇2 f (x∗ ) is positive definite.
Then, x∗ is a strict local minimizer of f .
September 27, 2024 18/44
Constrained Optimization
Equality Constraints
I Consider the following optimization problem
min f (x)
(12)
s.t. hi (x) = 0, i = 1, 2, . . . , m
I Constraint optimization to Unconstrained optimization
I Lagrange function with Lagrange multipliers λ1 , . . . , λm
m
X
L(x, λ) = f (x) + λi hi (x)
i=1
September 27, 2024 19/44
Constrained Optimization
I First-order necessary condition
∇f (x ∗ ) + λ∗1 ∇h1 (x ∗ ) + . . . + λ∗m ∇hm (x ∗ ) = 0
hi (x ∗ ) = 0, i = 1, 2, . . . , m
I Second-order necessary condition
w T Lxx (x ∗ , λ∗ )w ≥ 0, ∀w such that ∇hi (x ∗ ).w = 0, i = 1, 2, . . . , m
where
m
X
∗ ∗ ∗
2
Lxx (x , λ ) = ∇ f (x ) + λ∗i ∇2 hi (x ∗ )
i=1
Lxx(x ∗ , λ∗ )
is positive definite on the tangent space defined by
∗
∇hi (x ).w = 0 the tangent space
September 27, 2024 20/44
Example: Constrained Optimization
I Problem:
min x1 + x2
s.t. x12 + x22 − 2 = 0
I ∇f = [1, 1]T ∇h = [2x1 , 2x2 ]T
x2
0 2
2
2
1 (1,1)
2
0
2
−1
1
2
x1
2
2
−1
1
−2
2
−1
2
(−1,−1)
−2
2
September 27, 2024 21/44
Example: Constrained Optimization
I Problem:
min x1 + x2
s.t. x12 + x22 − 2 = 0
I ∇f = [1, 1]T ∇h = [2x1 , 2x2 ]T
September 27, 2024 21/44
Example: Constrained Optimization
I Problem:
min x1 + x2
s.t. x12 + x22 − 2 = 0
I ∇f = [1, 1]T ∇h = [2x1 , 2x2 ]T
September 27, 2024 21/44
Constrained Optimization: Example
KKT Conditions
I Constrained optimization problem:
min f (x)
s.t. hi (x) = 0, i = 1, . . . , m
s.t. gj (x) ≤ 0, j = 1, . . . , n
where f , hi , and gj : smooth, and real-valued functions on a
subset of Rn
I x is a feasible point, if it satisfies the equality and inequality
constraints
I A constraint j is said to be active if gj (x) = 0 at any point x
I A(x): A set of all active constraints at any point x
I For a feasible point x and the active set A(x), if the gradients
∇hi , and ∇gj , ∀j ∈ A(x) are linearly independent,they satisfy
the Linear Independence Constraint Qualification (LICQ).
September 27, 2024 22/44
Constrained Optimization
KKT Conditions
I Lagrangian for the problem with the Lagrange multipliers λ
and µ:
m
X n
X
L(x, λ, µ) = f (x) + λi hi (x) + µj gj (x)
i=1 j=1
I First order Necessary condtions (Karush-Kuhn-Tucker
conditions): For x ∗ a local solution
∇x L(x ∗ , λ∗ , µ∗ ) = 0
hi (x ∗ ) = 0, i = 1, . . . , m
gj (x ∗ ) ≤ 0, i = 1, . . . , n
λi > 0, µj ≥ 0
µj (x )gj (x ∗ ) = 0 (Complementarity condition)
∗
I Complementarity condition: Ensures µj = 0, ∀j 6∈ A(x)
September 27, 2024 23/44
Constrained Optimization
KKT Conditions
I Complementarity condition: Ensures µj = 0, ∀j 6∈ A(x)
I Second order necessary condition
w T ∇2 L(x ∗ λ∗ , µ∗ ) w ≥ 0, ∀w ∈ T (x ∗ )
T (x ∗ ) = {w|∇hi (x ∗ )T w = 0, ∀i and ∇gj (x ∗ )T w = 0, ∀j ∈ A(x ∗ )}
(a) (b) (c)
I (a)µj > 0, ∀j ∈ A(x ∗ ): ∇f
inwards: Otherwise maximization
and ∇g outward: Otherwise
increase inward
m
I −∇f (x ∗ ) = µj ∇gj (x ∗ )
P
(d)
j=1
(b) µ1 , µ2 < 0,
(c) µ1 > 0, µ2 < 0,
(d) µ1 , µ2 > 0
September 27, 2024 24/44
Constrained Optimization
KKT Conditions
I Complementarity condition: Ensures µj = 0, ∀j 6∈ A(x)
I Second order necessary condition
w T ∇2 L(x ∗ λ∗ , µ∗ ) w ≥ 0, ∀w ∈ T (x ∗ )
T (x ∗ ) = {w|∇hi (x ∗ )T w = 0, ∀i and ∇gj (x ∗ )T w = 0, ∀j ∈ A(x ∗ )}
(a) (b) (c)
I (a)µj > 0, ∀j ∈ A(x ∗ ): ∇f
inwards: Otherwise maximization
and ∇g outward: Otherwise
increase inward
m
I −∇f (x ∗ ) = µj ∇gj (x ∗ )
P
(d)
j=1
(b) µ1 , µ2 < 0,
(c) µ1 > 0, µ2 < 0,
(d) µ1 , µ2 > 0
September 27, 2024 24/44
Constrained Optimization
KKT Conditions: Problem
I Optimization problem
min x12 + 2x22
s.t. x1 + x2 ≥ 3
s.t. x2 − x12 ≥ 1
I The Lagrangian function:
l(x, µ1 , µ2 ) = x12 + 2x22 − µ1 (x1 + x2 − 3) − µ2 (x2 − x12 − 1), µ1 , µ2 ≥ 0
I KKT Conditions
1. 2x1 − µ1 + 2µ2 x1 = 0
2. 4x2 − µ1 − µ2 = 0
3. x1 + x2 ≥ 3
4. x2 − x12 ≥ 1
5. µ1 (x1 + x2 − 3) = 0
6. µ2 (x2 − x12 − 1) = 0
7. µ1 , µ2 ≥ 0
September 27, 2024 25/44
Constrained Optimization
KKT Conditions: Problem
I Case 1: (No active constraint) , then
KKT Conditions (x1∗ , x2∗ ) = (0, 0)
1. 2x1 − µ1 + 2µ2 x1 = 0 I Case 2: µ1 > 0, µ2 = 0 (First constraint is active),
then (x1∗ , x2∗ ) = (2, 1) and µ1 = 4 > 0, Does not
2. 4x2 − µ1 − µ2 = 0 satisfy Condition 4
3. x1 + x2 ≥ 3 I Case 3: µ1 = 0, µ2 > 0 (Second constraint is
active), then (x1∗ , x2∗ ) = (0, 1) and µ2 = 4 > 0
4. x2 − x12 ≥ 1 Does not satisfy Condition 3
5. µ1 (x1 + x2 − 3) = 0 I Case 4: µ1 > 0, µ2 > 0 (Both constraints are
active), then (x1∗ , x2∗ ) = (−2, 5) and
6. µ2 (x2 − x12 − 1) = 0
(x1∗ , x2∗ ) = (1, 2)
7. µ1 , µ2 ≥ 0 I (x ∗ , x ∗ ) = (−2, 5) =⇒ µ1 + 4µ2 = −4
1 2
I (x ∗ , x ∗ ) = (1, 2) =⇒ µ1 = 6, and µ2 = 2,
1 2
September 27, 2024 26/44
Constrained Optimization
KKT Conditions: Problem
I Case 1: (No active constraint) , then
KKT Conditions (x1∗ , x2∗ ) = (0, 0)
1. 2x1 − µ1 + 2µ2 x1 = 0 I Case 2: µ1 > 0, µ2 = 0 (First constraint is active),
then (x1∗ , x2∗ ) = (2, 1) and µ1 = 4 > 0, Does not
2. 4x2 − µ1 − µ2 = 0 satisfy Condition 4
3. x1 + x2 ≥ 3 I Case 3: µ1 = 0, µ2 > 0 (Second constraint is
active), then (x1∗ , x2∗ ) = (0, 1) and µ2 = 4 > 0
4. x2 − x12 ≥ 1 Does not satisfy Condition 3
5. µ1 (x1 + x2 − 3) = 0 I Case 4: µ1 > 0, µ2 > 0 (Both constraints are
active), then (x1∗ , x2∗ ) = (−2, 5) and
6. µ2 (x2 − x12 − 1) = 0
(x1∗ , x2∗ ) = (1, 2)
7. µ1 , µ2 ≥ 0 I (x ∗ , x ∗ ) = (−2, 5) =⇒ µ1 + 4µ2 = −4
1 2
I (x ∗ , x ∗ ) = (1, 2) =⇒ µ1 = 6, and µ2 = 2,
1 2
September 27, 2024 26/44
Constrained Optimization
KKT Conditions: Problem
I Case 1: (No active constraint) , then
KKT Conditions (x1∗ , x2∗ ) = (0, 0)
1. 2x1 − µ1 + 2µ2 x1 = 0 I Case 2: µ1 > 0, µ2 = 0 (First constraint is active),
then (x1∗ , x2∗ ) = (2, 1) and µ1 = 4 > 0, Does not
2. 4x2 − µ1 − µ2 = 0 satisfy Condition 4
3. x1 + x2 ≥ 3 I Case 3: µ1 = 0, µ2 > 0 (Second constraint is
active), then (x1∗ , x2∗ ) = (0, 1) and µ2 = 4 > 0
4. x2 − x12 ≥ 1 Does not satisfy Condition 3
5. µ1 (x1 + x2 − 3) = 0 I Case 4: µ1 > 0, µ2 > 0 (Both constraints are
active), then (x1∗ , x2∗ ) = (−2, 5) and
6. µ2 (x2 − x12 − 1) = 0
(x1∗ , x2∗ ) = (1, 2)
7. µ1 , µ2 ≥ 0 I (x ∗ , x ∗ ) = (−2, 5) =⇒ µ1 + 4µ2 = −4
1 2
I (x ∗ , x ∗ ) = (1, 2) =⇒ µ1 = 6, and µ2 = 2,
1 2
September 27, 2024 26/44
Constrained Optimization
KKT Conditions: Problem
I Case 1: (No active constraint) , then
KKT Conditions (x1∗ , x2∗ ) = (0, 0)
1. 2x1 − µ1 + 2µ2 x1 = 0 I Case 2: µ1 > 0, µ2 = 0 (First constraint is active),
then (x1∗ , x2∗ ) = (2, 1) and µ1 = 4 > 0, Does not
2. 4x2 − µ1 − µ2 = 0 satisfy Condition 4
3. x1 + x2 ≥ 3 I Case 3: µ1 = 0, µ2 > 0 (Second constraint is
active), then (x1∗ , x2∗ ) = (0, 1) and µ2 = 4 > 0
4. x2 − x12 ≥ 1 Does not satisfy Condition 3
5. µ1 (x1 + x2 − 3) = 0 I Case 4: µ1 > 0, µ2 > 0 (Both constraints are
active), then (x1∗ , x2∗ ) = (−2, 5) and
6. µ2 (x2 − x12 − 1) = 0
(x1∗ , x2∗ ) = (1, 2)
7. µ1 , µ2 ≥ 0 I (x ∗ , x ∗ ) = (−2, 5) =⇒ µ1 + 4µ2 = −4
1 2
I (x ∗ , x ∗ ) = (1, 2) =⇒ µ1 = 6, and µ2 = 2,
1 2
September 27, 2024 26/44
Constrained Optimization
KKT Conditions: Problem
I Case 1: (No active constraint) , then
KKT Conditions (x1∗ , x2∗ ) = (0, 0)
1. 2x1 − µ1 + 2µ2 x1 = 0 I Case 2: µ1 > 0, µ2 = 0 (First constraint is active),
then (x1∗ , x2∗ ) = (2, 1) and µ1 = 4 > 0, Does not
2. 4x2 − µ1 − µ2 = 0 satisfy Condition 4
3. x1 + x2 ≥ 3 I Case 3: µ1 = 0, µ2 > 0 (Second constraint is
active), then (x1∗ , x2∗ ) = (0, 1) and µ2 = 4 > 0
4. x2 − x12 ≥ 1 Does not satisfy Condition 3
5. µ1 (x1 + x2 − 3) = 0 I Case 4: µ1 > 0, µ2 > 0 (Both constraints are
active), then (x1∗ , x2∗ ) = (−2, 5) and
6. µ2 (x2 − x12 − 1) = 0
(x1∗ , x2∗ ) = (1, 2)
7. µ1 , µ2 ≥ 0 I (x ∗ , x ∗ ) = (−2, 5) =⇒ µ1 + 4µ2 = −4
1 2
I (x ∗ , x ∗ ) = (1, 2) =⇒ µ1 = 6, and µ2 = 2,
1 2
September 27, 2024 26/44
Constrained Optimization
KKT Conditions: Problem
I Case 1: (No active constraint) , then
KKT Conditions (x1∗ , x2∗ ) = (0, 0)
1. 2x1 − µ1 + 2µ2 x1 = 0 I Case 2: µ1 > 0, µ2 = 0 (First constraint is active),
then (x1∗ , x2∗ ) = (2, 1) and µ1 = 4 > 0, Does not
2. 4x2 − µ1 − µ2 = 0 satisfy Condition 4
3. x1 + x2 ≥ 3 I Case 3: µ1 = 0, µ2 > 0 (Second constraint is
active), then (x1∗ , x2∗ ) = (0, 1) and µ2 = 4 > 0
4. x2 − x12 ≥ 1 Does not satisfy Condition 3
5. µ1 (x1 + x2 − 3) = 0 I Case 4: µ1 > 0, µ2 > 0 (Both constraints are
active), then (x1∗ , x2∗ ) = (−2, 5) and
6. µ2 (x2 − x12 − 1) = 0
(x1∗ , x2∗ ) = (1, 2)
7. µ1 , µ2 ≥ 0 I (x ∗ , x ∗ ) = (−2, 5) =⇒ µ1 + 4µ2 = −4
1 2
I (x ∗ , x ∗ ) = (1, 2) =⇒ µ1 = 6, and µ2 = 2,
1 2
September 27, 2024 26/44
Constrained Optimization
KKT Conditions: Problem
I Case 1: (No active constraint) , then
KKT Conditions (x1∗ , x2∗ ) = (0, 0)
1. 2x1 − µ1 + 2µ2 x1 = 0 I Case 2: µ1 > 0, µ2 = 0 (First constraint is active),
then (x1∗ , x2∗ ) = (2, 1) and µ1 = 4 > 0, Does not
2. 4x2 − µ1 − µ2 = 0 satisfy Condition 4
3. x1 + x2 ≥ 3 I Case 3: µ1 = 0, µ2 > 0 (Second constraint is
active), then (x1∗ , x2∗ ) = (0, 1) and µ2 = 4 > 0
4. x2 − x12 ≥ 1 Does not satisfy Condition 3
5. µ1 (x1 + x2 − 3) = 0 I Case 4: µ1 > 0, µ2 > 0 (Both constraints are
active), then (x1∗ , x2∗ ) = (−2, 5) and
6. µ2 (x2 − x12 − 1) = 0
(x1∗ , x2∗ ) = (1, 2)
7. µ1 , µ2 ≥ 0 I (x ∗ , x ∗ ) = (−2, 5) =⇒ µ1 + 4µ2 = −4
1 2
I (x ∗ , x ∗ ) = (1, 2) =⇒ µ1 = 6, and µ2 = 2,
1 2
September 27, 2024 26/44
Constrained Optimization
KKT Conditions: Problem
I Case 1: (No active constraint) , then
KKT Conditions (x1∗ , x2∗ ) = (0, 0)
1. 2x1 − µ1 + 2µ2 x1 = 0 I Case 2: µ1 > 0, µ2 = 0 (First constraint is active),
then (x1∗ , x2∗ ) = (2, 1) and µ1 = 4 > 0, Does not
2. 4x2 − µ1 − µ2 = 0 satisfy Condition 4
3. x1 + x2 ≥ 3 I Case 3: µ1 = 0, µ2 > 0 (Second constraint is
active), then (x1∗ , x2∗ ) = (0, 1) and µ2 = 4 > 0
4. x2 − x12 ≥ 1 Does not satisfy Condition 3
5. µ1 (x1 + x2 − 3) = 0 I Case 4: µ1 > 0, µ2 > 0 (Both constraints are
active), then (x1∗ , x2∗ ) = (−2, 5) and
6. µ2 (x2 − x12 − 1) = 0
(x1∗ , x2∗ ) = (1, 2)
7. µ1 , µ2 ≥ 0 I (x ∗ , x ∗ ) = (−2, 5) =⇒ µ1 + 4µ2 = −4
1 2
I (x ∗ , x ∗ ) = (1, 2) =⇒ µ1 = 6, and µ2 = 2,
1 2
September 27, 2024 26/44
Constrained Optimization
KKT Conditions: Problem
I Case 1: (No active constraint) , then
KKT Conditions (x1∗ , x2∗ ) = (0, 0)
1. 2x1 − µ1 + 2µ2 x1 = 0 I Case 2: µ1 > 0, µ2 = 0 (First constraint is active),
then (x1∗ , x2∗ ) = (2, 1) and µ1 = 4 > 0, Does not
2. 4x2 − µ1 − µ2 = 0 satisfy Condition 4
3. x1 + x2 ≥ 3 I Case 3: µ1 = 0, µ2 > 0 (Second constraint is
active), then (x1∗ , x2∗ ) = (0, 1) and µ2 = 4 > 0
4. x2 − x12 ≥ 1 Does not satisfy Condition 3
5. µ1 (x1 + x2 − 3) = 0 I Case 4: µ1 > 0, µ2 > 0 (Both constraints are
active), then (x1∗ , x2∗ ) = (−2, 5) and
6. µ2 (x2 − x12 − 1) = 0
(x1∗ , x2∗ ) = (1, 2)
7. µ1 , µ2 ≥ 0 I (x ∗ , x ∗ ) = (−2, 5) =⇒ µ1 + 4µ2 = −4
1 2
I (x ∗ , x ∗ ) = (1, 2) =⇒ µ1 = 6, and µ2 = 2,
1 2
September 27, 2024 26/44
Constrained Optimization
KKT Conditions: Problem
I Case 1: (No active constraint), then
(x1∗ , x2∗ ) = (0, 0) =⇒ Violates 3 and 4
KKT Conditions
I Case 2: µ1 > 0, µ2 = 0 (First constraint is active),
1. 2x1 − µ1 + 2µ2 x1 = 0 then (x1∗ , x2∗ ) = (2, 1) and µ1 = 3 > 0 =⇒
Violates 4
2. 4x2 − µ1 − µ2 = 0 I Case 3: µ1 = 0, µ2 > 0 (Second constraint is
3. x1 + x2 ≥ 3 active), then (x1∗ , x2∗ ) = (0, 1) and µ2 = −1 < 0
=⇒ Violates 3 and 7
4. x2 − x12 ≥ 1 I Case 4: µ1 > 0, µ2 > 0 (Both constraints are
5. µ1 (x1 + x2 − 3) = 0 active), then (x1∗ , x2∗ ) = (−2, 5) and
6. µ2 (x2 − x12 − 1) = 0 (x1∗ , x2∗ ) = (1, 2)
I (x ∗ , x ∗ ) = (−2, 5) =⇒ µ1 = +4, µ2 = −4
1 2
7. µ1 , µ2 ≥ 0 not possible, Violates 7
I (x ∗ , x ∗ ) = (1, 2) =⇒ µ1 = 6, and µ2 = 2, all
1 2
KKT conditions satisfied
September 27, 2024 27/44
Example: Constrained Optimization
I Find the semi-major and semi-minor axes of the ellipse
defined by
(x1 + x2 )2 + 2(x1 − x2 )2 − 8 = 0
I Hint: Calculate the farthest (nearest) point on the ellipse from
the origin
I Problem formulation:
min x12 + x22
(13)
s.t. (x1 + x2 )2 + 2(x1 − x2 )2 − 8 = 0
September 27, 2024 28/44
Numerical Optimization
Different Types of Algorithms
I Optimization Algorithms:
generate a sequence of iterates: {xk }∞
0
I Termination Criteria:
I No more progress can be made
I A solution has been approximated with sufficient accuracy
I How to generate a new point xk +1 from xk
Use information about f at xk and/or previous iteration points
I xk +1 must be such that f (xk +1 ) < f (xk )
I Strategies to find a new xk +1 :
I Line search
I Trust region
September 27, 2024 29/44
Numerical Optimization
Different Types of Algorithms
I Strategies
I Line search
I Trust region
I Line search Strategy: Choose a direction pk from xk so that
f (xk +1 ) < f (xk )
I xk +1 = xk + αpk , α > 0
I Choose a direction pk and find a step length α.
I Objective function:
min f (xk + αpk ) (14)
α
September 27, 2024 30/44
Numerical Optimization
Different Types of Algorithms
I Line Search: Objective function:
min f (xk + αpk ) (15)
α
!
I The best pk = −∇f (xk ) = ∂f
∂x
xk
Steepest Descent direction Line search: Descent direction
September 27, 2024 31/44
Numerical Optimization
Different Types of Algorithms
I Line Search: Objective function:
min f (xk + αpk ) (16)
α
I The best pk = −∇f (xk ) = ∂f
∂x xk
I Newton’s direction:
pk = −(∇2 f (xk ))−1 ∇f (xk )
I Quadratic approximation of f (x) is sufficient to represent the
f (x):-> Netwon’s direction is reliable.
I Conjugate gradient directions:
pk = −∇f (xk ) + βk pk −1 , βk : scalar value
Conjugate vectors: pi A pj = 0, for i 6= j and A: Symmetric
positive definite matrix
September 27, 2024 32/44
Numerical Optimization
Line Search Strategy
I Why conjugate gradient Directions*
*: Honkela, A. et al., Journal of Machine Learning Research, 11, 2010.
September 27, 2024 33/44
Numerical Optimization
Line Search Strategy
I Step size α
I pk : Provide direction
I αk : Helps in reducing f value
I Challenge: Many evaluations of f (some time ∇f )
I Simplest Sufficient condition on αk , f (xk + αk pk ) < f (xk )
I Several Conditions: Wolfe Conditions, Goldstein Conditions.
September 27, 2024 34/44
Numerical Optimization
Trust Region
I Trust region method
I Idea: Construct a model function mk at xk using information
on f
I mk may not be a good approximation of f
I Limit search for xk +1 in a region: Trust region
I Objective function
min mk (xk + pk ) (17)
September 27, 2024 35/44
Numerical Optimization
Trust Region
I Not sufficient decrease in f for the candidate solution, change
the trust region
I Trust region method
min mk (xk + pk ) (18)
I mk : Quadratic approximation
September 27, 2024 36/44
Optimization for Machine and Deep Learning: Terminology
I Central Idea to DS/ML/AI: Minimize or maximize an objective
function
I Deep learning objective: minimize the loss function or
objective function
I Direct solution & Iterative solution
I Arthur Samuel’s paradigm:
“Mechanism" to improve performance by tweaking weights
(parameters)
I Loss function can be tweaked by varying the model
parameters
Million-dimensional space!
September 27, 2024 37/44
Gradient Descent
Loss Loss Loss
function function function
w w w
∇Lossw < 0 ∇Lossw > 0 ∇Lossw = 0
Increment w Decrement w Optimum w
w = w + ∆w w = w − ∆w ∆w = 0 (stationary point)
What should be ∆ w?
∆w = −η∇Lossw⇒ w − η∇Lossw, where η : Learning rate
September 27, 2024 38/44
Gradient Descent
Learning Rate
Learning rate is too big⇒ Learning rate is small⇒ Long
Oscillation diverges time to convergence
Loss Loss
function function
w w
I Learning rate should be:
I Near local solution: proceed quickly with small learning rate
I Far from local solution: proceed quickly with large learning rate
September 27, 2024 39/44
Gradient Descent
Three types
Stochastic GD (online) Mini-batch GD
Batch GD
I Use data point (xi , yi ) I Use a subset of data
I Use entire data set for
for computing gradient points for computing
computing gradient
I Update rule gradient
I Update rule
I Update rule
w = w−η∇w L (w, xi , yi ) w=
w = w − η∇w L(w)
I Much faster w −η∇w L(w, xi:i+n , yi:i+n )
I Slow I Best of both methods
I Update with high
I large data set I Reduces variances in
variance
does not fit in memory parameter updates
L fluctuates heavily
Algorithm of choice: Mini-batch gradient descent for NN and DL
Mini-batch GD is often referred to as SGD
Typical batch size of 50 or 256 are used
September 27, 2024 40/44
Gradient Descent
Momentum SGD
I SGD: Sometimes slow
I How do we accelerate the learning rate?
I Momentum approaches:
I Use the concept of momentum from physics
I v : velocity variable
I Update rule
wk +1 = wk − η∇w L(w, xi , yi ) + βv
I v : accumulates the past gradients
I β ∈ [0, 1): momentum parameter
I Larger β, current direction is affected by the more past
gradients
September 27, 2024 41/44
Gradient Descent
Stochastic GD
I Learning rate: An important hyperparameter
I Adaptive learning rate: Direction and magnitude of gradients
for different parameters
I AdaGrad Learning rate:
Scale learning rates of all parameters by the square root of the
sum of all the past gradient squares
η.
wk +1 = wk − √ ∇w L, vk +1 = vk +(∇w L(wk )) (∇w L(wk ))
δ + vk +1 .
I RMSProp Learning rate: Exponential moving average instead
of sum
η.
wk +1 = wk − p ∇w L,
δ + vk +1 .
vk +1 = ρ vk + (1 − ρ)(∇w L(wk )) (∇w L(wk ))
I Other algorithm: Adaptive moments "Adam": Combination of
RMSProp and Momentum SGD
September 27, 2024 42/44
Gradient Descent
Stochastic GD: Conclusions
I Which algorithm should I choose for a particular application?
I Which is the best algorithm to apply?
I Answer: None or all
I Robust performance: RMSProp and AdaDelta family1
I Hyper-parameter tuning does not matter much in adaptive
learning
I Higher power concludes
Choice of algorithm depends on user’s knowledge of algorithm
2
Schaul, et al. "No more pesky learning rates." International Conference on Machine Learning. 2013.
September 27, 2024 43/44
Gradient Descent
Stochastic GD: Summary
I The element for applying stochastic GD
I The loss function form
I A way to compute gradients wrt parameters
I Initialization for parameters and learning rate and other
hyperparameters
2
Schaul, et al. "No more pesky learning rates." International Conference on Machine Learning. 2013.
September 27, 2024 44/44