0% found this document useful (0 votes)
13 views39 pages

Clnote Oct8

The document discusses optimization techniques in nonlinear programming, focusing on various unconstrained minimization methods, including direct and descent methods. It elaborates on the importance of the gradient vector in determining the direction of steepest ascent and descent, as well as the challenges in evaluating gradients. Additionally, it introduces the steepest descent method and conjugate gradient methods, highlighting their iterative processes and convergence criteria.

Uploaded by

Sreya Banerjee
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views39 pages

Clnote Oct8

The document discusses optimization techniques in nonlinear programming, focusing on various unconstrained minimization methods, including direct and descent methods. It elaborates on the importance of the gradient vector in determining the direction of steepest ascent and descent, as well as the challenges in evaluating gradients. Additionally, it introduces the steepest descent method and conjugate gradient methods, highlighting their iterative processes and convergence criteria.

Uploaded by

Sreya Banerjee
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

Optimization

Nonlinear programming:
Multi-dimensional minimization methods
Classification of unconstrained
minimization methods
Direct search methods Descent methods

• Univariate method • Steepest descent (Cauchy


• Box evolution method method)
• Simplex method • Fletcher-Reeves method
• Pattern search methods • Newton’s method
• Powell’s method • Marquardt method
• Hooke-Jeeves method • Quasi-Newton methods
• Rosenbrock’s method • Davidon-Fletcher-Powell
method
• Broyden-Fletcher-Goldfarb-
Shanno method
Indirect search (descent method)
Gradient of a function
The gradient of a
function is an n-
component vector given
by:
 ∂f 
 ∂x 
 1
 ∂f 
 
∇f = g =  ∂x2 
n×1
  
 
 ∂f 
 ∂x 
 n
Indirect search (descent method)

The gradient has a very important property. If we move along the


gradient direction from any point in n dimensional space, the function
value increases at the fastest rate. Hence the gradient direction is called
the direction of the steepest ascent. Unfortunately, the direction of
steepest ascent is a local property not a global one.
Indirect search (descent method)
• The gradient vectors g evaluated at
points 1,2,3 and 4 lie along the
directions 11’, 22’, 33’,44’,
respectively.
• Thus the function value increases
at the fastest rate in the direction
11’ at point 1, but not at point 2.
Similarly, the function value
increases at the fastest rate in
direction 22’ at point 2, but not at
point 3.
• In other words, the direction of
steepest ascent generally varies
from point to point, and if we make
infinitely small moves along the
direction of steepest ascent, the
path will be a curved line like the
curve 1-2-3-4 in the
Indirect search (descent method)
• Since the gradient vector represents the direction of steepest ascent, the
negative of the gradient vector denotes the direction of the steepest
descent.

• Thus, any method that makes use of the gradient vector can be expected
to give the minimum point faster than one that does not make use of the
gradient vector.

• All the descent methods make use of the gradient vector, either directly
or indirectly, in finding the search directions.

Theorem 1: The gradient vector represents the direction of the steepest


ascent.
Theorem 2: The maximum rate of change of f at any point X is equal to the
magnitude of the gradient vector at the same point.
Indirect search (descent method)
• In general, if df/ds =∇f Tu > 0 along a vector dX, it is called a
direction of ascent, and if df/ds < 0, it is called a direction of descent.

Evaluation of the gradient


The evaluation of the gradient requires the computation of the partial
derivatives δf/δxi , i=1,2,….,n. There are three situations where the
evaluation of the gradient poses certain problems:

1. The function is differentiable at all the points, but the calculation of


the components of the gradient, δf/δxi , is either impractical or
impossible.
2. The expressions for the partial derivatives δf/δxi can be derived, but
they require large computational time for evaluation.
3. The gradient ∇f is not defined at all points.
Indirect search (descent method)
1st case: The function is differentiable at all the points, but the calculation of the
components of the gradient, δf/δxi , is either impractical or impossible.
• In the first case, we can use the forward finite-difference formula

∂f f ( X m + ∆xi u i ) − f ( X m )
≅ , i = 1,2,..., n
∂xi Xm
∆xi

to approximate the partial derivative δf/δxi at Xm. If the function value at


the base point Xm is known, this formula requires one additional function
evaluation to find (δf/δxi )|Xm. Thus, it requires n additional function
evaluations to evaluate the approximate gradient ∇f |Xm. For better
results, we can use the central finite difference formula to find the
approximate partial derivative δf/δxi |Xm

∂f f ( X m + ∆xi u i ) − f ( X m − ∆xi u i )
≅ , i = 1,2,..., n
∂xi Xm
2∆xi
Indirect search (descent method)
• In these two equations, ∆xi is a small scalar quantity and ui is a vector of
order n whose ith component has a value of 1, and all other components
have a value of zero.

• In practical computations, the value of ∆xi has to be chosen with some care.
If ∆xi is too small, the difference between the values of the function
evaluated at (Xm+ ∆xi ui) and (Xm- ∆xi ui) may be very small and numerical
round-off errors may dominate. On the other hand, if ∆xi is too large, the
truncation error may predominate in the calculation of the gradient.

• If the expressions for the partial derivatives may be derived, but they require
large computational time for evaluation (Case 2), the use of the finite
difference formulas has to be preferred whenever the exact gradient
evaluation requires more computational time.
Indirect search (descent method)
• If the gradient is not defined at all points (Case 3), we can not use the finite
difference formulas.
• For example, consider the function shown in the figure. If the equation
∂f f ( X m + ∆xi u i ) − f ( X m − ∆xi u i )
≅ , i = 1,2,..., n
∂xi Xm
2∆xi

is used to evaluate the derivative


df/dx at Xm, we obtain a value of
α1 for a step size ∆x1 and a value
of α2 for a step size ∆x2. Since in
reality, the derivative does not
exist at the point Xm, the use of
the finite-difference formulas
might lead to a complete
breakdown of the minimization
process. In such cases, the
minimization can be done only by
one of the direct search techniques
discussed.
Steepest descent (Cauchy method)
• The use of the negative of the gradient vector as a direction for
minimization was first made by Cauchy in 1847.

• In this method, we start from an initial trial point X1 and iteratively


move along the steepest descent directions until the optimum point is
found.

• The steepest descent method can be summarized by the following steps:

1. Start with an arbitrary initial point X1 . Set the iteration number as


i=1.
2. Find the search direction di as
d i = −∇f ( X i ) = g i

3. Determine the optimal step length λ* in the direction di and set


X i +1 = X i + λ*i d i = X i − λ*i g i
Steepest descent (Cauchy method)
4. Test the new point, Xi+1 , for optimality. If Xi+1 is optimum, stop the
process. Otherwise go to step 5.

5. Set the new iteration number i=i+1 and go to step 2.

The method of steepest descent may appear to be the best


unconstrained minimization technique since each one-dimensional
search starts in the ‘best direction’. However, owing to the fact that
the steepest descent direction is a local property, the method is not
really effective in most problems.
Example
Minimize f ( x1 , x2 ) = x1 − x2 + 2 x12 + 2 x1 x2 + x22
0
Starting from the point X1 =  
0
Solution:
Iteration 1: The gradient of f is given by:

∂f / ∂x1  1 + 4 x1 + 2 x2 
∇f =  = 
∂f / ∂x 2  − 1 + 2 x1 + 2 x 2

 1
g1 = ∇f ( X1 ) =  
− 1
Therefore
− 1
d1 = −g1 =  
 1
Example
• To find X2, we need to find the optimal step length λ1*. For this, we
minimize

df
f ( X1 + λ d) = f (−λ1 , λ1 ) = λ − 2λ1 with respect to λ1. Since
* 2
= 0 at λ1* = 1,
dλ1
1 1

we obtain :
0 − 1 − 1
X 2 = X1 + λ d =   + 1  =  
*
1 1
0  1  1

• As
− 1 0
g 2 = ∇f ( X 2 ) =   ≠  , X 2 is not optimum.
− 1 0
Example
Iteration 2: 1
d 2 = −g 2 =  
1
To minimize
f ( X 2 + λ2d 2 ) = f (−1 + λ2 ,1 + λ2 ) = 5λ22 − 2λ2 − 1
df
we set = 0. This gives λ*2 = 1 / 5, and hence
dλ2
− 1 1 1 − 0.8
X3 = X 2 + λ 2 d 2 =   +   = 
*

 1 5  
1 1 . 2 

 0.2
Since the components of the gradient at X3, g 3 =   are not zero, we
proceed to the next iteration.  − 0 . 2 
Example
Iteration 3: − 0.2 
d 3 = −g 3 =  
 0 . 2 
As
f ( X 3 + λ3d 3 ) = f (−0.8 − 0.2λ3 ,1.2 + 0.2λ3 ) = 0.04λ32 − 0.08λ3 − 1.20
df
we set = 0. This gives λ*3 = 1.0. Therefore,
dλ3
− 0.8 − 0.2 − 1.0 
X 4 = X3 + λ 3 d3 = 
*
 + 1.0 = 
 1. 2   0 . 2   1 . 4 

− 0.2
The gradient at X4 is given by: 4  g = ≠0
 − 0 . 2 
Since the components of the gradient at X4 are not equal to zero, X4 is not optimum
and hence we have to proceed to the next iteration. This process has to be continued
− 1.0 
until the optimum point, X* =   is found.
 1.5
Convergence Criteria
The following criteria can be used to terminate the iterative process:

1. When the change in function value in two consecutive iterations is


small:

f ( X i +1 ) − f ( X i )
≤ ε1
f (Xi )

2. When the partial derivatives (components of the gradient) of f are small:

∂f
≤ ε2, i = 1,2,..., n
∂xi
3. When the change in the design vector in two consecutive iterations is
small:
X i +1 - X i ≤ ε 3
Conjugate Gradient Methods
• The convergence characteristics of the steepest descent method can be
improved greatly by modifying it into a conjugate gradient method which
can be considered as a conjugate directions method involving the use of
the gradient of the function.

• We saw that any minimization method that makes use of the conjugate
directions is quadratically convergent. This property of quadratic
convergence is very useful because it ensures that the method will
minimize a quadratic function in n steps or less.

• Since any general function can be approximated reasonably well by a


quadratic near the optimum point, any quadratically convergent method
is expected to find the optimum point in a finite number of iterations.
Conjugate Gradient Method
• We have seen that Powell’s conjugate direction method requires n single-
variable minimizations per iteration and sets up a new conjugate direction at
the end of each iteration.

• Thus, it requires in general n2 single-variable minimizations to find the


minimum of a quadratic function.

• On the other hand, if we can evaluate the gradients of the objective function,
we can set up a new conjugate direction after every one dimensional
minimization, and hence we can achieve faster convergence.
Development of Conjugate Gradient Methods
• Consider the development of an algorithm by modifying the steepest descent
𝟏𝟏
method applied to a quadratic function 𝐟𝐟 𝐗𝐗 = 𝐗𝐗 𝐓𝐓 𝐀𝐀𝐀𝐀 + 𝐁𝐁 𝐓𝐓 𝐗𝐗 + 𝐂𝐂 by
𝟐𝟐
imposing the condition that the successive directions be mutually conjugate.

• Let X1 be the starting point for the minimization and let the first search
direction be the steepest descent direction:

d1 = -∇f1 = -AX1-B
X 2 = X1 + λ1*d1
X 2 - X1
or d1 =
λ1*
where λ1* is the minimizing step length in the direction d1, so that

d1T ∇f = d1T g 2 = 0
X2
Development of the Fletcher-Reeves Method
The equation d1T g 2 = 0 can be expanded as d [A(X + λ d ) + B] = 0
T
1 1
*
1 1

- d1T (AX1 + B) d1T g1


from which the value of λ1* can be found as: λ =
*
1 T
=− T
d1 Ad1 d1 Ad1

Now express the second search direction as a linear combination of d1 and -g2:
d 2 = -g 2 + β1d1
where β2 is to be chosen so as to make d1 and d2 conjugate. This requires that:
d1T Ad 2 = 0
Substituting d 2 = -g 2 + β1d1 into above equation leads to:

d1T A(-g 2 + β1d1 ) = 0


gT2 Ad1
or,
β1 = T
d1 Ad1
Development of the Conjugate Gradient Methods
The difference of the gradients (g2 - g1) can be expressed as:
(g2 – g1) = (AX2 + B) – (AX1 + B) = A(X2 – X1) = λ1* A d1

gT2 Ad1 gT2 (g 2 - g1 )


• =
so, β1 = (1)
d1T Ad1 d1T (g 2 - g1 )

• Also, we know, d1T g 2 = g T2d1 = 0 and d1T = −g1T

gT2 (g 2 - g1 ) gT2 (g 2 - g1 )
so, β1 =
= (2)
d1T (g 2 - g1 ) g1T g1

• Finally, g1T g 2 =
−d1T g 2 =
0 gives.

gT2 (g 2 - g1 ) gT2 g 2
=β1 = (3)
g1T g1 g1T g1
Development of the Conjugate Gradient Methods
The same methodology can be extended easily to obtain equations for βk values. So
equations 1, 2 and 3 can be written in general form as

gTk +1 (g k +1 - g k )
• βk = T (1) Hesteness-Shiefel Formula
d k (g k +1 - g k )

gTk +1 (g k +1 - g k )
• βk = (2) Polak-Ribiere Formula
gTk g k

gTk +1 g k +1
• βk = T (3) Fletcher-Reeves Formula
gk gk
Fletcher-Reeves Method
The iterative procedure of Fletcher-Reeves method can be stated as follows:
1. Start with an arbitrary initial point X1.
2. Set the first search direction d1 = -∇ f (X1)= -g1
3. Find the point X2 according to the relation
X 2 = X1 + λ*1d1
where λ1* is the optimal step length in the direction d1. Set i=2 and go to the next
step. 2
gi
4. Find gi = ∇f (Xi), and set d i = -g i + 2
d i-1
g i −1

5. Compute the optimum step length λi* in the direction di, and find the new point
X i +1 = X i + λ*i d i

6. Test for the optimality of the point Xi+1. If Xi+1 is optimum, stop the process.
Otherwise set the value of i=i+1 and go to step 4.
Fletcher-Reeves Method
Remarks:
1. Since the directions di used in this method are A- conjugate, the process
should converge in n cycles or less for a quadratic function. However, for
ill-conditioned quadratics (whose contours are highly eccentric and
distorted), the method may require much more than n cycles for
convergence. The reason for this has been found to be the cumulative
effect of the rounding errors. Since di is given by
2
gi
d i = -g i + 2
d i-1
g i −1

any error resulting from the inaccuracies involved in the determination


of λi* , and from the round-off error involved in accumulating the
successive
2 2
g i d i-1 / g i −1
terms, is carried forward through the vector di.
Fletcher-Reeves Method
Thus, the search directions di will be progressively contaminated by these
errors. Hence it is necessary, in practice, to restart the method periodically
after every, say, m steps by taking the new search direction as the steepest
descent direction. That is, after every m steps, dm+1 is set equal to -∇fm+1
instead of the usual form. Fletcher and Reeves have recommended a value of
m=n+1, where n is the number of design variables.

2. Despite the limitations indicated above, the Fletcher-Reeves method is vastly


superior to the steepest descent method and the pattern search methods, but it
turns out to be rather less efficient than the Newton and the quasi-Newton
(variable metric) methods discussed in the latter sections.
Example
Minimize

starting from the point

Solution:
Iteration 1

The search direction is taken as:


Example

• To find the optimal step length λ1* along S1, we minimize


with respect to λ1. Here

• Therefore
Example
Iteration 2: Since

∇f i
2

the equation S i = -∇f i + S i-1


∇f i −1
2

∇f 2
2

gives the next search direction as S 2 = -∇f 2 + S1


∇f 1
2

where

Therefore
Example
• To find λ2*, we minimize

with respect to λ2. As df/d λ2=8 λ2-2=0 at λ2*=1/4, we obtain:

Thus the optimum point is reached in two iterations. Even if we do not


know this point to be optimum, we will not be able to move from this
point in the next iteration. This can be verified as follows:
Example
Iteration 3:
Now

Thus,

This shows that there is no search direction to reduce f further, and hence
X3 is optimum.
Rate of Convergence
• Consider quadratic objective function
1 𝑇𝑇
• 𝑓𝑓 𝒙𝒙 = 𝒙𝒙 𝑯𝑯𝑯𝑯 + 𝑪𝑪𝑇𝑇 𝒙𝒙
2
• Optimal solution: 𝒙𝒙∗ = −𝑯𝑯−1 𝑪𝑪
1
• So, 𝑓𝑓 𝒙𝒙∗ = − 𝑪𝑪𝑇𝑇 𝑯𝑯−1 𝑪𝑪
2
• Let 𝒙𝒙𝑘𝑘 is the kth point of steepest descent algorithm
1 𝑇𝑇
i.e,𝑓𝑓 𝒙𝒙𝑘𝑘 = 𝒙𝒙𝑘𝑘 𝑯𝑯𝒙𝒙𝑘𝑘 + 𝑪𝑪𝑇𝑇 𝒙𝒙𝑘𝑘
2
• The current direction 𝒅𝒅𝑘𝑘 = −𝒈𝒈 𝒙𝒙𝑘𝑘 = −𝑯𝑯𝒙𝒙𝑘𝑘 − 𝑪𝑪
• Let us compute the next iteration:
1
• 𝜙𝜙 = 𝑓𝑓 𝒙𝒙𝑘𝑘 + 𝛼𝛼𝒅𝒅𝑘𝑘 = 𝒙𝒙𝑘𝑘 + 𝛼𝛼𝒅𝒅𝑘𝑘 𝑇𝑇 𝑯𝑯 𝒙𝒙𝑘𝑘 + 𝛼𝛼𝒅𝒅𝑘𝑘 +
2
𝑪𝑪𝑇𝑇 (𝒙𝒙𝑘𝑘 + 𝛼𝛼𝒅𝒅𝑘𝑘 )
Rate of Convergence
𝜙𝜙 = 𝑓𝑓 𝒙𝒙𝑘𝑘 + 𝛼𝛼𝒅𝒅𝑘𝑘
1
= 𝒙𝒙𝑘𝑘 + 𝛼𝛼𝒅𝒅𝑘𝑘 𝑇𝑇 𝑯𝑯 𝒙𝒙𝑘𝑘 + 𝛼𝛼𝒅𝒅𝑘𝑘 + 𝑪𝑪𝑇𝑇 (𝒙𝒙𝑘𝑘 + 𝛼𝛼𝒅𝒅𝑘𝑘 )
2
𝟏𝟏 𝑇𝑇 1 2 𝑇𝑇
= 𝒙𝒙𝑘𝑘 𝑯𝑯𝑥𝑥𝑘𝑘 + 𝛼𝛼𝒅𝒅𝑘𝑘 𝑯𝑯𝒙𝒙𝑘𝑘 + 𝛼𝛼 𝒅𝒅𝑘𝑘 𝑯𝑯𝒅𝒅𝑘𝑘 + 𝑪𝑪𝑇𝑇 𝒙𝒙𝑘𝑘 + 𝛼𝛼𝑪𝑪𝑇𝑇 𝒅𝒅𝑘𝑘
𝟐𝟐 2
𝑇𝑇 𝑇𝑇
1 2 𝑇𝑇
= 𝑓𝑓 𝒙𝒙𝑘𝑘 + 𝛼𝛼𝒅𝒅𝑘𝑘 𝑯𝑯𝒙𝒙𝑘𝑘 + 𝛼𝛼𝑪𝑪 𝒅𝒅𝑘𝑘 + 𝛼𝛼 𝒅𝒅𝑘𝑘 𝑯𝑯𝒅𝒅𝑘𝑘
2
𝑇𝑇 1 2 𝑇𝑇
= 𝑓𝑓 𝒙𝒙𝑘𝑘 − 𝛼𝛼𝒅𝒅𝑘𝑘 𝒅𝒅𝑘𝑘 + 𝛼𝛼 𝒅𝒅𝑘𝑘 𝑯𝑯𝒅𝒅𝑘𝑘
2
𝜕𝜕𝜕𝜕
= −𝒅𝒅𝑇𝑇𝑘𝑘 𝒅𝒅𝑘𝑘 + 𝛼𝛼𝒅𝒅𝑇𝑇𝑘𝑘 𝑯𝑯𝒅𝒅𝑘𝑘 = 0
𝜕𝜕𝜕𝜕
𝒅𝒅𝑇𝑇
𝑘𝑘 𝒅𝒅𝑘𝑘
So, 𝛼𝛼 =
𝒅𝒅𝑇𝑇
𝑘𝑘 𝑯𝑯𝒅𝒅𝑘𝑘
Rate of Convergence
1 2 𝑇𝑇
𝑓𝑓 𝒙𝒙𝑘𝑘 + 𝛼𝛼𝒅𝒅𝑘𝑘 = 𝑓𝑓 𝒙𝒙𝑘𝑘 − 𝛼𝛼𝒅𝒅𝑇𝑇𝑘𝑘 𝒅𝒅𝑘𝑘 + 𝛼𝛼 𝒅𝒅𝑘𝑘 𝑯𝑯𝒅𝒅𝑘𝑘
2 2
𝑇𝑇
1 𝒅𝒅𝑘𝑘 𝒅𝒅𝑘𝑘
= 𝑓𝑓 𝒙𝒙𝑘𝑘 −
2 𝒅𝒅𝑇𝑇𝑘𝑘 𝑯𝑯𝒅𝒅𝑘𝑘
𝑇𝑇 2
1 𝒅𝒅𝑘𝑘 𝒅𝒅𝑘𝑘
𝑓𝑓 𝒙𝒙𝑘𝑘 − 𝑇𝑇 −𝑓𝑓 𝒙𝒙∗
𝑓𝑓 𝒙𝒙𝑘𝑘 +1 −𝑓𝑓 𝒙𝒙∗ 2 𝒅𝒅 𝑯𝑯𝒅𝒅
Now, = 𝑘𝑘 𝑘𝑘
𝑓𝑓(𝒙𝒙𝑘𝑘 −𝑓𝑓(𝒙𝒙∗ ) 𝑓𝑓 𝒙𝒙𝑘𝑘 −𝑓𝑓(𝒙𝒙∗ )

𝑇𝑇 2
1 𝒅𝒅𝑘𝑘 𝒅𝒅𝑘𝑘
2 𝒅𝒅𝑇𝑇 𝑯𝑯𝒅𝒅
𝑘𝑘 𝑘𝑘
=1− 1 𝑇𝑇 1
𝒙𝒙𝑘𝑘 𝑯𝑯𝒙𝒙𝑘𝑘 +𝑪𝑪𝑇𝑇 𝒙𝒙𝑘𝑘 + 𝑪𝑪𝑇𝑇 𝑯𝑯−1 𝑪𝑪
2 2
Rate of Convergence
𝑇𝑇 2
1 𝒅𝒅𝑘𝑘 𝒅𝒅𝑘𝑘
𝑓𝑓 𝒙𝒙𝑘𝑘 +1 −𝑓𝑓 𝒙𝒙∗ 2 𝒅𝒅𝑇𝑇 𝑯𝑯𝒅𝒅
𝑘𝑘 𝑘𝑘
=1− 1 𝑇𝑇 1
𝑓𝑓(𝒙𝒙𝑘𝑘 )−𝑓𝑓(𝒙𝒙∗ ) 𝒙𝒙𝑘𝑘 𝑯𝑯𝒙𝒙𝑘𝑘 +𝑪𝑪𝑇𝑇 𝒙𝒙𝑘𝑘 + 𝑪𝑪𝑇𝑇 𝑯𝑯−1 𝑪𝑪
2 2

𝑇𝑇 2
1 𝒅𝒅𝑘𝑘 𝒅𝒅𝑘𝑘
2 𝒅𝒅𝑇𝑇𝑘𝑘 𝑯𝑯𝒅𝒅𝑘𝑘
=1−
1
(𝑯𝑯𝒙𝒙𝑘𝑘 + 𝑪𝑪)𝑻𝑻 𝐻𝐻−1 (𝑯𝑯𝒙𝒙𝑘𝑘 + 𝑪𝑪)
2

𝑇𝑇 2
𝒅𝒅𝑘𝑘 𝒅𝒅𝑘𝑘 1
=1− =1−
𝒅𝒅𝑇𝑇𝑘𝑘 𝑯𝑯𝒅𝒅𝑘𝑘 𝒅𝒅𝑇𝑇𝑘𝑘 𝑯𝑯−1 𝒅𝒅𝑘𝑘 β
Rate of Convergence
Kantorovich inequality
Let H ϵ 𝑅𝑅𝑁𝑁×𝑁𝑁 be a symmetric positive definite matrix. Let 𝜆𝜆1 𝑎𝑎𝑎𝑎𝑎𝑎 𝜆𝜆𝑛𝑛
be respectively the largest and smallest eigenvalues of H. Then, for
any 𝐲𝐲 ≠ 0,
𝑦𝑦 𝑇𝑇 𝑦𝑦 2 4𝜆𝜆1 𝜆𝜆𝑛𝑛
𝑇𝑇 𝑇𝑇 −1

(𝑦𝑦 𝐻𝐻𝐻𝐻)(𝑦𝑦 𝐻𝐻 𝑦𝑦) 𝜆𝜆1 + 𝜆𝜆𝑛𝑛 2

Using the above inequality,


2
𝜆𝜆𝑛𝑛
𝑓𝑓 𝒙𝒙𝑘𝑘 + 1 − 𝑓𝑓 𝒙𝒙∗ 1 4𝜆𝜆1 𝜆𝜆𝑛𝑛 −1
𝜆𝜆 1

=1− ≤1− 2
≤ 2
𝑓𝑓(𝒙𝒙𝑘𝑘 ) − 𝑓𝑓(𝒙𝒙 ) 𝛽𝛽 𝜆𝜆1 + 𝜆𝜆𝑛𝑛 𝜆𝜆𝑛𝑛
+1
𝜆𝜆 1
𝑓𝑓 𝒙𝒙𝑘𝑘 + 1 − 𝑓𝑓 𝒙𝒙∗ 𝜆𝜆𝑛𝑛 − 𝜆𝜆1 2
≤ ≤ 𝛿𝛿 ⇒ 𝐥𝐥𝐥𝐥𝐥𝐥𝐥𝐥𝐥𝐥𝐥𝐥 𝐜𝐜𝐜𝐜𝐜𝐜𝐜𝐜𝐜𝐜𝐜𝐜𝐜𝐜𝐜𝐜𝐜𝐜𝐜𝐜𝐜𝐜
𝑓𝑓(𝒙𝒙𝑘𝑘 ) − 𝑓𝑓(𝒙𝒙∗ ) 𝜆𝜆𝑛𝑛 + 𝜆𝜆1 2
Rate of Convergence
𝜆𝜆1
• The ratio 𝜅𝜅 𝑯𝑯 = is called the condition number of H.
𝜆𝜆𝑛𝑛
The larger the condition number, the closer the ratio of
𝜆𝜆1 −𝜆𝜆𝑛𝑛
to 1 and therefore slower the convergence.
𝜆𝜆1 +𝜆𝜆𝑛𝑛

• If 𝜆𝜆1 = 𝜆𝜆𝑛𝑛 then only one iteration is needed in this case H is


a multiple of the identity matrix, and the contours are
concentric circles which means that the steepest descent
direction points straight at the solution.

𝜆𝜆1
• As the condition number 𝜅𝜅 𝑯𝑯 = increases the contours
𝜆𝜆𝑛𝑛
become more elongated, which increases the amount of zig-
zagging.

You might also like