0% found this document useful (0 votes)
22 views49 pages

Cholesky

Uploaded by

Guna Sekar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views49 pages

Cholesky

Uploaded by

Guna Sekar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

L.

Vandenberghe ECE133A (Fall 2024)

12. Cholesky factorization

• positive definite matrices


• examples
• Cholesky factorization
• complex positive definite matrices
• kernel methods

12.1
Definitions

• a symmetric matrix 𝐴 ∈ R𝑛×𝑛 is positive semidefinite if

𝑥𝑇 𝐴𝑥 ≥ 0 for all 𝑥

• a symmetric matrix 𝐴 ∈ R𝑛×𝑛 is positive definite if

𝑥𝑇 𝐴𝑥 > 0 for all 𝑥 ≠ 0

this is a subset of the positive semidefinite matrices

note: if 𝐴 is symmetric and 𝑛 × 𝑛, then 𝑥𝑇 𝐴𝑥 is the function

𝑛 ∑︁
𝑛 𝑛
𝐴𝑖𝑖 𝑥𝑖2 +2
𝑇
∑︁ ∑︁ ∑︁
𝑥 𝐴𝑥 = 𝐴𝑖 𝑗 𝑥𝑖 𝑥 𝑗 = 𝐴𝑖 𝑗 𝑥𝑖 𝑥 𝑗
𝑖=1 𝑗=1 𝑖=1 𝑖> 𝑗

this is called a quadratic form

Cholesky factorization 12.2


Example

9 6
 
𝐴=
6 𝑎

𝑥𝑇 𝐴𝑥 = 9𝑥12 + 12𝑥1𝑥2 + 𝑎𝑥 22 = (3𝑥1 + 2𝑥2) 2 + (𝑎 − 4)𝑥22

• 𝐴 is positive definite for 𝑎 > 4

𝑥𝑇 𝐴𝑥 > 0 for all nonzero 𝑥

• 𝐴 is positive semidefinite but not positive definite for 𝑎 = 4

𝑥𝑇 𝐴𝑥 ≥ 0 for all 𝑥, 𝑥𝑇 𝐴𝑥 = 0 for 𝑥 = (2, −3)

• 𝐴 is not positive semidefinite for 𝑎 < 4

𝑥𝑇 𝐴𝑥 < 0 for 𝑥 = (2, −3)

Cholesky factorization 12.3


Simple properties

• every positive definite matrix 𝐴 is nonsingular

𝐴𝑥 = 0 =⇒ 𝑥𝑇 𝐴𝑥 = 0 =⇒ 𝑥=0

(last step follows from positive definiteness)

• every positive definite matrix 𝐴 has positive diagonal elements

𝐴𝑖𝑖 = 𝑒𝑇𝑖 𝐴𝑒𝑖 > 0

• every positive semidefinite matrix 𝐴 has nonnegative diagonal elements

𝐴𝑖𝑖 = 𝑒𝑇𝑖 𝐴𝑒𝑖 ≥ 0

Cholesky factorization 12.4


Schur complement

partition 𝑛 × 𝑛 symmetric matrix 𝐴 as


 
𝐴11 𝐴𝑇2:𝑛,1
𝐴=
𝐴2:𝑛,1 𝐴2:𝑛,2:𝑛

• the Schur complement of 𝐴11 is defined as the (𝑛 − 1) × (𝑛 − 1) matrix

1
𝑆 = 𝐴2:𝑛,2:𝑛 − 𝐴2:𝑛,1 𝐴𝑇2:𝑛,1
𝐴11

• if 𝐴 is positive definite, then 𝑆 is positive definite


to see this, take any 𝑥 ≠ 0 and define 𝑦 = −( 𝐴𝑇2:𝑛,1𝑥)/𝐴11; then

 𝑇   
𝑦 𝐴11 𝐴𝑇2:𝑛,1 𝑦
𝑇
𝑥 𝑆𝑥 = >0
𝑥 𝐴2:𝑛,1 𝐴2:𝑛,2:𝑛 𝑥

because 𝐴 is positive definite

Cholesky factorization 12.5


Singular positive semidefinite matrices

• we mentioned that positive definite matrices are nonsingular (page 12.4)


• if 𝐴 is positive semidefinite, but not positive definite, then it is singular

to see this, suppose 𝐴 is positive semidefinite but not positive definite

• there exists a nonzero 𝑥 with 𝑥𝑇 𝐴𝑥 = 0


• since 𝐴 is positive semidefinite the following function is nonnegative:

𝑓 (𝑡) = (𝑥 − 𝑡 𝐴𝑥)𝑇 𝐴(𝑥 − 𝑡 𝐴𝑥)


= 𝑥𝑇 𝐴𝑥 − 2𝑡𝑥𝑇 𝐴2𝑥 + 𝑡 2𝑥𝑇 𝐴3𝑥
= −2𝑡 ∥ 𝐴𝑥∥ 2 + 𝑡 2𝑥𝑇 𝐴3𝑥

• 𝑓 (𝑡) ≥ 0 for all 𝑡 is only possible if ∥ 𝐴𝑥∥ = 0; therefore 𝐴𝑥 = 0


• hence there exists a nonzero 𝑥 with 𝐴𝑥 = 0, so 𝐴 is singular

Cholesky factorization 12.6


Outline

• positive definite matrices


• examples
• Cholesky factorization
• complex positive definite matrices
• kernel methods
Exercise: resistor circuit

R1 R2

x1 x2

y1 + R3 + y2
− −

    
𝑦1 𝑅1 + 𝑅3 𝑅3 𝑥1
=
𝑦2 𝑅3 𝑅2 + 𝑅3 𝑥2

show that the matrix  


𝑅1 + 𝑅3 𝑅3
𝐴=
𝑅3 𝑅2 + 𝑅3

is positive definite if 𝑅1, 𝑅2, 𝑅3 are positive

Cholesky factorization 12.7


Solution

Solution from physics

• 𝑥𝑇 𝐴𝑥 = 𝑦𝑇 𝑥 is the power delivered by sources, dissipated by resistors


• power dissipated by the resistors is positive unless both currents are zero

Algebraic solution

𝑥𝑇 𝐴𝑥 = (𝑅1 + 𝑅3)𝑥12 + 2𝑅3𝑥1𝑥2 + (𝑅2 + 𝑅3)𝑥 22


= 𝑅1𝑥12 + 𝑅2𝑥22 + 𝑅3 (𝑥1 + 𝑥2) 2
≥ 0

and 𝑥𝑇 𝐴𝑥 = 0 only if 𝑥 1 = 𝑥 2 = 0

Cholesky factorization 12.8


Gram matrix

recall the definition of Gram matrix of a matrix 𝐵 (page 4.20):

𝐴 = 𝐵𝑇 𝐵

• every Gram matrix is positive semidefinite

𝑥𝑇 𝐴𝑥 = 𝑥𝑇 𝐵𝑇 𝐵𝑥 = ∥𝐵𝑥∥ 2 ≥ 0 ∀𝑥

• a Gram matrix is positive definite if

𝑥𝑇 𝐴𝑥 = 𝑥𝑇 𝐵𝑇 𝐵𝑥 = ∥𝐵𝑥∥ 2 > 0 ∀𝑥 ≠ 0

in other words, 𝐵 has linearly independent columns

Cholesky factorization 12.9


Graph Laplacian

recall definition of node–arc incidence matrix of a directed graph (page 3.29)

 1

 if edge 𝑗 ends at vertex 𝑖
𝐵𝑖 𝑗 = −1 if edge 𝑗 starts at vertex 𝑖

 0 otherwise

assume there are no self-loops and at most one edge between any two vertices

3
2 3

 −1 −1 0 1 0 
 1 0 −1 0 0 

1 4 5
𝐵=
 0 0 1 −1 −1 

 0 1 0 0 1 
2

1 4

Cholesky factorization 12.10


Graph Laplacian

the positive semidefinite matrix 𝐴 = 𝐵𝐵𝑇 is called the Laplacian of the graph

 degree of vertex 𝑖 if 𝑖 = 𝑗


if 𝑖 ≠ 𝑗 and vertices 𝑖 and 𝑗 are adjacent

𝐴𝑖 𝑗 = −1
 0 otherwise

the degree of a vertex is the number of edges incident to it

3
2 3

 3 −1 −1 −1 
2 −1 0 
 
1 4 5 𝑇
 −1
𝐴 = 𝐵𝐵 = 
3 −1 

 −1 −1
 −1 0 −1 2 
2

1 4

Cholesky factorization 12.11


Laplacian quadratic form

recall the interpretation of matrix–vector multiplication with 𝐵𝑇 (page 3.31)

• if 𝑦 is vector of node potentials, then 𝐵𝑇 𝑦 contains potential differences:

(𝐵𝑇 𝑦) 𝑗 = 𝑦 𝑘 − 𝑦 𝑙 if edge 𝑗 goes from vertex 𝑙 to 𝑘

• 𝑦𝑇 𝐴𝑦 = 𝑦𝑇 𝐵𝐵𝑇 𝑦 is the sum of squared potential differences

2
(𝑦 𝑗 − 𝑦𝑖 ) 2
∑︁
𝑦 𝐴𝑦 = ∥𝐵 𝑦∥ =
𝑇 𝑇
edges 𝑖 → 𝑗

this is also known as the Dirichlet energy function

Example: for the graph on the previous page

𝑦𝑇 𝐴𝑦 = (𝑦 2 − 𝑦 1) 2 + (𝑦 4 − 𝑦 1) 2 + (𝑦 3 − 𝑦 2) 2 + (𝑦 1 − 𝑦 3) 2 + (𝑦 4 − 𝑦 3) 2

Cholesky factorization 12.12


Weighted graph Laplacian

• we associate a nonnegative weight 𝑤 𝑘 with edge 𝑘


• the weighted graph Laplacian is the matrix 𝐴 = 𝐵 diag(𝑤)𝐵𝑇
(diag(𝑤) is the diagonal matrix with vector 𝑤 on its diagonal)

if 𝑖 = 𝑗 (where N𝑖 are the edges incident to vertex 𝑖 )


P


 𝑤𝑘

 𝑘∈N𝑖

𝐴𝑖 𝑗 = −𝑤 𝑘
 if 𝑖 ≠ 𝑗 and edge 𝑘 is between vertices 𝑖 and 𝑗
 0 otherwise



3
2 3

𝑤 1 + 𝑤 2 + 𝑤 4 −𝑤 1 −𝑤 4 −𝑤 2 
0 

1 4 5  −𝑤 1 𝑤1 + 𝑤3 −𝑤 3
𝐴=

−𝑤 4 −𝑤 3 𝑤 3 + 𝑤 4 + 𝑤 5 −𝑤 5 
0

 −𝑤 2 −𝑤 5 𝑤 2 + 𝑤 5
2

1 4

this is the conductance matrix of a resistive circuit (𝑤 𝑘 is conductance in branch 𝑘 )


Cholesky factorization 12.13
Outline

• positive definite matrices


• examples
• Cholesky factorization
• complex positive definite matrices
• kernel methods
Cholesky factorization

every positive definite matrix 𝐴 ∈ R𝑛×𝑛 can be factored as

𝐴 = 𝑅𝑇 𝑅

where 𝑅 is upper triangular with positive diagonal elements

• complexity of computing 𝑅 is 13 𝑛3 flops


• 𝑅 is called the Cholesky factor of 𝐴
• can be interpreted as “square root” of a positive definite matrix
• gives a practical method for testing positive definiteness

Cholesky factorization 12.14


Cholesky factorization algorithm

0
" #
𝑅11
  
𝐴11 𝐴1,2:𝑛 𝑅11 𝑅1,2:𝑛
=
𝐴2:𝑛,1 𝐴2:𝑛,2:𝑛 𝑅𝑇1,2:𝑛 𝑅𝑇2:𝑛,2:𝑛 0 𝑅2:𝑛,2:𝑛

2
" #
𝑅11 𝑅11 𝑅1,2:𝑛
=
𝑅11 𝑅𝑇1,2:𝑛 𝑅𝑇1,2:𝑛 𝑅1,2:𝑛 + 𝑅𝑇2:𝑛,2:𝑛 𝑅2:𝑛,2:𝑛

1. compute first row of 𝑅 :

√︁ 1
𝑅11 = 𝐴11, 𝑅1,2:𝑛 = 𝐴1,2:𝑛
𝑅11

2. compute 2, 2 block 𝑅2:𝑛,2:𝑛 from

𝐴2:𝑛,2:𝑛 − 𝑅𝑇1,2:𝑛 𝑅1,2:𝑛 = 𝑅𝑇2:𝑛,2:𝑛 𝑅2:𝑛,2:𝑛

this is a Cholesky factorization of order 𝑛 − 1

Cholesky factorization 12.15


Discussion

the algorithm works for positive definite 𝐴 of size 𝑛 × 𝑛

• step 1: if 𝐴 is positive definite then 𝐴11 > 0


• step 2: if 𝐴 is positive definite, then

1
𝐴2:𝑛,2:𝑛 − 𝑅𝑇1,2:𝑛 𝑅1,2:𝑛 = 𝐴2:𝑛,2:𝑛 − 𝐴2:𝑛,1 𝐴𝑇2:𝑛,1
𝐴11

is positive definite (see page 12.5)


• hence the algorithm works for 𝑛 = 𝑚 if it works for 𝑛 = 𝑚 − 1
• it obviously works for 𝑛 = 1; therefore it works for all 𝑛

Cholesky factorization 12.16


Example

 25 15 −5   𝑅11 0 0   𝑅11 𝑅12 𝑅13 


 15 18 0  0  0
     
=  𝑅12 𝑅22  𝑅22 𝑅23 
 −5 0 11   0 0
    
 𝑅13 𝑅23 𝑅33  𝑅33 
     

 5 0 0   5 3 −1 
 3 3 0  0 3 1
  
= 
 −1 1 3   0 0 3 
  
 

Cholesky factorization 12.17


Example

 25 15 −5   𝑅11 0 0   𝑅11 𝑅12 𝑅13 


 15 18 0  =  𝑅12 0  0
     
𝑅22  𝑅22 𝑅23 
 −5 0 11   𝑅13  0 0
   
   𝑅23 𝑅33 
  𝑅33 

• first row of 𝑅
 25 15 −5   5 0 0   5 3 −1 
 15 18 0  =  3 0  0
     
𝑅22  𝑅22 𝑅23 
 −5 0 11   −1  0 0
   
   𝑅23 𝑅33 
  𝑅33 

• second row of 𝑅

18 0 3  0
      
𝑅22 𝑅22 𝑅23
3 −1 =


0 11 −1 𝑅23 𝑅33 0 𝑅33

9 3 3 0 3 1
    
=
3 10 1 𝑅33 0 𝑅33
2 , i.e., 𝑅 = 3
• third column of 𝑅 : 10 − 1 = 𝑅33 33

Cholesky factorization 12.18


Solving equations with positive definite 𝐴

solve 𝐴𝑥 = 𝑏 with 𝐴 a positive definite 𝑛 × 𝑛 matrix

Algorithm

• factor 𝐴 as 𝐴 = 𝑅𝑇 𝑅
• solve 𝑅𝑇 𝑅𝑥 = 𝑏
– solve 𝑅𝑇 𝑦 = 𝑏 by forward substitution
– solve 𝑅𝑥 = 𝑦 by back substitution

Complexity: 31 𝑛3 + 2𝑛2 ∼ 31 𝑛3 flops

• factorization: 13 𝑛3
• forward and backward substitution: 2𝑛2

Cholesky factorization 12.19


Cholesky factorization of Gram matrix

• suppose 𝐵 is an 𝑚 × 𝑛 matrix with linearly independent columns


• the Gram matrix 𝐴 = 𝐵𝑇 𝐵 is positive definite (page 4.20)

two methods for computing the Cholesky factor of 𝐴, given 𝐵

1. compute 𝐴 = 𝐵𝑇 𝐵, then Cholesky factorization of 𝐴

𝐴 = 𝑅𝑇 𝑅

2. compute QR factorization 𝐵 = 𝑄𝑅 ; since

𝐴 = 𝐵𝑇 𝐵 = 𝑅𝑇 𝑄𝑇 𝑄𝑅 = 𝑅𝑇 𝑅

the matrix 𝑅 is the Cholesky factor of 𝐴

Cholesky factorization 12.20


Example

 3 −6 
25 −50
 
𝐵 =  4 −8  ,
 
𝐴 = 𝐵𝑇 𝐵 =
−50 101

 0 1 

1. Cholesky factorization:

5 0 5 −10
  
𝐴=
−10 1 0 1

2. QR factorization

 3 −6   3/5 0  
 5 −10

𝐵 =  4 −8  =  4/5 0 
  
 0 1   0 1  0 1
   

Cholesky factorization 12.21


Comparison of the two methods

Numerical stability: QR factorization method is more stable

• see the example on page 8.16


• QR method computes 𝑅 without “squaring” 𝐵 (i.e., forming 𝐵𝑇 𝐵)
• this is important when the columns of 𝐵 are “almost” linearly dependent

Complexity

• method 1: cost of symmetric product 𝐵𝑇 𝐵 plus Cholesky factorization

21 3
𝑚𝑛 + 𝑛 flops
3

• method 2: 2𝑚𝑛2 flops for QR factorization


• method 1 is faster but only by a factor of at most two (if 𝑚 ≫ 𝑛)

Cholesky factorization 12.22


Sparse positive definite matrices

Cholesky factorization of dense matrices

• 31 𝑛3 flops
• on a standard computer: a few seconds or less, for 𝑛 up to several 1000

Cholesky factorization of sparse matrices

• if 𝐴 is very sparse, 𝑅 is often (but not always) sparse


• if 𝑅 is sparse, the cost of the factorization is much less than 13 𝑛3
• exact cost depends on 𝑛, number of nonzero elements, sparsity pattern
• very large sets of equations can be solved by exploiting sparsity

Cholesky factorization 12.23


Sparse Cholesky factorization

if 𝐴 is sparse and positive definite, it is usually factored as

𝐴 = 𝑃𝑅𝑇 𝑅𝑃𝑇

𝑃 a permutation matrix; 𝑅 upper triangular with positive diagonal elements

Interpretation: we permute the rows and columns of 𝐴 and factor

𝑃𝑇 𝐴𝑃 = 𝑅𝑇 𝑅

• choice of permutation greatly affects the sparsity 𝑅


• there exist several heuristic methods for choosing a good permutation

Cholesky factorization 12.24


Example

sparsity pattern of 𝐴 Cholesky factor of 𝐴

pattern of 𝑃𝑇 𝐴𝑃 Cholesky factor of 𝑃𝑇 𝐴𝑃

Cholesky factorization 12.25


Solving sparse positive definite equations

solve 𝐴𝑥 = 𝑏 with 𝐴 a sparse positive definite matrix

Algorithm

1. compute sparse Cholesky factorization 𝐴 = 𝑃𝑅𝑇 𝑅𝑃𝑇


2. permute right-hand side: 𝑐 := 𝑃𝑇 𝑏
3. solve 𝑅𝑇 𝑦 = 𝑐 by forward substitution
4. solve 𝑅𝑧 = 𝑦 by back substitution
5. permute solution: 𝑥 := 𝑃𝑧

Cholesky factorization 12.26


Outline

• positive definite matrices


• examples
• Cholesky factorization
• complex positive definite matrices
• kernel methods
Quadratic form

suppose 𝐴 is 𝑛 × 𝑛 and Hermitian ( 𝐴𝑖 𝑗 = 𝐴¯ 𝑗𝑖 )

𝑛 ∑︁
𝑛
𝐴𝑖 𝑗 𝑥¯𝑖 𝑥 𝑗
𝐻
∑︁
𝑥 𝐴𝑥 =
𝑖=1 𝑗=1
𝑛
2
( 𝐴𝑖 𝑗 𝑥¯𝑖 𝑥 𝑗 + 𝐴¯ 𝑖 𝑗 𝑥𝑖 𝑥¯ 𝑗 )
∑︁ ∑︁
= 𝐴𝑖𝑖 |𝑥𝑖 | +
𝑖=1 𝑖> 𝑗
𝑛
2
𝐴𝑖𝑖 |𝑥𝑖 | + 2 Re 𝐴𝑖 𝑗 𝑥¯𝑖 𝑥 𝑗
∑︁ ∑︁
=
𝑖=1 𝑖> 𝑗

note that 𝑥 𝐻 𝐴𝑥 is real for all 𝑥 ∈ C𝑛

Cholesky factorization 12.27


Complex positive definite matrices

• a Hermitian 𝑛 × 𝑛 matrix 𝐴 is positive semidefinite if

𝑥 𝐻 𝐴𝑥 ≥ 0 for all 𝑥 ∈ C𝑛

• a Hermitian 𝑛 × 𝑛 matrix 𝐴 is positive definite if

𝑥 𝐻 𝐴𝑥 > 0 for all nonzero 𝑥 ∈ C𝑛

Cholesky factorization

every positive definite matrix 𝐴 ∈ C𝑛×𝑛 can be factored as

𝐴 = 𝑅𝐻 𝑅

where 𝑅 is upper triangular with positive real diagonal elements

Cholesky factorization 12.28


Outline

• positive definite matrices


• examples
• Cholesky factorization
• complex positive definite matrices
• kernel methods
Regularized least squares model fitting

• we revisit the data fitting problem with linear-in-parameters model (page 9.9)

𝑓ˆ(𝑥) = 𝜃 1 𝑓1 (𝑥) + 𝜃 2 𝑓2 (𝑥) + · · · + 𝜃 𝑝 𝑓 𝑝 (𝑥)


= 𝜃𝑇 𝐹 (𝑥)

• 𝐹 (𝑥) = ( 𝑓1 (𝑥), . . . , 𝑓 𝑝 (𝑥)) is a 𝑝 -vector of basis functions 𝑓1 (𝑥) , . . . , 𝑓 𝑝 (𝑥)

Regularized least squares model fitting (page 10.7)

𝑁  2 𝑝
minimize (𝑘) (𝑘)
𝜃 2𝑗
∑︁ ∑︁
𝜃 𝐹 (𝑥
𝑇
)−𝑦 +𝜆
𝑘=1 𝑗=1

• (𝑥 (1) , 𝑦 (1) ) , . . . , (𝑥 (𝑁) , 𝑦 (𝑁) ) are 𝑁 examples


• to simplify notation, we add regularization for all coefficients 𝜃 1, . . . , 𝜃 𝑝
P𝑝 2
• next discussion can be modified to handle 𝑓1 (𝑥) = 1, regularization 𝑗=2 𝜃 𝑗

Cholesky factorization 12.29


Regularized least squares problem in matrix notation

minimize ∥ 𝐴𝜃 − 𝑏∥ 2 + 𝜆∥𝜃 ∥ 2

• 𝐴 has size 𝑁 × 𝑝 (number of examples × number of basis functions)

 𝐹 (𝑥 (1) )𝑇   𝑓1 (𝑥 (1) ) 𝑓2 (𝑥 (1) ) ··· 𝑓 𝑝 (𝑥 (1) ) 


 𝐹 (𝑥 (2) )𝑇   𝑓1 (𝑥 (2) ) 𝑓2 (𝑥 (2) ) 𝑓 𝑝 (𝑥 (2) )
   
··· 
𝐴 =  .. =
  .. .. .. 

 𝐹 (𝑥 (𝑁) )𝑇   𝑓 (𝑥 (𝑁) ) 𝑓2 (𝑥 (𝑁) ) 𝑓 𝑝 (𝑥 (𝑁) )
   
   1 ··· 

• 𝑏 is the 𝑁 -vector 𝑏 = (𝑦 (1) , . . . , 𝑦 (𝑁) )


• we discuss methods for problems with 𝑁 ≪ 𝑝 ( 𝐴 is very wide)
• the equivalent “stacked” least squares problem (p.10.3) has size ( 𝑝 + 𝑁) × 𝑝
• QR factorization method may be too expensive when 𝑁 ≪ 𝑝

Cholesky factorization 12.30


Solution of regularized LS problem

from the normal equations:

𝜃ˆ = ( 𝐴𝑇 𝐴 + 𝜆𝐼) −1 𝐴𝑇 𝑏 = 𝐴𝑇 ( 𝐴𝐴𝑇 + 𝜆𝐼) −1 𝑏

• second expression follows from the “push-through” identity

( 𝐴𝑇 𝐴 + 𝜆𝐼) −1 𝐴𝑇 = 𝐴𝑇 ( 𝐴𝐴𝑇 + 𝜆𝐼) −1

this is easily proved, by writing it as 𝐴𝑇 ( 𝐴𝐴𝑇 + 𝜆𝐼) = ( 𝐴𝑇 𝐴 + 𝜆𝐼) 𝐴𝑇


• from the second expression for 𝜃ˆ and the definition of 𝐴,

𝑁
𝑓ˆ(𝑥) = 𝜃ˆ𝑇 𝐹 (𝑥) = 𝑤𝑇 𝐴𝐹 (𝑥) = 𝑤𝑖 𝐹 (𝑥 (𝑖) )𝑇 𝐹 (𝑥)
∑︁
𝑖=1

where 𝑤 = ( 𝐴𝐴𝑇 + 𝜆𝐼) −1 𝑏

Cholesky factorization 12.31


Algorithm

1. compute the 𝑁 × 𝑁 matrix 𝑄 = 𝐴𝐴𝑇 , which has elements

𝑄 𝑖 𝑗 = 𝐹 (𝑥 (𝑖) )𝑇 𝐹 (𝑥 ( 𝑗) ), 𝑖, 𝑗 = 1, . . . , 𝑁

2. use a Cholesky factorization to solve the equation

(𝑄 + 𝜆𝐼)𝑤 = 𝑏

Remarks

• 𝜃ˆ = 𝐴𝑇 𝑤 is not needed; 𝑤 is sufficient to evaluate the function 𝑓ˆ(𝑥) :

𝑁
𝑓ˆ(𝑥) = 𝑤𝑖 𝐹 (𝑥 (𝑖) )𝑇 𝐹 (𝑥)
∑︁
𝑖=1

• complexity: 31 𝑁 3 flops for factorization plus cost of computing 𝑄

Cholesky factorization 12.32


Example: multivariate polynomials

𝑓ˆ(𝑥) is a polynomial of degree 𝑑 (or less) in 𝑛 variables 𝑥 = (𝑥1, . . . , 𝑥 𝑛 )

• 𝑓ˆ(𝑥) is a linear combination of all possible monomials

𝑥1𝑘 1 𝑥2𝑘 2 · · · 𝑥 𝑛𝑘 𝑛

where 𝑘 1, . . . , 𝑘 𝑛 are nonnegative integers with 𝑘 1 + 𝑘 2 + · · · + 𝑘 𝑛 ≤ 𝑑


• number of different monomials is

(𝑛 + 𝑑)!
 
𝑛+𝑑
=
𝑛 𝑛! 𝑑!

Example: for 𝑛 = 2, 𝑑 = 3 there are ten monomials

1, 𝑥1, 𝑥2, 𝑥12, 𝑥1𝑥2, 𝑥22, 𝑥13, 𝑥12𝑥2, 𝑥1𝑥22, 𝑥23

Cholesky factorization 12.33


Multinomial formula

(𝑑 + 1)!
𝑥0𝑘 0 𝑥1𝑘 1 · · · 𝑥 𝑛𝑘 𝑛
∑︁
(𝑥0 + 𝑥1 + · · · + 𝑥 𝑛 ) = 𝑑
𝑘 0 +···+𝑘 𝑛 =𝑑 𝑘 0! 𝑘 1! · · · 𝑘 𝑛 !

sum is over all nonnegative integers 𝑘 0, 𝑘 1, . . . , 𝑘 𝑛 with sum 𝑑

• setting 𝑥0 = 1 gives

𝑐 𝑘 1 𝑘 2 ···𝑘 𝑛 𝑥1𝑘 1 𝑥2𝑘 2 · · · 𝑥 𝑛𝑘 𝑛


∑︁
(1 + 𝑥1 + 𝑥2 + · · · + 𝑥 𝑛 ) = 𝑑
𝑘 1 +···+𝑘 𝑛 ≤𝑑

• the sum includes all monomials of degree 𝑑 or less with variables 𝑥1, . . . , 𝑥 𝑛
• coefficient 𝑐 𝑘 1 𝑘 2 ···𝑘 𝑛 is defined as

(𝑑 + 1)!
𝑐 𝑘 1 𝑘 2 ···𝑘 𝑛 = with 𝑘 0 = 𝑑 − 𝑘 1 − · · · − 𝑘 𝑛
𝑘 0! 𝑘 1! 𝑘 2! · · · 𝑘 𝑛 !

Cholesky factorization 12.34


Vector of monomials

write polynomial of degree 𝑑 or less, with variables 𝑥 ∈ R𝑛 , as

𝑓ˆ(𝑥) = 𝜃𝑇 𝐹 (𝑥)

• 𝐹 (𝑥) is vector of basis functions



𝑐 𝑘 1 ···𝑘 𝑛 𝑥1𝑘 1 𝑥2𝑘 2 · · · 𝑥 𝑛𝑘 𝑛 for all 𝑘 1 + 𝑘 2 + · · · + 𝑘 𝑛 ≤ 𝑑

• length of 𝐹 (𝑥) is 𝑝 = (𝑛 + 𝑑)!/(𝑛! 𝑑!)


• multinomial formula gives simple formula for inner products 𝐹 (𝑢)𝑇 𝐹 (𝑣) :

𝑐 𝑘 1 𝑘 2 ···𝑘 𝑛 (𝑢 1𝑘 1 · · · 𝑢 𝑛𝑘 𝑛 )(𝑣 1𝑘 1 · · · 𝑣 𝑛𝑘 𝑛 )
∑︁
𝐹 (𝑢) 𝐹 (𝑣)
𝑇
=
𝑘 1 +···+𝑘 𝑛 ≤𝑑

= (1 + 𝑢 1 𝑣 1 + · · · + 𝑢 𝑛 𝑣 𝑛 ) 𝑑

• only 2𝑛 + 1 flops needed for inner product of length 𝑝 = (𝑛 + 𝑑)!/(𝑛! 𝑑!)

Cholesky factorization 12.35


Example

vector of monomials of degree 𝑑 = 3 or less in 𝑛 = 2 variables

 1 𝑇  1 
 √   √ 
3𝑢 3𝑣
√ 1 √ 1
   
   
3𝑢 3𝑣
   
√ 22 √ 22
   
3𝑢 1 3𝑣
   
√ 1
   
 √   
6𝑢 𝑢 6𝑣 𝑣
√ 122 √ 122
   
𝐹 (𝑢)𝑇 𝐹 (𝑣) =
   
3𝑢 2 3𝑣 2
   
   
𝑢 31 𝑣 31
   
   
 √ 2   √ 2 
3𝑢 𝑢 3𝑣 𝑣
√ 1 22 √ 1 22
   
   
3𝑢 1𝑢 2 3𝑣 1 𝑣 2
   
   
𝑢 32 𝑣 32
   
   
   

= (1 + 𝑢 1 𝑣 1 + 𝑢 2 𝑣 2) 3

Cholesky factorization 12.36


Least squares fitting of multivariate polynomials

fit polynomial of 𝑛 variables, degree ≤ 𝑑 , to points (𝑥 (1) , 𝑦 (1) ) , . . . , (𝑥 (𝑁) , 𝑦 (𝑁) )

Algorithm (see page 12.32)


1. compute the 𝑁 × 𝑁 matrix 𝑄 with elements

𝑄 𝑖 𝑗 = 𝐾 (𝑥 (𝑖) , 𝑥 ( 𝑗) ) where 𝐾 (𝑢, 𝑣) = (1 + 𝑢𝑇 𝑣) 𝑑

2. use a Cholesky factorization to solve the equation (𝑄 + 𝜆𝐼)𝑤 = 𝑏

• the fitted polynomial is

𝑁 𝑁
𝑓ˆ(𝑥) = (𝑖)
𝑤𝑖 (1 + (𝑥 (𝑖) )𝑇 𝑥) 𝑑
∑︁ ∑︁
𝑤𝑖 𝐾 (𝑥 , 𝑥) =
𝑖=1 𝑖=1

• complexity: 𝑛𝑁 2 flops for computing 𝑄 , plus 31 𝑁 3 for the factorization, i.e.,

21 3
𝑛𝑁 + 𝑁 flops
3
Cholesky factorization 12.37
Kernel methods

Kernel function: a generalized inner product 𝐾 (𝑢, 𝑣)

• 𝐾 (𝑢, 𝑣) is inner product of vectors of basis functions 𝐹 (𝑢) and 𝐹 (𝑣)


• 𝐹 (𝑢) may be infinite-dimensional
• kernel methods work with 𝐾 (𝑢, 𝑣) directly, do not require 𝐹 (𝑢)

Examples

• the polynomial kernel function 𝐾 (𝑢, 𝑣) = (1 + 𝑢𝑇 𝑣) 𝑑


• the Gaussian radial basis function kernel

∥𝑢 − 𝑣∥ 2
𝐾 (𝑢, 𝑣) = exp (− 2
)
2𝜎

• kernels exist for computing with graphs, texts, strings of symbols, . . .

Cholesky factorization 12.38


Example: handwritten digit classification
we apply the method of page 12.37 to least squares classification

• training set is 10000 images from MNIST data set (≈ 1000 examples per digit)
• vector 𝑥 is vector of pixel intensities (size 𝑛 = 282 = 784)
• we use the polynomial kernel with degree 𝑑 = 3:

𝐾 (𝑢, 𝑣) = (1 + 𝑢𝑇 𝑣) 3

hence 𝐹 (𝑧) has length 𝑝 = (𝑛 + 𝑑)!/(𝑛! 𝑑!) = 80931145


• we calculate ten Boolean classifiers

𝑓ˆ𝑘 (𝑥) = sign( 𝑓˜𝑘 (𝑥)), 𝑘 = 1, . . . 10

𝑓ˆ𝑘 (𝑥) distinguishes digit 𝑘 − 1 (outcome +1) form other digits (outcome −1)
• the Boolean classifiers are combined in the multi-class classifier

𝑓ˆ(𝑥) = argmax 𝑓˜𝑘 (𝑥)


𝑘=1,...,10

Cholesky factorization 12.39


Least squares Boolean classifier

Algorithm: compute Boolean classifier for digit 𝑘 − 1 versus the rest


1. compute 𝑁 × 𝑁 matrix 𝑄 with elements

𝑄 𝑖 𝑗 = (1 + (𝑥 (𝑖) )𝑇 𝑥 ( 𝑗) ) 𝑑 , 𝑖, 𝑗 = 1, . . . , 𝑁

2. define 𝑁 -vector 𝑏 = (𝑦 (1) , . . . , 𝑦 (𝑁) ) with elements

+1 𝑥 (𝑖) is an example of digit 𝑘 − 1



(𝑖)
=
−1 otherwise
𝑦

3. solve the equation (𝑄 + 𝜆𝐼)𝑤 = 𝑏

the solution 𝑤 gives the Boolean classifier for digit 𝑘 − 1 versus rest

𝑁
𝑓˜𝑘 (𝑥) = 𝑤𝑖 (1 + (𝑥 (𝑖) )𝑇 𝑥) 𝑑
∑︁
𝑖=1

Cholesky factorization 12.40


Complexity

• the matrix 𝑄 is the same for each of the ten Boolean classifiers
• hence, only the right-hand side of the equation

(𝑄 + 𝜆𝐼)𝑤 = 𝑦 d

is different for each Boolean classifier

Complexity

• constructing 𝑄 requires 𝑁 2/2 inner products of length 𝑛: 𝑛𝑁 2 flops


• Cholesky factorization of 𝑄 + 𝜆𝐼 : 31 𝑁 3 flops
• solve the equation (𝑄 + 𝜆𝐼)𝑤 = 𝑦 d for the 10 right-hand sides: 20𝑁 2 flops
• total is 13 𝑁 3 + 𝑛𝑁 2

Cholesky factorization 12.41


Classification error

Training set
6 Test set
Classification error (%) 5

10−2 100 102 104 106


λ

percentage of misclassified digits versus 𝜆

Cholesky factorization 12.42


Confusion matrix

Predicted digit
Digit 0 1 2 3 4 5 6 7 8 9 Total

0 965 1 0 0 0 1 8 2 3 0 980
1 0 1127 2 1 1 0 2 1 1 0 1135
2 6 2 988 4 1 1 5 16 8 1 1032
3 0 0 7 973 0 12 0 8 6 4 1010
4 1 3 0 0 957 0 3 1 3 14 982
5 3 0 0 5 0 874 5 2 2 1 892
6 9 4 0 0 5 2 937 0 1 0 958
7 0 13 13 1 5 0 0 987 2 7 1028
8 3 1 3 11 4 4 3 5 934 6 974
9 3 4 2 7 13 3 1 6 4 966 1009
All 990 1155 1015 1002 986 897 964 1028 964 999 10000

• multiclass classifier (𝜆 = 104) on 10000 test examples


• 292 digits are misclassified (2.9% error)

Cholesky factorization 12.43


Examples of misclassified digits

Predicted digit
Digit 0 1 2 3 4 5 6 7 8 9

Cholesky factorization 12.44


Examples of misclassified digits

Predicted digit
Digit 0 1 2 3 4 5 6 7 8 9

Cholesky factorization 12.45

You might also like