Cholesky
Cholesky
12.1
Definitions
𝑥𝑇 𝐴𝑥 ≥ 0 for all 𝑥
𝑛 ∑︁
𝑛 𝑛
𝐴𝑖𝑖 𝑥𝑖2 +2
𝑇
∑︁ ∑︁ ∑︁
𝑥 𝐴𝑥 = 𝐴𝑖 𝑗 𝑥𝑖 𝑥 𝑗 = 𝐴𝑖 𝑗 𝑥𝑖 𝑥 𝑗
𝑖=1 𝑗=1 𝑖=1 𝑖> 𝑗
9 6
𝐴=
6 𝑎
𝐴𝑥 = 0 =⇒ 𝑥𝑇 𝐴𝑥 = 0 =⇒ 𝑥=0
1
𝑆 = 𝐴2:𝑛,2:𝑛 − 𝐴2:𝑛,1 𝐴𝑇2:𝑛,1
𝐴11
𝑇
𝑦 𝐴11 𝐴𝑇2:𝑛,1 𝑦
𝑇
𝑥 𝑆𝑥 = >0
𝑥 𝐴2:𝑛,1 𝐴2:𝑛,2:𝑛 𝑥
R1 R2
x1 x2
y1 + R3 + y2
− −
𝑦1 𝑅1 + 𝑅3 𝑅3 𝑥1
=
𝑦2 𝑅3 𝑅2 + 𝑅3 𝑥2
Algebraic solution
and 𝑥𝑇 𝐴𝑥 = 0 only if 𝑥 1 = 𝑥 2 = 0
𝐴 = 𝐵𝑇 𝐵
𝑥𝑇 𝐴𝑥 = 𝑥𝑇 𝐵𝑇 𝐵𝑥 = ∥𝐵𝑥∥ 2 ≥ 0 ∀𝑥
𝑥𝑇 𝐴𝑥 = 𝑥𝑇 𝐵𝑇 𝐵𝑥 = ∥𝐵𝑥∥ 2 > 0 ∀𝑥 ≠ 0
1
if edge 𝑗 ends at vertex 𝑖
𝐵𝑖 𝑗 = −1 if edge 𝑗 starts at vertex 𝑖
0 otherwise
assume there are no self-loops and at most one edge between any two vertices
3
2 3
−1 −1 0 1 0
1 0 −1 0 0
1 4 5
𝐵=
0 0 1 −1 −1
0 1 0 0 1
2
1 4
the positive semidefinite matrix 𝐴 = 𝐵𝐵𝑇 is called the Laplacian of the graph
degree of vertex 𝑖 if 𝑖 = 𝑗
if 𝑖 ≠ 𝑗 and vertices 𝑖 and 𝑗 are adjacent
𝐴𝑖 𝑗 = −1
0 otherwise
3
2 3
3 −1 −1 −1
2 −1 0
1 4 5 𝑇
−1
𝐴 = 𝐵𝐵 =
3 −1
−1 −1
−1 0 −1 2
2
1 4
2
(𝑦 𝑗 − 𝑦𝑖 ) 2
∑︁
𝑦 𝐴𝑦 = ∥𝐵 𝑦∥ =
𝑇 𝑇
edges 𝑖 → 𝑗
𝑦𝑇 𝐴𝑦 = (𝑦 2 − 𝑦 1) 2 + (𝑦 4 − 𝑦 1) 2 + (𝑦 3 − 𝑦 2) 2 + (𝑦 1 − 𝑦 3) 2 + (𝑦 4 − 𝑦 3) 2
𝑤 1 + 𝑤 2 + 𝑤 4 −𝑤 1 −𝑤 4 −𝑤 2
0
1 4 5 −𝑤 1 𝑤1 + 𝑤3 −𝑤 3
𝐴=
−𝑤 4 −𝑤 3 𝑤 3 + 𝑤 4 + 𝑤 5 −𝑤 5
0
−𝑤 2 −𝑤 5 𝑤 2 + 𝑤 5
2
1 4
𝐴 = 𝑅𝑇 𝑅
0
" #
𝑅11
𝐴11 𝐴1,2:𝑛 𝑅11 𝑅1,2:𝑛
=
𝐴2:𝑛,1 𝐴2:𝑛,2:𝑛 𝑅𝑇1,2:𝑛 𝑅𝑇2:𝑛,2:𝑛 0 𝑅2:𝑛,2:𝑛
2
" #
𝑅11 𝑅11 𝑅1,2:𝑛
=
𝑅11 𝑅𝑇1,2:𝑛 𝑅𝑇1,2:𝑛 𝑅1,2:𝑛 + 𝑅𝑇2:𝑛,2:𝑛 𝑅2:𝑛,2:𝑛
√︁ 1
𝑅11 = 𝐴11, 𝑅1,2:𝑛 = 𝐴1,2:𝑛
𝑅11
1
𝐴2:𝑛,2:𝑛 − 𝑅𝑇1,2:𝑛 𝑅1,2:𝑛 = 𝐴2:𝑛,2:𝑛 − 𝐴2:𝑛,1 𝐴𝑇2:𝑛,1
𝐴11
5 0 0 5 3 −1
3 3 0 0 3 1
=
−1 1 3 0 0 3
• first row of 𝑅
25 15 −5 5 0 0 5 3 −1
15 18 0 = 3 0 0
𝑅22 𝑅22 𝑅23
−5 0 11 −1 0 0
𝑅23 𝑅33
𝑅33
• second row of 𝑅
18 0 3 0
𝑅22 𝑅22 𝑅23
3 −1 =
−
0 11 −1 𝑅23 𝑅33 0 𝑅33
9 3 3 0 3 1
=
3 10 1 𝑅33 0 𝑅33
2 , i.e., 𝑅 = 3
• third column of 𝑅 : 10 − 1 = 𝑅33 33
Algorithm
• factor 𝐴 as 𝐴 = 𝑅𝑇 𝑅
• solve 𝑅𝑇 𝑅𝑥 = 𝑏
– solve 𝑅𝑇 𝑦 = 𝑏 by forward substitution
– solve 𝑅𝑥 = 𝑦 by back substitution
• factorization: 13 𝑛3
• forward and backward substitution: 2𝑛2
𝐴 = 𝑅𝑇 𝑅
𝐴 = 𝐵𝑇 𝐵 = 𝑅𝑇 𝑄𝑇 𝑄𝑅 = 𝑅𝑇 𝑅
3 −6
25 −50
𝐵 = 4 −8 ,
𝐴 = 𝐵𝑇 𝐵 =
−50 101
0 1
1. Cholesky factorization:
5 0 5 −10
𝐴=
−10 1 0 1
2. QR factorization
3 −6 3/5 0
5 −10
𝐵 = 4 −8 = 4/5 0
0 1 0 1 0 1
Complexity
21 3
𝑚𝑛 + 𝑛 flops
3
• 31 𝑛3 flops
• on a standard computer: a few seconds or less, for 𝑛 up to several 1000
𝐴 = 𝑃𝑅𝑇 𝑅𝑃𝑇
𝑃𝑇 𝐴𝑃 = 𝑅𝑇 𝑅
Algorithm
𝑛 ∑︁
𝑛
𝐴𝑖 𝑗 𝑥¯𝑖 𝑥 𝑗
𝐻
∑︁
𝑥 𝐴𝑥 =
𝑖=1 𝑗=1
𝑛
2
( 𝐴𝑖 𝑗 𝑥¯𝑖 𝑥 𝑗 + 𝐴¯ 𝑖 𝑗 𝑥𝑖 𝑥¯ 𝑗 )
∑︁ ∑︁
= 𝐴𝑖𝑖 |𝑥𝑖 | +
𝑖=1 𝑖> 𝑗
𝑛
2
𝐴𝑖𝑖 |𝑥𝑖 | + 2 Re 𝐴𝑖 𝑗 𝑥¯𝑖 𝑥 𝑗
∑︁ ∑︁
=
𝑖=1 𝑖> 𝑗
𝑥 𝐻 𝐴𝑥 ≥ 0 for all 𝑥 ∈ C𝑛
Cholesky factorization
𝐴 = 𝑅𝐻 𝑅
• we revisit the data fitting problem with linear-in-parameters model (page 9.9)
𝑁 2 𝑝
minimize (𝑘) (𝑘)
𝜃 2𝑗
∑︁ ∑︁
𝜃 𝐹 (𝑥
𝑇
)−𝑦 +𝜆
𝑘=1 𝑗=1
minimize ∥ 𝐴𝜃 − 𝑏∥ 2 + 𝜆∥𝜃 ∥ 2
𝑁
𝑓ˆ(𝑥) = 𝜃ˆ𝑇 𝐹 (𝑥) = 𝑤𝑇 𝐴𝐹 (𝑥) = 𝑤𝑖 𝐹 (𝑥 (𝑖) )𝑇 𝐹 (𝑥)
∑︁
𝑖=1
𝑄 𝑖 𝑗 = 𝐹 (𝑥 (𝑖) )𝑇 𝐹 (𝑥 ( 𝑗) ), 𝑖, 𝑗 = 1, . . . , 𝑁
(𝑄 + 𝜆𝐼)𝑤 = 𝑏
Remarks
𝑁
𝑓ˆ(𝑥) = 𝑤𝑖 𝐹 (𝑥 (𝑖) )𝑇 𝐹 (𝑥)
∑︁
𝑖=1
𝑥1𝑘 1 𝑥2𝑘 2 · · · 𝑥 𝑛𝑘 𝑛
(𝑛 + 𝑑)!
𝑛+𝑑
=
𝑛 𝑛! 𝑑!
(𝑑 + 1)!
𝑥0𝑘 0 𝑥1𝑘 1 · · · 𝑥 𝑛𝑘 𝑛
∑︁
(𝑥0 + 𝑥1 + · · · + 𝑥 𝑛 ) = 𝑑
𝑘 0 +···+𝑘 𝑛 =𝑑 𝑘 0! 𝑘 1! · · · 𝑘 𝑛 !
• setting 𝑥0 = 1 gives
• the sum includes all monomials of degree 𝑑 or less with variables 𝑥1, . . . , 𝑥 𝑛
• coefficient 𝑐 𝑘 1 𝑘 2 ···𝑘 𝑛 is defined as
(𝑑 + 1)!
𝑐 𝑘 1 𝑘 2 ···𝑘 𝑛 = with 𝑘 0 = 𝑑 − 𝑘 1 − · · · − 𝑘 𝑛
𝑘 0! 𝑘 1! 𝑘 2! · · · 𝑘 𝑛 !
𝑓ˆ(𝑥) = 𝜃𝑇 𝐹 (𝑥)
𝑐 𝑘 1 𝑘 2 ···𝑘 𝑛 (𝑢 1𝑘 1 · · · 𝑢 𝑛𝑘 𝑛 )(𝑣 1𝑘 1 · · · 𝑣 𝑛𝑘 𝑛 )
∑︁
𝐹 (𝑢) 𝐹 (𝑣)
𝑇
=
𝑘 1 +···+𝑘 𝑛 ≤𝑑
= (1 + 𝑢 1 𝑣 1 + · · · + 𝑢 𝑛 𝑣 𝑛 ) 𝑑
1 𝑇 1
√ √
3𝑢 3𝑣
√ 1 √ 1
3𝑢 3𝑣
√ 22 √ 22
3𝑢 1 3𝑣
√ 1
√
6𝑢 𝑢 6𝑣 𝑣
√ 122 √ 122
𝐹 (𝑢)𝑇 𝐹 (𝑣) =
3𝑢 2 3𝑣 2
𝑢 31 𝑣 31
√ 2 √ 2
3𝑢 𝑢 3𝑣 𝑣
√ 1 22 √ 1 22
3𝑢 1𝑢 2 3𝑣 1 𝑣 2
𝑢 32 𝑣 32
= (1 + 𝑢 1 𝑣 1 + 𝑢 2 𝑣 2) 3
𝑁 𝑁
𝑓ˆ(𝑥) = (𝑖)
𝑤𝑖 (1 + (𝑥 (𝑖) )𝑇 𝑥) 𝑑
∑︁ ∑︁
𝑤𝑖 𝐾 (𝑥 , 𝑥) =
𝑖=1 𝑖=1
21 3
𝑛𝑁 + 𝑁 flops
3
Cholesky factorization 12.37
Kernel methods
Examples
∥𝑢 − 𝑣∥ 2
𝐾 (𝑢, 𝑣) = exp (− 2
)
2𝜎
• training set is 10000 images from MNIST data set (≈ 1000 examples per digit)
• vector 𝑥 is vector of pixel intensities (size 𝑛 = 282 = 784)
• we use the polynomial kernel with degree 𝑑 = 3:
𝐾 (𝑢, 𝑣) = (1 + 𝑢𝑇 𝑣) 3
𝑓ˆ𝑘 (𝑥) distinguishes digit 𝑘 − 1 (outcome +1) form other digits (outcome −1)
• the Boolean classifiers are combined in the multi-class classifier
𝑄 𝑖 𝑗 = (1 + (𝑥 (𝑖) )𝑇 𝑥 ( 𝑗) ) 𝑑 , 𝑖, 𝑗 = 1, . . . , 𝑁
the solution 𝑤 gives the Boolean classifier for digit 𝑘 − 1 versus rest
𝑁
𝑓˜𝑘 (𝑥) = 𝑤𝑖 (1 + (𝑥 (𝑖) )𝑇 𝑥) 𝑑
∑︁
𝑖=1
• the matrix 𝑄 is the same for each of the ten Boolean classifiers
• hence, only the right-hand side of the equation
(𝑄 + 𝜆𝐼)𝑤 = 𝑦 d
Complexity
Training set
6 Test set
Classification error (%) 5
Predicted digit
Digit 0 1 2 3 4 5 6 7 8 9 Total
0 965 1 0 0 0 1 8 2 3 0 980
1 0 1127 2 1 1 0 2 1 1 0 1135
2 6 2 988 4 1 1 5 16 8 1 1032
3 0 0 7 973 0 12 0 8 6 4 1010
4 1 3 0 0 957 0 3 1 3 14 982
5 3 0 0 5 0 874 5 2 2 1 892
6 9 4 0 0 5 2 937 0 1 0 958
7 0 13 13 1 5 0 0 987 2 7 1028
8 3 1 3 11 4 4 3 5 934 6 974
9 3 4 2 7 13 3 1 6 4 966 1009
All 990 1155 1015 1002 986 897 964 1028 964 999 10000
Predicted digit
Digit 0 1 2 3 4 5 6 7 8 9
Predicted digit
Digit 0 1 2 3 4 5 6 7 8 9