lec07
lec07
Chris J. Maddison
University of Toronto
Sometimes the data form clusters, where samples within a cluster are
similar to each other, and samples in different clusters are dissimilar:
(n)
rk =1
<latexit sha1_base64="3lopfPkkAXivKC86622GnO9S8cA=">AAAB+XicbVDLSsNAFL3xWesr6tLNYBHqpiSi6LLoxmUF+4A2lsl00g6dTMLMpFhC/sSNC0Xc+ifu/BsnbRbaemDgcM693DPHjzlT2nG+rZXVtfWNzdJWeXtnd2/fPjhsqSiRhDZJxCPZ8bGinAna1Exz2oklxaHPadsf3+Z+e0KlYpF40NOYeiEeChYwgrWR+rbdC7Ee+UH6lD2mVXGW9e2KU3NmQMvELUgFCjT69ldvEJEkpEITjpXquk6svRRLzQinWbmXKBpjMsZD2jVU4JAqL50lz9CpUQYoiKR5QqOZ+nsjxaFS09A3k3lOtejl4n9eN9HBtZcyESeaCjI/FCQc6QjlNaABk5RoPjUEE8lMVkRGWGKiTVllU4K7+OVl0jqvuZc15/6iUr8p6ijBMZxAFVy4gjrcQQOaQGACz/AKb1ZqvVjv1sd8dMUqdo7gD6zPH5sTk6I=</latexit>
x(n)
mk
<latexit sha1_base64="U5ptr1wypzLY69PGCvN2ZH2JI2Q=">AAAB83icbVDLSgMxFL1TX7W+qi7dBIvgqsyIosuiG5cV7AM6Q8mkmTY0yQxJRhiG/oYbF4q49Wfc+Tdm2llo64HA4Zx7uScnTDjTxnW/ncra+sbmVnW7trO7t39QPzzq6jhVhHZIzGPVD7GmnEnaMcxw2k8UxSLktBdO7wq/90SVZrF8NFlCA4HHkkWMYGMl3xfYTMIoF7PhdFhvuE13DrRKvJI0oER7WP/yRzFJBZWGcKz1wHMTE+RYGUY4ndX8VNMEkyke04GlEguqg3yeeYbOrDJCUazskwbN1d8bORZaZyK0k0VGvewV4n/eIDXRTZAzmaSGSrI4FKUcmRgVBaARU5QYnlmCiWI2KyITrDAxtqaaLcFb/vIq6V40vaum+3DZaN2WdVThBE7hHDy4hhbcQxs6QCCBZ3iFNyd1Xpx352MxWnHKnWP4A+fzB3LNkfM=</latexit>
(n) N
K-means Objective: Find centers {mk }k=1 and assignments {r }n=1 to minimize
K
(n)
the sum of squared distances of data points {x } to their assigned centers.
(n) D
Data sample n = 1, .., N: x ∈ R (observed),
D
Cluster center k = 1, .., K : mk ∈ R (not observed),
Responsibilities: Cluster assignment for sample n:
(n) K
r ∈ R 1-of-K encoding (not observed)
Intro ML (UofT) STA314-Lec7 7 / 43
K-means Objective
(n) N
K-means Objective: Find cluster centers {mk }k=1 and assignments {r }n=1
K
(n)
to minimize the sum of squared distances of data points {x } to their
assigned centers.
(n) D
▶ Data sample n = 1, .., N: x ∈ R (observed),
D
▶ Cluster center k = 1, .., K : mk ∈ R (not observed),
▶ Responsibilities: Cluster assignment for sample n:
(n) K
r ∈ R 1-of-K encoding (not observed)
Mathematically:
N K
(n) (n) (n) 2
min R̂({mk }, {r }) = min ∑ ∑ rk ∣∣mk − x ∣∣
{mk },{r(n) } {mk },{r(n) } n=1 k=1
Optimization problem:
N K
(n) (n) 2
min ∑ ∑ rk ∣∣mk − x ∣∣
(n)
{mk },{r } n=1 k=1
(n)
Problem is hard when minimizing jointly over the parameters {mk }, {r }
But note that if we fix one and minimize over the other, then it becomes
easy.
(n)
Fix the centers {mk } then we can easily find the optimal assignments {r }
for each sample n
K
(n) (n) 2
min ∑ rk ∣∣mk − x ∣∣
(n)
r k=1
(n)
Likewise, if we fix the assignments {r } then can easily find optimal centers
{mk }
▶ Set each cluster’s center to the average of its assigned data points: For
l = 1, 2, ..., K
N K
∂ (n) (n) 2
0= ∑ ∑ rk ∣∣mk − x ∣∣
∂ml n=1
k=1
N (n) (n)
(n) (n) ∑n rl x
=2 ∑ rl (ml − x ) ⟹ ml = (n)
n=1 ∑n rl
(n)
Let’s alternate between minimizing R̂({mk }, {r }) with respect to {mk }
(n)
and {r }
This is called alternating minimization
Assignments Refitted
means
K-means cost function after each assignment step (blue) and refitting step
(red). The algorithm has converged after the third refitting step.
Intro ML (UofT) STA314-Lec7 19 / 43
Local Minima
(n) (n) 2 K
⟹ r = softmax(−β{∥mk − x ∥ }k=1 )
How to set β?
Clusters with unequal weight and width?
(x − µ)
2
2 1
N (x; µ, σ ) = √ exp (− )
2πσ 2σ 2
If you know the mean and the
variance, then you can identify the
Gaussian distribution exactly.
All univariate Gaussians are shifted and scaled version of the standard normal: if
ϵ ∼ N (0, 1), then x = µ + σϵ has distribution N (µ, σ ).
2
Proof:
E[x] = µ + σE[ϵ] = µ
and
2 2
Var(x) = σ Var(ϵ) = σ
Since a Gaussian is defined by its mean and its variance, we’ve shown that
x ∼ N (µ, σ ).
2
I.e., densities are related to the standard normal by scaling by the square root σ
and shifting by the mean µ:
Covariance is a matrix
2
⎛ σ1 σ12 ⋯ σ1D ⎞
⎜
⎜ σ12 σ2
2
σ2D ⎟
⎟
Σ = Cov(x) = E[(x − µ)(x − µ) ] = ⎜ ⎟
⋯
⎜ ⎟
⊤
⎜
⎜ ⋮ ⎟⎟
⎜ ⋮ ⋮ ⋱ ⎟
⎝σD1 σD2 ⋯ σD
2 ⎠
1 1 T −1
p(x) = exp [− (x − µ) Σ (x − µ)]
(2π)d/2 ∣Σ∣1/2 2
⊤
1. symmetric, i.e., Σ = Σ
2. positive semi-definite, i.e.,
v Σv = v E[(x − µ)(x − µ) ]v = E[((x − µ) v) ] ≥ 0 for all v ≠ 0
⊤ ⊤ ⊤ ⊤ 2
⊤
▶ Symmetric matrix A is positive semidefinite if v Av ≥ 0 for all
⊤
non-zero v. It is positive definite if v Av > 0 for all non-zero v.
3. For a Gaussian distribution with density, Σ is positive definite.
⊤ 2
4. Cov(Ax + b) = AΣA (generalizes Var(ax + b) = a Var(x))
Multivariate Gaussians are also shifted and scaled version of the standard
multivariate normal distribution!
▶ The standard multivariate normal has µ = 0 and Σ = I
Multivariate analog of the shift is simple: it’s a vector µ
But what about the scale?
▶ But in the multivariate case, the covariance Σ is a matrix!
1
Does Σ 2 exist, and can we scale by it?
First, what does it mean to scale by a matrix?
D
1. R has an orthonormal basis consisting of the eigenvectors of A.
2. There exists orthonormal matrix Q and diagonal matrix Λ such that
T
A = QΛQ . This is called the spectral decomposition of A.
▶ The columns of Q are (unit) eigenvectors of A.
▶ The diagonal entries of Λ are the eigenvalues of A in order
corresponding to the eigenvector in Q.
T
So, scaling v by a matrix A is QΛQ v, which consists of rotating v by
T
Q , scaling by the eigenvalues, then rotating back by Q. This can be
thought of intuitively as stretching / compressing space along the
eigenvectors of A by the corresponding eigenvalues.
Note: positive definite matrices have positive eigenvalues, and positive semidefinite
matrices have non-negative eigenvalues.
Intro ML (UofT) STA314-Lec7 33 / 43
Multivariate Scaling (Intuitive)
To “scale” by a positive definite matrix, you scale along each eigenvector by the
corresponding eigenvalues:
Consider matrix (note it’s symmetric!): Consider action on eigenvectors:
2 0 1 0
( ) ( )⊥( )
0 0.5 0 1
To “scale” by a positive definite matrix, you scale along each eigenvector by the
corresponding eigenvalues:
Consider matrix (note it’s symmetric!): Consider action on eigenvectors:
1 0.5 1 −1
( ) ( )⊥( )
0.5 1 1 1
⊤
For positive definite Σ, consider a spectral decomposition QΛQ .
1/2 1/2 ⊤ 1/2
Define Σ = QΛ Q , where Λ is a diagonal matrix whose diagonal
entries are the square roots of the eigenvalues of Σ corresponding to the
eigenvectors in Q.
1/2
Now, if ϵ ∼ N (0, I) then x = µ + Σ ϵ ∼ N (µ, Σ). To see why:
1/2
E[x] = µ + Σ E[ϵ] = µ
and
1/2 1/2 1/2 1/2 1/2 1/2
Cov(x) = Σ Cov(ϵ)(Σ ) = Σ (Σ ) = QΛ Q QΛ Q
⊤ ⊤ ⊤ ⊤
⊤
= QΛQ = Σ