0% found this document useful (0 votes)
18 views10 pages

Lecture08b Kmeans

Lecture notes on the k means algorithm

Uploaded by

Greg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views10 pages

Lecture08b Kmeans

Lecture notes on the k means algorithm

Uploaded by

Greg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Machine Learning Course - CS-433

K-Means Clustering

Nov 15, 2018

c Mohammad Emtiyaz Khan 2015


minor changes by Martin Jaggi 2016
minor changes by Martin Jaggi 2017
changes by Ruediger Urbanke 2018
Last updated: November 13, 2018
Clustering
Clusters are groups of points so that the distances within
the clusters (groups) are small compared to the distances
between the clusters (groups).
The clusters are typically defined by finding some “centers”
µ1, µ2,, . . . , µK . Each such center then defines one cluster
via an assignement mapping zn ∈ {1, 2, . . . , K} for all N
data vectors xn ∈ RD .

K-means clustering
The perhaps best-known clustering algorithm is the K-means
algorithm. Assume that K is known. The optimization that
leads to the clusters is the following:
N X
X K
min L(z, µ) = znk kxn − µk k22
z,µ
n=1 k=1
K
X
s.t. µk ∈ RD , znk ∈ {0, 1}, znk = 1,
k=1
where zn = [zn1, zn2, . . . , znK ]>
z = [z1, z2, . . . , zN ]>
µ = [µ1, µ2, . . . , µK ]>

Let us discuss this in some more detail. For fixed centers µk ,


the cost is minimized if we map each sample to its nearest
center, where we measure distance in terms of Euclidean
distance. This is accomplished by means of the indicator
variables znk , znk ∈ {0, 1}, where for each sample n the sum
PK
k=1 znk = 1. In words, each sample is assigned exactly
to one center (cluster) and this is indicated by setting the
corresponding indicator variable snk to 1 and all the other
ones znk0 to 0. In addition we still have to minimize over the
best choices for the centers.
The above description leads very naturally to an algorithm.
Algorithm: Initialize µk ∀k.
Iterate:
1. For all n, compute zn given µ.
2. For all k, compute µk given z.
As we discussed, step (1) is clear. Once we fixed the cen-
ters the best assignments is to map each point to the nearest
center. But (2) is equally natural. If we have fixed assigne-
ments then it is easy to compute the best centers. This can
be formally seen by taking derivatives of the cost function
w.r.t. µk and solving for the centers. We get:
PN
n=1 znk xn
µk = PN
n=1 znk

Hence, the best centers are just the means of each cluster.
Hence, the name ‘K-means’.
Let us look at the following example. Figure (a) shows a set
of points. We have K = 2 and two initial centers are shown.
Step 1: Given the centers, in the first step we compute
the optimal assignments. I.e., for all n, compute zn given
µ. The result is shown in Figure (b) via a color-coding. We
map each point to its closest center.
2 (a) 2 (b)

0 0

−2 −2

−2 0 2 −2 0 2

1 if k = arg minj=1,2,...K kxn − µj k22



znk =
0 otherwise
Step 2: In the second step we now freeze the assignments
and re-compute the best centers for each cluster. I.e., for all
k, compute µk given z. As we mentioned, this computation

2 (b) 2 (c)

0 0

−2 −2

−2 0 2 −2 0 2

is just the computation of the mean of each cluster. T


Summary of K-means
Initialize µk ∀k, then iterate:
1. For all n, compute zn given µ.
1 if k = arg minj kxn − µj k22

znk =
0 otherwise

2. For all k, compute µk given z.


PN
n=1 znk xn
µk = PN
n=1 znk

Convergence to a local optimum is assured since each step


decreases the cost (see Bishop, Exercise 9.1). But note that
we are not guaranteed to reach the globally optimal solution
with this iterative algorithm.

Coordinate descent
K-means is a coordinate descent algorithm, where, to find
minz,µ L(z, µ), we start with some µ(0) and repeat the fol-
lowing:

z(t+1) := arg min L(z, µ(t))


z
µ(t+1) := arg min L(z(t+1), µ)
µ
Examples
K-means for the “old-faithful” dataset (Bishop’s Figure 9.1)

2 (a) 2 (b) 2 (c)

0 0 0

−2 −2 −2

−2 0 2 −2 0 2 −2 0 2

(e) Iteration 0 (f) Iteration 1 (g) Iteration 1

2 (d) 2 (e) 2 (f)

0 0 0

−2 −2 −2

−2 0 2 −2 0 2 −2 0 2

(h) Iteration 2 (i) Iteration 2 (j) Iteration 3

2 (g) 2 (h) 2 (i)

0 0 0

−2 −2 −2

−2 0 2 −2 0 2 −2 0 2

(k) Iteration 3 (l) Iteration 4 (m) Iteration 4

To consider a second example. In the figures below the colors


contained in the photo were quantized to a small number
K = 2, 3, · · · . And these “center” colors were chosen by
clustering. This is a form of data compression. It is known
the signal processing community as vector quantization.
K =2 K =3 K = 10 Original image

Probabilistic model for K-means


So far we have presented K-means using a geometric rea-
soning. But as so many other models in ML, it also has a
probabilistic interpretation.
Assume that, conditioned that a point is associated to cluster
k, we consider it a sample from the D-dimensional Gaussian
with mean µk and covariance matrix I. I.e., the likelihood
of a sample x given the cluster assignment z and the centers
µ = {µK }K
k=1 is

K
Y
p(x|z, µ) = [N (x|µk , I)]zk .
k=1

Then the likelihood associated to a whole data set Strain =


{xk }K
k=1 .

N Y
Y K
p(X|z, µ) = [N (xn|µk , I)]znk .
n=1 k=1

Taking, as always, the log and multiplying with −1 in order


to get a minimization problem this gives us
N X
X K
− log p(x|z, µ) = znk kx − µk22,
n=1 k=1

which corresponds exactly to our original formulation.

K-means as a Matrix Factorization


K-means can be interpreted as a matrix factorization prob-
lem. We will learn much more about the matrix factorization
problem in a future lecture but the set-up is very natural.
We can write:
N X
X K
min L(z, µ) = znk kxn − µk k22
z,µ
n=1 k=1
= kX − MZ>k2Frob
>

s.t. µk ∈ RD ,
K
X
znk ∈ {0, 1}, znk = 1.
k=1

As always X is the N × D data matrix whose rows are the


individual featue vectors. Hence X> has dimensions D × N .
The matrix Z is the N × K matrix whose rows are the
indicator vectors zn. And M is the real-valued matrix of
dimension D × K whose columns are the centers µk . Note
that MZ> is a D × N matrix that contains in its n-th
column one of the K centers µk . Therefore kX> − MZ>k
is a D × N matrix that contains in its n-th column the
difference between the n-th sample and the center that we
assigned it to. The Frobenius norm now squares all these
entries and sums them up. This is hence the total distance.
In order to have a short Frobenius norm our task hence is
to “factorize” the matrix X> in the form MZ> where the
matrix Z is restricted to entries from {0, 1} and each row of
Z contains exactly a single 1. This is why we say that this
is a matrix factorization formulation.
Issues with K-means
1. Computation can be heavy for large N, D and K.
2. Clusters are forced to be spherical (e.g. cannot be el-
liptical).
3. Each example can belong to only one cluster (“hard”
cluster assignments).

You might also like