0% found this document useful (0 votes)
4 views43 pages

lec07

Lecture 7 of STA 314 focuses on unsupervised learning, specifically clustering and the K-means algorithm. It discusses the concept of unsupervised learning, its applications, and the optimization problem involved in clustering data points into groups. The K-means algorithm is introduced as a method for finding cluster centers and assignments to minimize the sum of squared distances between data points and their assigned centers.

Uploaded by

moien
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views43 pages

lec07

Lecture 7 of STA 314 focuses on unsupervised learning, specifically clustering and the K-means algorithm. It discusses the concept of unsupervised learning, its applications, and the optimization problem involved in clustering data points into groups. The K-means algorithm is introduced as a method for finding cluster centers and assignments to minimize the sum of squared distances between data points and their assigned centers.

Uploaded by

moien
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

STA 314: Statistical Methods for Machine Learning I

Lecture 7 - Unsupervised Learning, K -Means

Chris J. Maddison

University of Toronto

Intro ML (UofT) STA314-Lec7 1 / 43


Unsupervised learning

What can we do without labels?


How can we even define what learning means without labels?
Unsupervised learning is the study of learning without labels.
In some sense, the ML community does not exactly agree on what it
means to do unsupervised learning, but intuitively, unsupervised
learning is the task of grouping, explaining, and finding structure data.

Intro ML (UofT) STA314-Lec7 2 / 43


Motivating Examples

Some examples of situations where you’d use unupservised learning


▶ You want to understand how a scientific field has changed over time.
You want to take a large database of papers and model how the
distribution of topics changes from year to year. But what are the
topics?
▶ You’re a biologist studying animal behavior, so you want to infer a
high-level description of their behavior from video. You don’t know the
set of behaviors ahead of time.
▶ You want to reduce your energy consumption, so you take a time series
of your energy consumption over time, and try to break it down into
separate components (refrigerator, washing machine, etc.).
Common themes:
▶ you have some data, and you want to infer some structure underlying
the data.
▶ you have some data, and you want to be able to make more data that
looks similar

Intro ML (UofT) STA314-Lec7 3 / 43


Clustering

Today we will look at the simplest type of unsupervised learning:


clustering.
Clustering is the task of organizing data into groups or clusters.
We will study the simplest method for doing this: the K -means
algorithm.

Intro ML (UofT) STA314-Lec7 4 / 43


Clustering

Sometimes the data form clusters, where samples within a cluster are
similar to each other, and samples in different clusters are dissimilar:

Such a distribution is multimodal, since it has multiple modes, or


regions of high probability mass.
Grouping data points into clusters, with no observed labels, is called
clustering. It is an unsupervised learning technique.
E.g. clustering machine learning papers based on topic (deep learning,
Bayesian models, etc.)
▶ But topics are never observed (unsupervised).
Intro ML (UofT) STA314-Lec7 5 / 43
Clustering problem

(1) (N) (n)


Assume the data {x } lives in a Euclidean space, x
D
,...,x ∈R .
Assume each data point belongs to one of K clusters
Assume the data points from same cluster are similar, i.e. close in Euclidean
distance.
How can we identify those clusters (data points that belong to each
cluster)? Let’s formulate as an optimization problem.
Intro ML (UofT) STA314-Lec7 6 / 43
K-means Objective
<latexit sha1_base64="AwgEJiiMfgLTV1d53gPayZCOjDI=">AAAB9HicbVBNSwMxEJ2tX7V+VT16CRahXsquKHoRil48VrAf0K4lm2bb0CS7JtlCWfo7vHhQxKs/xpv/xrTdg7Y+GHi8N8PMvCDmTBvX/XZyK6tr6xv5zcLW9s7uXnH/oKGjRBFaJxGPVCvAmnImad0ww2krVhSLgNNmMLyd+s0RVZpF8sGMY+oL3JcsZAQbK/nqMS3L00l3iK6R1y2W3Io7A1omXkZKkKHWLX51ehFJBJWGcKx123Nj46dYGUY4nRQ6iaYxJkPcp21LJRZU++ns6Ak6sUoPhZGyJQ2aqb8nUiy0HovAdgpsBnrRm4r/ee3EhFd+ymScGCrJfFGYcGQiNE0A9ZiixPCxJZgoZm9FZIAVJsbmVLAheIsvL5PGWcW7qLj356XqTRZHHo7gGMrgwSVU4Q5qUAcCT/AMr/DmjJwX5935mLfmnGzmEP7A+fwBF2mQ/w==</latexit>

(n)
rk =1

<latexit sha1_base64="3lopfPkkAXivKC86622GnO9S8cA=">AAAB+XicbVDLSsNAFL3xWesr6tLNYBHqpiSi6LLoxmUF+4A2lsl00g6dTMLMpFhC/sSNC0Xc+ifu/BsnbRbaemDgcM693DPHjzlT2nG+rZXVtfWNzdJWeXtnd2/fPjhsqSiRhDZJxCPZ8bGinAna1Exz2oklxaHPadsf3+Z+e0KlYpF40NOYeiEeChYwgrWR+rbdC7Ee+UH6lD2mVXGW9e2KU3NmQMvELUgFCjT69ldvEJEkpEITjpXquk6svRRLzQinWbmXKBpjMsZD2jVU4JAqL50lz9CpUQYoiKR5QqOZ+nsjxaFS09A3k3lOtejl4n9eN9HBtZcyESeaCjI/FCQc6QjlNaABk5RoPjUEE8lMVkRGWGKiTVllU4K7+OVl0jqvuZc15/6iUr8p6ijBMZxAFVy4gjrcQQOaQGACz/AKb1ZqvVjv1sd8dMUqdo7gD6zPH5sTk6I=</latexit>

x(n)
mk
<latexit sha1_base64="U5ptr1wypzLY69PGCvN2ZH2JI2Q=">AAAB83icbVDLSgMxFL1TX7W+qi7dBIvgqsyIosuiG5cV7AM6Q8mkmTY0yQxJRhiG/oYbF4q49Wfc+Tdm2llo64HA4Zx7uScnTDjTxnW/ncra+sbmVnW7trO7t39QPzzq6jhVhHZIzGPVD7GmnEnaMcxw2k8UxSLktBdO7wq/90SVZrF8NFlCA4HHkkWMYGMl3xfYTMIoF7PhdFhvuE13DrRKvJI0oER7WP/yRzFJBZWGcKz1wHMTE+RYGUY4ndX8VNMEkyke04GlEguqg3yeeYbOrDJCUazskwbN1d8bORZaZyK0k0VGvewV4n/eIDXRTZAzmaSGSrI4FKUcmRgVBaARU5QYnlmCiWI2KyITrDAxtqaaLcFb/vIq6V40vaum+3DZaN2WdVThBE7hHDy4hhbcQxs6QCCBZ3iFNyd1Xpx352MxWnHKnWP4A+fzB3LNkfM=</latexit>

(n) N
K-means Objective: Find centers {mk }k=1 and assignments {r }n=1 to minimize
K
(n)
the sum of squared distances of data points {x } to their assigned centers.

(n) D
Data sample n = 1, .., N: x ∈ R (observed),
D
Cluster center k = 1, .., K : mk ∈ R (not observed),
Responsibilities: Cluster assignment for sample n:
(n) K
r ∈ R 1-of-K encoding (not observed)
Intro ML (UofT) STA314-Lec7 7 / 43
K-means Objective

(n) N
K-means Objective: Find cluster centers {mk }k=1 and assignments {r }n=1
K
(n)
to minimize the sum of squared distances of data points {x } to their
assigned centers.
(n) D
▶ Data sample n = 1, .., N: x ∈ R (observed),
D
▶ Cluster center k = 1, .., K : mk ∈ R (not observed),
▶ Responsibilities: Cluster assignment for sample n:
(n) K
r ∈ R 1-of-K encoding (not observed)
Mathematically:
N K
(n) (n) (n) 2
min R̂({mk }, {r }) = min ∑ ∑ rk ∣∣mk − x ∣∣
{mk },{r(n) } {mk },{r(n) } n=1 k=1

(n) (n) (n)


= [0, .., 1, .., 0]

where rk = I[x is assigned to cluster k], i.e., r
Finding an optimal solution is an NP-hard problem!

Intro ML (UofT) STA314-Lec7 8 / 43


K-means Objective
Optimization problem:
N K
(n) (n) 2
min ∑ ∑ rk ∣∣mk − x ∣∣
{mk },{r(n) } n=1 k=1
ÍÒÒ Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò ÒÑ Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ï
(n)
distance between x
and its assigned cluster center

(n) (n) (n)


= [0, .., 1, .., 0]

Since rk = I[x is assigned to cluster k], i.e., r
inner sum is over K terms but only one of them is non-zero.
(n)
E.g. say sample x is assigned to cluster k = 3, then
n
r = [0, 0, 1, 0, ...]
K
(n) (n) 2 (n) 2
∑ rk ∣∣mk − x ∣∣ = ∣∣m3 − x ∣∣
k=1

Intro ML (UofT) STA314-Lec7 9 / 43


How to optimize?: Alternating Minimization

Optimization problem:
N K
(n) (n) 2
min ∑ ∑ rk ∣∣mk − x ∣∣
(n)
{mk },{r } n=1 k=1

(n)
Problem is hard when minimizing jointly over the parameters {mk }, {r }

But note that if we fix one and minimize over the other, then it becomes
easy.

Doesn’t guarantee the same solution!

Intro ML (UofT) STA314-Lec7 10 / 43


How to optimize?: Alternating Minimization
Optimization problem:
N K
(n) (n) 2
min ∑ ∑ rk ∣∣mk − x ∣∣
(n)
{mk },{r } n=1 k=1

(n)
Fix the centers {mk } then we can easily find the optimal assignments {r }
for each sample n
K
(n) (n) 2
min ∑ rk ∣∣mk − x ∣∣
(n)
r k=1

▶ Assign each point to the cluster with the nearest center


(n)
if k = arg minj ∥x − mj ∥
2
(n) 1
rk ={
0 otherwise
(n)
▶ E.g. if x is assigned to cluster k̂,
(n)
= [0, 0, ..., 1, ..., 0]

r
ÍÒÒ Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò ÒÑÒ Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ò Ï
Only k̂-th entry is 1
Intro ML (UofT) STA314-Lec7 11 / 43
Alternating Minimization

(n)
Likewise, if we fix the assignments {r } then can easily find optimal centers
{mk }
▶ Set each cluster’s center to the average of its assigned data points: For
l = 1, 2, ..., K
N K
∂ (n) (n) 2
0= ∑ ∑ rk ∣∣mk − x ∣∣
∂ml n=1
k=1
N (n) (n)
(n) (n) ∑n rl x
=2 ∑ rl (ml − x ) ⟹ ml = (n)
n=1 ∑n rl
(n)
Let’s alternate between minimizing R̂({mk }, {r }) with respect to {mk }
(n)
and {r }
This is called alternating minimization

Intro ML (UofT) STA314-Lec7 12 / 43


K-means Algorithm
High level overview of algorithm:

Initialization: randomly initialize cluster centers


The algorithm iteratively alternates between two steps:
▶ Assignment step: Assign each data point to the closest cluster
▶ Refitting step: Move each cluster center to the mean of the data
assigned to it

Assignments Refitted
means

Intro ML (UofT) STA314-Lec7 13 / 43


Figure from Bishop Simple demo: http://syskall.com/kmeans.js/
Intro ML (UofT) STA314-Lec7 14 / 43
The K-means Algorithm

Initialization: Set K cluster means m1 , . . . , mK to random values


Repeat until convergence (until assignments do not change):
(n)
▶ Assignment: Optimize R̂ w.r.t. {r}: Each data point x assigned to
nearest center
(n) (n) 2
k̂ = arg min ∣∣mk − x ∣∣
k

and Responsibilities (1-hot or 1-of-K encoding)


(n) (n)
rk = I[k̂ = k] for k = 1, .., K

▶ Refitting: Optimize R̂ w.r.t. {m}: Each center is set to mean of data


assigned to it
(n) (n)
∑ r x
mk = n k (n) .
∑n rk

Intro ML (UofT) STA314-Lec7 15 / 43


K-means for Vector Quantization

Figure from Bishop

Given image, construct “dataset“ of pixels represented by their RGB pixel


intensities
Run k-means, replace each pixel by its cluster center
Intro ML (UofT) STA314-Lec7 16 / 43
K-means for Image Segmentation

Given image, construct “dataset” of pixels, represented by their RGB pixel


intensities and grid locations
Run k-means (with some modifications) to get superpixels
Intro ML (UofT) STA314-Lec7 17 / 43
Questions about K-means

Why does update set mk to mean of assigned points?


What if we used a different distance measure?
How can we choose the best distance?
How to choose K ?
Will it converge?

Hard cases – unequal spreads, non-circular spreads, in-between points

Intro ML (UofT) STA314-Lec7 18 / 43


Why K-means Converges
K-means algorithm reduces the cost at each iteration.
▶ Whenever an assignment is changed, the sum squared distances R̂ of

data points from their assigned cluster centers is reduced.


▶ Whenever a cluster center is moved, R̂ is reduced.

Test for convergence: If the assignments do not change in the assignment


step, we have converged (to at least a local minimum).
This will always happen after a finite number of iterations, since the number
of possible cluster assignments is finite

K-means cost function after each assignment step (blue) and refitting step
(red). The algorithm has converged after the third refitting step.
Intro ML (UofT) STA314-Lec7 19 / 43
Local Minima

A bad local optimum

There is nothing to prevent k-means


getting stuck at local minima.
We could try many random starting points

Intro ML (UofT) STA314-Lec7 20 / 43


Soft K-means

Instead of making hard assignments of data points to clusters, we can make


soft assignments. One cluster may have a responsibility of .7 for a datapoint
and another may have a responsibility of .3.
▶ Allows a cluster to use more information about the data in the refitting
step.
▶ How do we decide on the soft assignments?
▶ We already saw this in multi-class classification:
▶ 1-of-K encoding vs softmax assignments

Intro ML (UofT) STA314-Lec7 21 / 43


Soft K-means Algorithm

Initialization: Set K means {mk } to random values


Repeat until convergence (measured by how much R̂ changes):
▶ Assignment: Each data point n given soft “degree of assignment” to
each cluster mean k, based on responsibilities
(n) 2
(n) exp[−β∥mk − x ∥ ]
rk = (n) 2
∑j exp[−β∥mj − x ∥ ]

(n) (n) 2 K
⟹ r = softmax(−β{∥mk − x ∥ }k=1 )

▶ Refitting: Model parameters, means, are adjusted to match sample


means of datapoints they are responsible for:
(n) (n)
∑n rk x
mk = (n)
∑n rk

Intro ML (UofT) STA314-Lec7 22 / 43


Questions about Soft K-means

Some remaining issues

How to set β?
Clusters with unequal weight and width?

These aren’t straightforward to address with K-means. Instead, in the sequel,


we’ll reformulate clustering using a generative model.

As β → ∞, soft k-Means becomes k-Means! (Exercise)

Intro ML (UofT) STA314-Lec7 23 / 43


PCA Overview

We now turn to the second unsupervised learning algorithm for this


course: principal component analysis (PCA)
Dimensionality reduction: map data to a lower dimensional space
▶ Save computation/memory
▶ Reduce overfitting, achieve better generalization
▶ Visualize in 2 dimensions
PCA is a linear model. It’s useful for understanding lots of other
algorithms.
▶ Autoencoders
▶ Matrix factorizations (next week)
PCA is finding linear structure in data.

Intro ML (UofT) STA314-Lec7 24 / 43


Gaussian distribution

First, we will review the Gaussian, or normal, distribution.


The Central Limit Theorem says that sums of lots of independent
random variables are approximately Gaussian. So, Gaussians show up
everywhere in nature.
In machine learning, we use Gaussians a lot because they make the
calculations easy.

Intro ML (UofT) STA314-Lec7 25 / 43


Gaussian distribution

The normal distribution is


2
parameterized by µ ∈ R and σ > 0:

(x − µ)
2
2 1
N (x; µ, σ ) = √ exp (− )
2πσ 2σ 2
If you know the mean and the
variance, then you can identify the
Gaussian distribution exactly.

Intro ML (UofT) STA314-Lec7 26 / 43


Shifting + scaling univariate Gaussians

All univariate Gaussians are shifted and scaled version of the standard normal: if
ϵ ∼ N (0, 1), then x = µ + σϵ has distribution N (µ, σ ).
2

Proof:

E[x] = µ + σE[ϵ] = µ

and
2 2
Var(x) = σ Var(ϵ) = σ

Since a Gaussian is defined by its mean and its variance, we’ve shown that
x ∼ N (µ, σ ).
2

Intro ML (UofT) STA314-Lec7 27 / 43


Shifting + scaling univariate Gaussians

I.e., densities are related to the standard normal by scaling by the square root σ
and shifting by the mean µ:

Intro ML (UofT) STA314-Lec7 28 / 43


Multivariate Gaussian statistics

Similar intuitions apply to multivariate Gaussians!


A multivariate Gaussian (or multivariate Normal) distribution is uniquely
defined by a mean vector µ and covariance matrix Σ, denoted N (µ, Σ) or
N (x; µ, Σ)
Mean is a vector
⎛ µ1 ⎞
µ = E[x] = ⎜
⎜⋮⎟ ⎟
⎝µd ⎠

Covariance is a matrix
2
⎛ σ1 σ12 ⋯ σ1D ⎞

⎜ σ12 σ2
2
σ2D ⎟

Σ = Cov(x) = E[(x − µ)(x − µ) ] = ⎜ ⎟

⎜ ⎟


⎜ ⋮ ⎟⎟
⎜ ⋮ ⋮ ⋱ ⎟
⎝σD1 σD2 ⋯ σD
2 ⎠

Intro ML (UofT) STA314-Lec7 29 / 43


Multivariate Gaussian Distribution

Normally distributed variable x ∼ N (µ, Σ) has distribution:

1 1 T −1
p(x) = exp [− (x − µ) Σ (x − µ)]
(2π)d/2 ∣Σ∣1/2 2

Intro ML (UofT) STA314-Lec7 30 / 43


Key properties of covariances
Covariances are


1. symmetric, i.e., Σ = Σ
2. positive semi-definite, i.e.,
v Σv = v E[(x − µ)(x − µ) ]v = E[((x − µ) v) ] ≥ 0 for all v ≠ 0
⊤ ⊤ ⊤ ⊤ 2


▶ Symmetric matrix A is positive semidefinite if v Av ≥ 0 for all

non-zero v. It is positive definite if v Av > 0 for all non-zero v.
3. For a Gaussian distribution with density, Σ is positive definite.

Some other key properties:

3. Σ = E[xx ] − µµ (generalizes Var(x) = E[x ] − µ ))


⊤ ⊤ 2 2

⊤ 2
4. Cov(Ax + b) = AΣA (generalizes Var(ax + b) = a Var(x))

Intro ML (UofT) STA314-Lec7 31 / 43


Shifting and scaling a multivariate Gaussian

Multivariate Gaussians are also shifted and scaled version of the standard
multivariate normal distribution!
▶ The standard multivariate normal has µ = 0 and Σ = I
Multivariate analog of the shift is simple: it’s a vector µ
But what about the scale?
▶ But in the multivariate case, the covariance Σ is a matrix!
1
Does Σ 2 exist, and can we scale by it?
First, what does it mean to scale by a matrix?

Intro ML (UofT) STA314-Lec7 32 / 43


Multivariate Scaling (Formal)
Recall from linear algebra (without proof): (Spectral Theorem). If
d×d
A∈R is symmetric, then

D
1. R has an orthonormal basis consisting of the eigenvectors of A.
2. There exists orthonormal matrix Q and diagonal matrix Λ such that
T
A = QΛQ . This is called the spectral decomposition of A.
▶ The columns of Q are (unit) eigenvectors of A.
▶ The diagonal entries of Λ are the eigenvalues of A in order
corresponding to the eigenvector in Q.

T
So, scaling v by a matrix A is QΛQ v, which consists of rotating v by
T
Q , scaling by the eigenvalues, then rotating back by Q. This can be
thought of intuitively as stretching / compressing space along the
eigenvectors of A by the corresponding eigenvalues.

Note: positive definite matrices have positive eigenvalues, and positive semidefinite
matrices have non-negative eigenvalues.
Intro ML (UofT) STA314-Lec7 33 / 43
Multivariate Scaling (Intuitive)
To “scale” by a positive definite matrix, you scale along each eigenvector by the
corresponding eigenvalues:
Consider matrix (note it’s symmetric!): Consider action on eigenvectors:
2 0 1 0
( ) ( )⊥( )
0 0.5 0 1

Intro ML (UofT) STA314-Lec7 34 / 43


Multivariate Scaling (Intuitive) (optional draw-on slide for intuition)

To “scale” by a positive definite matrix, you scale along each eigenvector by the
corresponding eigenvalues:
Consider matrix (note it’s symmetric!): Consider action on eigenvectors:
1 0.5 1 −1
( ) ( )⊥( )
0.5 1 1 1

Intro ML (UofT) STA314-Lec7 35 / 43


Shifting and scaling a multivariate Gaussian
So here is the “scale” intuition:


For positive definite Σ, consider a spectral decomposition QΛQ .
1/2 1/2 ⊤ 1/2
Define Σ = QΛ Q , where Λ is a diagonal matrix whose diagonal
entries are the square roots of the eigenvalues of Σ corresponding to the
eigenvectors in Q.
1/2
Now, if ϵ ∼ N (0, I) then x = µ + Σ ϵ ∼ N (µ, Σ). To see why:
1/2
E[x] = µ + Σ E[ϵ] = µ

and
1/2 1/2 1/2 1/2 1/2 1/2
Cov(x) = Σ Cov(ϵ)(Σ ) = Σ (Σ ) = QΛ Q QΛ Q
⊤ ⊤ ⊤ ⊤


= QΛQ = Σ

What effect does this have on the density? Let’s explore.

Intro ML (UofT) STA314-Lec7 36 / 43


Bivariate Gaussian
1 0 1 0 1 0
Σ=( ) Σ = 0.5 ( ) Σ = 2( )
0 1 0 1 0 1

Probability density function

Contour plot of the pdf

Intro ML (UofT) STA314-Lec7 37 / 43


Bivariate Gaussian
1 0 2 0 1 0
Σ=( ) Σ=( ) Σ=( )
0 1 0 1 0 2

Probability density function

Contour plot of the pdf


Intro ML (UofT) STA314-Lec7 38 / 43
Bivariate Gaussian
1 0 1 0.5 1 0.8
Σ=( ) Σ=( ) Σ=( )
0 1 0.5 1 0.8 1
1.5 0. 1.8 0.
= Q1 ( )Q = Q2 ( )Q
⊤ ⊤
0. 0.5 1 0. 0.2 2

Test your intuition: Does Q1 = Q2 ?

Probability density function

Contour plot of the pdf


Intro ML (UofT) STA314-Lec7 39 / 43
Bivariate Gaussian
1 0 1 0.5 1 −0.5
Σ=( ) Σ=( ) Σ=( )
0 1 0.5 1 −0.5 1
1.5 0. λ 0.
= Q1 ( )Q = Q2 ( 1 ) Q2
⊤ ⊤
0. 0.5 1 0. λ2

Test your intuition: Does Q1 = Q2 ? What are λ1 and λ2 ?

Probability density function

Contour plot of the pdf


Intro ML (UofT) STA314-Lec7 40 / 43
Bivariate Gaussian

Intro ML (UofT) STA314-Lec7 41 / 43


Bivariate Gaussian

Intro ML (UofT) STA314-Lec7 42 / 43


PCA

Next week: PCA.


Dimensionality reduction: map data to a lower dimensional space
PCA is finding linear structure in data.

Intro ML (UofT) STA314-Lec7 43 / 43

You might also like