0% found this document useful (0 votes)

30 views57 pages

Lec 11

:))

Uploaded by

nguyenthanhphu30052005ag

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

30 views57 pages

Lec 11

:))

Uploaded by

nguyenthanhphu30052005ag

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 57

CSC 311: Introduction to Machine Learning

Lecture 11 - k-Means and EM Algorithm

Roger Grosse Rahul G. Krishnan Guodong Zhang

University of Toronto, Fall 2021

Intro ML (UofT) CSC311-Lec11 1 / 57

Overview

In the previous lecture, we covered PCA, Autoencoders and

Matrix Factorization—all unsupervised learning algorithms.
I Each algorithm can be used to approximate high dimensional data
using some lower dimensional form.
Those methods made an interesting assumption that data depends
on some latent variables that are never observed. Such models are
called latent variable models.
I For PCA, these correspond to the code vectors (representation).
Today’s lecture:
I First, introduce K-means, a simple algorithm for clustering, i.e.
grouping data points into clusters
I Then, we will reformulate clustering as a latent variable model,
apply the EM algorithm

Intro ML (UofT) CSC311-Lec11 2 / 57

Clustering
Sometimes the data form clusters, where samples within a cluster
are similar to each other, and samples in different clusters are
dissimilar:

Such a distribution is multimodal, since it has multiple modes, or

regions of high probability mass.
Grouping data points into clusters, with no observed labels, is
called clustering. It is an unsupervised learning technique.
E.g. clustering machine learning papers based on topic (deep
learning, Bayesian models, etc.)
I But topics are never observed (unsupervised).
Intro ML (UofT) CSC311-Lec11 3 / 57
K-means intuition

K-means assumes there are k clusters, and each point is close to its
cluster center, or mean (the mean of points in the cluster).
I If we knew the cluster assignment, we could easily compute the
centers.
I If we knew the centers, we could easily compute cluster assignment.
I Chicken and egg problem!
Very simple (and useful) heuristic - start randomly and alternate
between the two!

Intro ML (UofT) CSC311-Lec11 4 / 57

K-means

Initialization: randomly initialize cluster centers

The algorithm iteratively alternates between two steps:
I Assignment step: Assign each data point to the closest cluster
I Refitting step: Move each cluster center to the center of gravity of
the data assigned to it

Assignments Refitted
means

Intro ML (UofT) CSC311-Lec11 5 / 57

Figure from Bishop Simple demo: http://syskall.com/kmeans.js/
Intro ML (UofT) CSC311-Lec11 6 / 57
K-means Objective
What is actually being optimized?

K-means Objective:
Find cluster centers m and assignments r to minimize the sum of squared
distances of data points {x(n) } to their assigned cluster centers
N X K
(n)
X
min J({m}, {r}) = min rk ||mk − x(n) ||2
{m},{r} {m},{r}
n=1 k=1
(n) (n)
X
s.t. rk = 1, ∀n, where rk ∈ {0, 1}, ∀k, n
k
(n)
where rk = 1 means that x(n) is assigned to cluster k (with center mk )

Finding the exact optimum can be shown to be NP-hard.

K-means can be seen as block coordinate descent on this objective
(analogous to ALS for matrix completion)
(n)
I Assignment step = minimize w.r.t. {rk }
I Refitting step = minimize w.r.t. {mk }
Intro ML (UofT) CSC311-Lec11 7 / 57
How to optimize?: Alternating Minimization
Optimization problem:
N X
K
(n)
X
min rk ||mk − x(n) ||2
{mk },{r(n) }
n=1 k=1

If we fix the centers {mk } then we can easily find the optimal
assignments {r(n) } for each sample n
K
(n)
X
min rk ||mk − x(n) ||2
r(n)
k=1

I Assign each point to the cluster with the nearest center

(n) 1 if k = arg minj kx(n) − mj k2
rk =
0 otherwise
I E.g. if x(n) is assigned to cluster k̂,
r(n) = [0, 0, ..., 1, ..., 0]>
| {z }
Only k̂-th entry is 1
Intro ML (UofT) CSC311-Lec11 8 / 57
Alternating Minimization

Likewise, if we fix the assignments {r(n) } then can easily find optimal
centers {mk }
N K
∂ X X (n)
0= rk ||mk − x(n) ||2
∂ml n=1
k=1
N P (n) (n)
(n) n rl x
X
=2 rl (ml − x(n) ) =⇒ ml = P (n)
n=1 n rl

K-Means simply alternates bewteen minimizing w.r.t. assignments and

centers. This is an instance of alternating minimization, or block
coordinate descent.

Intro ML (UofT) CSC311-Lec11 9 / 57

The K-means Algorithm

Initialization: Set K cluster means m1 , . . . , mK to random values

Repeat until convergence (until assignments do not change):
I Assignment: Optimize J w.r.t. {r}: Each data point x(n) assigned
to nearest center

k̂ (n) = arg min ||mk − x(n) ||2

and Responsibilities (1-hot or 1-of-K encoding)

(n)
rk = I[k̂ (n) = k] for k = 1, .., K

I Refitting: Optimize J w.r.t. {m}: Each center is set to mean of

data assigned to it
P (n) (n)
n rk x
mk = P (n)
.
n rk

Intro ML (UofT) CSC311-Lec11 10 / 57

Why K-means Converges
K-means algorithm reduces the cost at each iteration.
I Whenever an assignment is changed, the sum squared distances J of

data points from their assigned cluster centers is reduced.

I Whenever a cluster center is moved, J is reduced.

Test for convergence: If the assignments do not change in the

assignment step, we have converged (to at least a local minimum).
This will always happen after a finite number of iterations, since the
number of possible cluster assignments is finite

K-means cost function after each assignment step (blue) and refitting
step (red). The algorithm has converged after the third refitting step.
Intro ML (UofT) CSC311-Lec11 11 / 57
Local Minima

The objective J is non-convex (so

coordinate descent on J is not A bad local optimum
guaranteed to converge to the global
minimum)
There is nothing to prevent k-means
getting stuck at local minima.
We could try many random starting
points

Intro ML (UofT) CSC311-Lec11 12 / 57

K-means for Vector Quantization

Figure from Bishop

Given image, construct “dataset” of pixels represented by their RGB

pixel intensities
Run k-means, replace each pixel by its cluster center
Intro ML (UofT) CSC311-Lec11 13 / 57
K-means for Image Segmentation

Given image, construct “dataset” of pixels, represented by their RGB

pixel intensities and grid locations
Run k-means (with some modifications) to get superpixels
Intro ML (UofT) CSC311-Lec11 14 / 57
Soft K-means

Instead of making hard assignments of data points to clusters, we can

make soft assignments. One cluster may have a responsibility of .7 for a
datapoint and another may have a responsibility of .3.
I Allows a cluster to use more information about the data in the
refitting step.
I How do we decide on the soft assignments?
I We already saw this in multi-class classification:
I 1-of-K encoding vs softmax assignments

Intro ML (UofT) CSC311-Lec11 15 / 57

Soft K-means Algorithm

Initialization: Set K means {mk } to random values

Repeat until convergence (measured by how much J changes):
I Assignment: Each data point n given soft “degree of assignment” to
each cluster mean k, based on responsibilities

(n) exp[−βkmk − x(n) k2 ]

rk =P (n) k2 ]
j exp[−βkmj − x

=⇒ r(n) = softmax(−β{kmk − x(n) k2 }K

k=1 )

I Refitting: Model parameters, means, are adjusted to match sample

means of datapoints they are responsible for:
P (n) (n)
n rk x
mk = P (n)
n rk

Intro ML (UofT) CSC311-Lec11 16 / 57

Questions about Soft K-means

Some remaining issues

How to set β?
Clusters with unequal weight and width?
These aren’t straightforward to address with K-means. Instead, in the sequel,
we’ll reformulate clustering using a generative model.

As β → ∞, soft k-Means becomes k-Means! (Exercise)

Intro ML (UofT) CSC311-Lec11 17 / 57

Gaussian Mixture Models

Intro ML (UofT) CSC311-Lec11 18 / 57

A Generative View of Clustering

What if the data don’t look like spherical blobs?

I elongated clusters
I discrete data
Remainder of this lecture: formulating clustering as a probabilistic
model
I specify assumptions about how the observations relate to latent
variables
I use an algorithm called E-M to (approximtely) maximize the
likelihood of the observations
This lets us generalize clustering to non-spherical centers or to
non-Gaussian observation models (as in this week’s tutorial).
This lecture is when probabilistic modeling starts to shine!

Intro ML (UofT) CSC311-Lec11 19 / 57

Generative Models Recap
Recall generative (Bayes) classifiers:

p(x, t) = p(x | t) p(t)

I We fit p(t) and p(x | t) using labeled data.

If t is never observed, we call it a latent variable, or hidden
variable, and generally denote it with z instead.
I The things we can observe (i.e. x) are called observables.
By marginalizing out z, we get a density over the observables:
X X
p(x) = p(x, z) = p(x | z) p(z)
z z

This is called a latent variable model.

If p(z) is a categorial distribution, this is a mixture model, and
different values of z correspond to different components.
Intro ML (UofT) CSC311-Lec11 20 / 57
Gaussian Mixture Model (GMM)
Most common mixture model: Gaussian mixture model (GMM)
A GMM represents a distribution as
K
X
p(x) = πk N (x | µk , Σk )
k=1

with πk the mixing coefficients, where:

K
X
πk = 1 and πk ≥ 0 ∀k
k=1

This defines a density over x, so we can fit the parameters using

maximum likelihood. We’re try to match the data density of x as closely
as possible.
I This is a hard optimization problem (and the focus of this lecture).

GMMs are universal approximators of densities (analogously to our

universality result for MLPs). Even diagonal GMMs are universal
approximators.
Intro ML (UofT) CSC311-Lec11 21 / 57
Gaussian Mixture Model (GMM)

We can also write the model as a generative process:

For i = 1, . . . , N :

z (i) ∼ Categorical(π)
x(i) | z (i) ∼ N (µz (i) , Σz (i) )

Intro ML (UofT) CSC311-Lec11 22 / 57

The Generative Model

500 points drawn from a mixture of 3 Gaussians.

a) Samples from p(x | z) b) Samples from the marginal p(x) c) Responsibilities p(z | x)

Intro ML (UofT) CSC311-Lec11 23 / 57

Maximum Likelihood with Latent Variables

How should we choose the parameters {πk , µk , Σk }K

k=1 ?
Maximum likelihood principle: choose parameters to maximize
likelihood of observed data
We don’t observe the cluster assignments z, we only see the data x
Given data D = {x(n) }N
n=1 , choose parameters to maximize:

N
X
log p(D) = log p(x(n) )
n=1

We can find p(x) by marginalizing out z:

K
X K
X
p(x) = p(z = k, x) = p(z = k)p(x|z = k)
k=1 k=1

Intro ML (UofT) CSC311-Lec11 24 / 57

Visualizing a Mixture of Gaussians – 1D Gaussians
If you fit a Gaussian to data:

Now, we are trying to fit a GMM (with K = 2 in this example):

[Slide credit: K. Kutulakos]

Intro ML (UofT) CSC311-Lec11 25 / 57
Visualizing a Mixture of Gaussians – 2D Gaussians

Intro ML (UofT) CSC311-Lec11 26 / 57

Expectation-Maximization (E-M)

Intro ML (UofT) CSC311-Lec11 27 / 57

Fitting GMMs: Maximum Likelihood
Some shorthand notation: let θ = {πk , µk , Σk } denote the full set of
model parameters. Let X = {x(i) } and Z = {z (i) }.
Maximum likelihood objective:
N K
!
X X
(i)
log p(X; θ) = log πk N (x ; µk , Σk )
i=1 k=1

In general, no closed-form solution

Not identifiable: solution is invariant to permutations
Challenges in optimizing this using gradient descent?
I Non-convex (due to permutation symmetry)
I Need to enforce non-negativity constraint on πk and PSD constraint
on Σk
I Derivatives w.r.t. Σk are expensive/complicated.
We need a different approach!
Intro ML (UofT) CSC311-Lec11 28 / 57
Fitting GMMs: Maximum Likelihood

Warning: you don’t want the global maximum. You can achieve
arbitrarily high training likelihood by placing a small-variance
Gaussian component on a training example.
This is known as a singularity.

Intro ML (UofT) CSC311-Lec11 29 / 57

Latent Variable Models: Inference

If we knew the parameters θ = {πk , µk , Σk }, we could infer which

component a data point x(i) probably belongs to by inferring its
latent variable z (i) .
This is just posterior inference, which we do using Bayes’ Rule:

Pr(z = k) p(x | z = k)
Pr(z (i) = k | x(i) ) = P
` Pr(z = `) p(x | z = `)

Just like Naı̈ve Bayes, GDA, etc. at test time.

Intro ML (UofT) CSC311-Lec11 30 / 57

Latent Variable Models: Learning
If we somehow knew the latent variables for every data point, we
could simply maximize the joint log-likelihood.
N
X
log p(X, Z; θ) = log p(x(i) , z (i) ; θ)
i=1
N
X
= log p(z (i) ) + log p(x(i) | z (i) ).
i=1

This is just like GDA at training time. Our formulas from Week 8,
written in a suggestive notation:
N
1 X (i)
πk = r
N i=1 k
PN (i) (i)
i=1 rk · x
µk = PN (i)
i=1 rk
N
1 X
rk (x(i) − µk )(x(i) − µk )>
(i)
Σk = PN (i)
i=1 rk i=1
(i)
rk = 1[z (i) = k]
Intro ML (UofT) CSC311-Lec11 31 / 57
Latent Variable Models

But we don’t know the z (i) , so we need to marginalize them out. Now
the log-likelihood is more awkward.
N
X
log p(X; θ) = log p(x(i) | θ)
i=1
N
X K
X
= log p(x(i) | z (i) ; {µk }, {Σk }) p(z (i) | π)
i=1 z (i) =1

Problem: the log is outside the sum, so things don’t simplify.

We have a chicken-and-egg problem, just like with K-Means!
I Given θ, inferring the z (i) is easy.
I Given the z (i) , learning θ (with maximum likelihood) is easy.
I Doing both simultaneously is hard.
Can you guess the algorithm?

Intro ML (UofT) CSC311-Lec11 32 / 57

Intuitively, How Can We Fit a Mixture of Gaussians?
We use the Expectation-Maximization algorithm, which alternates
between two steps:
1. Expectation step (E-step): Compute the posterior probability over
z given our current model - i.e. how much do we think each
Gaussian generates each datapoint.
2. Maximization step (M-step): Assuming that the data really was
generated this way, change the parameters of each Gaussian to
maximize the probability that it would generate the data it is
currently responsible for.

.95
.05 .05
.95

.5
.5 .5

.5
Intro ML (UofT) CSC311-Lec11 33 / 57
Expectation Maximization for GMM Overview
1. E-step:
(i)
I Assign the responsibility rk of component k for data point i using
the posterior probability:
(i)
rk = Pr(z (i) = k | x(i) ; θ)
2. M-step:
I Apply the maximum likelihood updates, where each component is

fit with a weighted dataset. The weights are proportional to the

responsibilities.
N
1 X (i)
πk = r
N i=1 k
PN (i) (i)
i=1 rk · x
µk = PN (i)
i=1 rk
N
1 X
rk (x(i) − µk )(x(i) − µk )>
(i)
Σk = PN (i)
i=1 rk i=1

Intro ML (UofT) CSC311-Lec11 34 / 57

Example

Suppose we recorded a bunch of temperatures in March for

Toronto and Miami, but forgot to record which was which, and
they’re all jumbled together.

Let’s try to separate them out using a mixture of Gaussians and

E-M.

Intro ML (UofT) CSC311-Lec11 35 / 57

Example

Random initialization

Intro ML (UofT) CSC311-Lec11 36 / 57

Example

Step 1:

E-step M-step

Intro ML (UofT) CSC311-Lec11 37 / 57

Example

Step 2:

E-step M-step

Intro ML (UofT) CSC311-Lec11 38 / 57

Example

Step 3:

E-step M-step

Intro ML (UofT) CSC311-Lec11 39 / 57

Example

Step 10:

E-step M-step

Intro ML (UofT) CSC311-Lec11 40 / 57

Expectation-Maximization

EM for Multivariate Gaussians:

In tutorial, you will fit a mixture of Bernoullis model.

Intro ML (UofT) CSC311-Lec11 41 / 57

Relation to k-Means

The K-Means Algorithm:

1. Assignment step: Assign each data point to the closest cluster
2. Refitting step: Move each cluster center to the average of the data
assigned to it
The EM Algorithm:
1. E-step: Compute the posterior probability over z given our current
model
2. M-step: Maximize the probability that it would generate the data it
is currently responsible for.
Can you find the similarities between the soft k-Means algorithm
and EM algorithm with shared covariance β1 I?
Both rely on alternating optimization methods and can suffer from
bad local optima.

Intro ML (UofT) CSC311-Lec11 42 / 57

Why EM Works
(the rest of this lecture is optional)

Intro ML (UofT) CSC311-Lec11 43 / 57

Jensen’s Inequality (optional)

Recall: if a function f is convex, then

!
X X
f λi xi ≤ λi f (xi ),
i i

where
P {λi } are such that each λi ≥ 0 and
i λi = 1.
If we treat the λi as the parameters of a
categorical distribution, λi = Pr(X = xi ),
this can be rewritten as:

f (E[X]) ≤ E[f (X)].

This is known as Jensen’s Inequality. It

holds for continuous distributions as well.

Intro ML (UofT) CSC311-Lec11 44 / 57

Jensen’s Inequality (optional)
A function f (x) is concave if −f (x) is convex. In this case, we flip
Jensen’s Inequality:
f (E[X]) ≥ E[f (X)].

When would you expect the inequality to be tight?

Intro ML (UofT) CSC311-Lec11 45 / 57
Where does EM come from? (optional)
Recall: the log-likelihood function is awkward because it has a
summation inside the log:
!
X X X
log p(X; θ) = log(p(x(i) ; θ)) = log p(x(i) , z (i) ; θ)
i i z (i)

Introduce a new distribution q(z (i) ) (we’ll see what this is shortly):
!
(i) (i)
(i) p(x , z ; θ)
X X
log p(X; θ) = log q(z )
i
q(z (i) )
z (i)
p(x(i) , z (i) ; θ)
X
= log Eq(z(i) )
i
q(z (i) )

Notice that log is a concave function. So we can use Jensen’s Inequality

to push the log inwards, obtaining the variational lower bound:
p(x(i) , z (i) ; θ)
X
log p(X; θ) ≥ Eq(z(i) ) log , L(q, θ)
i
q(z (i) )

Intro ML (UofT) CSC311-Lec11 46 / 57

Where does EM come from? (optional)

Just derived a lower bound on the log-likelihood:

p(x(i) , z (i) ; θ)
X
log p(X; θ) ≥ Eq(z(i) ) log , L(q, θ)
i
q(z (i) )

Simplifying the right-hand-side:

X
L(q, θ) = Eq(z(i) ) [log p(x(i) , z (i) ; θ)] − Eq(z(i) ) [log q(z (i) )]
i
| {z }
constant w.r.t. θ

The expected log-probability will turn out to be nice.

Intro ML (UofT) CSC311-Lec11 47 / 57

Where does EM come from? (optional)
Everything so far holds for any choice of q. But what should we
actually pick?
Jensen’s inequality gives a lower bound on the log-likelihood, so
the best we can achieve is to make the bound tight (i.e. equality).
Denote the current parameters as θ old .
It turns out the posterior probability p(z (i) | x(i) ; θ old ) is a very
good choice for q. Plugging it in to the lower bound:
p(x(i) , z (i) ; θ old ) p(x(i) , z (i) ; θ old )
X X
Eq(z(i) ) log = E (i)
q(z ) log
i
q(z (i) ) i
p(z (i) | x(i) ; θ old )
X h i
= Eq(z(i) ) log p(x(i) ; θ old )
i
X
= log p(x(i) ; θ old )
i

= log p(X; θ old )

Equality achieved!
Intro ML (UofT) CSC311-Lec11 48 / 57
Where does EM come from? (optional)
An aside:

How could you pick

q(z (i) ) = p(z (i) | x(i) ; θ old ) if you didn’t
already know the answer?
Observe: if f is strictly concave, then
Jensen’s inequality becomes an
equality exactly when the random
variable X is determinisic.

Hence, to solve
" # " #
p(x(i) , z (i) ; θ) p(x(i) , z (i) ; θ)
log Eq(z (i) ) = Eq(z (i) ) log ,
q(z (i) ) q(z (i) )

we should set q(z (i) ) ∝ p(x(i) , z (i) ; θ).

Intro ML (UofT) CSC311-Lec11 49 / 57
Where does EM come from? (optional)
E-step: compute the responsibilities using Bayes’ Rule:
(i)
rk , q(z (i) = k) = Pr(z (i) = k | x(i) ; θ old )
Rewriting the variational lower bound in terms of the
responsibilities:
XX (i)
L(q, θ) = rk log Pr(z (i) = k; π)
i k
XX (i)
+ rk log p(x(i) | z (i) = k; {µk }, {Σk })
i k

+ const

M-step: maximize L(q, θ) with respect to θ, giving θ new . This

can be done analytically, and gives the parameter updates we saw
previously.
The two steps are guaranteed to improve the log-likelihood:
log p(X; θ new ) ≥ L(q, θ new ) ≥ L(q, θ old ) = log p(X; θ old ).
Intro ML (UofT) CSC311-Lec11 50 / 57
EM: Recap (optional)

Recap of EM derivation:
We’re trying to maximize the log-likelihood log p(X; θ).
The exact log-likelihood is awkward, but we can use Jensen’s
Inequality to lower bound it with a nicer function L(q, θ), the
variatonal lower bound, which depends on a choice of q.
The E-step chooses q to make the bound tight at the current
parameters θ old . Mechanistically, this means computing the
(i)
responsibilities rk = Pr(z (i) = k | x(i) ; θ old ).
The M-step maximizes L(q, θ) with respect to θ, giving θ new . For
GMMs, this can be done analytically.
The combination of the E-step and M-step is guaranteed to
improve the true log-likelihood.

Intro ML (UofT) CSC311-Lec11 51 / 57

Visualization of the EM Algorithm (optional)

The EM algorithm involves alternately computing a lower bound on the

log likelihood for the current parameter values and then maximizing this
bound to obtain the new parameter values.

Intro ML (UofT) CSC311-Lec11 52 / 57

GMM E-Step: Responsibilities (optional)

Lets see how it works on GMM:

Conditional probability (using Bayes’ rule) of z given x

Intro ML (UofT) CSC311-Lec11 53 / 57

GMM E-Step (optional)

(i)
Once we computed rk = Pr(z (i) = k | x(i) ) we can compute the expected
likelihood
" #
X
(i) (i)
Ep(z(i) | x(i) ) log(p(x , z | θ))
i

(i)
XX
= rk log(Pr(z (i) = k | θ)) + log(p(x(i) | z (i) = k, θ))
i k

(i)
XX
= rk log(πk ) + log(N (x(i) ; µk , Σk ))
i k
(i) (i)
XX XX
= rk log(πk ) + rk log(N (x(i) ; µk , Σk ))
k i k i

We need to fit k Gaussians, just need to weight examples by rk

Intro ML (UofT) CSC311-Lec11 54 / 57

GMM M-Step (optional)
Need to optimize
(i) (i)
XX XX
rk log(πk ) + rk log(N (x(i) ; µk , Σk ))
k i k i

Solving for µk and Σk is like fitting k separate Gaussians but with

(i)
weights rk .
Solution is similar to what we have already seen:
N
1 X (i) (i)
µk = r x
Nk i=1 k
N
1 X (i) (i)
Σk = r (x − µk )(x(i) − µk )T
Nk i=1 k
N
Nk X (N )
πk = with Nk = rk
N i=1

Intro ML (UofT) CSC311-Lec11 55 / 57

EM Algorithm for GMM (optional)
Initialize the means µk , covariances Σk and mixing coefficients πk
Iterate until convergence:
I E-step: Evaluate the responsibilities given current parameters

(i) πk N (x(i) | µk , Σk )
rk = p(z (i) | x(i) ) = PK
j=1 πj N (x
(i) | µ , Σ )
j j

I M-step: Re-estimate the parameters given current responsibilities

N
1 X (i) (i)
µk = r x
Nk i=1 k
N
1 X (i) (i)
Σk = r (x − µk )(x(i) − µk )>
Nk i=1 k
N
Nk X (i)
πk = with Nk = rk
N i=1
I Evaluate log likelihood and check for convergence
N K
!
X X (i)
log p(X | π, µ, Σ) = log πk N (x | µk , Σk )
i=1 k=1

Intro ML (UofT) CSC311-Lec11 56 / 57

GMM Recap

A probabilistic view of clustering - Each cluster corresponds to a

different Gaussian.

Model using latent variables.

General approach, can replace Gaussian with other distributions

(continuous or discrete)

More generally, mixture models are very powerful models, i.e.

universal distribution approximators

Optimization is done using the EM algorithm.

Intro ML (UofT) CSC311-Lec11 57 / 57

Lec11 Ann
No ratings yet
Lec11 Ann
61 pages
Lecture08b Kmeans
No ratings yet
Lecture08b Kmeans
10 pages
ML Lecture06 Unsupervised Learning
No ratings yet
ML Lecture06 Unsupervised Learning
87 pages
DSA5102 Lecture10
No ratings yet
DSA5102 Lecture10
40 pages
EM and Kmeans Relations
No ratings yet
EM and Kmeans Relations
70 pages
MLSlides5 - Selected - Shared
No ratings yet
MLSlides5 - Selected - Shared
30 pages
Intro to Machine Learning Concepts
No ratings yet
Intro to Machine Learning Concepts
30 pages
M146 Lec14 Sidenotes S25
No ratings yet
M146 Lec14 Sidenotes S25
33 pages
ML Lec-16
No ratings yet
ML Lec-16
16 pages
Lecture 3
No ratings yet
Lecture 3
15 pages
Week 7 - Latent Variable Models and Expectation Maximization
No ratings yet
Week 7 - Latent Variable Models and Expectation Maximization
39 pages
Week 5 v1.1 - Unsupervised Learning
No ratings yet
Week 5 v1.1 - Unsupervised Learning
40 pages
ML.5-Clustering Techniques (Week 9)
No ratings yet
ML.5-Clustering Techniques (Week 9)
71 pages
Lecture 16
No ratings yet
Lecture 16
5 pages
L4 Kmeans
No ratings yet
L4 Kmeans
30 pages
Data Clustering (Contd) : CS771: Introduction To Machine Learning Piyush Rai
No ratings yet
Data Clustering (Contd) : CS771: Introduction To Machine Learning Piyush Rai
15 pages
Lect 10 - Unsupervised Learning
No ratings yet
Lect 10 - Unsupervised Learning
50 pages
K Means
No ratings yet
K Means
24 pages
07 Clustering 2024
No ratings yet
07 Clustering 2024
51 pages
DM&BAFall2204 2
No ratings yet
DM&BAFall2204 2
61 pages
K Means
No ratings yet
K Means
10 pages
Week 4 - Lecture Slides - K-Means, Mixture Models, & EM
No ratings yet
Week 4 - Lecture Slides - K-Means, Mixture Models, & EM
65 pages
A Paper With 12pt Global Font Size
No ratings yet
A Paper With 12pt Global Font Size
13 pages
Week 7 GMM
No ratings yet
Week 7 GMM
9 pages
UNIT III Part-1
No ratings yet
UNIT III Part-1
69 pages
1 The K-Medoids Algorithm
No ratings yet
1 The K-Medoids Algorithm
5 pages
AI-AG-Day-2-28th Feb 2023
No ratings yet
AI-AG-Day-2-28th Feb 2023
44 pages
Clustering Techniques Overview
No ratings yet
Clustering Techniques Overview
22 pages
INAIO Stage 2 Sample Problems MLTheory
No ratings yet
INAIO Stage 2 Sample Problems MLTheory
6 pages
5clustering 2
No ratings yet
5clustering 2
35 pages
Clustering and K-Means Lecture
No ratings yet
Clustering and K-Means Lecture
36 pages
Clustering: CMPUT 466/551 Nilanjan Ray
No ratings yet
Clustering: CMPUT 466/551 Nilanjan Ray
34 pages
Quiz 4
No ratings yet
Quiz 4
4 pages
2021 Clustering
No ratings yet
2021 Clustering
50 pages
Intro To Data Science
No ratings yet
Intro To Data Science
47 pages
WINSEM2021-22 ECE6093 ETH VL2021220505450 Reference Material I 23-03-2022 Slides Kmeans
No ratings yet
WINSEM2021-22 ECE6093 ETH VL2021220505450 Reference Material I 23-03-2022 Slides Kmeans
28 pages
ML DSBA Lab7
No ratings yet
ML DSBA Lab7
6 pages
Tema5 Teoria-2830
No ratings yet
Tema5 Teoria-2830
57 pages
ML 3
No ratings yet
ML 3
100 pages
K Means
No ratings yet
K Means
33 pages
01 Clustering
No ratings yet
01 Clustering
43 pages
Machine Learning Syllabus
No ratings yet
Machine Learning Syllabus
73 pages
0006 - K Means Clustering - Introduction - 2025
No ratings yet
0006 - K Means Clustering - Introduction - 2025
19 pages
AI ML Lecture 6
No ratings yet
AI ML Lecture 6
20 pages
20 - 1 - ML - Unsup - 01 - Partition Based - Kmeans
No ratings yet
20 - 1 - ML - Unsup - 01 - Partition Based - Kmeans
20 pages
کتاب چهارم بارگزاری شده
No ratings yet
کتاب چهارم بارگزاری شده
63 pages
Eml 10 250825
No ratings yet
Eml 10 250825
91 pages
ADL LAB Manual
No ratings yet
ADL LAB Manual
27 pages
ML Unit3
No ratings yet
ML Unit3
21 pages
IntroML 8 KmeanClustering
No ratings yet
IntroML 8 KmeanClustering
21 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
24 pages
Report 1
No ratings yet
Report 1
3 pages
Lecture 3. Partitioning-Based Clustering Methods
No ratings yet
Lecture 3. Partitioning-Based Clustering Methods
27 pages
Datamining-Lect5 - Clustering. The K-Means Algorithm. Hierarchical Clustering. The DBSCAN Algorithm. Clustering Evaluation
No ratings yet
Datamining-Lect5 - Clustering. The K-Means Algorithm. Hierarchical Clustering. The DBSCAN Algorithm. Clustering Evaluation
110 pages
Cluster Analysis 1731695796
No ratings yet
Cluster Analysis 1731695796
91 pages
K Means Algorithms
No ratings yet
K Means Algorithms
27 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
12 pages
Module 1 Overview of Data Data Science Analytics and Tools For Upoload
No ratings yet
Module 1 Overview of Data Data Science Analytics and Tools For Upoload
8 pages
B.Tech CSE (3rd To 8th) Semester
No ratings yet
B.Tech CSE (3rd To 8th) Semester
170 pages
Leveraging Power BI For Enhanced Data Visualizatio
No ratings yet
Leveraging Power BI For Enhanced Data Visualizatio
36 pages
How To Fix The Poco M3 Black Screen of Death Issue
No ratings yet
How To Fix The Poco M3 Black Screen of Death Issue
2 pages
Scholarship Management System
100% (1)
Scholarship Management System
32 pages
Twincat 2 Manual
No ratings yet
Twincat 2 Manual
129 pages
Articsoft Tabla de Comparacion
No ratings yet
Articsoft Tabla de Comparacion
3 pages
Archer Vr300 Ug Rev1.20.0
No ratings yet
Archer Vr300 Ug Rev1.20.0
99 pages
(Ebook) JavaScript Design Patterns by Hugo Di Francesco Download
100% (1)
(Ebook) JavaScript Design Patterns by Hugo Di Francesco Download
58 pages
Operating Instructions The New Elcometer 456 Coating Thickness Gauge
No ratings yet
Operating Instructions The New Elcometer 456 Coating Thickness Gauge
72 pages
HANA 2.0 System Replication
100% (1)
HANA 2.0 System Replication
19 pages
CV Anoop Purohit Updated
No ratings yet
CV Anoop Purohit Updated
6 pages
Huawei Data Center Product Catalogue Full Series - Middle East
No ratings yet
Huawei Data Center Product Catalogue Full Series - Middle East
106 pages
Class 11 PRACTICAL FILE
No ratings yet
Class 11 PRACTICAL FILE
11 pages
JAVA MCQ (Answers) - 2
100% (1)
JAVA MCQ (Answers) - 2
7 pages
3x Terminal Script
No ratings yet
3x Terminal Script
3 pages
Introduction To Java Through Game Development (Victor G. Brusca) (Z-Library)
100% (1)
Introduction To Java Through Game Development (Victor G. Brusca) (Z-Library)
256 pages
Brksec 2020
No ratings yet
Brksec 2020
189 pages
Icon 10-6 To 8-Answer Key-Updated
100% (2)
Icon 10-6 To 8-Answer Key-Updated
40 pages
Programmable Oscilloscope
No ratings yet
Programmable Oscilloscope
2 pages
AutoPIPE BASIC PDF
No ratings yet
AutoPIPE BASIC PDF
228 pages
SIWES Report for Computer Science Students
No ratings yet
SIWES Report for Computer Science Students
14 pages
Computer Aided Design of Flywheel Using C' Program: A. Thirugnanam, Rohitkumar and Lenin Rakesh
No ratings yet
Computer Aided Design of Flywheel Using C' Program: A. Thirugnanam, Rohitkumar and Lenin Rakesh
3 pages
FMS User Manual: Download and Installation
No ratings yet
FMS User Manual: Download and Installation
20 pages
Robot Project
No ratings yet
Robot Project
3 pages
Tutorial Sheet 5 (Differential Calculus)
No ratings yet
Tutorial Sheet 5 (Differential Calculus)
2 pages
Ilok License Manager: - The Manual
No ratings yet
Ilok License Manager: - The Manual
28 pages
Canon I350 Waste Tank Full - Fixyourownprinter
No ratings yet
Canon I350 Waste Tank Full - Fixyourownprinter
22 pages
Ch-1 Digital Documentation (Advanced) MCQs
No ratings yet
Ch-1 Digital Documentation (Advanced) MCQs
10 pages
8400 Building A Social Media Presence
No ratings yet
8400 Building A Social Media Presence
19 pages