CSC 311: Introduction to Machine Learning
Lecture 11 - k-Means and EM Algorithm
    Roger Grosse     Rahul G. Krishnan             Guodong Zhang
                    University of Toronto, Fall 2021
Intro ML (UofT)              CSC311-Lec11                          1 / 57
Overview
   In the previous lecture, we covered PCA, Autoencoders and
   Matrix Factorization—all unsupervised learning algorithms.
      I   Each algorithm can be used to approximate high dimensional data
          using some lower dimensional form.
   Those methods made an interesting assumption that data depends
   on some latent variables that are never observed. Such models are
   called latent variable models.
      I   For PCA, these correspond to the code vectors (representation).
   Today’s lecture:
      I   First, introduce K-means, a simple algorithm for clustering, i.e.
          grouping data points into clusters
      I   Then, we will reformulate clustering as a latent variable model,
          apply the EM algorithm
   Intro ML (UofT)                CSC311-Lec11                                2 / 57
Clustering
   Sometimes the data form clusters, where samples within a cluster
   are similar to each other, and samples in different clusters are
   dissimilar:
   Such a distribution is multimodal, since it has multiple modes, or
   regions of high probability mass.
   Grouping data points into clusters, with no observed labels, is
   called clustering. It is an unsupervised learning technique.
   E.g. clustering machine learning papers based on topic (deep
   learning, Bayesian models, etc.)
      I   But topics are never observed (unsupervised).
   Intro ML (UofT)               CSC311-Lec11                      3 / 57
K-means intuition
   K-means assumes there are k clusters, and each point is close to its
   cluster center, or mean (the mean of points in the cluster).
      I   If we knew the cluster assignment, we could easily compute the
          centers.
      I   If we knew the centers, we could easily compute cluster assignment.
      I   Chicken and egg problem!
   Very simple (and useful) heuristic - start randomly and alternate
   between the two!
   Intro ML (UofT)               CSC311-Lec11                             4 / 57
K-means
   Initialization: randomly initialize cluster centers
   The algorithm iteratively alternates between two steps:
      I   Assignment step: Assign each data point to the closest cluster
      I   Refitting step: Move each cluster center to the center of gravity of
          the data assigned to it
                         Assignments                           Refitted
                                                               means
   Intro ML (UofT)                CSC311-Lec11                               5 / 57
Figure from Bishop                     Simple demo: http://syskall.com/kmeans.js/
      Intro ML (UofT)   CSC311-Lec11                                         6 / 57
K-means Objective
What is actually being optimized?
  K-means Objective:
  Find cluster centers m and assignments r to minimize the sum of squared
  distances of data points {x(n) } to their assigned cluster centers
                                         N X K
                                                  (n)
                                        X
          min J({m}, {r}) = min                  rk ||mk − x(n) ||2
           {m},{r}                       {m},{r}
                                                   n=1 k=1
                               (n)                           (n)
                          X
                   s.t.       rk     = 1, ∀n, where       rk       ∈ {0, 1}, ∀k, n
                          k
            (n)
  where rk        = 1 means that x(n) is assigned to cluster k (with center mk )
    Finding the exact optimum can be shown to be NP-hard.
    K-means can be seen as block coordinate descent on this objective
    (analogous to ALS for matrix completion)
                                                             (n)
       I   Assignment step = minimize w.r.t. {rk }
       I   Refitting step = minimize w.r.t. {mk }
    Intro ML (UofT)                        CSC311-Lec11                              7 / 57
How to optimize?: Alternating Minimization
Optimization problem:
                                         N X
                                           K
                                                       (n)
                                         X
                          min                      rk ||mk − x(n) ||2
                       {mk   },{r(n) }
                                         n=1 k=1
    If we fix the centers {mk } then we can easily find the optimal
    assignments {r(n) } for each sample n
                                          K
                                                 (n)
                                          X
                                  min           rk ||mk − x(n) ||2
                                  r(n)
                                          k=1
       I   Assign each point to the cluster with the nearest center
                                                      (n)      1 if k = arg minj kx(n) − mj k2
                       rk =
                                 0 otherwise
       I   E.g. if x(n) is assigned to cluster k̂,
                                         r(n) = [0, 0, ..., 1, ..., 0]>
                                                |         {z          }
                                                   Only k̂-th entry is 1
    Intro ML (UofT)                       CSC311-Lec11                     8 / 57
Alternating Minimization
   Likewise, if we fix the assignments {r(n) } then can easily find optimal
   centers {mk }
                           N K
                       ∂ X X (n)
               0=              rk ||mk − x(n) ||2
                      ∂ml n=1
                                   k=1
                          N                                   P  (n) (n)
                                 (n)                          n rl x
                          X
                     =2         rl (ml − x(n) )     =⇒ ml =   P (n)
                          n=1                                   n rl
   K-Means simply alternates bewteen minimizing w.r.t. assignments and
   centers. This is an instance of alternating minimization, or block
   coordinate descent.
   Intro ML (UofT)                       CSC311-Lec11                         9 / 57
The K-means Algorithm
   Initialization: Set K cluster means m1 , . . . , mK to random values
   Repeat until convergence (until assignments do not change):
      I   Assignment: Optimize J w.r.t. {r}: Each data point x(n) assigned
          to nearest center
                                k̂ (n) = arg min ||mk − x(n) ||2
                                             k
          and Responsibilities (1-hot or 1-of-K encoding)
                            (n)
                           rk     = I[k̂ (n) = k] for k = 1, .., K
      I   Refitting: Optimize J w.r.t. {m}: Each center is set to mean of
          data assigned to it
                                        P (n) (n)
                                          n rk x
                                 mk = P        (n)
                                                   .
                                            n rk
   Intro ML (UofT)                  CSC311-Lec11                            10 / 57
Why K-means Converges
   K-means algorithm reduces the cost at each iteration.
     I Whenever an assignment is changed, the sum squared distances J of
       data points from their assigned cluster centers is reduced.
     I Whenever a cluster center is moved, J is reduced.
   Test for convergence: If the assignments do not change in the
   assignment step, we have converged (to at least a local minimum).
   This will always happen after a finite number of iterations, since the
   number of possible cluster assignments is finite
   K-means cost function after each assignment step (blue) and refitting
   step (red). The algorithm has converged after the third refitting step.
   Intro ML (UofT)              CSC311-Lec11                                 11 / 57
Local Minima
   The objective J is non-convex (so
   coordinate descent on J is not             A bad local optimum
   guaranteed to converge to the global
   minimum)
   There is nothing to prevent k-means
   getting stuck at local minima.
   We could try many random starting
   points
   Intro ML (UofT)             CSC311-Lec11                         12 / 57
K-means for Vector Quantization
Figure from Bishop
      Given image, construct “dataset” of pixels represented by their RGB
      pixel intensities
      Run k-means, replace each pixel by its cluster center
      Intro ML (UofT)             CSC311-Lec11                              13 / 57
K-means for Image Segmentation
   Given image, construct “dataset” of pixels, represented by their RGB
   pixel intensities and grid locations
   Run k-means (with some modifications) to get superpixels
   Intro ML (UofT)             CSC311-Lec11                               14 / 57
Soft K-means
   Instead of making hard assignments of data points to clusters, we can
   make soft assignments. One cluster may have a responsibility of .7 for a
   datapoint and another may have a responsibility of .3.
      I   Allows a cluster to use more information about the data in the
          refitting step.
      I   How do we decide on the soft assignments?
      I   We already saw this in multi-class classification:
            I   1-of-K encoding vs softmax assignments
   Intro ML (UofT)                 CSC311-Lec11                            15 / 57
Soft K-means Algorithm
   Initialization: Set K means {mk } to random values
   Repeat until convergence (measured by how much J changes):
      I   Assignment: Each data point n given soft “degree of assignment” to
          each cluster mean k, based on responsibilities
                            (n)     exp[−βkmk − x(n) k2 ]
                           rk     =P                 (n) k2 ]
                                     j exp[−βkmj − x
                      =⇒ r(n) = softmax(−β{kmk − x(n) k2 }K
                                                          k=1 )
      I   Refitting: Model parameters, means, are adjusted to match sample
          means of datapoints they are responsible for:
                                           P    (n) (n)
                                             n rk x
                                    mk =     P (n)
                                               n rk
   Intro ML (UofT)                 CSC311-Lec11                         16 / 57
Questions about Soft K-means
Some remaining issues
    How to set β?
    Clusters with unequal weight and width?
These aren’t straightforward to address with K-means. Instead, in the sequel,
we’ll reformulate clustering using a generative model.
As β → ∞, soft k-Means becomes k-Means! (Exercise)
    Intro ML (UofT)              CSC311-Lec11                            17 / 57
                  Gaussian Mixture Models
Intro ML (UofT)          CSC311-Lec11       18 / 57
A Generative View of Clustering
   What if the data don’t look like spherical blobs?
      I   elongated clusters
      I   discrete data
   Remainder of this lecture: formulating clustering as a probabilistic
   model
      I   specify assumptions about how the observations relate to latent
          variables
      I   use an algorithm called E-M to (approximtely) maximize the
          likelihood of the observations
   This lets us generalize clustering to non-spherical centers or to
   non-Gaussian observation models (as in this week’s tutorial).
   This lecture is when probabilistic modeling starts to shine!
   Intro ML (UofT)               CSC311-Lec11                               19 / 57
Generative Models Recap
   Recall generative (Bayes) classifiers:
                               p(x, t) = p(x | t) p(t)
      I   We fit p(t) and p(x | t) using labeled data.
   If t is never observed, we call it a latent variable, or hidden
   variable, and generally denote it with z instead.
      I   The things we can observe (i.e. x) are called observables.
   By marginalizing out z, we get a density over the observables:
                               X                   X
                      p(x) =        p(x, z) =          p(x | z) p(z)
                                z                  z
   This is called a latent variable model.
   If p(z) is a categorial distribution, this is a mixture model, and
   different values of z correspond to different components.
   Intro ML (UofT)                  CSC311-Lec11                        20 / 57
Gaussian Mixture Model (GMM)
Most common mixture model: Gaussian mixture model (GMM)
    A GMM represents a distribution as
                                    K
                                    X
                           p(x) =         πk N (x | µk , Σk )
                                    k=1
    with πk the mixing coefficients, where:
                         K
                         X
                               πk = 1     and     πk ≥ 0    ∀k
                         k=1
    This defines a density over x, so we can fit the parameters using
    maximum likelihood. We’re try to match the data density of x as closely
    as possible.
       I This is a hard optimization problem (and the focus of this lecture).
    GMMs are universal approximators of densities (analogously to our
    universality result for MLPs). Even diagonal GMMs are universal
    approximators.
    Intro ML (UofT)               CSC311-Lec11                           21 / 57
Gaussian Mixture Model (GMM)
   We can also write the model as a generative process:
               For i = 1, . . . , N :
                                   z (i) ∼ Categorical(π)
                            x(i) | z (i) ∼ N (µz (i) , Σz (i) )
   Intro ML (UofT)                  CSC311-Lec11                  22 / 57
The Generative Model
    500 points drawn from a mixture of 3 Gaussians.
  a) Samples from p(x | z)   b) Samples from the marginal p(x)   c) Responsibilities p(z | x)
    Intro ML (UofT)                  CSC311-Lec11                                           23 / 57
Maximum Likelihood with Latent Variables
   How should we choose the parameters {πk , µk , Σk }K
                                                      k=1 ?
   Maximum likelihood principle: choose parameters to maximize
   likelihood of observed data
   We don’t observe the cluster assignments z, we only see the data x
   Given data D = {x(n) }N
                         n=1 , choose parameters to maximize:
                                           N
                                           X
                              log p(D) =         log p(x(n) )
                                           n=1
   We can find p(x) by marginalizing out z:
                        K
                        X                     K
                                              X
               p(x) =         p(z = k, x) =         p(z = k)p(x|z = k)
                        k=1                   k=1
   Intro ML (UofT)                 CSC311-Lec11                          24 / 57
Visualizing a Mixture of Gaussians – 1D Gaussians
     If you fit a Gaussian to data:
     Now, we are trying to fit a GMM (with K = 2 in this example):
[Slide credit: K. Kutulakos]
     Intro ML (UofT)              CSC311-Lec11                       25 / 57
Visualizing a Mixture of Gaussians – 2D Gaussians
   Intro ML (UofT)     CSC311-Lec11                 26 / 57
                  Expectation-Maximization (E-M)
Intro ML (UofT)             CSC311-Lec11           27 / 57
Fitting GMMs: Maximum Likelihood
   Some shorthand notation: let θ = {πk , µk , Σk } denote the full set of
   model parameters. Let X = {x(i) } and Z = {z (i) }.
   Maximum likelihood objective:
                                     N           K
                                                                         !
                                     X           X
                                                              (i)
                     log p(X; θ) =         log         πk N (x ; µk , Σk )
                                     i=1         k=1
   In general, no closed-form solution
   Not identifiable: solution is invariant to permutations
   Challenges in optimizing this using gradient descent?
      I   Non-convex (due to permutation symmetry)
      I   Need to enforce non-negativity constraint on πk and PSD constraint
          on Σk
      I   Derivatives w.r.t. Σk are expensive/complicated.
   We need a different approach!
   Intro ML (UofT)                    CSC311-Lec11                           28 / 57
Fitting GMMs: Maximum Likelihood
   Warning: you don’t want the global maximum. You can achieve
   arbitrarily high training likelihood by placing a small-variance
   Gaussian component on a training example.
   This is known as a singularity.
   Intro ML (UofT)           CSC311-Lec11                        29 / 57
Latent Variable Models: Inference
   If we knew the parameters θ = {πk , µk , Σk }, we could infer which
   component a data point x(i) probably belongs to by inferring its
   latent variable z (i) .
   This is just posterior inference, which we do using Bayes’ Rule:
                                          Pr(z = k) p(x | z = k)
                 Pr(z (i) = k | x(i) ) = P
                                           ` Pr(z = `) p(x | z = `)
   Just like Naı̈ve Bayes, GDA, etc. at test time.
   Intro ML (UofT)                CSC311-Lec11                        30 / 57
Latent Variable Models: Learning
   If we somehow knew the latent variables for every data point, we
   could simply maximize the joint log-likelihood.
                                          N
                                          X
                      log p(X, Z; θ) =          log p(x(i) , z (i) ; θ)
                                          i=1
                                          N
                                          X
                                      =         log p(z (i) ) + log p(x(i) | z (i) ).
                                          i=1
   This is just like GDA at training time. Our formulas from Week 8,
   written in a suggestive notation:
                                    N
                                1 X (i)
                     πk     =          r
                                N i=1 k
                                PN (i) (i)
                                  i=1 rk · x
                     µk     =     PN (i)
                                     i=1 rk
                                            N
                                     1      X
                                                   rk (x(i) − µk )(x(i) − µk )>
                                                    (i)
                     Σk     =   PN      (i)
                                   i=1 rk i=1
                      (i)
                     rk     =   1[z (i) = k]
   Intro ML (UofT)                       CSC311-Lec11                                   31 / 57
Latent Variable Models
   But we don’t know the z (i) , so we need to marginalize them out. Now
   the log-likelihood is more awkward.
                           N
                           X
           log p(X; θ) =         log p(x(i) | θ)
                           i=1
                           N
                           X            K
                                        X
                       =         log              p(x(i) | z (i) ; {µk }, {Σk }) p(z (i) | π)
                           i=1         z (i) =1
   Problem: the log is outside the sum, so things don’t simplify.
   We have a chicken-and-egg problem, just like with K-Means!
      I   Given θ, inferring the z (i) is easy.
      I   Given the z (i) , learning θ (with maximum likelihood) is easy.
      I   Doing both simultaneously is hard.
   Can you guess the algorithm?
   Intro ML (UofT)                       CSC311-Lec11                                           32 / 57
Intuitively, How Can We Fit a Mixture of Gaussians?
   We use the Expectation-Maximization algorithm, which alternates
   between two steps:
     1. Expectation step (E-step): Compute the posterior probability over
        z given our current model - i.e. how much do we think each
        Gaussian generates each datapoint.
     2. Maximization step (M-step): Assuming that the data really was
        generated this way, change the parameters of each Gaussian to
        maximize the probability that it would generate the data it is
        currently responsible for.
            .95
             .05     .05
                     .95
                       .5
             .5            .5
             .5
   Intro ML (UofT)              CSC311-Lec11                           33 / 57
Expectation Maximization for GMM Overview
 1. E-step:
                                       (i)
      I   Assign the responsibility rk of component k for data point i using
          the posterior probability:
                                (i)
                               rk = Pr(z (i) = k | x(i) ; θ)
 2. M-step:
      I Apply the maximum likelihood updates, where each component is
        fit with a weighted dataset. The weights are proportional to the
        responsibilities.
                                 N
                              1 X (i)
                     πk   =          r
                              N i=1 k
                              PN (i) (i)
                                i=1 rk · x
                     µk   =     PN (i)
                                   i=1 rk
                                             N
                                   1         X
                                                   rk (x(i) − µk )(x(i) − µk )>
                                                    (i)
                     Σk   =   PN       (i)
                                i=1 rk       i=1
   Intro ML (UofT)                 CSC311-Lec11                                   34 / 57
Example
   Suppose we recorded a bunch of temperatures in March for
   Toronto and Miami, but forgot to record which was which, and
   they’re all jumbled together.
   Let’s try to separate them out using a mixture of Gaussians and
   E-M.
   Intro ML (UofT)          CSC311-Lec11                          35 / 57
Example
                     Random initialization
   Intro ML (UofT)        CSC311-Lec11       36 / 57
Example
Step 1:
               E-step                  M-step
    Intro ML (UofT)     CSC311-Lec11            37 / 57
Example
Step 2:
               E-step                  M-step
    Intro ML (UofT)     CSC311-Lec11            38 / 57
Example
Step 3:
               E-step                  M-step
    Intro ML (UofT)     CSC311-Lec11            39 / 57
Example
Step 10:
               E-step                  M-step
    Intro ML (UofT)     CSC311-Lec11            40 / 57
Expectation-Maximization
   EM for Multivariate Gaussians:
   In tutorial, you will fit a mixture of Bernoullis model.
   Intro ML (UofT)            CSC311-Lec11                    41 / 57
Relation to k-Means
   The K-Means Algorithm:
     1. Assignment step: Assign each data point to the closest cluster
     2. Refitting step: Move each cluster center to the average of the data
        assigned to it
   The EM Algorithm:
     1. E-step: Compute the posterior probability over z given our current
        model
     2. M-step: Maximize the probability that it would generate the data it
        is currently responsible for.
   Can you find the similarities between the soft k-Means algorithm
   and EM algorithm with shared covariance β1 I?
   Both rely on alternating optimization methods and can suffer from
   bad local optima.
   Intro ML (UofT)             CSC311-Lec11                              42 / 57
                            Why EM Works
                  (the rest of this lecture is optional)
Intro ML (UofT)                 CSC311-Lec11               43 / 57
Jensen’s Inequality (optional)
   Recall: if a function f is convex, then
                       !
               X            X
          f       λi xi ≤       λi f (xi ),
               i             i
   where
   P      {λi } are such that each λi ≥ 0 and
     i λi = 1.
   If we treat the λi as the parameters of a
   categorical distribution, λi = Pr(X = xi ),
   this can be rewritten as:
               f (E[X]) ≤ E[f (X)].
   This is known as Jensen’s Inequality. It
   holds for continuous distributions as well.
   Intro ML (UofT)                CSC311-Lec11   44 / 57
Jensen’s Inequality (optional)
   A function f (x) is concave if −f (x) is convex. In this case, we flip
   Jensen’s Inequality:
                           f (E[X]) ≥ E[f (X)].
   When would you expect the inequality to be tight?
   Intro ML (UofT)            CSC311-Lec11                           45 / 57
Where does EM come from? (optional)
   Recall: the log-likelihood function is awkward because it has a
   summation inside the log:
                                                                                          !
                         X                        X         X
         log p(X; θ) =       log(p(x(i) ; θ)) =       log           p(x(i) , z (i) ; θ)
                         i                        i         z (i)
   Introduce a new distribution q(z (i) ) (we’ll see what this is shortly):
                                                                        !
                                                           (i) (i)
                                                  (i) p(x , z ; θ)
                             X          X
               log p(X; θ) =    log           q(z )
                              i
                                                           q(z (i) )
                                        z (i)
                                                  p(x(i) , z (i) ; θ)
                             X                                       
                           =    log Eq(z(i) )
                              i
                                                      q(z (i) )
   Notice that log is a concave function. So we can use Jensen’s Inequality
   to push the log inwards, obtaining the variational lower bound:
                                            p(x(i) , z (i) ; θ)
                           X                                   
            log p(X; θ) ≥     Eq(z(i) ) log                       , L(q, θ)
                            i
                                               q(z (i) )
   Intro ML (UofT)                 CSC311-Lec11                                               46 / 57
Where does EM come from? (optional)
   Just derived a lower bound on the log-likelihood:
                                           p(x(i) , z (i) ; θ)
                           X                                  
             log p(X; θ) ≥   Eq(z(i) ) log                       , L(q, θ)
                           i
                                              q(z (i) )
   Simplifying the right-hand-side:
                    X
          L(q, θ) =     Eq(z(i) ) [log p(x(i) , z (i) ; θ)] − Eq(z(i) ) [log q(z (i) )]
                      i
                                                              |           {z          }
                                                                   constant w.r.t. θ
   The expected log-probability will turn out to be nice.
   Intro ML (UofT)                     CSC311-Lec11                                       47 / 57
Where does EM come from? (optional)
   Everything so far holds for any choice of q. But what should we
   actually pick?
   Jensen’s inequality gives a lower bound on the log-likelihood, so
   the best we can achieve is to make the bound tight (i.e. equality).
   Denote the current parameters as θ old .
   It turns out the posterior probability p(z (i) | x(i) ; θ old ) is a very
   good choice for q. Plugging it in to the lower bound:
                           p(x(i) , z (i) ; θ old )                     p(x(i) , z (i) ; θ old )
        X                                           X                                         
             Eq(z(i) ) log                           =   E    (i)
                                                           q(z )    log
         i
                                q(z (i) )              i
                                                                        p(z (i) | x(i) ; θ old )
                                                       X          h                     i
                                                     =   Eq(z(i) ) log p(x(i) ; θ old )
                                                        i
                                                       X
                                                   =        log p(x(i) ; θ old )
                                                        i
                                                   = log p(X; θ old )
   Equality achieved!
   Intro ML (UofT)                        CSC311-Lec11                                               48 / 57
Where does EM come from? (optional)
An aside:
    How could you pick
    q(z (i) ) = p(z (i) | x(i) ; θ old ) if you didn’t
    already know the answer?
    Observe: if f is strictly concave, then
    Jensen’s inequality becomes an
    equality exactly when the random
    variable X is determinisic.
    Hence, to solve
                             "                       #             "                        #
                                 p(x(i) , z (i) ; θ)                    p(x(i) , z (i) ; θ)
            log Eq(z (i) )                             = Eq(z (i) ) log                       ,
                                    q(z (i) )                              q(z (i) )
    we should set q(z (i) ) ∝ p(x(i) , z (i) ; θ).
    Intro ML (UofT)                           CSC311-Lec11                                        49 / 57
Where does EM come from? (optional)
   E-step: compute the responsibilities using Bayes’ Rule:
                      (i)
                     rk , q(z (i) = k) = Pr(z (i) = k | x(i) ; θ old )
   Rewriting the variational lower bound in terms of the
   responsibilities:
                            XX                   (i)
                L(q, θ) =                       rk log Pr(z (i) = k; π)
                                i       k
                                XX                 (i)
                            +                    rk log p(x(i) | z (i) = k; {µk }, {Σk })
                                    i       k
                            + const
   M-step: maximize L(q, θ) with respect to θ, giving θ new . This
   can be done analytically, and gives the parameter updates we saw
   previously.
   The two steps are guaranteed to improve the log-likelihood:
          log p(X; θ new ) ≥ L(q, θ new ) ≥ L(q, θ old ) = log p(X; θ old ).
   Intro ML (UofT)                               CSC311-Lec11                               50 / 57
EM: Recap (optional)
Recap of EM derivation:
    We’re trying to maximize the log-likelihood log p(X; θ).
    The exact log-likelihood is awkward, but we can use Jensen’s
    Inequality to lower bound it with a nicer function L(q, θ), the
    variatonal lower bound, which depends on a choice of q.
    The E-step chooses q to make the bound tight at the current
    parameters θ old . Mechanistically, this means computing the
                      (i)
    responsibilities rk = Pr(z (i) = k | x(i) ; θ old ).
    The M-step maximizes L(q, θ) with respect to θ, giving θ new . For
    GMMs, this can be done analytically.
    The combination of the E-step and M-step is guaranteed to
    improve the true log-likelihood.
    Intro ML (UofT)           CSC311-Lec11                            51 / 57
Visualization of the EM Algorithm (optional)
   The EM algorithm involves alternately computing a lower bound on the
   log likelihood for the current parameter values and then maximizing this
   bound to obtain the new parameter values.
   Intro ML (UofT)             CSC311-Lec11                             52 / 57
GMM E-Step: Responsibilities (optional)
Lets see how it works on GMM:
    Conditional probability (using Bayes’ rule) of z given x
                                          Pr(z = k) p(x | z = k)
                 rk = Pr(z = k | x)   =
                                                  p(x)
                                             p(z = k) p(x | z = k)
                                      =   PK
                                            j=1 p(z = j) p(x | z = j)
                                           πk N (x | µk , Σk )
                                      =   PK
                                           j=1 πj N (x | µj , Σj )
    Intro ML (UofT)               CSC311-Lec11                          53 / 57
GMM E-Step (optional)
                                (i)
   Once we computed rk = Pr(z (i) = k | x(i) ) we can compute the expected
   likelihood
                               "                     #
                                 X
                                          (i) (i)
              Ep(z(i) | x(i) )     log(p(x , z | θ))
                                      i
                                                                                       
                          (i)
               XX
          =              rk         log(Pr(z (i) = k | θ)) + log(p(x(i) | z (i) = k, θ))
                i    k
                                                                      
                          (i)
               XX
          =              rk         log(πk ) + log(N (x(i) ; µk , Σk ))
                i    k
                          (i)                             (i)
               XX                            XX
          =              rk log(πk ) +                 rk log(N (x(i) ; µk , Σk ))
                k    i                         k   i
   We need to fit k Gaussians, just need to weight examples by rk
   Intro ML (UofT)                         CSC311-Lec11                                     54 / 57
GMM M-Step (optional)
   Need to optimize
                           (i)                       (i)
             XX                           XX
                          rk log(πk ) +           rk log(N (x(i) ; µk , Σk ))
                k    i                    k   i
   Solving for µk and Σk is like fitting k separate Gaussians but with
            (i)
   weights rk .
   Solution is similar to what we have already seen:
                                     N
                                 1 X (i) (i)
                     µk     =          r x
                                 Nk i=1 k
                                     N
                                 1 X (i) (i)
                     Σk     =          r (x − µk )(x(i) − µk )T
                                 Nk i=1 k
                                                           N
                                 Nk                        X      (N )
                     πk     =          with    Nk =              rk
                                 N                         i=1
   Intro ML (UofT)                    CSC311-Lec11                              55 / 57
EM Algorithm for GMM (optional)
   Initialize the means µk , covariances Σk and mixing coefficients πk
   Iterate until convergence:
      I E-step: Evaluate the responsibilities given current parameters
                         (i)                     πk N (x(i) | µk , Σk )
                        rk = p(z (i) | x(i) ) = PK
                                                 j=1 πj N (x
                                                             (i) | µ , Σ )
                                                                    j   j
      I   M-step: Re-estimate the parameters given current responsibilities
                                       N
                                   1 X (i) (i)
                         µk    =         r x
                                   Nk i=1 k
                                       N
                                   1 X (i) (i)
                         Σk    =         r (x − µk )(x(i) − µk )>
                                   Nk i=1 k
                                                               N
                                   Nk                          X      (i)
                         πk    =            with        Nk =         rk
                                   N                           i=1
      I   Evaluate log likelihood and check for convergence
                                            N            K
                                                                                              !
                                            X            X                (i)
                     log p(X | π, µ, Σ) =         log          πk N (x          | µk , Σk )
                                            i=1          k=1
   Intro ML (UofT)                  CSC311-Lec11                                                  56 / 57
GMM Recap
  A probabilistic view of clustering - Each cluster corresponds to a
  different Gaussian.
  Model using latent variables.
  General approach, can replace Gaussian with other distributions
  (continuous or discrete)
  More generally, mixture models are very powerful models, i.e.
  universal distribution approximators
  Optimization is done using the EM algorithm.
  Intro ML (UofT)           CSC311-Lec11                          57 / 57