Gaussian Mixture Models: LE Thi Khuyen
Gaussian Mixture Models: LE Thi Khuyen
-learn Advan
LE Thi Khuyen
February 2025
1 Context
1 Context
Source: www.robots.ox.ac.uk
LE Thi Khuyen Gaussian Mixture Models February 2025 5 / 34
Context Introduction of Gaussian Mixture model Expectation-Maximization for GMM GMM implementation from scratch and with scikit-learn Advan
K
X
p(xi ) = πk N (xi |µk , Σk ). (3)
k =1
Assuming {x1 , . . . , xN } are independent, the likelihood of the GMM for N points is:
N
Y N X
Y K
P(x|θ) = p(xi ) = πk N (xi |µk , Σk ) (4)
i=1 i=1 k =1
N
X K
X
L(θ) = − ln πk N (xi |µk , Σk ). (5)
i=1 k=1
Latent variables
Given a datapoint xi , let define:
1 if xi ∈ Ck ,
zik = (6)
0 otherwise.
zi = (zi1 , zi2 , . . . , ziK ) is called a latent variable (unobserved) and zik ∼ Ber (πk ), (i.e
p(zik = 1) = πk ).
Generative model
N N
∂L(θ) X X
=0⇔ γik xi = γik µk . (9)
∂µk i=1 i=1
N N
∂L(θ) X X
=0⇔ γik xi = γik µk . (9)
∂µk i=1 i=1
We obtain:
N
1 X
µk = γik xi , (10)
Nk i=1
where
π N (xi |µk , Σk )
• γik = P k is called the responsibility of mixture component k for vector xi ,
K
l=1 πl N (xi |µl , Σl )
γik ∈ [0, 1].
PN
• Nk =
i=1 γik is the effective number of vectors assigned to component k.
N
1 X
Σk = γik (xi − µk )(xi − µk )T . (11)
Nk
i=1
N
1 X
Σk = γik (xi − µk )(xi − µk )T . (11)
Nk
i=1
N
∂L(θ, λ) X N (x|µk , Σk )
=− PK + λ = 0, (12)
∂πk l=1 πl N (xi |µl , Σl )
i=1
N
1 X
Σk = γik (xi − µk )(xi − µk )T . (11)
Nk
i=1
N
∂L(θ, λ) X N (x|µk , Σk )
=− PK + λ = 0, (12)
∂πk l=1 πl N (xi |µl , Σl )
i=1
Hence,
N K X
N K
X γik X X
=λ⇔ γik = pk λ, (13)
πk
i=1 k=1 i=1 k =1
γik N
= k.
PN
Therefore, λ = N and πk = i=1
N N
1 Context
N
1 X
µk = γik xi ,
Nk
i=1
N
1 X
Σk = γik (xi − µk )(xi − µk )T ,
Nk
i=1
N
Nk X
πk = , where Nk = γik .
N
i=1
Exercise
Implement the EM algorithm for a GMM with K = 2 components on the following 1D-dataset:
Solution
The density functions corresponding to two initialized distributions:
!
(x − µ1 )2 (x − 2)2
1 1
• f (x|µ1 , σ 2 ) = q exp − = √ exp −
1 2
2πσ12 2σ1 2π 2
!
(x − µ2 )2 (x − 12)2
1 1
• f (x|µ2 , σ 2 ) = q exp − = √ exp −
2 2
2πσ22 2σ2 2 2π 8
E-step
Compute the responsibilities (γik ) for each data point xi and each component k:
We obtain:
• γi1 ∼ 1 for i = 1, 2, 3, γi1 ∼ 0 for i = 4, 5, 6
• γi2 ∼ 0 for i = 1, 2, 3, γi2 ∼ 1 for i = 4, 5, 6
1 Context
Exercise
Training GMM model with EM algorithm from scratch. Given three datasets following normal
distribution X1 ∼ N (0, 1), X2 ∼ N (−3, 1.5), X3 ∼ N (8, 2).
1. Write python code to generate these datasets with 1000 samples in each group.
(Hint: use function np.random.normal(loc = mean, scale = std, size = samples) or
norm.rvs(loc = mean, scale = std, size = samples))
2. Implement the EM-algorithm to estimate the data distribution.
3. Apply the function GaussianMixture from the scikit-learn package to estimate the data
distribution. Compare the results from two approaches.
Expectation step
Maximization step
1 Context
Advantages of GMMs
• Flexibility: GMMs have a ability to model a wide range of probability distributions (which
can be represented as a weighted sum of multiple normal distribution).
• Interpretability: The parameters of GMMs (weights, means, covariances) provide a clear
interpretation, which is useful for understanding the underlying data structure.
• Handles overlapping clusters by assigning probabilities to each point belonging to
different clusters.
• Handles missing values: EM estimates missing data probabilistically, GMMs can be
used to model incomplete datasets.
Disadvantages of GMMs
• Sensitive to initialization of parameters as weights, means, covariances. Poor
initialization may lead to local optima (bad clustering).
• Assumption of Normality which is not always true in real world dataset.
• Computational Complexity is expensive compared to K-means clustering, making it is
not scalable for large datasets.
• Number of components K must be predefined (choosing a wrong value of K may lead
to overfitting or underfiting).