AML
Nayan Ranjan Paul
Assistant Professor
Silicon Institute of Technology
Module - IV
Probability Theory
●
Probability theory is a mathematical framework for representing uncertain
statements.
●
It provides a means of quantifying uncertainty as well as axioms for deriving new
uncertain statements.
●
In artificial intelligence applications, we use probability theory in two major
ways.
– First, the laws of probability tell us how AI systems should reason, so we
design our algorithms to compute or approximate various expressions derived
using probability theory.
– Second, we can use probability and statistics to theoretically analyze the
behavior of proposed AI systems.
Why Probability?
●
Many branches of computer science deal mostly with entities that are entirely
deterministic and certain.
●
A programmer can usually safely assume that a CPU will execute each machine
instruction flawlessly.
●
Given that many computer scientists and software engineers work in a relatively
clean and certain environment.
●
It can be surprising that machine learning makes heavy use of probability theory.
●
Machine learning must always deal with uncertain quantities and sometimes
stochastic (nondeterministic) quantities.
Why Probability?
●
There are three possible sources of uncertainty:
– Inherent stochasticity in the system being modeled.
– Incomplete observability. Even deterministic systems can appear stochastic when we
cannot observe all the variables that drive the behavior of the system.
– Incomplete modeling. When we use a model that must discard some of the
information we have observed, the discarded information results in uncertainty in the
model’s predictions.
●
In many cases, it is more practical to use a simple but uncertain rule rather than a complex
but certain one.
●
For example, thesimple rule “Most birds fly” is cheap to develop and is broadly useful,
●
while a rule of the form, “Birds fly, except for very young birds that have not yet learned
to fly, sick or injured birds that have lost the ability to fly, flightless species of
birdsincluding the cassowary, ostrich and kiwi. . .” is expensive to develop, maintain and
communicate and, after all this effort, is still brittle and prone to failure.
Probability Theory
●
Probability theory was originally developed to analyze the frequencies of events.
●
It is easy to see how probability theory can be used to study events like drawing a
certain hand of cards in a poker game.
●
These kinds of events are often repeatable.
●
When we say that an outcome has a probability p of occurring, it means that if we
repeated the experiment (e.g., drawing a hand of cards) infinitely many times,
then a proportion p of the repetitions would result in that outcome.
●
This kind of reasoning does not seem immediately applicable to propositions that
are not repeatable.
Probability Theory
●
For example, if a doctor analyzes a patient and says that the patient has a 40
percent chance of having the flu, this means something very different—we cannot
make infinitely many replicas of the patient, nor is there any reason to believe that
different replicas of the patient would present with the same symptoms yet have
varying underlying conditions.
●
In the case of the doctor diagnosing the patient, we use probability to represent a
degree of belief, with 1 indicating absolute certainty that the patient has the flu
and 0 indicating absolute certainty that the patient does not have the flu.
●
The former kind of probability, related directly to the rates at which events occur,
is known as frequentist probability, while the latter, related to qualitative levels of
certainty, is known as Bayesian probability.
●
Probability theory provides a set of formal rules for determining the likelihood of
a proposition being true given the likelihood of other propositions.
Random Variables
●
Random variables are the numerical descriptions of the outcomes of any statistical
experiment.
●
In simple terms, a random variable is a rule or a function that associates a
numerical value with each outcome in a sample space of an experiment.
●
A random variable is a variable whose value is unknown or a function that assigns
values to each of an experiment's outcomes.
●
Arandom variableis a variable that can take on different values randomly.
●
In probability and statistics, random variables are used to quantify outcomes of a
random occurrence, and therefore, can take on many values.
Random Variables
●
Random variables are required to be measurable and are typically real numbers.
●
For example, the letter X may be designated to represent the sum of the resulting
numbers after three dice are rolled. In this case, X could be 3 (1 + 1+ 1), 18 (6 + 6
+ 6), or somewhere between 3 and 18, since the highest number of a die is 6 and
the lowest number is 1.
●
A random variable is different from an algebraic variable.
●
The variable in an algebraic equation is an unknown value that can be calculated.
●
The equation 10 + x = 13 shows that we can calculate the specific value for x
which is 3.
●
On the other hand, a random variable has a set of values, and any of those values
could be the resulting outcome as seen in the example of the dice above.
Types of Random Variables
●
Based on the values a random variable can take, they are classified into
– Discrete Random variable
– Continuous Random variable
Types of Random Variables
●
Discrete Random variable:
– A random variable that can take only a countable number of distinct values or
states, such as 0,1,2,3,4…. is known as a discrete random variable.
– Note that these states are not necessarily the integers; they can also just be
named states that are not considered to have any numerical value.
– Example of discrete random variable includes:
●
The number of heads on tossing a coin.
●
Getting a king from the shuffled deck of cards.
●
The number of successes in the Bernoulli trials.
●
The number of cars passing through a red light in an hour.
– Discrete Random Variables are used to calculate the probability of events,
statistical interference, and machine learning.
Types of Random Variables
●
Continuous Random variable:
– A random variable that can take infinite possible values is known as a
continuous random variable.
– The set of all the possible values of a continuous random variable is an
interval of real numbers.
– Example of continuous random variable includes
●
heights of students
●
weight of employees
●
amount of rainfall.
– Continuous random variables are used to estimate mean and standard
deviation, hypothesis testing, and regression analysis.
Probability Distributions
●
A probability distribution is a description of how likely a random variable or set of random variables is
to take on each of its possible states or values.
or
●
A probability distribution of a random variable is a list of all possible outcomes with the
corresponding probability values.
or
●
The probability distribution for a random variable shows how the probabilities are distributed over the
values of the random variable.
●
Example: The probability distribution for tossing two coins is given as
– Outcome of the toss - HH HT TH TT
– Probability 1/4 1/4 1/4 1/4
●
There are types of probability distribution based on the types of random variables.
– Discrete Probability Distribution
– Continuous Probability Distribution
Probability Distributions
●
Discrete Probability Distribution
– A probability distribution over discrete variables may be described using a
probability mass function(PMF).
●
Continuous Probability Distribution
– When working with continuous random variables, we describe probability
distributions using a probability density function (PDF) rather than a
probability mass function.
Probability Distributions
●
Joint Probability Distribuion:
– Joint probability is the probability that two or more events will occur
simultaneously.
– A joint probability distribution represents a probability distribution for two or
more random variables.
– Instead of events being labelled A and B, the condition is to use X and Y as
given below. f(x,y) = P(X = x, Y = y) The main purpose of this is to look for a
relationship between two variables.
Probability Distributions
●
Marginal Probability Distribuion:
– Sometimes we know the probability distribution over a set of variables and
we want to know the probability distribution over just a subset of them.
– The probability distribution over the subset is known as the marginal
probability distribution
– For example, suppose we have discrete random variables x and y, and we
know P (x, y). We can find P (x) with the sum rule:
Probability Distributions
Chain Rule
●
By the chain rule of probability, we can always represent a joint distribution as
follows, using any ordering of the variables:
●
This observation is known as the chain rule, or product rule, of probability.
●
It follows immediately from the definition of conditional probability.
●
For example, applying the definition twice, we get
Chain Rule
●
Chain rule can also be written as :
where V is the number of variables.
●
For example, suppose all the variables have K states.
●
We can represent p(x1) as a table of O(K) numbers, representing a discrete
distribution.
●
Similarly, we can represent p(x2|x1) as a table of O(K 2) numbers by writing p(x2 = j|
x1 = i) = Tij ; we say that T is a stochastic matrix, since it satisfies the constraint
Σj Tij = 1 for all rows i, and 0 ≤ Tij ≤ 1 for all entries.
●
Similarly, we can represent p(x3|x1, x2) as a 3d table with O(K 3) numbers. These are
called conditional probability tables or CPTs.
●
We see that there are O(KV) parameters in the model.
●
We would need an awful lot of data to learn so many parameters.
Chain Rule
●
One solution is to replace each CPT with a conditional probability distribution or
CPD, such as multinomial logistic regression, i.e., p(xt=k|x1:t-1 ) = S(Wtx1:t-1)k
●
The total number of parameters is now only O(K2V2), making this a compact
density model.
●
However, this model is not useful for other kinds of prediction tasks, since each
variable depends on all the previous variables, So we need another approach.
Independence and Conditional Independence
●
Two random variables x and y are independent if their probability distribution can
be expressed as a product of two factors, one involving only x and one involving
only y
●
Two random variables x and y are conditionally independent given a random
variable z if the conditional probability distribution over x and y factorizes in this
way for every value of z:
●
We can denote independence and conditional independence with compact
notation: x⊥y means that x and y are independent, while x⊥y | z means that x
and y are conditionally independent given z
Independence and Conditional Independence
● Let us see why this might help. Suppose we assume that x t+1⊥x1:t-1|xt , or in other
words, “the future is independent of the past given the present”, This is called the
(first order) Markov assumption.
●
Using this assumption, plus the chain rule, we can write the joint distribution as
follows:
●
This is called a (first-order) Markov chain.
●
Although the first-order Markov assumption is useful for defining distributions on
1d sequences.
●
How can we define distributions on 2d images, or 3d videos, or, in general,
arbitrary collections of variables (such as genes belonging to some biological
pathway)?
●
This is where graphical models come in.
Graphical Model
●
A graphical model (GM) is a way to represent a joint distribution by making
conditional independent assumptions.
●
In particular, the nodes in the graph represent random variables, and the edges
represent CI assumptions.
Graph Terminology
●
A graph G = (V, E) consists of a set of nodes or vertices, V = {1,...,V }, and a set
of edges, E = {(s, t) : s, t ∈ V}.
●
The graph can be represented by its adjacency matrix, in which we write G(s, t)=1
to denote (s, t) ∈ E, that is, if s → t is an edge in the graph.
●
If G(s, t)=1 and G(t, s)=1, then the graph is undirected, otherwise it is directed.
●
We usually assume G(s, s)=0, which means there are no self loops.
Graph Terminology
●
Parent - The parents of a node is the set of all nodes that feed into it: pa(s) = {t :
G(t, s)=1}.
●
Child - The children of a node is the set of all nodes that feed out of it: ch(s) = {t :
G(s,t)=1}.
●
Family - The family of a node is the node and its parents, fam(s) = {s} ∪ pa(s).
●
Root - A root is a node with no parents.
●
Leaf - A leaf is a node with no children.
●
Ancestors - The ancestors are the parents, grand-parents, etc of a node. That is,
the ancestors of t is the set of nodes that connect to t via a trail:
anc(t) = {s : s t}.
Graph Terminology
●
Descendants - The descendants are the children, grand-children, etc of a node.
That is, the descendants of s is the set of nodes that can be reached via trails from
s: desc(s) ! {t : s t}.
●
Neighbors - The neighbors of a node as the set of all immediately connected
nodes, nbr(s) = {t : G(s, t)=1 ∨ G(t, s)=1}. For an undirected graph.
●
Degree - The degree of a node is the number of neighbors. For directed graphs,
we speak of the in-degree and out-degree, which count the number of parents and
children.
●
Cycle or loop - We define a cycle or loop to be a series of nodes such that we can
get back to where we started by following edges, s1 − s2 ··· − sn − s1, n ≥ 2. If
the graph is directed, we may speak of a directed cycle.
●
DAG - A directed acyclic graph or DAG is a directed graph with no directed
cycles.
Graph Terminology
●
Topological ordering For a DAG, a topological ordering or total ordering is a
numbering of the nodes such that parents have lower numbers than their children.
For example, in Figure ,
we can use (1, 2, 3, 4, 5), or (1, 3, 2, 5, 4), etc.
●
Path or trail - A path or trail s t is a series of directed edges leading from s to
t.
●
Tree - An undirected tree is an undirectecd graph with no cycles. A directed tree is
a DAG in which there are no directed cycles. If we allow a node to have multiple
parents, we call it a polytree, otherwise we call it a moral directed tree.
Graph Terminology
●
Forest - A forest is a set of trees.
● Subgraph - A (node-induced) subgraph GA is the graph created by using the nodes
in A and their corresponding edges, GA = (VA, EA).
●
Clique For an undirected graph - A clique is a set of nodes that are all neighbors
of each other.
●
A maximal clique is a clique which cannot be made any larger without losing the
clique property.
– For example, in Figure , {1, 2} is a clique but it is not maximal, since we can
add 3 and still maintain the clique property. In fact, the maximal cliques are as
follows: {1, 2, 3}, {2, 3, 4}, {3, 5}.
Bayesian Network
●
A Bayesian network is a probabilistic graphical model which represents a set of
random variables and their conditional dependencies using a directed acyclic
graph.
●
It is also called a Bayes network, belief network, decision network, or Bayesian
model.
●
A Bayesian Network for a set of variables (nodes) X = { X1,…….Xn}
●
Arcs represent probabilistic dependence among variables
●
Lack of an arc denotes a conditional independence
●
The network structure S is a directed acyclic graph
●
A set P of local probability distributions at each node (Conditional Probability
Table)
Representation in Bayesian Belief Networks
●
Conditional probability table
associated with each node specifies the
conditional distribution for the
variable given its immediate parents in
the graph
●
Each node is asserted to be
conditionally independent of its non-
descendants, given its immediate
parents
Applications of Bayesian Networks
●
Diagnosis: P(cause|symptom)=?
●
Prediction: P(symptom|cause)=?
●
Classification: P(class|data)
●
Decision-making (given a cost function)
Bayesian Networks
●
Structure of the graph Conditional independence relations
● In general, p(X1, X2 ,....XN) = π p(Xi | parents(Xi))
The full joint distribution The graph-structured approximation
●
Requires that graph is acyclic (no directed cycles)
●
2 components to a Bayesian network
– The graph structure (conditional independence assumptions)
– The numerical probabilities (for each variable given its parents)
Examples
A B C ●
Marginal Independence: p(A,B,C) = p(A)
p(B) p(C)
●
Conditionally independent effects:
A
p(A,B,C) = p(B|A)p(C|A)p(A) B and C are
conditionally independent Given A
B C
Examples
●
Independent Causes: p(A,B,C) = p(C|
A B A,B)p(A)p(B) “Explaining away”
●
Markov dependence: p(A,B,C) = p(C|B)
A B C p(B|A)p(A)
Bayesian Network for General Naiive Bayes Model
●
Encoded using a very small number of parameters
●
Linear in the number of variables
Joint distribution from Student Bayesian Network
Bayesian Network Charecterstic
●
Nodes in the network can be topological ordered i.e. there exists a sequence of
nodes such that parent nodes comes before child.
– This property defines Markov order property.
●
It is the assumption a node only depends on its immediate parent, not on
all predecessors.
– Ex. p(x1:x5) = p(x1).p(x2|x1).p(x3|x1,x2).p(x4|x1,x2,x3).p(x5|x1,x2,x3,x4)
= p(x1).p(x2|x1).p(x3|x1).p(x4|x2,x3).p(x5|x3)
– In general
where each term is a CPD
From Factorization to Indipendence
From Factorization to Indipendence
From Factorization to Indipendence
From Factorization to Indipendence
From Factorization to Indipendence
From Factorization to Indipendence
Probabilistic Indipendence Rule
Example
Example
Example
Example
Minimal I-Maps
●
Graph G is minimal I-Map of P if G is an I-Map of P and if there is no G’ (subset
of G) which is an I-Map of P.
●
How to determine A CI B of graph G given C?
●
Deriving this independence for undirected graph, how?
●
The DAG situation sometimes complicated.
D-Separation
●
Directed edge separation(D-Separation)
●
We say an undirected path P is d-separated by a set of nodes E (containing the
evidence) if at least one of the following conditions hold:
– P contains a chain, s → m → t or s ← m ← t, where m ∈ E
– P contains a tent or fork, s ↙m↘ t, where m ∈ E
– P contains a collider or v-structure, s ↘m↙ t, where m is not in E and nor is
any descendant of m
●
A set of nodes A is d-separated from a diferent set of nodes B given a third
observed set E if each undirected path from every node a ∈ A to every node b ∈
B is d-separated by E.
●
The CI properties of a DAG as follows:
xA⊥G xB |xE ⇐⇒ A is d-separated from B given E
Bayes ball algorithm
●
If A is d-separated from B given E
– Shade all nodes in E indicating that they are observed.
– Place “balls” at each node in A, let them “bounce around” according to some
rules
– Then ask if any of the balls reach any of the nodes in B.
●
There are 3 rules (The balls can travel opposite to edge directions)
– A ball can pass through a chain, but not if it is shaded in the middle.
– A ball can pass through a fork, but not if it is shaded in the middle.
– A ball cannot pass through a v-structure, unless it is shaded in the middle.
Bayes ball algorithm
Bayes ball algorithm
●
We can justify the 3 rules of Bayes ball as follows.
1. First consider a chain structure X → Y → Z,
which encodes p(x, y, z) = p(x)p(y|x)p(z|y)
When we condition on y, are x and z independent? We have
and therefore x ⊥ z|y. So observing the middle node of chain breaks it in two.
Bayes ball algorithm
2. Consider the tent structure X ← Y → Z.
The joint is p(x, y, z) = p(y)p(x|y)p(z|y)
When we condition on y, are x and z independent? We have
and therefore x ⊥ z|y. So observing a root node separates its children
Bayes ball algorithm
3. Finally consider a v-structure X → Y ← Z.
The joint is p(x, y, z) = p(x)p(z)p(y|x, z)
When we condition on y, are x and z independent? We have
so x ̸⊥ z|y. However, in the unconditional distribution, we have p(x, z) = p(x)p(z)
●
Conclusion: conditioning on a common child at the bottom of a v-structure makes
its parents become dependent. This important efect is called explaining away,
inter-causal reasoning, or Berkson’s paradox.
Example
●
Which of the following are guarented
to be true?
– V⊥Z
– V⊥Z|T
– U⊥V
– U⊥V|W
– U⊥V|X
– U⊥V|Y
– U⊥V|Z
– W⊥X
– X⊥T|V
– X⊥W|U
Markov Model
• A Markov model is a finite state machine with N distinct
states begins at (Time t = 1) in initial state .
• It moves from current state to next state according to the
transition probabilities associated with the current state
• This kind of system is called Finite or Discrete Markov
model.
Markov Property
• Markov Property : The Current state of the system depends only on
the previous state of the system.
• The State of the system at Time [ T+1 ] depends on the state of the
system at time T.
Xt=1 Xt=2 Xt=3 Xt=4 Xt=5
Discreate Markov Model-Example
• A Discrete Markov Model with 5 states.
• Each aij represents the probability of moving from state ‘ i’ to state
‘j’.
• The probability to start in a given state i is πi .
• The Vector π represents the start probabilities.
• To define Markov model, the following probabilities have to be
specified:
– transition probabilities aij = P(Si | Sj ) and
– initial probabilities πi = P( Si )
Discreate Markov Model-Example
• For the given Markov model, the transition probabilities are
mentioned in the diagram, the initial probabilities for Rain and Dry
is 0.4 and 0.6 respectively. Calculate the probability of a sequence
of states {Dry,Dry,Rain,Rain}
0.7
0.3 Rain Dry 0.7
0.2
Hidden Markov Model
• A Hidden Markov model is a statistical model in which the system
being modelled is assumed to be markov process with unobserved
hidden states.
• HMM is a graphical model which can be used to predict a sequence
of observed variables.
• In Regular Markov models the state is clearly visible to others in
which the state transition probabilities are the parameters only
where as in HMM the state is not visible but the output is visible.
What is an HMM?
• Graphical Model
• Circles indicate states
• Arrows indicate probabilistic dependencies
between states
What is an HMM?
• Green circles are hidden states
• Dependent only on the previous state
• “The past is independent of the future given the
present.”
What is an HMM?
• Purple nodes are observed states
• Dependent only on their corresponding hidden
state
HMM Formalism
S S S S S
K K K K K
• {S, K,
• S : {s1…sN } are the values for the hidden states
• Process moves from one state to another generating a
sequence of states.
• States are not visible, but each state randomly generates one
of M observation or visible states. K : {k1…kM } are the
values for the observations
HMM Formalism
S A S A S A S A S
B B B
K K K K K
• {S, K,
• are the initial state probabilities
• A = {aij} are the state transition probabilities
• B = {bik} are the observation state probabilities
HMM Example
0.7
0.3 low high 0.8
0.2
0.6
0.4 0.6
0.4
Rain Dry
• For the given HMM hidden states are low and high and
observation states are Rain and Dry. The inital probabilities
of low and high are 0.4 and 0.6 respectively. Calculate the
sequence probability for the observed sequence
{Dry,Rain}?
Types of HMM structure
• Ergodic model:
– Every aij is positive
– Every transition is possible
• Bakis (Left-to-Right) model:
– Can not go backward
– Aij = 0 , j<i
– Good for temporal data like speech recognition.
Main issues in HMM
• Evaluation problem:
– Given the the HMM M={S, K, observation sequence O =
o1o2o3....ok , calculate the probability that model M has generated the
sequence O.
– To solve this problem “forward-backword algorithm is used.
• Decoding problem:
– Given the the HMM M={S, K, observation sequence O =
o1o2o3....ok , calculate the most likely sequence of hidden states Si that
produced this observation sequence O.
– To solve this problem Viterbi algorithm is used
• Learning problem:
– Given some training observation sequences O = o1o2o3....ok general
structure of the HMM(number of hidden and visible states), determine
HMM parameters M={ that best fit training data.
– To solve this problem Baum-Weleh algorithm is used.
Inference in an HMM
• Compute the probability of a given observation
sequence
• Given an observation sequence, compute the most
likely hidden state sequence
• Given an observation sequence and set of possible
models, which model most closely fits the data?
Gaussian Mixture Model
●
In many engineering applications, such as signal processing, communication,
machine learning, and others, a Gaussian distribution is commonly used to model
data distributions.
●
Which means a particular data distribution can be approximated by a single
Gaussian distribution.
●
Advantages
– It is very closer to natural distribution
– Mathematical manipulations for Gaussian functions is easy,like detrmining the
nth order derivative is easy.
Gaussian Mixture Model
●
Disadvantage
– Complex data distributions cannot always be modeled accurately by a single
Gaussian.
●
Solution
– A Gaussian Mixture Model (GMM) represents data as a combination of
multiple Gaussian distributions
– Captures diverse patterns in complex datasets.
Gaussian Mixture Model
●
Parameters of Gaussian Distribution
– For single Gaussian / 1-D Gaussian distribution
●
Mean (μ): Central tendency of the distribution
●
Variance (σ²): Measure of spread or dispersion
– For mixture of Gaussian / Multinomial distribution
●
Mean Vector (μ): Generalization of mean to multiple dimensions
●
Covariance Matrix (Σ): Describes the spread and relationships between
dimensions
Gaussian Mixture Model
●
The parameters of Gaussian distribution can be detrmined by some estimation
technique like maximum liklihood estimation (MLE)
●
But, in case of GMM, MLE technique can not be applied to detrmine the
parameters of GMM.
●
Because MLE can not give a closed form solution.
●
The solution is to use an itterative algorithm like Expectation Maximization (EM)
Algorithm to estimate the parameters of GMM.
Gaussian Mixture Model
●
1-D example
– Observations / data x1,x2,.....xn
●
K=2 Gaussian with unknown
μ and σ²
Gaussian Mixture Model
●
1-D example
– Observations / data x1,x2,.....xn
●
K=2 Gaussian with unknown
μ and σ²
●
Estimation of parameters is
trivial if we know the source
of each observation.
Gaussian Mixture Model
●
1-D example
– Observations / data x1,x2,.....xn
●
K=2 Gaussian with unknown
μ and σ²
●
Estimation of parameters is
trivial if we know the source
of each observation.
Gaussian Mixture Model
●
1-D example
– Observations / data x1,x2,.....xn
●
K=2 Gaussian with unknown
μ and σ²
●
Estimation of parameters is
trivial if we know the source
of each observation.
Gaussian Mixture Model
●
1-D example
– What if we do not know the source?
Gaussian Mixture Model
●
1-D example
– What if we do not know the source?
– If we knew the parameters of the Gaussian (μ, σ²)
●
Can guess whether point is more likely to be a or b
Gaussian Mixture Model
●
1-D example
– What if we do not know the source?
– If we knew the parameters of the Gaussian (μ, σ²)
●
Can guess whether point is more likely to be a or b
Gaussian Mixture Model
●
1-D example
– What if we do not know the source?
– If we knew the parameters of the Gaussian (μ, σ²)
●
Can guess whether point is more likely to be a or b
Gaussian Mixture Model
●
1-D example
– What if we do not know the source?
– If we knew the parameters of the Gaussian (μ, σ²)
●
Can guess whether point is more likely to be a or b
The Problem
●
We have the dataset that we believe is drawn from n distributions.
●
We want to identify the parameters of each distributions.
●
We do not know anything about the populations a priori.
– Except we believe that they are Gaussians
●
Fit a set of n Gaussians to the data
The Problem
Mixture Models
●
Formally a Mixture Model is the weighted sum of a number of the pdfs where the
weights are determined by a distribution, .
Gaussian Mixture Model
●
GMM: The weighted sum of a number of Gaussians where the weights are
determined by a distribution .
Gaussian Mixture Model
●
GMM: The weighted sum of a number of Gaussians where the weights are
determined by a distribution .
Gaussian Mixture Models
Expectation Maximization Algorithm
●
The Expectation Maximization (EM) Algorithm (EM algorithm) is a method to
find MLE of the parameters of a statistical model in cases where the equations
cannot be solved directly.
●
Gaussian mixture model is a kind of statistcal model which involves latent
variables (un obsereved or hidden) and hence cannot be solved directly using MLE
method.
●
GMM models can be used to cluster unlabelled data points.
●
That is not knowing what samples came from which class or distribution.
●
Our goal is to use GMM to assign the data points to the appropriate cluster.
●
Since GMM contains latent variables, we apply EM algorithm to solve the
problem.
Expectation Maximization Algorithm
●
In a GMM:
●
●
Expectation Maximization Algorithm
●
General outline of EM Algorithm
– Step-1: Initialize the parameters θ.
– Step-2: Expectation step(E-Step) : Using the observed available data of the
dataset, estimate(guess) the values of the missing data.
– Step-3: Maximization step(M-Step): Complete data generated after the
expectation step is used to update the parameters θ by maximizing likelyhood
function.
– Step-4: Repeat step-2 and 3 until converge.
EM Algorithm for GMM
●
General outline of EM Algorithm
– Step-1: Randomly initialize the parameters
– Step-2: Expectation step(E-Step) :
●
Compute the responsibilities: the probability that each data point belongs
to each Gaussian component.
●
For datapoint xi, the responsibility rik is:
– Here
:Multivariate Gaussian probability density function for
component 𝑘
– r Responsibility that component k generated x
ik i
Expectation Maximization Algorithm
●
General outline of EM Algorithm
– Step-3: Maximization step(M-Step): Update parameters
to maximize the likelihood based on the responsibilities:
●
Update mean:
●
Update covariance matrix:
●
Update mixing coefficients: Here, N is the total number of
data points.
– Step-4: Repeat step-2 and 3 until converge.