0% found this document useful (0 votes)

16 views98 pages

AML-IV New

This document provides an overview of probability theory, its significance in artificial intelligence, and its application in machine learning. It discusses the concepts of random variables, their types, probability distributions, and the importance of independence and conditional independence in modeling. Additionally, it introduces graphical models as a way to represent joint distributions and the terminology associated with graph theory.

Uploaded by

Dibyajyoti Dash

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views98 pages

AML-IV New

Uploaded by

Dibyajyoti Dash

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 98

AML

Nayan Ranjan Paul

Assistant Professor
Silicon Institute of Technology
Module - IV
Probability Theory
●
Probability theory is a mathematical framework for representing uncertain
statements.
●
It provides a means of quantifying uncertainty as well as axioms for deriving new
uncertain statements.
●
In artificial intelligence applications, we use probability theory in two major
ways.
– First, the laws of probability tell us how AI systems should reason, so we
design our algorithms to compute or approximate various expressions derived
using probability theory.
– Second, we can use probability and statistics to theoretically analyze the
behavior of proposed AI systems.
Why Probability?
●
Many branches of computer science deal mostly with entities that are entirely
deterministic and certain.
●
A programmer can usually safely assume that a CPU will execute each machine
instruction flawlessly.
●
Given that many computer scientists and software engineers work in a relatively
clean and certain environment.
●
It can be surprising that machine learning makes heavy use of probability theory.
●
Machine learning must always deal with uncertain quantities and sometimes
stochastic (nondeterministic) quantities.
Why Probability?
●
There are three possible sources of uncertainty:
– Inherent stochasticity in the system being modeled.
– Incomplete observability. Even deterministic systems can appear stochastic when we
cannot observe all the variables that drive the behavior of the system.
– Incomplete modeling. When we use a model that must discard some of the
information we have observed, the discarded information results in uncertainty in the
model’s predictions.
●
In many cases, it is more practical to use a simple but uncertain rule rather than a complex
but certain one.
●
For example, thesimple rule “Most birds fly” is cheap to develop and is broadly useful,
●
while a rule of the form, “Birds fly, except for very young birds that have not yet learned
to fly, sick or injured birds that have lost the ability to fly, flightless species of
birdsincluding the cassowary, ostrich and kiwi. . .” is expensive to develop, maintain and
communicate and, after all this effort, is still brittle and prone to failure.
Probability Theory
●
Probability theory was originally developed to analyze the frequencies of events.
●
It is easy to see how probability theory can be used to study events like drawing a
certain hand of cards in a poker game.
●
These kinds of events are often repeatable.
●
When we say that an outcome has a probability p of occurring, it means that if we
repeated the experiment (e.g., drawing a hand of cards) infinitely many times,
then a proportion p of the repetitions would result in that outcome.
●
This kind of reasoning does not seem immediately applicable to propositions that
are not repeatable.
Probability Theory
●
For example, if a doctor analyzes a patient and says that the patient has a 40
percent chance of having the flu, this means something very different—we cannot
make infinitely many replicas of the patient, nor is there any reason to believe that
different replicas of the patient would present with the same symptoms yet have
varying underlying conditions.
●
In the case of the doctor diagnosing the patient, we use probability to represent a
degree of belief, with 1 indicating absolute certainty that the patient has the flu
and 0 indicating absolute certainty that the patient does not have the flu.
●
The former kind of probability, related directly to the rates at which events occur,
is known as frequentist probability, while the latter, related to qualitative levels of
certainty, is known as Bayesian probability.
●
Probability theory provides a set of formal rules for determining the likelihood of
a proposition being true given the likelihood of other propositions.
Random Variables
●
Random variables are the numerical descriptions of the outcomes of any statistical
experiment.
●
In simple terms, a random variable is a rule or a function that associates a
numerical value with each outcome in a sample space of an experiment.
●
A random variable is a variable whose value is unknown or a function that assigns
values to each of an experiment's outcomes.
●
Arandom variableis a variable that can take on different values randomly.
●
In probability and statistics, random variables are used to quantify outcomes of a
random occurrence, and therefore, can take on many values.
Random Variables
●
Random variables are required to be measurable and are typically real numbers.
●
For example, the letter X may be designated to represent the sum of the resulting
numbers after three dice are rolled. In this case, X could be 3 (1 + 1+ 1), 18 (6 + 6
+ 6), or somewhere between 3 and 18, since the highest number of a die is 6 and
the lowest number is 1.
●
A random variable is different from an algebraic variable.
●
The variable in an algebraic equation is an unknown value that can be calculated.
●
The equation 10 + x = 13 shows that we can calculate the specific value for x
which is 3.
●
On the other hand, a random variable has a set of values, and any of those values
could be the resulting outcome as seen in the example of the dice above.
Types of Random Variables
●
Based on the values a random variable can take, they are classified into
– Discrete Random variable
– Continuous Random variable
Types of Random Variables
●
Discrete Random variable:
– A random variable that can take only a countable number of distinct values or
states, such as 0,1,2,3,4…. is known as a discrete random variable.
– Note that these states are not necessarily the integers; they can also just be
named states that are not considered to have any numerical value.
– Example of discrete random variable includes:
●
The number of heads on tossing a coin.
●
Getting a king from the shuffled deck of cards.
●
The number of successes in the Bernoulli trials.
●
The number of cars passing through a red light in an hour.
– Discrete Random Variables are used to calculate the probability of events,
statistical interference, and machine learning.
Types of Random Variables
●
Continuous Random variable:
– A random variable that can take infinite possible values is known as a
continuous random variable.
– The set of all the possible values of a continuous random variable is an
interval of real numbers.
– Example of continuous random variable includes
●
heights of students
●
weight of employees
●
amount of rainfall.
– Continuous random variables are used to estimate mean and standard
deviation, hypothesis testing, and regression analysis.
Probability Distributions
●
A probability distribution is a description of how likely a random variable or set of random variables is
to take on each of its possible states or values.
or
●
A probability distribution of a random variable is a list of all possible outcomes with the
corresponding probability values.
or
●
The probability distribution for a random variable shows how the probabilities are distributed over the
values of the random variable.
●
Example: The probability distribution for tossing two coins is given as
– Outcome of the toss - HH HT TH TT
– Probability 1/4 1/4 1/4 1/4
●
There are types of probability distribution based on the types of random variables.
– Discrete Probability Distribution
– Continuous Probability Distribution
Probability Distributions
●
Discrete Probability Distribution
– A probability distribution over discrete variables may be described using a
probability mass function(PMF).
●
Continuous Probability Distribution
– When working with continuous random variables, we describe probability
distributions using a probability density function (PDF) rather than a
probability mass function.
Probability Distributions
●
Joint Probability Distribuion:
– Joint probability is the probability that two or more events will occur
simultaneously.
– A joint probability distribution represents a probability distribution for two or
more random variables.
– Instead of events being labelled A and B, the condition is to use X and Y as
given below. f(x,y) = P(X = x, Y = y) The main purpose of this is to look for a
relationship between two variables.
Probability Distributions
●
Marginal Probability Distribuion:
– Sometimes we know the probability distribution over a set of variables and
we want to know the probability distribution over just a subset of them.
– The probability distribution over the subset is known as the marginal
probability distribution
– For example, suppose we have discrete random variables x and y, and we
know P (x, y). We can find P (x) with the sum rule:
Probability Distributions
Chain Rule
●
By the chain rule of probability, we can always represent a joint distribution as
follows, using any ordering of the variables:

●
This observation is known as the chain rule, or product rule, of probability.
●
It follows immediately from the deﬁnition of conditional probability.
●
For example, applying the deﬁnition twice, we get
Chain Rule
●
Chain rule can also be written as :
where V is the number of variables.
●
For example, suppose all the variables have K states.
●
We can represent p(x1) as a table of O(K) numbers, representing a discrete
distribution.
●
Similarly, we can represent p(x2|x1) as a table of O(K 2) numbers by writing p(x2 = j|
x1 = i) = Tij ; we say that T is a stochastic matrix, since it satisfies the constraint
Σj Tij = 1 for all rows i, and 0 ≤ Tij ≤ 1 for all entries.
●
Similarly, we can represent p(x3|x1, x2) as a 3d table with O(K 3) numbers. These are
called conditional probability tables or CPTs.
●
We see that there are O(KV) parameters in the model.
●
We would need an awful lot of data to learn so many parameters.
Chain Rule
●
One solution is to replace each CPT with a conditional probability distribution or
CPD, such as multinomial logistic regression, i.e., p(xt=k|x1:t-1 ) = S(Wtx1:t-1)k

●
The total number of parameters is now only O(K2V2), making this a compact
density model.

●
However, this model is not useful for other kinds of prediction tasks, since each
variable depends on all the previous variables, So we need another approach.
Independence and Conditional Independence
●
Two random variables x and y are independent if their probability distribution can
be expressed as a product of two factors, one involving only x and one involving
only y

●
Two random variables x and y are conditionally independent given a random
variable z if the conditional probability distribution over x and y factorizes in this
way for every value of z:

●
We can denote independence and conditional independence with compact
notation: x⊥y means that x and y are independent, while x⊥y | z means that x
and y are conditionally independent given z
Independence and Conditional Independence
● Let us see why this might help. Suppose we assume that x t+1⊥x1:t-1|xt , or in other
words, “the future is independent of the past given the present”, This is called the
(first order) Markov assumption.
●
Using this assumption, plus the chain rule, we can write the joint distribution as
follows:

●
This is called a (first-order) Markov chain.
●
Although the first-order Markov assumption is useful for defining distributions on
1d sequences.
●
How can we define distributions on 2d images, or 3d videos, or, in general,
arbitrary collections of variables (such as genes belonging to some biological
pathway)?
●
This is where graphical models come in.
Graphical Model
●
A graphical model (GM) is a way to represent a joint distribution by making
conditional independent assumptions.
●
In particular, the nodes in the graph represent random variables, and the edges
represent CI assumptions.
Graph Terminology
●
A graph G = (V, E) consists of a set of nodes or vertices, V = {1,...,V }, and a set
of edges, E = {(s, t) : s, t ∈ V}.

●
The graph can be represented by its adjacency matrix, in which we write G(s, t)=1
to denote (s, t) ∈ E, that is, if s → t is an edge in the graph.

●
If G(s, t)=1 and G(t, s)=1, then the graph is undirected, otherwise it is directed.

●
We usually assume G(s, s)=0, which means there are no self loops.
Graph Terminology
●
Parent - The parents of a node is the set of all nodes that feed into it: pa(s) = {t :
G(t, s)=1}.
●
Child - The children of a node is the set of all nodes that feed out of it: ch(s) = {t :
G(s,t)=1}.
●
Family - The family of a node is the node and its parents, fam(s) = {s} ∪ pa(s).
●
Root - A root is a node with no parents.
●
Leaf - A leaf is a node with no children.
●
Ancestors - The ancestors are the parents, grand-parents, etc of a node. That is,
the ancestors of t is the set of nodes that connect to t via a trail:
anc(t) = {s : s t}.
Graph Terminology
●
Descendants - The descendants are the children, grand-children, etc of a node.
That is, the descendants of s is the set of nodes that can be reached via trails from
s: desc(s) ! {t : s t}.
●
Neighbors - The neighbors of a node as the set of all immediately connected
nodes, nbr(s) = {t : G(s, t)=1 ∨ G(t, s)=1}. For an undirected graph.
●
Degree - The degree of a node is the number of neighbors. For directed graphs,
we speak of the in-degree and out-degree, which count the number of parents and
children.
●
Cycle or loop - We define a cycle or loop to be a series of nodes such that we can
get back to where we started by following edges, s1 − s2 ··· − sn − s1, n ≥ 2. If
the graph is directed, we may speak of a directed cycle.
●
DAG - A directed acyclic graph or DAG is a directed graph with no directed
cycles.
Graph Terminology
●
Topological ordering For a DAG, a topological ordering or total ordering is a
numbering of the nodes such that parents have lower numbers than their children.
For example, in Figure ,

we can use (1, 2, 3, 4, 5), or (1, 3, 2, 5, 4), etc.

●
Path or trail - A path or trail s t is a series of directed edges leading from s to
t.
●
Tree - An undirected tree is an undirectecd graph with no cycles. A directed tree is
a DAG in which there are no directed cycles. If we allow a node to have multiple
parents, we call it a polytree, otherwise we call it a moral directed tree.
Graph Terminology
●
Forest - A forest is a set of trees.
● Subgraph - A (node-induced) subgraph GA is the graph created by using the nodes
in A and their corresponding edges, GA = (VA, EA).
●
Clique For an undirected graph - A clique is a set of nodes that are all neighbors
of each other.
●
A maximal clique is a clique which cannot be made any larger without losing the
clique property.
– For example, in Figure , {1, 2} is a clique but it is not maximal, since we can
add 3 and still maintain the clique property. In fact, the maximal cliques are as
follows: {1, 2, 3}, {2, 3, 4}, {3, 5}.
Bayesian Network
●
A Bayesian network is a probabilistic graphical model which represents a set of
random variables and their conditional dependencies using a directed acyclic
graph.
●
It is also called a Bayes network, belief network, decision network, or Bayesian
model.
●
A Bayesian Network for a set of variables (nodes) X = { X1,…….Xn}
●
Arcs represent probabilistic dependence among variables
●
Lack of an arc denotes a conditional independence
●
The network structure S is a directed acyclic graph
●
A set P of local probability distributions at each node (Conditional Probability
Table)
Representation in Bayesian Belief Networks
●
Conditional probability table
associated with each node specifies the
conditional distribution for the
variable given its immediate parents in
the graph
●
Each node is asserted to be
conditionally independent of its non-
descendants, given its immediate
parents
Applications of Bayesian Networks
●
Diagnosis: P(cause|symptom)=?
●
Prediction: P(symptom|cause)=?
●
Classification: P(class|data)
●
Decision-making (given a cost function)
Bayesian Networks
●
Structure of the graph Conditional independence relations
● In general, p(X1, X2 ,....XN) = π p(Xi | parents(Xi))

The full joint distribution The graph-structured approximation

●
Requires that graph is acyclic (no directed cycles)
●
2 components to a Bayesian network
– The graph structure (conditional independence assumptions)
– The numerical probabilities (for each variable given its parents)
Examples
A B C ●
Marginal Independence: p(A,B,C) = p(A)
p(B) p(C)

●
Conditionally independent effects:
A
p(A,B,C) = p(B|A)p(C|A)p(A) B and C are
conditionally independent Given A
B C
Examples
●
Independent Causes: p(A,B,C) = p(C|
A B A,B)p(A)p(B) “Explaining away”

●
Markov dependence: p(A,B,C) = p(C|B)
A B C p(B|A)p(A)
Bayesian Network for General Naiive Bayes Model

●
Encoded using a very small number of parameters
●
Linear in the number of variables
Joint distribution from Student Bayesian Network
Bayesian Network Charecterstic
●
Nodes in the network can be topological ordered i.e. there exists a sequence of
nodes such that parent nodes comes before child.
– This property defines Markov order property.
●
It is the assumption a node only depends on its immediate parent, not on
all predecessors.
– Ex. p(x1:x5) = p(x1).p(x2|x1).p(x3|x1,x2).p(x4|x1,x2,x3).p(x5|x1,x2,x3,x4)
= p(x1).p(x2|x1).p(x3|x1).p(x4|x2,x3).p(x5|x3)
– In general
where each term is a CPD
From Factorization to Indipendence
From Factorization to Indipendence
From Factorization to Indipendence
From Factorization to Indipendence
From Factorization to Indipendence
From Factorization to Indipendence
Probabilistic Indipendence Rule
Example
Example
Example
Example
Minimal I-Maps
●
Graph G is minimal I-Map of P if G is an I-Map of P and if there is no G’ (subset
of G) which is an I-Map of P.
●
How to determine A CI B of graph G given C?
●
Deriving this independence for undirected graph, how?
●
The DAG situation sometimes complicated.
D-Separation
●
Directed edge separation(D-Separation)
●
We say an undirected path P is d-separated by a set of nodes E (containing the
evidence) if at least one of the following conditions hold:
– P contains a chain, s → m → t or s ← m ← t, where m ∈ E
– P contains a tent or fork, s ↙m↘ t, where m ∈ E
– P contains a collider or v-structure, s ↘m↙ t, where m is not in E and nor is
any descendant of m
●
A set of nodes A is d-separated from a diferent set of nodes B given a third
observed set E if each undirected path from every node a ∈ A to every node b ∈
B is d-separated by E.
●
The CI properties of a DAG as follows:
xA⊥G xB |xE ⇐⇒ A is d-separated from B given E
Bayes ball algorithm
●
If A is d-separated from B given E
– Shade all nodes in E indicating that they are observed.
– Place “balls” at each node in A, let them “bounce around” according to some
rules
– Then ask if any of the balls reach any of the nodes in B.
●
There are 3 rules (The balls can travel opposite to edge directions)
– A ball can pass through a chain, but not if it is shaded in the middle.
– A ball can pass through a fork, but not if it is shaded in the middle.
– A ball cannot pass through a v-structure, unless it is shaded in the middle.
Bayes ball algorithm
Bayes ball algorithm
●
We can justify the 3 rules of Bayes ball as follows.

1. First consider a chain structure X → Y → Z,

which encodes p(x, y, z) = p(x)p(y|x)p(z|y)
When we condition on y, are x and z independent? We have

and therefore x ⊥ z|y. So observing the middle node of chain breaks it in two.
Bayes ball algorithm

2. Consider the tent structure X ← Y → Z.

The joint is p(x, y, z) = p(y)p(x|y)p(z|y)
When we condition on y, are x and z independent? We have

and therefore x ⊥ z|y. So observing a root node separates its children

Bayes ball algorithm

3. Finally consider a v-structure X → Y ← Z.

The joint is p(x, y, z) = p(x)p(z)p(y|x, z)
When we condition on y, are x and z independent? We have

so x ̸⊥ z|y. However, in the unconditional distribution, we have p(x, z) = p(x)p(z)

●
Conclusion: conditioning on a common child at the bottom of a v-structure makes
its parents become dependent. This important efect is called explaining away,
inter-causal reasoning, or Berkson’s paradox.
Example
●
Which of the following are guarented
to be true?
– V⊥Z
– V⊥Z|T
– U⊥V
– U⊥V|W
– U⊥V|X
– U⊥V|Y
– U⊥V|Z
– W⊥X
– X⊥T|V
– X⊥W|U
Markov Model
• A Markov model is a finite state machine with N distinct
states begins at (Time t = 1) in initial state .

• It moves from current state to next state according to the

transition probabilities associated with the current state

• This kind of system is called Finite or Discrete Markov

model.
Markov Property
• Markov Property : The Current state of the system depends only on
the previous state of the system.

• The State of the system at Time [ T+1 ] depends on the state of the
system at time T.

Xt=1 Xt=2 Xt=3 Xt=4 Xt=5

Discreate Markov Model-Example
• A Discrete Markov Model with 5 states.
• Each aij represents the probability of moving from state ‘ i’ to state
‘j’.
• The probability to start in a given state i is πi .
• The Vector π represents the start probabilities.
• To define Markov model, the following probabilities have to be
specified:
– transition probabilities aij = P(Si | Sj ) and
– initial probabilities πi = P( Si )
Discreate Markov Model-Example

• For the given Markov model, the transition probabilities are

mentioned in the diagram, the initial probabilities for Rain and Dry
is 0.4 and 0.6 respectively. Calculate the probability of a sequence
of states {Dry,Dry,Rain,Rain}

0.7

0.3 Rain Dry 0.7

0.2
Hidden Markov Model
• A Hidden Markov model is a statistical model in which the system
being modelled is assumed to be markov process with unobserved
hidden states.
• HMM is a graphical model which can be used to predict a sequence
of observed variables.
• In Regular Markov models the state is clearly visible to others in
which the state transition probabilities are the parameters only
where as in HMM the state is not visible but the output is visible.
What is an HMM?

• Graphical Model
• Circles indicate states
• Arrows indicate probabilistic dependencies
between states
What is an HMM?

• Green circles are hidden states

• Dependent only on the previous state
• “The past is independent of the future given the
present.”
What is an HMM?

• Purple nodes are observed states

• Dependent only on their corresponding hidden
state
HMM Formalism

S S S S S

K K K K K

• {S, K, 
• S : {s1…sN } are the values for the hidden states
• Process moves from one state to another generating a
sequence of states.
• States are not visible, but each state randomly generates one
of M observation or visible states. K : {k1…kM } are the
values for the observations
HMM Formalism

S A S A S A S A S

B B B

K K K K K

• {S, K, 
• are the initial state probabilities
• A = {aij} are the state transition probabilities
• B = {bik} are the observation state probabilities
HMM Example
0.7

0.3 low high 0.8

0.2

0.6
0.4 0.6

0.4
Rain Dry

• For the given HMM hidden states are low and high and
observation states are Rain and Dry. The inital probabilities
of low and high are 0.4 and 0.6 respectively. Calculate the
sequence probability for the observed sequence
{Dry,Rain}?
Types of HMM structure
• Ergodic model:
– Every aij is positive
– Every transition is possible
• Bakis (Left-to-Right) model:
– Can not go backward
– Aij = 0 , j<i
– Good for temporal data like speech recognition.
Main issues in HMM
• Evaluation problem:
– Given the the HMM M={S, K,  observation sequence O =
o1o2o3....ok , calculate the probability that model M has generated the
sequence O.
– To solve this problem “forward-backword algorithm is used.

• Decoding problem:
– Given the the HMM M={S, K,  observation sequence O =
o1o2o3....ok , calculate the most likely sequence of hidden states Si that
produced this observation sequence O.
– To solve this problem Viterbi algorithm is used
• Learning problem:
– Given some training observation sequences O = o1o2o3....ok general
structure of the HMM(number of hidden and visible states), determine
HMM parameters M={  that best fit training data.
– To solve this problem Baum-Weleh algorithm is used.
Inference in an HMM

• Compute the probability of a given observation

sequence
• Given an observation sequence, compute the most
likely hidden state sequence
• Given an observation sequence and set of possible
models, which model most closely fits the data?
Gaussian Mixture Model
●
In many engineering applications, such as signal processing, communication,
machine learning, and others, a Gaussian distribution is commonly used to model
data distributions.
●
Which means a particular data distribution can be approximated by a single
Gaussian distribution.
●
Advantages
– It is very closer to natural distribution
– Mathematical manipulations for Gaussian functions is easy,like detrmining the
nth order derivative is easy.
Gaussian Mixture Model
●
Disadvantage
– Complex data distributions cannot always be modeled accurately by a single
Gaussian.
●
Solution
– A Gaussian Mixture Model (GMM) represents data as a combination of
multiple Gaussian distributions
– Captures diverse patterns in complex datasets.
Gaussian Mixture Model
●
Parameters of Gaussian Distribution
– For single Gaussian / 1-D Gaussian distribution
●
Mean (μ): Central tendency of the distribution
●
Variance (σ²): Measure of spread or dispersion
– For mixture of Gaussian / Multinomial distribution
●
Mean Vector (μ): Generalization of mean to multiple dimensions
●
Covariance Matrix (Σ): Describes the spread and relationships between
dimensions
Gaussian Mixture Model
●
The parameters of Gaussian distribution can be detrmined by some estimation
technique like maximum liklihood estimation (MLE)

●
But, in case of GMM, MLE technique can not be applied to detrmine the
parameters of GMM.

●
Because MLE can not give a closed form solution.

●
The solution is to use an itterative algorithm like Expectation Maximization (EM)
Algorithm to estimate the parameters of GMM.
Gaussian Mixture Model
●
1-D example
– Observations / data x1,x2,.....xn
●
K=2 Gaussian with unknown
μ and σ²
Gaussian Mixture Model
●
1-D example
– Observations / data x1,x2,.....xn
●
K=2 Gaussian with unknown
μ and σ²
●
Estimation of parameters is
trivial if we know the source
of each observation.
Gaussian Mixture Model
●
1-D example
– Observations / data x1,x2,.....xn
●
K=2 Gaussian with unknown
μ and σ²
●
Estimation of parameters is
trivial if we know the source
of each observation.
Gaussian Mixture Model
●
1-D example
– Observations / data x1,x2,.....xn
●
K=2 Gaussian with unknown
μ and σ²
●
Estimation of parameters is
trivial if we know the source
of each observation.
Gaussian Mixture Model
●
1-D example
– What if we do not know the source?
Gaussian Mixture Model
●
1-D example
– What if we do not know the source?
– If we knew the parameters of the Gaussian (μ, σ²)
●
Can guess whether point is more likely to be a or b
Gaussian Mixture Model
●
1-D example
– What if we do not know the source?
– If we knew the parameters of the Gaussian (μ, σ²)
●
Can guess whether point is more likely to be a or b
Gaussian Mixture Model
●
1-D example
– What if we do not know the source?
– If we knew the parameters of the Gaussian (μ, σ²)
●
Can guess whether point is more likely to be a or b
Gaussian Mixture Model
●
1-D example
– What if we do not know the source?
– If we knew the parameters of the Gaussian (μ, σ²)
●
Can guess whether point is more likely to be a or b
The Problem
●
We have the dataset that we believe is drawn from n distributions.
●
We want to identify the parameters of each distributions.
●
We do not know anything about the populations a priori.
– Except we believe that they are Gaussians
●
Fit a set of n Gaussians to the data
The Problem
Mixture Models
●
Formally a Mixture Model is the weighted sum of a number of the pdfs where the
weights are determined by a distribution, .
Gaussian Mixture Model
●
GMM: The weighted sum of a number of Gaussians where the weights are
determined by a distribution .
Gaussian Mixture Model
●
GMM: The weighted sum of a number of Gaussians where the weights are
determined by a distribution .
Gaussian Mixture Models
Expectation Maximization Algorithm
●
The Expectation Maximization (EM) Algorithm (EM algorithm) is a method to
find MLE of the parameters of a statistical model in cases where the equations
cannot be solved directly.
●
Gaussian mixture model is a kind of statistcal model which involves latent
variables (un obsereved or hidden) and hence cannot be solved directly using MLE
method.
●
GMM models can be used to cluster unlabelled data points.
●
That is not knowing what samples came from which class or distribution.
●
Our goal is to use GMM to assign the data points to the appropriate cluster.
●
Since GMM contains latent variables, we apply EM algorithm to solve the
problem.
Expectation Maximization Algorithm
●
In a GMM:
●

●
Expectation Maximization Algorithm
●
General outline of EM Algorithm
– Step-1: Initialize the parameters θ.

– Step-2: Expectation step(E-Step) : Using the observed available data of the

dataset, estimate(guess) the values of the missing data.

– Step-3: Maximization step(M-Step): Complete data generated after the

expectation step is used to update the parameters θ by maximizing likelyhood
function.

– Step-4: Repeat step-2 and 3 until converge.

EM Algorithm for GMM
●
General outline of EM Algorithm
– Step-1: Randomly initialize the parameters
– Step-2: Expectation step(E-Step) :
●
Compute the responsibilities: the probability that each data point belongs
to each Gaussian component.
●
For datapoint xi, the responsibility rik is:
– Here
:Multivariate Gaussian probability density function for
component 𝑘
– r Responsibility that component k generated x
ik i
Expectation Maximization Algorithm
●
General outline of EM Algorithm
– Step-3: Maximization step(M-Step): Update parameters
to maximize the likelihood based on the responsibilities:

●
Update mean:

●
Update covariance matrix:

●
Update mixing coefficients: Here, N is the total number of
data points.

– Step-4: Repeat step-2 and 3 until converge.

Unit1 - Read-Only
No ratings yet
Unit1 - Read-Only
191 pages
Unit 1 - Probability
No ratings yet
Unit 1 - Probability
29 pages
Unit 1 - Prob
No ratings yet
Unit 1 - Prob
48 pages
Unit 1 - Deep Learning
No ratings yet
Unit 1 - Deep Learning
49 pages
Probability 360
No ratings yet
Probability 360
74 pages
vt59.2708-21417172328 1042234447002868 534542395353161222 N.pdfintroduction-Final - PDF NC Cat 103&ccb 1
No ratings yet
vt59.2708-21417172328 1042234447002868 534542395353161222 N.pdfintroduction-Final - PDF NC Cat 103&ccb 1
53 pages
Module 1 Lecture 4-Probability Distributions
No ratings yet
Module 1 Lecture 4-Probability Distributions
39 pages
Random Variables and Distributions - New
No ratings yet
Random Variables and Distributions - New
84 pages
ML Unit2-1
No ratings yet
ML Unit2-1
11 pages
Probability Theory and Stochastic Processes
0% (1)
Probability Theory and Stochastic Processes
155 pages
Reflective Essay of Probability Statistics
No ratings yet
Reflective Essay of Probability Statistics
24 pages
CHP 5
No ratings yet
CHP 5
63 pages
PTSP 2019 Class Notes
No ratings yet
PTSP 2019 Class Notes
177 pages
Probability and Random Processes-1-10
No ratings yet
Probability and Random Processes-1-10
10 pages
Ma s1-2 Discrete-Probability-Distributions 201024
No ratings yet
Ma s1-2 Discrete-Probability-Distributions 201024
41 pages
Random Variables & Probability Module
No ratings yet
Random Variables & Probability Module
60 pages
Probability Distribution: Question Booklet
No ratings yet
Probability Distribution: Question Booklet
8 pages
Probability FoundationalMathofAI S24
No ratings yet
Probability FoundationalMathofAI S24
7 pages
Unit-IV Engineering Maths-III (Defn and Problems)
No ratings yet
Unit-IV Engineering Maths-III (Defn and Problems)
14 pages
Probability and Random Processes
No ratings yet
Probability and Random Processes
167 pages
Module-Iii Probability Distributions: RV Institute of Technology & Management
No ratings yet
Module-Iii Probability Distributions: RV Institute of Technology & Management
30 pages
Unit 1 Review of Probability and Basic Statistics
100% (1)
Unit 1 Review of Probability and Basic Statistics
90 pages
Applications of Probability
No ratings yet
Applications of Probability
11 pages
Lecture 9 - Probability COMP7180
No ratings yet
Lecture 9 - Probability COMP7180
58 pages
Script
No ratings yet
Script
9 pages
Stat - G. Assignment
No ratings yet
Stat - G. Assignment
21 pages
Module 1
No ratings yet
Module 1
12 pages
Random Variable and Mathematical Expectation
No ratings yet
Random Variable and Mathematical Expectation
9 pages
Probability & Testing in Data Analytics
No ratings yet
Probability & Testing in Data Analytics
70 pages
Module 1 CS III Sem
No ratings yet
Module 1 CS III Sem
30 pages
Probability & Queueing Theory Guide
No ratings yet
Probability & Queueing Theory Guide
12 pages
Unit II
No ratings yet
Unit II
140 pages
Probability Probability Distribution Function Probability Density Function Random Variable Bayes' Rule Gaussian Distribution
No ratings yet
Probability Probability Distribution Function Probability Density Function Random Variable Bayes' Rule Gaussian Distribution
26 pages
Probability & Statistics Guide
No ratings yet
Probability & Statistics Guide
120 pages
PART 1.odt
No ratings yet
PART 1.odt
17 pages
Lect1 Math231
No ratings yet
Lect1 Math231
65 pages
Mat 352 Official Note Construction
No ratings yet
Mat 352 Official Note Construction
38 pages
Introduction To Probability Theory: A Short Course On Graphical Models
No ratings yet
Introduction To Probability Theory: A Short Course On Graphical Models
30 pages
Chapter 2 - Random Variables and Probabi - 2016 - Introduction To Statistical Ma
No ratings yet
Chapter 2 - Random Variables and Probabi - 2016 - Introduction To Statistical Ma
14 pages
Probability Is A Branch of Mathematics That Deals With Measuring The Likelihood of Events
No ratings yet
Probability Is A Branch of Mathematics That Deals With Measuring The Likelihood of Events
34 pages
Probability Theory
No ratings yet
Probability Theory
8 pages
3 Prob-Review
No ratings yet
3 Prob-Review
77 pages
Mathematical Statistics: Probability & Distributions
No ratings yet
Mathematical Statistics: Probability & Distributions
87 pages
Scribe: Naive Bayes Classifier
No ratings yet
Scribe: Naive Bayes Classifier
16 pages
NLP Module 2
No ratings yet
NLP Module 2
73 pages
Probability Notes
No ratings yet
Probability Notes
7 pages
From the Foundations of Probability to its Applications - 學術
No ratings yet
From the Foundations of Probability to its Applications - 學術
5 pages
20250313-Week4-Probability and Random Variables
No ratings yet
20250313-Week4-Probability and Random Variables
52 pages
Unit 1 Ssmda Notes
No ratings yet
Unit 1 Ssmda Notes
35 pages
Probability Distributions - Training
No ratings yet
Probability Distributions - Training
43 pages
Probability Notes
No ratings yet
Probability Notes
19 pages
BCS-DS-602: Machine Learning: Dr. Sarika Chaudhary Associate Professor Fet-Cse
No ratings yet
BCS-DS-602: Machine Learning: Dr. Sarika Chaudhary Associate Professor Fet-Cse
18 pages
Applied Maths
No ratings yet
Applied Maths
34 pages
ML Academy - Part II
No ratings yet
ML Academy - Part II
8 pages
Module 1 CS III Sem
No ratings yet
Module 1 CS III Sem
31 pages
Model Questions 19-28
No ratings yet
Model Questions 19-28
8 pages
MM3&4 - Probability and Distributions Summary Notes
No ratings yet
MM3&4 - Probability and Distributions Summary Notes
31 pages
AML Unit 1
No ratings yet
AML Unit 1
160 pages
CNN New
No ratings yet
CNN New
225 pages
ST 1
No ratings yet
ST 1
1 page
NLP M5 Part-2 SPP
No ratings yet
NLP M5 Part-2 SPP
62 pages
NLP M5 Part-1 SPP
No ratings yet
NLP M5 Part-1 SPP
55 pages
NLP M4 Part 1 SPP
No ratings yet
NLP M4 Part 1 SPP
57 pages
Econometrics for University Students
No ratings yet
Econometrics for University Students
65 pages
Structural Reliability Powered by Strurel: Sofistik 2020
No ratings yet
Structural Reliability Powered by Strurel: Sofistik 2020
75 pages
2 Marks MLT Ai&ds
No ratings yet
2 Marks MLT Ai&ds
2 pages
Probability Distribution Formulas
No ratings yet
Probability Distribution Formulas
6 pages
Multiple Random Variables
No ratings yet
Multiple Random Variables
14 pages
QB University Question Paper GAMAT301 Final
No ratings yet
QB University Question Paper GAMAT301 Final
31 pages
Ma1252-Probability and Queueing Theory Unit - I Probability and Random Variable
100% (1)
Ma1252-Probability and Queueing Theory Unit - I Probability and Random Variable
19 pages
TOD501-M22 03 Probability and Distributions Lecture Slides
No ratings yet
TOD501-M22 03 Probability and Distributions Lecture Slides
67 pages
Essential Math For AI Hala Nelson Instant Download
No ratings yet
Essential Math For AI Hala Nelson Instant Download
54 pages
Probability & Stochastic Processes Tutorial
No ratings yet
Probability & Stochastic Processes Tutorial
2 pages
AprilMay 2023
No ratings yet
AprilMay 2023
2 pages
Probability & Statistics Course Info
No ratings yet
Probability & Statistics Course Info
3 pages
PRO-Ch4 (2021-22 Note
No ratings yet
PRO-Ch4 (2021-22 Note
52 pages
BCom Statistics and Data Science
No ratings yet
BCom Statistics and Data Science
42 pages
AMA2104 Probability and Engineering Statistics 3 Joint Distribution
No ratings yet
AMA2104 Probability and Engineering Statistics 3 Joint Distribution
25 pages
MRA Project Milesone-1: BY-Shorya Goel PGP Dsba Oct - 20 B
92% (25)
MRA Project Milesone-1: BY-Shorya Goel PGP Dsba Oct - 20 B
35 pages
Bmat202l Cat 1 D2 2024
No ratings yet
Bmat202l Cat 1 D2 2024
7 pages
R20 CO PO Mapping Table
No ratings yet
R20 CO PO Mapping Table
50 pages
Random Vibration and Spectral Analysis: André Preumont
No ratings yet
Random Vibration and Spectral Analysis: André Preumont
8 pages
The Phi-Coefficient, The Tetrachoric Correlation Coefficient, and The Pearson-Yule Debate
No ratings yet
The Phi-Coefficient, The Tetrachoric Correlation Coefficient, and The Pearson-Yule Debate
19 pages
Be - PQT - Problem Metiral
No ratings yet
Be - PQT - Problem Metiral
13 pages
''Intuitive Understanding of Kalman Filtering With MATLAB'' First Edition Adjouadi Full Digital Chapters
No ratings yet
''Intuitive Understanding of Kalman Filtering With MATLAB'' First Edition Adjouadi Full Digital Chapters
126 pages
Probability for Investment Analysts
No ratings yet
Probability for Investment Analysts
52 pages
Final Study Solutions
No ratings yet
Final Study Solutions
14 pages
Reddy Et Al-2012-Hydrological Processes
No ratings yet
Reddy Et Al-2012-Hydrological Processes
15 pages
Probability Homework Solutions
No ratings yet
Probability Homework Solutions
5 pages
Joint Probability Distributions: Chapter Outline
No ratings yet
Joint Probability Distributions: Chapter Outline
12 pages
(Ebook) Probabilistic Machine Learning: An Introduction by Kevin P. Murphy ISBN 9780262046824, 0262046822 Updated 2025
100% (6)
(Ebook) Probabilistic Machine Learning: An Introduction by Kevin P. Murphy ISBN 9780262046824, 0262046822 Updated 2025
156 pages
PI Kit - MBA Admissions 2023
No ratings yet
PI Kit - MBA Admissions 2023
50 pages
15EMAB203 988 15EMAB203 988 23kle632
No ratings yet
15EMAB203 988 15EMAB203 988 23kle632
2 pages

AML-IV New

Uploaded by

AML-IV New

Uploaded by

AML

Nayan Ranjan Paul

we can use (1, 2, 3, 4, 5), or (1, 3, 2, 5, 4), etc.

The full joint distribution The graph-structured approximation

1. First consider a chain structure X → Y → Z,

2. Consider the tent structure X ← Y → Z.

and therefore x ⊥ z|y. So observing a root node separates its children

3. Finally consider a v-structure X → Y ← Z.

so x ̸⊥ z|y. However, in the unconditional distribution, we have p(x, z) = p(x)p(z)

• It moves from current state to next state according to the

• This kind of system is called Finite or Discrete Markov

Xt=1 Xt=2 Xt=3 Xt=4 Xt=5

• For the given Markov model, the transition probabilities are

0.3 Rain Dry 0.7

• Green circles are hidden states

• Purple nodes are observed states

0.3 low high 0.8

• Compute the probability of a given observation

– Step-2: Expectation step(E-Step) : Using the observed available data of the

– Step-3: Maximization step(M-Step): Complete data generated after the

– Step-4: Repeat step-2 and 3 until converge.

– Step-4: Repeat step-2 and 3 until converge.

You might also like