0% found this document useful (0 votes)
48 views17 pages

Lecture 18 1

Uploaded by

niveditasimmi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views17 pages

Lecture 18 1

Uploaded by

niveditasimmi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Approximate Inference via Sampling (1)

CS698X: Topics in Probabilistic Modeling and Inference


Piyush Rai
2
Plan
▪ Sampling to approximate distributions
▪ Basic sampling methods
▪ Markov Chain Monte Carlo (MCMC)

CS698X: TPMI
3
Sampling for Approximate Inference
▪ Some typical tasks that we have to solve in probabilistic/fully-Bayesian inference
Posterior
distribution

Posterior
predictive
distribution

Needed for model


selection (and in
Marginal
computing
likelihood
posterior too)
Expected
Needed in EM complete data
log-likelihood

Evidence lower
Needed in VI bound (ELBO) ]
▪ Sampling methods provide a general way to (approximately) solve these problems
CS698X: TPMI
4
Approximating a Prob. Distribution using Samples
▪ Can approximate any distribution using a set of randomly drawn samples from it

Samples can thought


of as a histogram-
Given large-enough
based approximation
samples, it is proportional to
of a distribution
the probability density at Height of each bar
that location denotes how many
times that location
was sampled 𝑝(𝑧)

▪ The samples can also be used for computing expectations (Monte-Carlo averaging)

▪ Usually straightforward to generate samples if it is a simple/standard distribution

▪ The interesting bit: Even if the distribution is “difficult” (e.g., an intractable posterior), it
is often possible to generate random samples from such a distribution, as we will see.
CS698X: TPMI
5
The Empirical Distribution
▪ Sampling based approx. can be formally represented using an empirical distribution
▪ Given 𝐿 points/samples 𝒛(1) , 𝒛(2) , … , 𝒛(𝐿) , empirical distr. defined by these is

Weights sum to 1
Dirac Distribution with Weight of point 𝑧 (ℓ)
finite support at
𝒛(1) , 𝒛(2) , … , 𝒛(𝐿)

Can think of 𝐴 as being the


area over which we want to
evaluate the distribution

Dirac Distribution

CS698X: TPMI
𝜕𝑥 6
Sampling: Some Basic Methods 𝑝 𝑧 = 𝑞(𝑥)
𝜕𝑧
▪ Most of these basic methods are based on the idea of transformation Determinant
of Jacobian
▪ Generate a random sample 𝑥 from a distribution 𝑞(𝑥) which is easy to sample from
▪ Apply a transformation on 𝑥 to make it random sample 𝑧 from a complex distr 𝑝(𝑧) 𝐹(𝑧): CDF of 𝑝(𝑧)

▪ Some popular examples of transformation methods 𝑥


▪ Inverse CDF method
𝑧 = 𝐹 −1 (𝑥)

▪ Reparametrization method

▪ Box-Mueller method: Given (𝑥1 , 𝑥2 ) from Unif(−1, +1), generate (𝑧1 , 𝑧2 ) from 𝒩(0, 𝐈2 )
𝑧1 = −2 ln 𝑥1 cos 2𝜋𝑥2 , 𝑧1 = −2 ln 𝑥1 sin(2𝜋𝑥2 )
▪ Transformation Methods are simple but have limitations
▪ Mostly limited to standard distributions and/or distributions with very few variables
CS698X: TPMI
7
Rejection Sampling

𝑝(𝑧)
▪ Goal: Generate a random sample from a distribution of the form 𝑝 𝑧 = , assuming
𝑍𝑝
▪ We can only evaluate the value of numerator 𝑝(𝑧)
෤ for any 𝑧
▪ The denominator (normalization constant) 𝑍𝑝 is intractable and we don’t know its value
Should have the same
support as 𝑝(𝑧)
▪ Assume a proposal distribution 𝑞(𝑧) we can generate samples from, and

▪ Rejection Sampling then works as follows


▪ Sample an random variable 𝑧∗ from 𝑞(𝑧)
▪ Sampling a uniform r.v. 𝑢 ∼ Unif 0, 𝑀𝑞 𝑧∗
▪ If 𝑢 ≤ 𝑝(𝑧
෤ ∗ ) then accept 𝑧∗ , otherwise reject it

▪ All accepted 𝑧∗ ’s will be random samples from 𝑝 𝑧 . Proof on next slide


CS698X: TPMI
8
Rejection Sampling
▪ Why 𝑧 ∼ 𝑞(𝑧) + accept/reject rule is equivalent to 𝑧 ∼ 𝑝(𝑧)?

▪ Let’s look at the pdf of the 𝑧’s that were accepted, i.e., 𝑝(𝑧|accept)

CS698X: TPMI
9
Computing Expectations via Monte Carlo Sampling
▪ Often we are interested in computing expectations of the form
𝔼 𝑓 = න 𝑓 𝑧 𝑝 𝑧 𝑑𝑧
where 𝑓(𝑧) is some function of the random variable 𝑧 ∼ 𝑝(𝑧)
▪ A simple approx. scheme to compute the above expectation: Monte Carlo integration
(ℓ) 𝐿 Assuming we know how
▪ Generate 𝐿 independent samples from 𝑝(𝑧): 𝑧 ℓ=1 ∼ 𝑝(𝑧) to sample from 𝑝(𝑧)
▪ Approximate the expectation by the following empirical average
1 𝐿
𝔼 𝑓 ≈ 𝑓መ = σℓ=1 𝑓(𝑧 (ℓ) )
𝐿
▪ Since the samples are independent of each other, we can show the following
Variance in our
estimate decreases
Unbiased
as 𝐿 increases
expectation
CS698X: TPMI
10
Computing Expectations via Importance Sampling
▪ How to compute Monte Carlo expec. if we don’t know how to sample from 𝑝(𝑧)?
▪ One way is to use transformation methods or rejection sampling
▪ Another way is to use Importance Sampling (assuming 𝑝(𝑧) can be evaluated at least)
𝐿
▪ Generate 𝐿 indep samples from a proposal 𝑞(𝑧) we know how sample from: 𝑧 (ℓ) ℓ=1
∼ 𝑞(𝑧)
▪ Now approximate the expectation as follows
𝑝 𝑧 (ℓ)
1 𝐿 𝑝 𝑧
𝔼 𝑓 = න 𝑓 𝑧 𝑝 𝑧 𝑑𝑧 = න 𝑓 𝑧 𝑞 𝑧 𝑑𝑧 ≈ ෍ 𝑓(𝑧 (ℓ) )
𝑞 𝑧 𝐿 ℓ=1 𝑞 𝑧 (ℓ)
▪ This is basically “weighted” Monte Carlo integration
𝑝 𝑧 (ℓ)
▪ 𝑤 (ℓ) = denotes the importance weight of each sample 𝑧 (ℓ)
𝑞 𝑧 (ℓ) See PRML 11.1.4

𝑝(𝑧)
▪ IS works even when we can only evaluate 𝑝 𝑧 = up to a prop. constant
𝑍𝑝
▪ Note: Monte Carlo and Importance Sampling are NOT sampling methods!
▪ These are only uses for computing expectations (approximately) CS698X: TPMI
11
Limitations of the Basic Methods
▪ Transformation based methods: Usually limited to drawing from standard distributions

▪ Rejection Sampling and Importance Sampling: Require good proposal distributions

1 𝐿 (ℓ)
𝑝 𝑧 (ℓ)
𝔼 𝑓 ≈ ෍ 𝑓(𝑧 )
𝑞(𝑧) should be such that 𝐿 ℓ=1 𝑞 𝑧 (ℓ)
𝑀𝑞(𝑧) envelopes 𝑝(𝑧)

Ideally, would like 𝑞(𝑧) to
everywhere
give samples from where 𝑝(𝑧)
is large or 𝑓(𝑧)𝑝(𝑧) is large
Difficult to guarantee so if 𝑧 is
high-dimensional

▪ In general, difficult to find good prop. distr. especially when 𝑧 is high-dim

▪ More sophisticated sampling methods like MCMC work well in such high-dim spaces
CS698X: TPMI
12
Markov Chain Monte Carlo (MCMC) If the target is a posterior, it will be
conditioned on data, i.e., 𝑝(𝒛|𝒙)

▪ Goal: Generate samples from some target distribution 𝑝 𝒛 =



𝑝(𝒛)
𝑍𝑝
𝒛 usually is high-dim Means we can at least
▪ Assume we can evaluate 𝑝(𝒛) at least up to a proportionality constant evaluate 𝑝(𝒛)

▪ MCMC uses a Markov Chain which, when converged, starts giving samples from 𝑝(𝑧)

▪ Given current sample 𝒛(ℓ) from the chain, MCMC generates the next sample 𝒛(ℓ+1) as
▪ Use a proposal distribution 𝑞(𝒛|𝒛(ℓ) ) to generate a candidate sample 𝒛∗
▪ Accept/reject 𝒛∗ as the next sample based on an acceptance criterion (will see later)
▪ If accepted, set 𝒛(ℓ+1) = 𝒛∗ . If rejected, set 𝒛(ℓ+1) = 𝒛(ℓ)
Should also have the
same support as 𝑝(𝒛)
▪ Important: The proposal distribution 𝑞(𝒛|𝒛(ℓ) ) depends on the previous sample 𝒛(ℓ)
CS698X: TPMI
13
MCMC: The Basic Scheme
▪ The chain run infinitely long (i.e., upon convergence) will give ONE sample from 𝑝 𝒛
MCMC is exact in theory but
▪ But we usually require several samples to approximate 𝑝 𝒛 approximate in practice since
we can’t run the chain for
Thus we say that the infinitely long in practice
▪ This is done as follows samples are approximately
from the target distribution
▪ Start the chain at an initial 𝒛(0) Will treat it as our first
sample from 𝑝(𝒛)
▪ Using the proposal 𝑞(𝒛|𝒛(ℓ) ), run the chain long enough, say 𝑇1 steps
▪ Discard the first 𝑇1 − 1 samples (called “burn-in” samples) and take last sample 𝒛(𝑇1 )
▪ Continue from 𝒛(𝑇1) up to 𝑇2 steps, discard intermediate samples, take last sample 𝒛(𝑇2)
▪ This discarding (called “thinning”) helps ensure that 𝒛(𝑇1) and 𝒛(𝑇2) are uncorrelated
▪ Repeat the same for a total of 𝑆 times Requirement for Monte
Carlo approximation
▪ In the end, we now have 𝑆 approximately independent samples from 𝑝 𝒛

▪ Note: Good choices for 𝑇1 and 𝑇𝑖 − 𝑇𝑖−1 (thinning gap) are usually based on heuristics
CS698X: TPMI
14
MCMC: Some Basic Theory
▪ A first order Markov Chain assumes 𝑝 𝒛(ℓ+1) |𝒛 1 , … , 𝒛(ℓ) = 𝑝(𝒛 ℓ+1 |𝒛(ℓ) )
▪ A 1st order Markov Chain 𝒛(0) , 𝒛(1) , … , 𝒛(𝐿) is a sequence of r.v.’s and is defined by
▪ An initial state distribution 𝑝(𝒛 0 )
▪ A Transition Function (TF): 𝑇ℓ 𝒛 ℓ → 𝒛 ℓ+1 = 𝑝(𝒛 ℓ+1 |𝒛(ℓ) )
▪ TF is a distribution over the values of next state given the value of the current state
▪ Assuming a 𝐾-dim discrete state-space, TF will be 𝐾 × 𝐾 probability table

▪ Homogeneous Markov Chain: The TF is the same for all ℓ , i.e., 𝑇ℓ = 𝑇


CS698X: TPMI
15
MCMC: Some Basic Theory
▪ Consider the following Markov Chain with a 𝐾 = 3 discrete state-space

0 0 0
𝑝 𝒛0 = 𝑝 𝑧1 , 𝑧2 , 𝑧3
= [0.5,0.2,0.3]

1 0
𝑝 𝒛 = 𝑝 𝒛 × 𝑇 = [0.2,0.6,0.2] (rounded to single digit after decimal)
After doing it a few more Stationary/Invariant Distribution 𝑝(𝒛) is multinoulli with 𝜋 = [0.2,0.4,0.4]
(say some 𝑚) times 𝑝(𝒛) of this Markov Chain
𝑝 𝒛 0 × 𝑇 𝑚 = [0.2,0.4,0.4] (rounded to single digit after decimal)

▪ 𝑝(𝒛) being Stationary means no matter what 𝑝 𝒛 0 is, we will reach 𝑝(𝒛)
▪ A Markov Chain has a stationary distribution if 𝑇 has the following properties
▪ Irreducibility: T’s graph is connected (ensures reachability from anywhere to anywhere)
▪ Aperiodicity: T’s graph has no cycles (ensures that the chain isn’t trapped in cycles)
CS698X: TPMI
16
MCMC: Some Basic Theory
▪ A Markov Chain with transition function 𝑇 has stationary distribution 𝑝(𝒛) if 𝑇 satisfies
Here 𝑇 𝑏 𝑎 denotes the
Known as the Detailed transition probability of going
Balance condition 𝑝 𝒛 𝑇 𝒛′ |𝒛 = 𝑝 𝒛′ 𝑇(𝒛|𝒛′ ) from state 𝑏 to state 𝑎

▪ Integrating out (or summing over) detailed balanced condition on both sides w.r.t. 𝒛′
Thus 𝑝(𝑧) is the
stationary distribution of
this Markov Chain
𝑝 𝒛 = න 𝑝 𝒛′ 𝑇 𝒛 𝒛′ 𝑑𝒛′

▪ Thus a Markov Chain with detailed balance always converges to a stationary distribution

▪ Detailed Balance ensures reversibility

▪ Detailed balance is sufficient but not necessary condition for having a stationary distr.
CS698X: TPMI
17
Coming Up Next
▪ MCMC algorithms
▪ Metropolis Hastings (MH)
▪ Gibbs sampling (special case of MH)

CS698X: TPMI

You might also like