Approximate Inference via Sampling (1)
CS698X: Topics in Probabilistic Modeling and Inference
Piyush Rai
2
Plan
▪ Sampling to approximate distributions
▪ Basic sampling methods
▪ Markov Chain Monte Carlo (MCMC)
CS698X: TPMI
3
Sampling for Approximate Inference
▪ Some typical tasks that we have to solve in probabilistic/fully-Bayesian inference
Posterior
distribution
Posterior
predictive
distribution
Needed for model
selection (and in
Marginal
computing
likelihood
posterior too)
Expected
Needed in EM complete data
log-likelihood
Evidence lower
Needed in VI bound (ELBO) ]
▪ Sampling methods provide a general way to (approximately) solve these problems
CS698X: TPMI
4
Approximating a Prob. Distribution using Samples
▪ Can approximate any distribution using a set of randomly drawn samples from it
Samples can thought
of as a histogram-
Given large-enough
based approximation
samples, it is proportional to
of a distribution
the probability density at Height of each bar
that location denotes how many
times that location
was sampled 𝑝(𝑧)
▪ The samples can also be used for computing expectations (Monte-Carlo averaging)
▪ Usually straightforward to generate samples if it is a simple/standard distribution
▪ The interesting bit: Even if the distribution is “difficult” (e.g., an intractable posterior), it
is often possible to generate random samples from such a distribution, as we will see.
CS698X: TPMI
5
The Empirical Distribution
▪ Sampling based approx. can be formally represented using an empirical distribution
▪ Given 𝐿 points/samples 𝒛(1) , 𝒛(2) , … , 𝒛(𝐿) , empirical distr. defined by these is
Weights sum to 1
Dirac Distribution with Weight of point 𝑧 (ℓ)
finite support at
𝒛(1) , 𝒛(2) , … , 𝒛(𝐿)
Can think of 𝐴 as being the
area over which we want to
evaluate the distribution
Dirac Distribution
CS698X: TPMI
𝜕𝑥 6
Sampling: Some Basic Methods 𝑝 𝑧 = 𝑞(𝑥)
𝜕𝑧
▪ Most of these basic methods are based on the idea of transformation Determinant
of Jacobian
▪ Generate a random sample 𝑥 from a distribution 𝑞(𝑥) which is easy to sample from
▪ Apply a transformation on 𝑥 to make it random sample 𝑧 from a complex distr 𝑝(𝑧) 𝐹(𝑧): CDF of 𝑝(𝑧)
▪ Some popular examples of transformation methods 𝑥
▪ Inverse CDF method
𝑧 = 𝐹 −1 (𝑥)
▪ Reparametrization method
▪ Box-Mueller method: Given (𝑥1 , 𝑥2 ) from Unif(−1, +1), generate (𝑧1 , 𝑧2 ) from 𝒩(0, 𝐈2 )
𝑧1 = −2 ln 𝑥1 cos 2𝜋𝑥2 , 𝑧1 = −2 ln 𝑥1 sin(2𝜋𝑥2 )
▪ Transformation Methods are simple but have limitations
▪ Mostly limited to standard distributions and/or distributions with very few variables
CS698X: TPMI
7
Rejection Sampling
𝑝(𝑧)
▪ Goal: Generate a random sample from a distribution of the form 𝑝 𝑧 = , assuming
𝑍𝑝
▪ We can only evaluate the value of numerator 𝑝(𝑧)
for any 𝑧
▪ The denominator (normalization constant) 𝑍𝑝 is intractable and we don’t know its value
Should have the same
support as 𝑝(𝑧)
▪ Assume a proposal distribution 𝑞(𝑧) we can generate samples from, and
▪ Rejection Sampling then works as follows
▪ Sample an random variable 𝑧∗ from 𝑞(𝑧)
▪ Sampling a uniform r.v. 𝑢 ∼ Unif 0, 𝑀𝑞 𝑧∗
▪ If 𝑢 ≤ 𝑝(𝑧
∗ ) then accept 𝑧∗ , otherwise reject it
▪ All accepted 𝑧∗ ’s will be random samples from 𝑝 𝑧 . Proof on next slide
CS698X: TPMI
8
Rejection Sampling
▪ Why 𝑧 ∼ 𝑞(𝑧) + accept/reject rule is equivalent to 𝑧 ∼ 𝑝(𝑧)?
▪ Let’s look at the pdf of the 𝑧’s that were accepted, i.e., 𝑝(𝑧|accept)
CS698X: TPMI
9
Computing Expectations via Monte Carlo Sampling
▪ Often we are interested in computing expectations of the form
𝔼 𝑓 = න 𝑓 𝑧 𝑝 𝑧 𝑑𝑧
where 𝑓(𝑧) is some function of the random variable 𝑧 ∼ 𝑝(𝑧)
▪ A simple approx. scheme to compute the above expectation: Monte Carlo integration
(ℓ) 𝐿 Assuming we know how
▪ Generate 𝐿 independent samples from 𝑝(𝑧): 𝑧 ℓ=1 ∼ 𝑝(𝑧) to sample from 𝑝(𝑧)
▪ Approximate the expectation by the following empirical average
1 𝐿
𝔼 𝑓 ≈ 𝑓መ = σℓ=1 𝑓(𝑧 (ℓ) )
𝐿
▪ Since the samples are independent of each other, we can show the following
Variance in our
estimate decreases
Unbiased
as 𝐿 increases
expectation
CS698X: TPMI
10
Computing Expectations via Importance Sampling
▪ How to compute Monte Carlo expec. if we don’t know how to sample from 𝑝(𝑧)?
▪ One way is to use transformation methods or rejection sampling
▪ Another way is to use Importance Sampling (assuming 𝑝(𝑧) can be evaluated at least)
𝐿
▪ Generate 𝐿 indep samples from a proposal 𝑞(𝑧) we know how sample from: 𝑧 (ℓ) ℓ=1
∼ 𝑞(𝑧)
▪ Now approximate the expectation as follows
𝑝 𝑧 (ℓ)
1 𝐿 𝑝 𝑧
𝔼 𝑓 = න 𝑓 𝑧 𝑝 𝑧 𝑑𝑧 = න 𝑓 𝑧 𝑞 𝑧 𝑑𝑧 ≈ 𝑓(𝑧 (ℓ) )
𝑞 𝑧 𝐿 ℓ=1 𝑞 𝑧 (ℓ)
▪ This is basically “weighted” Monte Carlo integration
𝑝 𝑧 (ℓ)
▪ 𝑤 (ℓ) = denotes the importance weight of each sample 𝑧 (ℓ)
𝑞 𝑧 (ℓ) See PRML 11.1.4
𝑝(𝑧)
▪ IS works even when we can only evaluate 𝑝 𝑧 = up to a prop. constant
𝑍𝑝
▪ Note: Monte Carlo and Importance Sampling are NOT sampling methods!
▪ These are only uses for computing expectations (approximately) CS698X: TPMI
11
Limitations of the Basic Methods
▪ Transformation based methods: Usually limited to drawing from standard distributions
▪ Rejection Sampling and Importance Sampling: Require good proposal distributions
1 𝐿 (ℓ)
𝑝 𝑧 (ℓ)
𝔼 𝑓 ≈ 𝑓(𝑧 )
𝑞(𝑧) should be such that 𝐿 ℓ=1 𝑞 𝑧 (ℓ)
𝑀𝑞(𝑧) envelopes 𝑝(𝑧)
Ideally, would like 𝑞(𝑧) to
everywhere
give samples from where 𝑝(𝑧)
is large or 𝑓(𝑧)𝑝(𝑧) is large
Difficult to guarantee so if 𝑧 is
high-dimensional
▪ In general, difficult to find good prop. distr. especially when 𝑧 is high-dim
▪ More sophisticated sampling methods like MCMC work well in such high-dim spaces
CS698X: TPMI
12
Markov Chain Monte Carlo (MCMC) If the target is a posterior, it will be
conditioned on data, i.e., 𝑝(𝒛|𝒙)
▪ Goal: Generate samples from some target distribution 𝑝 𝒛 =
𝑝(𝒛)
𝑍𝑝
𝒛 usually is high-dim Means we can at least
▪ Assume we can evaluate 𝑝(𝒛) at least up to a proportionality constant evaluate 𝑝(𝒛)
▪ MCMC uses a Markov Chain which, when converged, starts giving samples from 𝑝(𝑧)
▪ Given current sample 𝒛(ℓ) from the chain, MCMC generates the next sample 𝒛(ℓ+1) as
▪ Use a proposal distribution 𝑞(𝒛|𝒛(ℓ) ) to generate a candidate sample 𝒛∗
▪ Accept/reject 𝒛∗ as the next sample based on an acceptance criterion (will see later)
▪ If accepted, set 𝒛(ℓ+1) = 𝒛∗ . If rejected, set 𝒛(ℓ+1) = 𝒛(ℓ)
Should also have the
same support as 𝑝(𝒛)
▪ Important: The proposal distribution 𝑞(𝒛|𝒛(ℓ) ) depends on the previous sample 𝒛(ℓ)
CS698X: TPMI
13
MCMC: The Basic Scheme
▪ The chain run infinitely long (i.e., upon convergence) will give ONE sample from 𝑝 𝒛
MCMC is exact in theory but
▪ But we usually require several samples to approximate 𝑝 𝒛 approximate in practice since
we can’t run the chain for
Thus we say that the infinitely long in practice
▪ This is done as follows samples are approximately
from the target distribution
▪ Start the chain at an initial 𝒛(0) Will treat it as our first
sample from 𝑝(𝒛)
▪ Using the proposal 𝑞(𝒛|𝒛(ℓ) ), run the chain long enough, say 𝑇1 steps
▪ Discard the first 𝑇1 − 1 samples (called “burn-in” samples) and take last sample 𝒛(𝑇1 )
▪ Continue from 𝒛(𝑇1) up to 𝑇2 steps, discard intermediate samples, take last sample 𝒛(𝑇2)
▪ This discarding (called “thinning”) helps ensure that 𝒛(𝑇1) and 𝒛(𝑇2) are uncorrelated
▪ Repeat the same for a total of 𝑆 times Requirement for Monte
Carlo approximation
▪ In the end, we now have 𝑆 approximately independent samples from 𝑝 𝒛
▪ Note: Good choices for 𝑇1 and 𝑇𝑖 − 𝑇𝑖−1 (thinning gap) are usually based on heuristics
CS698X: TPMI
14
MCMC: Some Basic Theory
▪ A first order Markov Chain assumes 𝑝 𝒛(ℓ+1) |𝒛 1 , … , 𝒛(ℓ) = 𝑝(𝒛 ℓ+1 |𝒛(ℓ) )
▪ A 1st order Markov Chain 𝒛(0) , 𝒛(1) , … , 𝒛(𝐿) is a sequence of r.v.’s and is defined by
▪ An initial state distribution 𝑝(𝒛 0 )
▪ A Transition Function (TF): 𝑇ℓ 𝒛 ℓ → 𝒛 ℓ+1 = 𝑝(𝒛 ℓ+1 |𝒛(ℓ) )
▪ TF is a distribution over the values of next state given the value of the current state
▪ Assuming a 𝐾-dim discrete state-space, TF will be 𝐾 × 𝐾 probability table
▪ Homogeneous Markov Chain: The TF is the same for all ℓ , i.e., 𝑇ℓ = 𝑇
CS698X: TPMI
15
MCMC: Some Basic Theory
▪ Consider the following Markov Chain with a 𝐾 = 3 discrete state-space
0 0 0
𝑝 𝒛0 = 𝑝 𝑧1 , 𝑧2 , 𝑧3
= [0.5,0.2,0.3]
1 0
𝑝 𝒛 = 𝑝 𝒛 × 𝑇 = [0.2,0.6,0.2] (rounded to single digit after decimal)
After doing it a few more Stationary/Invariant Distribution 𝑝(𝒛) is multinoulli with 𝜋 = [0.2,0.4,0.4]
(say some 𝑚) times 𝑝(𝒛) of this Markov Chain
𝑝 𝒛 0 × 𝑇 𝑚 = [0.2,0.4,0.4] (rounded to single digit after decimal)
▪ 𝑝(𝒛) being Stationary means no matter what 𝑝 𝒛 0 is, we will reach 𝑝(𝒛)
▪ A Markov Chain has a stationary distribution if 𝑇 has the following properties
▪ Irreducibility: T’s graph is connected (ensures reachability from anywhere to anywhere)
▪ Aperiodicity: T’s graph has no cycles (ensures that the chain isn’t trapped in cycles)
CS698X: TPMI
16
MCMC: Some Basic Theory
▪ A Markov Chain with transition function 𝑇 has stationary distribution 𝑝(𝒛) if 𝑇 satisfies
Here 𝑇 𝑏 𝑎 denotes the
Known as the Detailed transition probability of going
Balance condition 𝑝 𝒛 𝑇 𝒛′ |𝒛 = 𝑝 𝒛′ 𝑇(𝒛|𝒛′ ) from state 𝑏 to state 𝑎
▪ Integrating out (or summing over) detailed balanced condition on both sides w.r.t. 𝒛′
Thus 𝑝(𝑧) is the
stationary distribution of
this Markov Chain
𝑝 𝒛 = න 𝑝 𝒛′ 𝑇 𝒛 𝒛′ 𝑑𝒛′
▪ Thus a Markov Chain with detailed balance always converges to a stationary distribution
▪ Detailed Balance ensures reversibility
▪ Detailed balance is sufficient but not necessary condition for having a stationary distr.
CS698X: TPMI
17
Coming Up Next
▪ MCMC algorithms
▪ Metropolis Hastings (MH)
▪ Gibbs sampling (special case of MH)
CS698X: TPMI