Chap 25
Chap 25
John Fox
iii
iv CONTENTS
Chapter 25
Bayesian Estimation of
Regression Models
1
2 CHAPTER 25. BAYESIAN ESTIMATION OF REGRESSION MODELS
and substituting this result into Equation 25.1 yields Bayes’s theorem:
Pr(B|A) Pr(A)
Pr(A|B) = (25.3)
Pr(B)
Bayes’s theorem is named after its discoverer, the Reverend Thomas Bayes, an
18th-century English mathematician.
Bayesian statistical inference is based on the following interpretation of
Equation 25.3: Let A represent some uncertain proposition (a “hypothesis”)
whose truth or falsity we wish to establish—for example, the proposition that a
parameter is equal to a particular value. Let B represent observed data that are
relevant to the truth of the proposition. We interpret the unconditional prob-
ability Pr(A), called the prior probability of A, as our strength of belief in the
truth of A prior to collecting data, and Pr(B|A) as the probability of obtaining
the observed data assuming the truth of A—that is, the likelihood of the data
given A (see on-line Appendix Section D.6.1). The unconditional probability of
the data B is3
where A is the event not-A (the complement of A). Then Pr(A|B), given by
Equation 25.3 and called the posterior probability of A, represents our revised
strength of belief in A in light of the data B.
2 Indeed,this would be a good time to review Appendix D!
3 Thisis an application of the law Pof total probability: Given an event B and a set of k
disjoint events A1 , . . . , Ak for which ki=1 Pr(Ai ) = 1 (i.e., the events Ai partition the sample
space S),
X k
Pr(B) = Pr(B|Ai ) Pr(Ai )
i=1
In the current context, S = A ∪ A, that is, the sample space is the union of A and A.
25.1. INTRODUCTION TO BAYESIAN STATISTICAL INFERENCE 3
Pr(B|A) Pr(A)
Pr(A|B) =
Pr(B)
probability and of Bayesian inference with subjective probability is a simplification that glosses
over differences in both camps. Such subtleties are well beyond the scope of this presentation,
and the shorthand—Bayesian inference versus classical or frequentist inference—is convenient.
5 See Exercise 25.1.
4 CHAPTER 25. BAYESIAN ESTIMATION OF REGRESSION MODELS
• Imagine that a test for antibodies has been developed that never produces
a false negative. Let P represent the event that a person tests positive
for antibodies. The conditional probability that a person with antibodies
correctly tests positive, called the sensitivity of the test, is then Pr(P |A) =
1.
• Imagine further that the test has a false positive rate of 10%—that is, 10%
of people who don’t have antibodies to the virus nevertheless test positive
(perhaps because they’ve been infected by a similar virus). The condi-
tional probability that a person who doesn’t have antibodies incorrectly
tests positive is therefore Pr(P |A) = .1.6
• Imagine, finally, that you test positive. What is the probability that you
actually have antibodies to the virus—that is, what is Pr(A|P )?7 [Hint:
Pr(A|P ) is much smaller than Pr(P |A) = 1.]
the specificity of the test, is Pr(P |A) = 1 − .1 = .9, but this probability isn’t needed for the
problem.
7 A strict frequentist would object to referring to the probability that a specific individual,
like you, has antibodies to the virus because, after all, either you have antibodies or you
don’t. Pr(A|P ) is therefore a subjective probability, reflecting your ignorance of the true
state of affairs. Pr(A|P ) can be given an objective frequentist interpretation as the long-
run proportion (i.e., the relative frequency) of individuals testing positive who are actually
positive.
25.1. INTRODUCTION TO BAYESIAN STATISTICAL INFERENCE 5
reasonable to take as prior probabilities Pr(A) = Pr(A) = .5. Calling the data
B, the likelihood of the data under A and A is
Pr(B|A) = .37 (1 − .3)3 = .0000750
Pr(B|A) = .87 (1 − .8)3 = .0016777
As is typically the case, the likelihood of the observed data is small in both cases,
but the data are much more likely under A than under A. (The likelihood of
these data for any value of Pr(H) between 0 and 1 appears in on-line Appendix
Figure D.18.)
Using Bayes’s theorem (Equation 25.3), you find the posterior probabilities
.0000750 × .5
Pr(A|B) = = .0428
.0000750 × .5 + .0016777 × .5
.0016777 × .5
Pr(A|B) = = .9572
.0000750 × .5 + .0016777 × .5
suggesting that it is much more probable that the selected coin has Pr(H) = .8
than Pr(H) = .3.
L(α)p(α)
p(α|D) = R
A
L(α0 )p(α0 ) dα0
p(α|D) ∝ L(α)p(α)
That is, the posterior probability or density is proportional to the product of the
likelihood P
and the prior probability or density. As before, we can if necessary
divide by L(α)p(α) or L(α)p(α) dα to recover the posterior probabilities or
R
densities.10
Two points are noteworthy:
• We require a prior distribution p(α) over the possible values of the param-
eter α (the parameter space) to set the machinery of Bayesian inference
in motion.
9 I’ve resisted starring this material because parameters are almost always continuous. If
R
you’re unfamiliar with calculus, just think P of the integral symbol (which is an elongated
“S ”) as the continuous analog of a S um —indeed, the analogy is very close—here the area
under the density function of a continuous random, which represents a probability, an idea
that is surely familiar from a basic statistics course.
10 The statement is glib, in that it may not be easy in the continuous case to evaluate
R
the integral L(α)p(α) dα. This potential difficulty motivates the use of conjugate priors,
discussed immediately below, and the more generally applicable Markov-chain Monte-Carlo
methods described later in the chapter (Section 25.2).
25.1. INTRODUCTION TO BAYESIAN STATISTICAL INFERENCE 7
L(α)p(α)
p(α|D) = R 0 )p(α0 ) dα0
A
L(α
when α is continuous (and where A represents the set of all values of α).
estimation, but it is the simplest case of the binomial distribution, discussed in Section D.2.1,
where there is only a single binomial trial. The probability mass function of the Bernoulli
distribution is therefore p(x) = π x (1 − π)1−x , where π is the probability of “success” (i..e, a
head in the coin-flipping example), and x = 1 for a success and 0 for a failure.
The Bernoulli likelihood function follows immediately by multiplying the probabilities for
the 10 (more generally, n) independent trials. We could work directly with the binomial
likelihood and obtain the same result, taking the data as h, the number of heads in n binomial
trials, because h is a sufficient statistic for the probability π: See on-line Appendix D.5.4 for
an explanation of sufficiency.
That is, given h heads in n independent flips of a coin with probability π of a head on
an individual flip, the Bernoulli likelihood function is LBern (π) = π h (1 − π)n−h , while the
h h h n!
binomial likelihood is Lbinom (π) = n π (1 − π)n−h , where n ≡ h!(n−h)! is the binomial
coefficient. Because n and h are both constants after the data are observed, the binomial
likelihood is proportional to the Bernoulli likelihood. The Bernoulli likelihood takes the order
of the h heads and n − h tails into account while the binomial likelihood ignores the order
and only attends to the number of heads; the order is immaterial to estimating the value of
π because the flips are independent.
12 But see the discussion of uninformative priors in the following section.
25.1. INTRODUCTION TO BAYESIAN STATISTICAL INFERENCE 9
(a) (b)
4 3
3
2
2
1
1
0 0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
π π
Figure 25.1 Beta priors for various choices of a and b: (a) symmetric priors, for which a = b;
(b) asymmetric priors, for which a 6= b. Beta(0.5, 0.5) is the Jeffreys prior,
discussed in Section 25.1.6.
Figure 25.2 shows the posterior distributions for π under these two priors.
Under the flat prior, the posterior is proportional to the likelihood, and there-
fore if you take the mode of the posterior as your estimate of π, you get the
maximum-likelihood estimate π b = .7.13 The posterior for the informative prior
a = b = 16, in contrast, has a mode at π ≈ .55 , which is much closer to the
mode of the prior distribution π = .5.
It may be disconcerting that the conclusion should depend so crucially on
the prior distribution, but this result is a consequence of the very small sample
(10 coin flips) in the example: Recall that using a beta prior in this case is like
adding a + b observations to the data. As the sample size grows, the likelihood
comes to dominate the posterior distribution, and the influence of the prior
distribution fades.14 In the current example, if the coin is flipped n times, then
the posterior distribution takes the form
and the numbers of heads h and tails n − h will grow with the number of flips.
It is intuitively sensible that your prior beliefs should carry greater weight when
the sample is small than when it is large.
estimate of π. In most cases, however, the posterior distribution will approach a normal
distribution as the sample size increases, and the posterior mean, median, and mode will
therefore be approximately equal if the sample size is sufficiently large.
14 An exception to this rule occurs when the prior distribution assigns 0 density to some
values of the parameter; such values will necessarily have posterior densities of 0 regardless of
the data.
10 CHAPTER 25. BAYESIAN ESTIMATION OF REGRESSION MODELS
p(π)
5 Posterior for Beta Prior
a = 1, b = 1
a = 16, b = 16
4
Figure 25.2 Posterior distributions for the probability of a head π under two prior
distributions: the flat beta prior with a = 1, b = 1 (the posterior for which is
shown as a solid curve), and the informative beta prior with a = 16, b = 16 (the
broken curve). The data contain 7 heads in 10 flips of a coin. The two horizontal
lines near the bottom of the graph show 95% central posterior intervals (described
in Section 25.1.7) corresponding to the two priors.
25.1. INTRODUCTION TO BAYESIAN STATISTICAL INFERENCE 11
the probability of success, and so I use an uppercase Π ≈ 3.14159 for the math-
ematical constant, whose role here is to ensure that the prior density integrates
to 1. The Jeffreys prior pJ (π) is a beta distribution with a = b = 0.5 (shown in
Figure 25.1) and so is a conjugate prior to the Bernoulli likelihood.
15 The general solution for the Jeffreys prior turns out to be remarkably simple, and is
explored in Exercise 25.4. Although I don’t take it up here, the Jeffreys prior can be extended
to several parameters. There are, as well, other general uninformative priors based on various
principles, such as reference priors, introduced by Bernardo (1979).
12 CHAPTER 25. BAYESIAN ESTIMATION OF REGRESSION MODELS
Priors
Jeffreys
Flat Probability
Flat Logit (improper)
3
Prior Density p(π)
2
1
0
Figure 25.3 Three putatively uninformative priors expressed on the probability scale π : The
flat prior on the probability scale (broken line); the flat prior on the logit scale
(dotted line); and the Jeffreys prior (solid line). Because the flat prior on the logit
scale is improper, its height in the figure is essentially arbitrary. A question for the
reader: Are you comfortable with the flat logit prior or the Jeffreys prior as an
expression of weak prior belief?
25.1. INTRODUCTION TO BAYESIAN STATISTICAL INFERENCE 13
There are several kinds of uninformative prior distributions. The flat prior asigns
equal probability density to all values of a parameter; if the parameter is un-
bounded, then the flat prior is improper, in that it doesn’t integrate to 1, and
a flat prior for a parameter is not in general flat for a transformation of the pa-
rameter. The Jeffreys prior for a parameter (introduced by Sir Harold Jeffreys),
in contrast, is invariant with respect to transformation of the parameter. Weakly
informative priors are often employed in practice, and are selected to place broad
plausible constraints on the value of a parameter.
Ninety-five percent central posterior intervals for the example are shown for
the two posterior distributions in Figure 25.2 (page 10).16
or
p(α)L(α)
p(α|D) = R ∗ )L(α∗ )dk α∗
(25.7)
A
p(α
where A is the set of all values of the parameter vector α (i.e., A is the mul-
tidimensional parameter space). Inference typically focuses on the marginal
posterior distribution of each parameter, p(αi |D), or possibly on the distribu-
tions of scalar functions of the parameters, such as predicted values in regression,
which are functions of regression coefficients, or the ratio or difference of two
parameters.19
16 Page references in this chapter may be to pages within the chapter or to pages in the
printed text (i.e., Chapters 1 through 24). The references should, however, be clear from the
context: In this case, for example, Figure 25.2 is obviously in the current chapter.
17 If you’re not familiar with vector notation, just think of α ≡ [α , α , . . . , α ]0 as a collec-
1 2 k
tion of several parameters.
18 It’s frequently the case in practice that prior distributions are specified independently for
the various parameters, so that the joint prior is the product of the separate (marginal) priors.
19 Although there may be several parameters, a scalar function of the parameters returns a
single number.
25.2. MARKOV-CHAIN MONTE CARLO SIMULATION 15
p(α)L(α)
p(α|D) = R
A
p(α∗ )L(α∗ )dk α∗
without having explicitly to evaluate the integral in the denominator, which is typ-
ically intractible analytically. MCMC methods have therefore rendered Bayesian
inference practical for a broad range of statistical problems.
There are three common (and related) MCMC methods: the Metropolis-Hastings
algorithm, the Gibbs sampler, and Hamiltonian Monte Carlo (HMC ). HMC is
considered the best current method of MCMC for sampling from continuous dis-
tributions.
The substitution of p∗ (·) for p(·) in the second line of Equation 25.8 is
justified because the unknown normalizing constant c (recall, the num-
ber that makes the density integrate to 1) cancels in the numerator and
denominator, making the ratio in the equation computable even though
the numerator and denominator in the first line of the equation are not
separately computable. Calculate a0 = min(a, 1).
into account the possible bias in the direction of movement of the pro-
posal function from the preceding value. If the proposal is less probable
than the preceding value, then the probability of accepting the proposal
declines with the ratio a, but isn’t 0. Thus, the chain will tend to visit
higher-density regions of the target distribution with greater frequency
but will still explore the entire target distribution. It can be shown (e.g.,
Chib and Greenberg, 1995) that the limiting distribution of the Markov
chain (the distribution to which the sample tends as m → ∞) is indeed the
target distribution, and so the algorithm should work if m is big enough.
The Metropolis-Hastings algorithm is simpler when the proposal distribu-
tion is symmetric, in the sense that f (xi |xi−1 ) = f (xi−1 |xi ). This is true,
for example, when the proposal distribution is multivariate-normal (see on-line
Appendix Section D.3.5) with mean vector xi−1 and some specified covariance
matrix S:
1 1 0 −1
f (xi |xi−1 ) = √ × exp − (xi − xi−1 ) S (xi − xi−1 )
(2π)n/2 det S 2
= f (xi−1 |xi ) (25.9)
Then, a in Equation 25.8 becomes
p∗ (x∗ )
a= (25.10)
p∗ (xi−1 )
which (again, because the missing normalizing constant c cancels) is equivalent
to the ratio of the target density at the proposed and preceding values of x. This
simplified version of the Metropolis-Hastings algorithm, based on a symmetric
proposal distribution, is the version originally introduced by Metropolis et al.
(1953).
By construction, the Metropolis-Hastings algorithm generates statistically
dependent successive values of x. If an approximately independent sample is
desired, then the sequence of sampled values can be thinned by discarding a
sufficient number of intermediate values of x, retaining only every kth value.
Additionally, because of an unfortunately selected initial value x0 , it may take
some time for the sampled sequence to approach its limiting distribution—that
is, the target distribution. It may therefore be advantageous to discard a number
of values at the beginning of the sequence, termed the burn-in or warm-up
period .
It’s not necessary to use MCMC in this case, because it’s easy to approxi-
mate bivariate-normal probabilities or to draw samples from the distribution
directly, but the bivariate-normal distribution provides a simple setting in which
to demonstrate the Metropolis algorithm, and for pedagogical purposes it helps
to know the right answer in advance—that is, the example is selected for its
transparency.
Let’s pretend that we know the bivariate-normal distribution only up to
a constant of proportionality. To this end, I omit √ the normalizing constant,
which for this simple example works out to 2π × 3 (see on-line Appendix
Section D.3.5).
To illustrate that the proposal distribution and the target distribution are
distinct, I use a bivariate-rectangular proposal distribution centered at the pre-
ceding value xi−1 with half-extent δ1 = 2 in the direction of the coordinate x1
and δ2 = 4 in the direction of x2 , reflecting the relative sizes of the standard
deviations of the two variables. This proposal distribution is symmetric, as
required by the simpler Metropolis algorithm. Clearly, because it has finite sup-
port, the rectangular proposal distribution doesn’t cover the entire support of
the bivariate-normal target distribution, which extends infinitely in each direc-
tion, but because the proposal distribution “travels” (i.e., moves in the {x1 , x2 }
plane) with xi , it can generate a valid Markov chain.
I arbitrarily set x0 = [0, 0]0 , and sampled m = 105 values of x. As it
turned out, 41.7% of proposals were accepted. To get a sense of how Metropolis
sampling proceeds, I show the first 50 accepted proposals, along with the du-
plicated points corresponding to rejected proposals (of which there are 75), in
Figure 25.4. The 95% concentration ellipse for the bivariate-normal distribution
is also shown on the graph.23
How well does the Metropolis algorithm approximate the bivariate-normal
distribution? Here are the mean vector and covariance matrix of the sam-
pled points, which are quite close to the corresponding parameters in Equa-
tions 25.11:
µ
b = [1.003, 1.987]0
0.989 0.972
Σ
b =
0.972 3.963
Figure 25.5 shows all of the 105 sampled points together with several theoret-
ical elliptical contours of constant density and corresponding empirical density
contours.24 Clearly, the Metropolis algorithm does a good job of recovering the
target bivariate-normal density.
As mentioned, the Metropolis algorithm doesn’t produce an independent
random sample from the target distribution. One way to measure the depen-
dence among the sampled values is to compute their autocorrelations. Focus, for
23 As explained in Section 9.4.4, density contours of the multivariate-normal distribution are
tor ; see Silverman (1986, Chap. 4). One-dimensional kernel-density estimation is introduced
in Section 3.1.2.
20 CHAPTER 25. BAYESIAN ESTIMATION OF REGRESSION MODELS
● ●
●
●
4
● ● ● ●●●
● ●
●
●●
●
●
●●●●
●●
●
●
●
●
●
●
●●
●
●
●●●
●●
●
●●
●
●
●
● ●● ●●
● ●● ●
x2
●
●
2
● ● ●
● ●●
● ●
●
●
● ●
●● ●
● ●●
●
●●● ●
●● ●●
● ●
●
●
● ●
●
● ●●
● ●●
0
●
●
●
●●●●
●
●
● ●
●
●
● ●
●
−2
−4 −2 0 2 4 6
x1
Figure 25.4 First 50 accepted proposals and duplicated points representing rejected proposals,
sampling from the bivariate-normal distribution in Equations 25.11, which is
represented by its 95% concentration ellipse. The solid dots represent the 50
distinct points, corresponding to accepted proposals, starting out as light gray and
getting progressively darker, with the arrows showing the transitions. Duplicated
points corresponding to rejected proposals, shown as hollow dots, are slightly offset
so that they don’t precisely over-plot the accepted proposals.
25.2. MARKOV-CHAIN MONTE CARLO SIMULATION 21
Figure 25.5 The gray dots show m = 105 values sampled by the Metropolis algorithm from
the bivariate-normal distribution in Equations 25.11. The slightly irregular solid
lines represent estimated density contours enclosing 50%, 95%, and 99% of the
sampled points. The broken lines are the corresponding elliptical density contours
of the bivariate-normal target distribution.
22 CHAPTER 25. BAYESIAN ESTIMATION OF REGRESSION MODELS
(a) x1 (b) x2
1.0
1.0
0.8
0.8
Autocorrelation rt
Autocorrelation rt
0.6
0.6
0.4
0.4
0.2
0.2
0.0
0.0
0 10 20 30 40 50 0 10 20 30 40 50
lag t lag t
Figure 25.6 Autocorrelations of the sampled values of (a) x1 and (b) x2 produced by the
Metropolis algorithm applied to the bivariate-normal distribution in
Equations 25.11.
example, on the vector of jth sampled coordinates, say xj = [x1j , x2j , . . . , xmj ]0 ,
with mean x̄j . The sample autocorrelation at lag t is defined as25
Pm
i=t+1 (xij − x̄j )(xi−t,j − x̄j )
rtj = Pm 2
(25.12)
i=1 (xij − x̄j )
however, allows us to compare the results of MCMC with the known right an-
swer.
Recall our coin-flipping experiment, which produced h = 7 heads in n = 10
independent flips of a coin, with Bernoulli likelihood L(π|h = 7) = π h (1 −
π)n−h = π 7 (1 − π)3 . As in Section 25.1.5, I’ll consider two prior distributions:
a flat prior, in which the parameters of the Beta(a, b) distribution (see on-line
Appendix Section D.3.8) are set to a = b = 1, and an informative prior centered
on the population proportion π = 0.5 (representing a “fair” coin) in which
a = b = 16. In the first case, the posterior is Beta(8, 4), and in the second case,
it is Beta(23, 19). These posteriors appear in Figure 25.2 (page 10).
For the first simulation—with a flat prior—I set the standard deviation of
the normal proposal distribution N (πi−1 , s2 ) in the Metropolis algorithm to
s = 0.1, starting arbitrarily from π0 = 0.5 and sampling m = 105 values from
the posterior distribution of π, with an acceptance rate of 77.5%. An estimate of
the resulting posterior density function is shown in panel (a) of Figure 25.7, along
with the true Beta(8, 4) posterior density; panel(b) shows a quantile-comparison
(QQ) plot of the sampled values versus the Beta(8, 4) distribution: If the values
were sampled from Beta(8, 4), then the points would lie approximately on a 45◦
straight line (shown on the QQ plot), within the bounds of sampling error.26
The agreement between the approximate posterior produced by the Metropo-
lis algorithm and the true posterior distribution is very good, except at the
extreme left of the distribution, where the sampled values are slightly shorter-
tailed than the Beta(8, 4) distribution. The results for the second simulation,
employing the informative Beta(16, 16) prior, for which the true posterior is
Beta(23, 19) (shown in panels (c) and (d) of Figure 25.7), are similarly encourag-
ing. The acceptance rate for the Metropolis algorithm in the second simulation
was 63.2%. In both cases (but particularly with the flat prior), the Metropolis
samples of π are highly autocorrelated and would require thinning to produce
an approximately independent sample; see Figure 25.8.
In the first case, using the flat Beta(1, 1) prior, an estimate of π based
on the median of the true Beta(8, 4) posterior distribution is π b = 0.676, and
the 95% Bayesian credible interval for π from the 0.025 and 0.975 quantiles
of the posterior is 0.390 < π < 0.891. In comparison, using the median and
0.025 and 0.975 quantiles of the Metropolis sample, we have π b = 0.677 and
0.392 < π < 0.891.
The analogous results for the second case, with the informative Beta(16, 16)
prior, are π b = 0.548 and 0.397 < π < 0.693 based on the true Beta(23, 19)
posterior, and π b = 0.549 and 0.396 < π < 0.692 based on the Metropolis
sample.
Finally, returning to the flat prior, Figure 25.9 demonstrates how making the
standard deviation of the proposal distribution larger (it’s set initially, recall,
to 0.1) can, up to a point, decrease the autocorrelation of the sampled values,
reducing the need for thinning. In this example, setting the standard deviation
of the proposal distribution to 0.4 produces lower autocorrelation in the sampled
26 See Section 3.1.3 for an explanation of QQ plots.
24 CHAPTER 25. BAYESIAN ESTIMATION OF REGRESSION MODELS
(a)
3.0
2.5
2.0
Density
1.5
1.0
0.5
0.0
(c)
5
4
3
Density
2
1
0
Figure 25.7 Comparing the results produced by the Metropolis algorithm to the true posterior
distribution of the population proportion of heads π , based on an independent
sample of size n = 10 with h = 7 heads, and a prior distribution in the conjugate
beta family. For panels (a) and (b), the flat prior Beta(1, 1) was used, producing
the true posterior Beta(8, 4); in panels (c) and (d), the informative prior
Beta(16, 16) was used, producing the true posterior Beta(23, 19). Panels (a) and
(c) show nonparametric density estimates (solid lines) for the Metropolis samples,
comparing these to the true posterior densities (broken lines); panels (b) and (d)
are quantile-comparison plots for the Metropolis samples versus the true posterior
distributions.
25.2. MARKOV-CHAIN MONTE CARLO SIMULATION 25
(a) (b)
1.0
1.0
0.8
0.8
Autocorrelation rt
Autocorrelation rt
0.6
0.6
0.4
0.4
0.2
0.2
0.0
0.0
0 10 20 30 40 50 0 10 20 30 40 50
lag t lag t
Figure 25.8 Autocorrelations of the Metropolis samples from the two posterior distributions:
(a) based on the flat Beta(1, 1) prior, and (b) based on the informative Beta(16,
16) prior.
1.0
1.0
0.8
0.8
0.8
Autocorrelation rt
Autocorrelation rt
Autocorrelation rt
0.6
0.6
0.6
0.4
0.4
0.4
0.2
0.2
0.2
0.0
0.0
0.0
0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50
Figure 25.9 Autocorrelations of the sampled values of π produced by the Metropolis algorithm
using a flat prior, for various standard deviations of the normal proposal
distribution.
25.2. MARKOV-CHAIN MONTE CARLO SIMULATION 27
distributions of this form can arise naturally in the process of specifying hierar-
chical Bayesian statistical models in circumstances where it is difficult to derive
the joint distribution p(x) analytically.
The basic Gibbs sampler is simple to describe and proceeds as follows:
where
σ12
β12 = (25.15)
σ22
α12 = µ1 − β12 µ2
σ2
2
σ1|2 = σ12 1 − 2122 (25.16)
σ1 σ2
28 CHAPTER 25. BAYESIAN ESTIMATION OF REGRESSION MODELS
µ
b = [1.007, 2.017]0 (25.18)
0.997 1.002
Σ
b =
1.002 4.022
The sampled values are shown in Figure 25.10 along with a nonparametric
density estimate and the corresponding bivariate-normal density contours. Ev-
idently, the Gibbs sampler, like the Metropolis algorithm, accurately recovers
the means, variances, covariance, and shape of the bivariate-normal target dis-
tribution.
The values of X1 and X2 drawn by the Gibbs sampler are autocorrelated,
but less so than those produced for this example by the Metropolis algorithm:
See Figure 25.11 (and cf., Figure 25.6 on page 22). Consequently, we can get
an approximately independent sample with less thinning (taking every third or
fourth value).
Hamiltonian Dynamics
Extending an example suggested by Neal (2011), think of a hockey puck (Neal’s
paper and this chapter were written in Canada after all) given an initial push in
a particular direction on a frictionless and completely flat horizontal ice surface:
The puck will continue to travel indefinitely in a straight line and with constant
velocity in the direction in which it’s pushed.28
Now imagine a surface that isn’t flat, such as the surface in Figure 25.12: The
graphs in this figure record the final results of two 100-step simulations, in which
27 Although it’s slightly off-topic, Susskind and Hrabovsky (2013) provide an especially lucid
Figure 25.10 The gray dots show m = 105 values drawn by the Gibbs sampler from the
bivariate-normal distribution with µ1 = 1, µ2 = 2, σ12 = 1, σ22 = 4, and σ12 = 1.
The slightly irregular solid lines represent estimated density contours enclosing
50%, 95%, and 99% of the sampled points. The broken lines are the corresponding
elliptical density contours of the bivariate-normal target distribution.
30 CHAPTER 25. BAYESIAN ESTIMATION OF REGRESSION MODELS
(a) x1 (b) x2
1.0
1.0
0.8
0.8
Autocorrelation rt
Autocorrelation rt
0.6
0.6
0.4
0.4
0.2
0.2
0.0
0.0
0 10 20 30 40 50 0 10 20 30 40 50
lag t lag t
Figure 25.11 Autocorrelations of the values of (a) x1 and (b) x2 drawn by the Gibbs sampler
applied to the bivariate-normal distribution with µ1 = 1, µ2 = 2, σ12 = 1,
σ22 = 4, and σ12 = 1.
Figure 25.12 Trajectories described by two pucks released on a frictionless surface. The puck at
the left is released with 0 initial momentum; the puck at the right with small
momentum in the directions of x1 and x2 . The surface was generated by the
bivariate-normal distribution with µ1 = µ2 = 0, σ12 = σ22 = 1, and σ12 = 0, as
described in the text.
25.2. MARKOV-CHAIN MONTE CARLO SIMULATION 31
two pucks are released on the surface depicted above the point x0 = (−3.5, 2)0 ,
where x1 and x2 are the horizontal axes of the 3D space:
• The puck at the left is released with 0 momentum, and is therefore initially
subject only to the force of gravity; the puck oscillates between the release
point and an equally high point opposite to it on the surface.
• The puck at the right is released with small momentum (0.5 and 1, respec-
tively) in the positive directions of the x1 and x2 axes; this puck describes
a more complex looping trajectory over the surface but also returns to its
starting point.
Assuming, without any real loss of generality, that the puck has mass equal to
1,29 at any instant, the potential energy of the puck due to gravity is equal to
its height, while its kinetic energy is equal to the sum of the squared values
of its momentum (mass × velocity = 1 × velocity) in the directions of x1 and
x2 . Because there is no loss of energy due to friction, conservation of energy
dictates that the sum of potential energy and kinetic energy remains the same
as the puck moves. When the puck moves downwards on the surface, its velocity
increases, and hence its momentum and kinetic energy increase, while its height
and hence its potential energy decrease; when the momentum of the puck carries
it upwards on the surface, against gravity, the opposite is the case: Momentum
and kinetic energy decrease, height and potential energy increase.
By repeatedly introducing randomness into the momentum of the puck,
HMC is able to visit the surface more generally, favoring lower regions of the
surface. In a statistical application, the surface in question is the negative of the
log of a multivariate probability density function (up to an additive constant on
the log-density scale—i.e., a multiplicative constant on the density scale), and
thus low regions correspond to regions of high probability. In the context of
exploring a probability-density surface, the position variables X1 and X2 cor-
respond to the random variables to be sampled, while the momentum variables
are purely artificial, though necessary for the physical Hamiltonian analogy.
Metropolis proposals in HMC are more adapted to the probability surface
to be sampled than in the traditional Metropolis or Metropolis-Hastings algo-
rithms. By adjusting factors (discussed below) that affect the trajectory, it’s
possible to increase the proportion of accepted proposals and to decrease the
autocorrelation of successively sampled values.
The surface shown in Figure 25.12 is for a bivariate-normal distribution
with mean vector µ = (0, 0)0 and covariance matrix Σ = I2 (the order-two iden-
tity matrix)—that is, two uncorrelated standard-normal random variables. The
graphed surface is the negative log-density of this bivariate-normal distribution,
29 The motion of the puck doesn’t depend on its mass—recall Galileo’s famous, if apocryphal,
experiment in which he simultaneously dropped objects of different weight from the Leaning
Tower of Pisa and observed that they hit the ground at the same time—and setting mass to
1 simplifies the formulas given below. Thought of another way, the units of mass, say kg, are
purely conventional and arbitrary, so we’re entitled to pick units that make the mass of the
puck equal to 1.
32 CHAPTER 25. BAYESIAN ESTIMATION OF REGRESSION MODELS
Figure 25.13 Trajectory described by a puck released on a frictionless surface with 0 initial
momentum; 400 steps are shown. The surface was generated by the
bivariate-normal distribution with µ1 = µ2 = 0, σ12 = σ22 = 1, and σ12 = 0.5, as
described in the text.
omitting the normalizing constant. Thus, the negative log-density surface differs
from the graphed surface only by a constant difference in elevation, producing
essentially the same dynamics for pucks sliding along both surfaces. This sur-
face is a simple “bowl,” whose vertical slices are parabolic and whose horizontal
slices are circular, yielding the very simple dynamics illustrated in Figure 25.12.
For a partly contrasting example, consider the surface shown in Figure 25.13,
generated by the bivariate normal distribution with mean vector µ = (0, 0)0 and
covariance matrix
1 0.5
Σ= (25.19)
0.5 1
the same effect can be achieved with uncorrelated Xs that have different standard deviations,
stretching the bowl in the direction of the larger standard deviation.
25.2. MARKOV-CHAIN MONTE CARLO SIMULATION 33
physics, it’s conventional to represent the position variables by q and the momentum variables
by p.
32 How good the approximation is depends on the step size and length of the trajectory.
34 CHAPTER 25. BAYESIAN ESTIMATION OF REGRESSION MODELS
HMC Sampling
To implement HMC sampling, the leapfrog method can be used to generate
proposals to the Metropolis algorithm (Section 25.2.1).34 To generate a proposal
x∗ , start at the current values xi of the variables to be sampled, and randomly
select the starting momentum in each direction—here, the momentum values
are sampled independently from the standard-normal distribution, N (0, 1). The
Metropolis acceptance ratio (Equation 25.10 on page 18) becomes
The acceptance ratio a depends on both the momentum variables m and the
position variables x, because both are necessary to characterize the current state
of the system, even though only the position variables are of real interest.
If the leapfrog method were exact, then (because of conservation of energy)
the energy for the proposal at the end of the path, H(x∗ , m∗ ), would be exactly
equal to the energy at the beginning of the path, H(xi , mi ), in which case the
acceptance ratio a = exp(0) = 1, and the proposal would always be accepted.
The acceptance ratio can only depart from 1 due to discretization error in the
leapfrog method.35 If we “tune” the step size and number of steps well, therefore,
we should expect a high acceptance rate for proposals. To achieve both a high
rate of acceptance and nearly independent draws from the target distribution,
tuning is more critical for HMC than for simpler Metropolis sampling.
I proceed to apply HMC to the bivariate-normal target distribution with
lis algorithm instead of Metropolis-Hastings. We can also permit different “step” sizes for the
various elements of x, effectively multiplying the time-increment ε by scaling factors for the
position variables; this can be a useful approach when the variables have different scales (e.g.,
standard deviations). The resulting path along the surface isn’t a true Hamiltonian trajectory,
but it still provides legitimate update candidates to the Metropolis algorithm. A common al-
ternative is to complicate the Hamiltonian equations by introducing a diagonal n × n “mass
matrix” M, the diagonal entries of which reflect the scales of the variables.
35 Because, however, the leapfrog method is reversible, an error in approximating the Hamil-
used previously in Sections 25.2.1 and 25.2.2 to illustrate the Hastings and
Gibbs algorithms, producing the following estimates, based on m = 105 sampled
values:
µ
b = [1.001, 2.001]0 (25.28)
0.997 1.001
Σ
b =
1.001 3.956
The percentage of accepted proposals, 98.7%, is much higher than for the stan-
dard Metropolis version of this example in Section 25.2.1. The density estimate
in Figure 25.14 shows that the HMC sample closely reproduces the target distri-
bution, and the sampled values are much less autocorrelated (Figure 25.15) than
the values produced by the traditional Metropolis algorithm or by the Gibbs
sampler (cf., Figures 25.6 and 25.11, respectively on pages 22 and 30).
To achieve these desirable results, I had to select by trial and error suitable
values for the step sizes and number of steps. In particular, step size was set
twice as large in the direction of X2 (which has standard deviation σ2 = 2) than
in the direction of X1 (which has standard deviation σ1 = 1.)
Figure 25.14 The gray dots show m = 105 values drawn by Hamiltonian Monte-Carlo from the
bivariate-normal distribution with µ1 = 1, µ2 = 2, σ12 = 1, σ22 = 4, and σ12 = 1.
The slightly irregular solid lines represent estimated density contours enclosing
50%, 95%, and 99% of the sampled points. The broken lines are the corresponding
elliptical density contours of the bivariate-normal target distribution.
25.2. MARKOV-CHAIN MONTE CARLO SIMULATION 37
(a) x1 (b) x2
1.0
1.0
0.8
0.8
Autocorrelation rt
Autocorrelation rt
0.6
0.6
0.4
0.4
0.2
0.2
0.0
0.0
0 10 20 30 40 50 0 10 20 30 40 50
lag t lag t
Figure 25.15 Autocorrelations of the values of (a) x1 and (b) x2 drawn by HMC applied to the
bivariate-normal distribution with µ1 = 1, µ2 = 2, σ12 = 1, σ22 = 4, and σ12 = 1.
(a) (b)
θ θ
0 200 400 600 800 1000 0 200 400 600 800 1000
(c) (d)
θ θ Chain 1 Chain 2
0 200 400 600 800 1000 0 200 400 600 800 1000
Figure 25.16 Potential issues in MCMC trace plots for a parameter θ : (a) burn-in required; (b)
highly autocorrelated values; (c) lack of convergence; and (d) two independent
chains that have apparently converged to different values.
25.2. MARKOV-CHAIN MONTE CARLO SIMULATION 39
upper lines are computed respectively for the values below and above the mean.
Both the center and spread of the simulated values appear to be constant over
the 100,000 simulated draws.
Gelman et al. (2013, Section 11.4) also suggest the following numeric proce-
dure for assessing convergence:
2. Discard the first half of each chain (that is, use the first half of the sampled
values as a burn-in period, to avoid the problem in Figure 25.16 (a)).
3. Split the retained values from each chain in half, treating each half as a
separate chain (to be able to detect the problem illustrated in Figure 25.16
(c)).
4
2
x1
0
−2
10
5
x2
0
−5
Simulated Draw
Figure 25.17 Trace plots of x1 and x2 for the 105 HMC samples drawn from the
bivariate-normal distribution with µ1 = 1, µ2 = 2, σ12 = 1, σ22 = 4, and σ12 = 1.
The broken horizontal lines are drawn at the average values of x1 and x2 . The
central solid line in each graph is from a nonparametric regression of the x-values
versus index, and the lower and upper solid lines are from nonparametric
regressions for the values below and above the mean, respectively.
25.2. MARKOV-CHAIN MONTE CARLO SIMULATION 41
called the potential scale-reduction factor because it estimates how much the
dispersion of the posterior distribution of θ would decline if there were an infinite
number of MCMC samples. R b substantially larger than 1 (say, exceeding 1.1)
suggests that the several chains are heterogeneous, indicative of a convergence
problem.36
I applied Gelman et al.’s procedure to the HMC samples drawn from the
bivariate-normal distribution with µ1 = 1, µ2 = 2, σ12 = 1, σ22 = 4, and σ12 = 1,
sampling a second Markov chain of 105 values, and splitting each chain in half;
I didn’t bother with a burn-in period for this example. Thus, p = 4 and m =
5 × 104 . The values of R
b for x1 and x2 were both equal to 1 to several places to
the right of the decimal point, consistent with convergence of the chains to the
target distribution.
Gelman et al. (2013, Section 11.5) also suggest a measure of the effective
sample size for a set of autocorrelated Markov chains. Consider p chains for a
parameter θ, each of size m, thus comprising p × m non-independent draws from
the target distribution, where θij is the ith draw in the jth chain. Adapting
Equation 25.12 (on page 22) to pool the sampled values across the several chains,
the estimated autocorrelations are
Pp Pm
j=1 i=t+1 (θij − θ ·· )(θi−t,j − θ·· )
rt = Pp Pm 2
, t = 1, 2, . . .
j=1 i=1 (θij − θ ·· )
36 Brooks and Gelman (1997) introduce an improved estimate of R, but the details need not
concern us here.
42 CHAPTER 25. BAYESIAN ESTIMATION OF REGRESSION MODELS
Writing statistical models in this form will be convenient in more complex ap-
plications, such as the mixed-effects models considered in Section 25.4.
With the likelihood of the data in hand, we next require the joint prior
distribution of the k + 2 parameters of the model, α, β1 , β2 , . . . , βk , σε2 , which is
typically addressed by specifying independent prior distributions for the various
parameters.37
For concreteness, let’s focus on Duncan’s occupational prestige regression
(introduced in Section 5.2 and discussed a several points in the text). Recall
that the response variable in Duncan’s regression is the prestige of each of 45
U. S. occupations, circa 1950, measured as the percentage of ratings of “good’
or better in a national survey. There are two explanatory variables in Duncan’s
regression—the educational and income levels of the occupations, measured re-
spectively as the percentage of high-school graduates in the occupation and the
percentage of individuals in the occupation earning $3500 or more, both from
the 1950 U. S. Census.
I previously fit Duncan’s model by least-squares regression (see Sections 5.2.1
and 6.2.2). Table 25.1 compares the least-squares estimates of the regression
coefficients, their standard errors,38 and the error standard deviation to two
sets of Bayesian estimates: one produced by specifying flat independent prior
distributions for all of the parameters of the model, and the other produced by
specifying vaguely informative independent prior distributions for the parame-
ters, both to be explained presently. The table also shows the R b convergence
diagnostic for each Bayesian estimate, all of which round to 1.000, along with the
effective sample size, meff . Trace plots for the two sets of Bayesian parameter
estimates were unremarkable and aren’t shown.39
37 We know (see, e.g., Section 9.4.4) that when the regressors in a linear model are correlated,
so are the estimated regression coefficients, and consequently independent prior distributions
for regression coefficients aren’t generally credible. It’s not obvious, however, how one would
go about specifying non-independent priors for the parameters of the model. See Section 25.5
and footnote 40 for further discussion of this point.
38 In the case of the Bayesian estimates, the “standard errors” are the standard deviations
The flat improper priors employed for the first set of Bayesian estimates are
straightforward for the three regression coefficients:
α ∼ Unif(−∞, ∞)
β1 ∼ Unif(−∞, ∞)
β2 ∼ Unif(−∞, ∞)
The error standard deviation σε is non-negative, and so I used a flat prior for
its log: loge σε ∼ Unif(−∞, ∞).
The second set of Bayesian estimates, employing vaguely informative priors,
requires more explanation:
• I centered the explanatory variables, income and education, to 0 means,
x∗i1 = xi1 − x1 , x∗i2 = x2 − xi2 , so that the regression model becomes
Yi |x∗i1 , x∗i2 ∼ N (α∗ + β1 x∗i1 + β2 x∗i2 , σε2 ). That’s helpful for thinking about
the prior distribution of the intercept, because now the intercept α∗ is
interpretable as the expected value of Y when the xs are equal to their
means (and in least-squares regression α b∗ would simply be the uncondi-
tional mean of Y ). 40
using Stan. For each model, I sampled four independent chains, each of length 10,000, with the
first 5000 iterations used for burn-in and no thinning of the remaining 5000 iterations; thus,
4 × 5000 = 20, 000 sampled values of the parameters were retained in each case. I generally
omit reporting R,b meff , and trace plots in the subsequent examples in this chapter, because,
unless indicated to the contrary, all of the diagnostics are satisfactory.
40 A careful reader may notice that centering the explanatory variables at their sample
25.3. LINEAR AND GENERALIZED LINEAR MODELS 45
• The explanatory variables x1 and x2 are also percentages, and I set the
prior distributions of the corresponding regression coefficients to β1 ∼
N (0, 1) and β2 ∼ N (0, 1), effectively restricting these coefficients to the
range between −3 and 3. A 3-percent change in prestige for a 1-percent
increase in income or education would be a very large effect.
• Finally, I specified loge σε ∼ N (0, 1.52 ), which constrains the error stan-
dard deviation to lie approximately between e−3×1.5 ≈ 0.01 and e3×1.5 ≈
90. Recall that σε is on the same percentage scale as Y .
Prior SDs −3 −2 −1 1 2 3
Centered Intercept (α∗ ) 5 20 35 65 80 95
Income Coefficient (β1 ) −3 −2 −1 1 2 3
Education Coefficient (β2 ) −3 −2 −1 1 2 3
Error Standard Deviationa (σε ) 0.01 0.05 0.22 4.5 20.9 90
a Showing eZ×1.5 , where Z is −3, −2, −1, 1, 2, 3.
unlikely—to sample a value outside the range 0 to 100. Moreover, the normal linear model,
whether estimated by least-squares or by Bayesian methods, doesn’t constrain Y to lie between
0 and 100.
46 CHAPTER 25. BAYESIAN ESTIMATION OF REGRESSION MODELS
(a) (b)
p(α*) p(βj)
0 20 40 60 80 100 −3 −2 −1 0 1 2 3
α* βj, j = 1,2
(c) (d)
σε p(σε)
0.01 0.05 0.22 1 4.48 20.09 90.02
p(logeσε)
logeσε σε
Figure 25.18 Vaguely informative priors for the parameters of Duncan’s occupational prestige
regression model: (a) prior for the intercept α∗ of the centered model; (b) common
prior for the regression coefficients β1 and β2 of income and education; (c) and (d)
prior for the error standard deviation σε .
25.3. LINEAR AND GENERALIZED LINEAR MODELS 47
• Ironically, the justification for vague priors like the ones employed here
often appeals to frequentist properties of the resulting estimators, such as
robustness—the vague priors are not very restrictive but they do rule out
wild estimates of the regression coefficients—and by shrinking regression
coefficients towards 0 may improve the mean-squared error of estimates,
a process termed regularization.
that is, the posterior distribution is only approximated by MCMC—but even if the posterior
produced by MCMC were exact, the MLE would correspond to the posterior mode, not to
the posterior mean, and the latter is used for the Bayesian point estimates reported here.
48 CHAPTER 25. BAYESIAN ESTIMATION OF REGRESSION MODELS
The normal linear regression model provides a probability model for the data
from which the likelihood can be calculated:
where
25.3. LINEAR AND GENERALIZED LINEAR MODELS 49
Table 25.2 Maximum-Likelihood and Bayesian Estimates for Cowles and Davis’s Logistic
Regression.
Table 25.3 Summaries of Vaguely Informative Prior Distributions for the Parameters of
Cowles and Davis’s Logistic Regression
Prior SDs, Z
−3 −2 −1 1 2 3
π = 1/[1 + e −Z×SD(α)
] .047 .12, .27 .73 .88 .95
e Z×SD(β1 )
0.050 0.13 0.37 2.7 7.4 20
e24×Z×SD(βj ) , j = 2, 3, 4 0.027 0.091 0.30 3.3 11 37
certainly suitable for a probability.45 Also see Table 25.3 for the implied
probabilities of volunteering at ±1, 2, and 3 prior standard deviations, and
Figure 25.19 (a) for a graph of the prior for α.
(a) (b)
π exp(β1)
0.05 0.12 0.27 0.5 0.73 0.88 0.95 0.05 0.14 0.37 1 2.72 7.39 20.09
p(α) p(β1)
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
α β1
(c)
exp(24 × βj)
p(βj)
0.03 0.09 0.3 1 3.32 11.02 36.6
βj, j = 2,3,4
Figure 25.19 Vaguely informative priors for the coefficients of the logistic-regression model fit to
Cowles and Davis’s data on volunteering for a psychological experiment.
52 CHAPTER 25. BAYESIAN ESTIMATION OF REGRESSION MODELS
ηi = α + β1 xi1 + · · · βk xik
µi = g −1 (ηi )
Yi |xi1 , . . . , xik ∼ p(µi , φ)
1.0
0.5
x2
0.0
−0.5
−1.0
x1
Figure 25.20 Complete separation for a binary response variable; the filled points represent
Y = 1 (“successes”) and the hollow points Y = 0 (“failures”).
54 CHAPTER 25. BAYESIAN ESTIMATION OF REGRESSION MODELS
Table 25.4 Maximum-Likelihood and Bayesian Estimates of a Logistic Regression Model with
Complete Separation of the Data (see Figure 25.20).
efficient method for computing the resulting estimates. Firth’s estimates for the
data in Figure 25.20 are shown in Table 25.4, along with similar estimates (but
smaller posterior standard deviations, labeled “standard errors” in the table)
produced by using diffuse normal priors for the logistic-regression coefficients:
α ∼ N (0, 202 )
β1 ∼ N (0, 202 )
β2 ∼ N (0, 202 )
Firth (1993) showed that substantial bias reduction in estimating the logistic
regression model under difficult circumstances can be achieved by employing the
Jeffreys prior. An especially problematic data pattern for logistic regression to
which Firth’s estimator is applicable is complete separation, where a linear func-
tion of the regressors in the model partitions the data into disjoint regions of 0s
and 1s. In this case, maximum-likelihood estimation of the logistic regression coef-
ficients produces one or more infinite estimates, while Firth’s method yields finite
estimates.
distinction between fixed effects, which are conceptualized as parameters with fixed though
25.4. MIXED-EFFECTS MODELS 55
• the Yij are the values of the response, log-exercise, for individual i on occa-
sion j, and are normally and independently distributed with expectation
µij and common error variance σε2 ;
• x1i is a dummy regressor coded 1 for patients and 0 for control subjects;
unknown values, and random effects (e.g., individual-specific regression coefficients), which
are random variables whose variance and covariance components are parameters. Bayesians
treat all unknown quantities as random variables. A Bayesian, therefore, might emphasize
the application, and term such a model “hierarchical,” “multilevel,” or “longitudinal” rather
than calling it a “mixed-effects” model. The key point, however, is that the random effects are
expressed in terms of more fundamental parameters, the variance and covariance components,
and so I continue to find the distinction between fixed and random effects useful.
48 If you’re unfamiliar with matrix notation, just think of b as a list of the two person-
i
specific regression coefficients for individual i, and Ψ as a table of the variance-covariance
components (see Equation 25.31).
56 CHAPTER 25. BAYESIAN ESTIMATION OF REGRESSION MODELS
• B0i is the deviation of the intercept for individual i from the fixed-effects
intercept β0 , for the control group, or from β0 + β1 , for the patient group;
and
• B1i is the deviation of the age slope for individual i from the fixed-effects
slope β2 , for the control group, or from β2 + β3 , for the patient group.
Prior SDs −3 −2 −1 1 2 3
Hours of exercise 0.353 0.5 0.707 1.41 2 2.83
49 The ML estimates in Table 25.5 are similar, but not identical, to the REML estimates
given in the table on page 721; note as well that the latter shows the estimated covariance
ψ01 of the random-effect intercepts and slopes rather than their correlation ρ01 . Standard
errors of variance and covariance components are of limited usefulness and are not shown.
25.4. MIXED-EFFECTS MODELS 57
• Setting SD(β1 ) = 0.5 constrains the intercept for the patient group to
differ from that of the control by a multiplicative factor of from 2−3×0.5 ≈
3 to 2
1 3×0.5
≈ 3.
• Similarly, setting SD(β2 ) = 0.5 constrains the age slope in the control
group to a decline or increase in exercise by a factor of at most 3 per year.
The table for β0 , shown above, is also applicable to β1 and β2 , both of
which set the prior standard deviation to 0.5, treating the numbers given
in the table as multiplicative effects.
• Finally, setting SD(β3 ) = 1 allows the age slope in the patient group to
be smaller or larger than that in the control group by a factor of at most
23 = 8; more generally, checking at ±1, 2, and 3 prior SDs,
Prior SDs −3 −2 −1 1 2 3
Multiplicative factor 1
8
1
4
1
2 2 4 8
The standard deviations of the normal priors for the random-effect log stan-
dard deviations, loge ψ0 and loge ψ1 , and for the error log standard deviation,
loge σε , are harder to understand because the response is itself on the log-base-2
scale. I’ve taken these prior standard deviations as 0.4, which implies the follow-
ing multiplicative constraint on the original hours-per week scale of the response:
from 2exp(−3×0.4) ≈ 1.2 to 2exp(3×0.4) ≈ 10, which is quite broad—ranging over
nearly an order of magnitude. More generally,
Prior SDs −3 −2 −1 1 2 3
Multiplicative factor 1.23 1.37 1.59 2.81 4.68 9.99
The correlation ρ between the random-effect intercept B0 and slope B1 also
merits special treatment because, as a correlation, it’s bounded between −1
and 1, and thus isn’t naturally modeled with a normal prior. R. A. Fisher’s
z-transformation of the correlation (introduced in Fisher, 1915),
1 1+ρ
z(ρ) ≡ 2 loge = arctanh(ρ)
1−ρ
maps ρ to (−∞, ∞) and serves to normalize its distribution. I then used the prior
z(ρ01 ) ∼ N (0, 0.6) which effectively constrains ρ01 between tanh(−3 × 0.6) ≈
−.95 and tanh(3 × 0.6) ≈ .95.50 Checking more generally at ±1, 2, and 3 prior
standard deviations,
50∗ This trick works because there are only two sets of random-effect coefficients in the
model, the B0i s and the B1i s. More generally, the random-effect covariance matrix Ψ can be
larger than 2 × 2 but must be positive-definite. It consequently requires a parametrization
and prior distribution that preserve positive-definiteness. See the references at the end of the
chapter for further discussion, in particular, the LKJ prior for the correlation matrix of the
random effects described by Gelman et al. (2013, pages 578, 584).
You may be unfamiliar with the hyperbolic tangent function, tanh, and its inverse, the
arctanh function, which is used for Fisher’s z-transformation. The hyperbolic tangent of x is
defined as
e2x − 1
tanh(x) ≡ 2x
e +1
and arctanh(y) = x when tanh(x) = y.
58 CHAPTER 25. BAYESIAN ESTIMATION OF REGRESSION MODELS
Prior SDs −3 −2 −1 1 2 3
ρ01 −.947 −.834 −.537 .537 .834 947
The prior distributions for the fixed effects, error standard deviation, random-
effect standard deviations, and random-effect correlation are graphed in Fig-
ure 25.21.
The ML estimates and the Bayesian estimates using flat priors are very
similar to one-another (see Table 25.5), as is to be expected. The Bayesian
estimates produced by the vaguely informative priors just discussed differ only
slightly from the other two sets of estimates.
I could take this example in several directions. In Section 23.4, for example,
I considered different structures for the random effects in the model along with
the possibility of serially correlated intra-individual errors. Here I’ll pursue
another issue in the Davis et al. data: the relatively large number of 0 responses
(about 12% overall). To this point, I’ve dealt with the 0s by adding 5 minutes
to each exercise value prior to taking logs to reduce the positive skew in the
distribution of the response. I’ll instead now specify a so-called hurdle model ,
which takes the 0s explicitly into account, and which is similar in spirit to the
zero-inflated Poisson and negative-binomial models for overdispersed count data
discussed in Section 15.2.1.51
51 I don’t want to imply, however, that a hurdle model requires a Bayesian approach—it’s
possible, with proper software (and such software exists), to fit a mixed-effects hurdle model
by maximum likelihood.
25.4. MIXED-EFFECTS MODELS 59
(a) (b)
2βj 2β3
0.35 0.5 0.71 1 1.41 2 2.83 0.125 0.25 0.5 1 2 4 8
p(βj) p(β3)
βj, j = 0,1,2 β3
(c) (d)
2σε,2ψ0,2ψ1 ρ
1.23 1.37 1.59 2 2.81 4.68 9.99 −0.95 −0.83 −0.54 0 0.54 0.83 0.95
p( ⋅ ) p(z(ρ))
logeσε,logeψ0,logeψ1 z(ρ)
Figure 25.21 Vaguely informative priors for the parameters of the simple mixed-effects model fit
to Davis et al.’s data on exercise and eating orders.
Table 25.5
Bayes, Vague Priors −0.279 −0.325 0.0634 0.239 1.454 0.191 −0.301 1.233
(0.163) (0.207) (0.0320) (0.040)
a For Bayesian estimates, posterior standard deviation.
b Cf., the REML estimates on page 721.
60 CHAPTER 25. BAYESIAN ESTIMATION OF REGRESSION MODELS
where
• the dichotomous response Nij = 0 if exercise is 0 for individual i on
occasion j and Nij = 1 if exercise is nonzero;
• ηij is the linear predictor for the logit part of the model;
• the ξs are fixed effects in the logit part of the model;
• the Zs are random-effects coefficients in the logit part of the model;
• the elements of Ψz are variance and covariance components for the random
effects in the logit part of the model; and
• Yij , µij , the βs, the Bs, and Ψb (replacing Ψ) are defined as in Equa-
tions 25.30 (on page 55), but only for nonzero exercise, where Yij is now
hours of exercise without the necessity of adding a start to the exercise
values to avoid the log of 0.
I’ll use the same priors as before for the linear part of the model, along with
the following vague priors for the logit part of the model:
• ξ0 is the average logit of nonzero exercise for the control subjects at age
8. I know that overall about 81 of exercise measurements are 0s and thus
about 78 are nonzero, but the proportion of 0s is likely greater at age 8 than
52 As before, if you’re unfamiliar with matrix notation, just think of z and b as lists
i i
of individual-specific regression coefficients and Ψz and Ψb as tables of the variance and
covariance components for the corresponding random effects.
62 CHAPTER 25. BAYESIAN ESTIMATION OF REGRESSION MODELS
overall.53 To use the data to help specify the prior is problematic from
a Bayesian perspective,54 however, and so I’ll disregard this information,
and simply specify the normal prior ξ0 ∼ N (0, 1). Then three prior stan-
dard deviations correspond to logits between −3 and 3, which translate
to probabilities of nonzero exercise between approximately .05 and .95, a
broad prior range. More generally, we have the following probabilities of
nonzero exercise at ±1, 2, and 3 prior SDs:
Prior SDs −3 −2 −1 1 2 3
π .047 .119 .269 .731 .881 .953
Prior SDs −3 −2 −1 1 2 3
relative odds 0.22 0.37 0.61 1.65 2.7 4.5
• ξ2 is the average age slope of nonzero exercise for control subjects on the
logit scale. Using the prior ξ2 ∼ N (0, 0.52 ) therefore permits the odds of
nonzero exercise to either increase or decrease by at most a factor of 5 per
year (and the relative odds at ±1, 2, and 3 prior SDs are as in the table
for ξ1 immediately above).
• ξ3 is the average difference in the age slope of nonzero exercise between
eating-disordered and control subjects on the logit scale. Once more us-
ing the prior ξ2 ∼ N (0, 0.52 ) constrains the odds of nonzero exercise to
increase or decrease in eating-disordered subjects up to 5 times more slowly
or rapidly in comparison to control subjects.55 Again, the relative odds
at ±1, 2, and 3 prior SDs are as in the table for ξ1 .
• I’ll use the same prior distributions for the variance-covariance components
of the random effects in the logit part of the model as in the linear part of
the model, thus taking loge ψZj ∼ N (0, 0.42 ) for j = 0, 1 and z(ρZ0 Z1 ) ∼
N (0, 0.62 ).
The priors for the fixed-effect coefficients in the logit part of the hurdle model
are shown in Figure 25.22.
As usual, I estimated the posterior distribution of the parameters by Hamil-
tonian Monte Carlo, using four independent Markov chains of 10,000 iterations
53 Of course, we don’t have to speculate because we have the data. I invite the reader to
(a) (b)
π exp(ξj)
p(ξj)
0.05 0.12 0.27 0.5 0.73 0.88 0.95 0.22 0.37 0.61 1 1.65 2.72 4.48
p(ξ0)
ξ0 ξj, j = 1,2,3
Figure 25.22 Vaguely informative priors for the fixed-effect intercept (a) ξ0 and (b) coefficients
ξ1 , ξ2 , and ξ3 in the logit part of the hurdle model fit to Davis et al.’s data on
exercise and eating orders.
each, the first half of which were discarded as warm-up, leaving m = 20, 000 sam-
pled values of each parameter. The hurdle model, as specified, proved difficult
to estimate, with two of the parameters having unacceptably small estimated
effective sample sizes, mb eff = 346 for ψZ0 and mb eff = 252 for ρZ0 Z1 .
I encountered a similar problem in Section 23.4, where I couldn’t fit a mixed
model to the Davis et al. data that had random intercepts, random slopes, and
serially correlated errors. After all, there are few measurements per individual
(about 4, on average) to estimate the variance and covariance components, along
with parameters modeling serial dependence in the errors.
In the current context, tightening the priors for the variance-covariance com-
ponents of the logit part of the model might make it possible to estimate the
hurdle model.56 I decided instead to simplify the random-effects structure by
eliminating random intercepts from the logit part of the model, retaining ran-
dom slopes; that reduces the variance-covariance components for this part of the
model to the single standard deviation ψZ1 . The resulting parameter estimates
(posterior means and standard deviations) are shown in Table 25.6.
56 Fiddling with a prior in light of an unsatisfactory posterior could also be regarded from a
Bayesian point of view as cheating; see Section 25.5 for further discussion, and Exercise 25.7.
64 CHAPTER 25. BAYESIAN ESTIMATION OF REGRESSION MODELS
Table 25.6 Posterior Means and Standard Deviations for the Mixed-Effects Hurdle Model fit
to the Davis et al. Data.
Group
8 Patient
Control
Exercise (hours/week)
8 10 12 14 16 18
Age (years)
Figure 25.23 Average exercise (displayed as points) as a function of age and group, based on the
mixed-effects hurdle model fit to the Davis et al. data. The bars around the points
show 95% central credible intervals.
66 CHAPTER 25. BAYESIAN ESTIMATION OF REGRESSION MODELS
for the effects. Compare this graph to Figure 23.10 (page 725), which is based on
a simpler mixed-effects model for log2 (exercise + 5 minutes) fit by REML (and
which evaluates the fitted model between ages 8 and 16 rather than between 8
and 18).
versity in Toronto, whose knowledge of the foundations of statistical inference vastly exceeds
mine. Of course, Georges isn’t responsible for my remarks here.
58 Indeed, it’s my impression that often less thought is given to the formulation of vaguely
the data to construct a prior distribution. Empirical Bayes methods are potentially applicable
to situations in which parameters are hierarchically related, as is the case in the mixed effects
models considered in Section 25.4, where, for example, the distributions of individual-level
regression coefficients depend on variance-covariance components, which are termed hyperpa-
rameters. The justification of empirical Bayes methods typically appeals to reduced mean-
squared error of estimation in comparison to classical estimators, a frequentist criterion (see
on-line Appendix Section D5.2). Empirical Bayes estimators may be thought of as approx-
imations to Bayesian estimators. Efron and Morris (1977) and Casella (1985) provide nice
introductions to empirical Bayes methods. Also see the discussion in footnote 40 (page 44) of
conditioning on the sample distribution of the explanatory variables.
68 CHAPTER 25. BAYESIAN ESTIMATION OF REGRESSION MODELS
Exercises
Please find data analysis exercises and data sets for this chapter on the website
for the book.
Exercise 25.1. Return to the application of Bayes’s theorem described at the
end of Section 25.1.1 (on page 3). On the basis of the information supplied,
calculate the probability that you have antibodies to the virus given a positive
test, Pr(A|P ). Compare this probability to the probability of a positive test
given antibodies Pr(P |A). Are you surprised by the difference between these
two probabilities? Can you explain the source of the difference in simple terms?
In developing your explanation, it might help to consider what Pr(A|P ) would
be in a population in which no one had antibodies—that is, for which Pr(A) = 0.
Exercise 25.2. The preliminary example of Bayesian inference in Section 25.1.2
describes a situation in which two biased coins, one with the probability of a
head Pr(H) = .3 and the other with Pr(H) = .8, are loose in a drawer and
you don’t know which is which. Suppose instead that these two biased coins
are mixed in with eight fair coins for which Pr(H) = .5. As in Section 25.1.2,
you choose one coin from the drawer and flip it 10 times, observing 7 heads
in the 10 flips. Let H1 represent the hypothesis that you picked the coin with
Pr(H) = .3, H2 the hypothesis that you picked the coin with Pr(H) = .8, and
H3 the hypothesis that you picked one of the coins with Pr(H) = .5. What is
a reasonable set of prior probabilities for these hypotheses? Find the likelihood
of the data under each hypothesis, and then compute the posterior probabilities
for the three hypotheses. What do you conclude?
Exercise 25.3. Figure 25.2 (page 10) summarizes the example in Section 25.1.5,
where a coin is flipped n = 10 times, producing h = 7 heads, and the resulting
Bernoulli likelihood is combined with a beta prior to produce a beta posterior
distribution. Figure 25.2 shows the posteriors for π, the probability of a head
on an individual flip, produced by the flat Beta(a = 1, b = 1) prior and the
informative Beta(a = 16, b = 16) prior.
(a) What happens to the posterior distribution of π for both of these priors
as the sample size grows, successively taking on the values n = 10, 100,
and 1000, while the observed proportion of heads remains at .7? For
EXERCISES 69
each of the two priors, graph the resulting posterior distributions, show-
ing the posterior modes along with 95% central posterior intervals for
π. Comment on the results.
(b) Now let n = 10, as in the text, but vary the observed number of heads
h = 0, 1, 2, . . . , 10; use the informative Beta(a = 16, b = 16) prior. How
does the posterior mode of π change with the number of heads?
dα
pβ (β) = pα (α)
dβ
dα
where pβ (β) is the prior density for β and is the Jacobian of the transfor-
dβ
mation from α to β.60 It turns out that the Fisher information,61
2
d loge L(α)
Iα (α) = −E
dα2
60 See on-line Appendix Section D.1.3 for an explanation of the the Jacobian of a transfor-
mation of a random variable. The formula in the appendix is for a vector random variable
but simplifies to the result given here when the random variable, α, is a scalar.
61 See on-line Appendix Section D.6.2.
62 The result in the text includes the normalizing constant 1/Π.
70 CHAPTER 25. BAYESIAN ESTIMATION OF REGRESSION MODELS
Because pµ (µ) = N (µ0 , σ02 ) is a conjugate prior, the posterior is also normal,
and takes the form
(µ − µ1 )2
1
pµ|data (µ|data) = exp −
1
(2πσ12 ) 2 2σ12
Here, y = 1
yi is the familiar sample mean of y.64
P
n
distribution.
64 Although setting up this result is straightforward—just multiply the formulas for p (µ)
µ
and pµ|data (µ|data) and note that multiplying two exponentials is equivalent to the exponen-
tial of the sum of their exponents—putting the result in the form given here requires nontrivial
algebra. See, for example, Box and Tiao (1973, Appendix A1.1).
EXERCISES 71
(a) Draw a scatterplot matrix for the four measured variables, marking the
points by iris species. What do you conclude about the ability of the
four variables to distinguish among the species?
(b) Perform a binary logistic regression of species setosa versus the other
two species as the response on the four measured variables as predictors,
fitting the model by maximum likelihood. Are you able to fit the model?
Try again using Bayesian logistic regression, either via Firth’s bias-
reduced logistic regression or directly specifying a vague prior. What
do you conclude?
(c) Dichtomizing iris species is artificial, and we can instead perform a poly-
tomous logistic regression (see Section 14.2.1) with the three species as
the response. Try to fit the polytomous logit model by maximum likeli-
hood. Are you successful? Try again using Bayesian polytomous logistic
regression with a vague prior. Alternatively, Kosmidis and Firth (2011)
extend Firth’s bias-reduced binary logistic regression to polytomous lo-
gistic regression; use their method to fit the polytomous logit model to
the iris data.
Exercise 25.7. Return to the mixed-effects hurdle model fit to Davis et al.’s
data on eating disorders and exercise, the results of which are summarized in
72 CHAPTER 25. BAYESIAN ESTIMATION OF REGRESSION MODELS
Table 25.6 (page 64). I encountered an obstacle in fitting this model that led
me to remove the random-effect intercepts from the model, thereby eliminating
the parameters ψZ0 and ρZ0 Z1 . As I indicated (see footnote 56 on page 63),
a conceptually problematic alternative to simplifying the random effects might
be to tighten the priors for some of the parameters. Experiment with this
approach. Are you able to obtain useful estimates? If so, do the estimates of
the fixed effects differ substantially from those reported in Table 25.6?
Summary
• Named after the Reverend Thomas Bayes, an 18th-century English math-
ematician, Bayes’s theorem, which follows from elementary probability
theory, states that for events A and B,
Pr(B|A) Pr(A)
Pr(A|B) =
Pr(B)
Pr(D|Hi ) Pr(Hi )
Pr(Hi |D) = Pk
j=1 Pr(D|Hj ) Pr(Hj )
where A is the set of all values of the parameter vector α (i.e., the multi-
dimensional parameter space).
74 CHAPTER 25. BAYESIAN ESTIMATION OF REGRESSION MODELS
• The normal linear regression model provides a probability model for the
data from which the likelihood can be calculated:
ηi = α + β1 xi1 + · · · βk xik
µi = g −1 (ηi )
Yi |xi1 , . . . , xik ∼ p(µi , φ)
Recommended Reading
Books and articles on Bayesian statistics are typically written by committed
Bayesians, and so I suggest that you maintain an open mind with respect to
the case that Bayesians often press against classical approaches to statistical
inference. There is a very large literature on modern Bayesian methods; the few
sources given here are therefore highly selected.
• The treatise on Bayesian data analysis by Gelman et al. (2013) is a wide-
ranging and through treatment of the subject by some of the leading
figures in contemporary Bayesian statistics. The book covers in greater
detail all of the topics included in this chapter (basics of Bayesian in-
ference, MCMC methods, linear models, generalized linear models, and
mixed-effects models) and more, and almost all of the book should be
accessible to readers of the starred sections of the current text.
• McElreath (2020) presents a much gentler introduction to Bayesian meth-
ods, but, like Gelman et al. (2013), takes up the topics in this chapter,
usually more thoroughly than I do, along with others, such as causal in-
ference. Depending on your taste, you may find the author’s style either
engaging or irritating, but his exposition is almost always very clear and
carefully thought-out. The computing in the text tends towards the id-
iosyncratic, with a compelling justification: Rather than using standard
black-box Bayesian software such as Stan (see below), McElreath guides
the reader through the nuts and bolts of Bayesian computations, rendering
the process transparent.
• Gelman and Hill (2007), and its partial successor Gelman et al. (2021),
are intended for a second course in statistics. Both books treat linear and
generalized linear models from a Bayesian perspective, while Gelman and
Hill (2007) (as the title of their book implies) additionally cover hierar-
chical (i.e., mixed-effects) regression models. A projected second updated
volume supplementing Gelman et al. (2021) will also focus on hierarchical
models. The older text by Gelman and Hill (2007) primarily uses R and
BUGS for computing, while Gelman et al. (2021) use R and Stan.
• The documentation for Stan (Carpenter et al., 2017), available at https:
//mc-stan.org/users/documentation/ and including a user’s guide, lan-
guage reference manual, and more, provides valuable information not only
on using the Stan software but also on current practices in Bayesian data
analysis more generally.
RECOMMENDED READING 77
• Thomas and Tu (2021), which appeared after this chapter was written, is
a generally accessible introduction to Hamilton Monte Carlo. In addition
to explaining how the Metropolis-Hastings and HMC algorithms work,
the authors develop applications to linear regression, logistic regression,
and Poisson regression with random intercepts. The article is associated
with the hmclearn R package for exploring HMC. The implementation of
various regression models in this package, and in the article, is weakened
by cavalier treatment of prior distributions, specifying independent normal
priors with a common prior standard deviation for regression coefficients.
• For what Bayesian statistics was like in the era prior to the use of MCMC,
see Box and Tiao (1973), who present a lucid treatment of the basics of
Bayesian inference along with a range of applications, including to regres-
sion models. Although they are Bayesians, the authors don’t hesitate to
use the term “random effect models.”
78 CHAPTER 25. BAYESIAN ESTIMATION OF REGRESSION MODELS
References for Chapter 25
E. Anderson. The irises of the Gaspé Peninsula. Bulletin of the American Iris
Society, 59:2–5, 1935.
79
80 REFERENCES FOR CHAPTER 25
W. K. Hastings. Monte Carlo sampling methods using Markov chains and their
applications. Biometrika, 57:97–109, 1970.
I. Kosmidis and D. Firth. Multinomial logit bias reduction via the Poisson
log-linear model. Biometrika, 98:755–759, 2011.
R. M. Neal. Bayesian Learning for Neural Networks. Springer, New York, 1996.
REFERENCES FOR CHAPTER 25 81