7.4 - Bayesian Estimation - 2
7.4 - Bayesian Estimation - 2
Basic Theory
The General Method
Suppose again that we have an observable random variable X for an experiment, that takes values in a set S . Suppose
also that distribution of X depends on a parameter θ taking values in a parameter space T . Of course, our data variable
X is almost always vector-valued, so that typically S ⊆ R for some n ∈ N . Depending on the nature of the sample
n
+
space S , the distribution of X may be discrete or continuous. The parameter θ may also be vector-valued, so that
typically T ⊆ R for some k ∈ N .
k
+
In Bayesian analysis, named for the famous Thomas Bayes, we model the deterministic, but unknown parameter θ with a
random variable Θ that has a specified distribution on the parameter space T . Depending on the nature of the
parameter space, this distribution may also be either discrete or continuous. It is called the prior distribution of Θ and is
intended to reflect our knowledge of the parameter θ , before we gather data. After observing X = x ∈ S , we then use
Bayes' theorem, to compute the conditional distribution of Θ given X = x. This distribution is called the posterior
distribution of Θ, and is an updated distribution, given the information in the data. Here is the mathematical
description, stated in terms of probability density functions.
Suppose that the prior distribution of Θ on T has probability density function h, and that given Θ = θ ∈ T , the
conditional probability density function of X on S is f (⋅ ∣ θ). Then the probability density function of the posterior
distribution of Θ given X = x ∈ S is
h(θ)f (x ∣ θ)
h(θ ∣ x) = , θ ∈ T (7.4.1)
f (x)
where the function in the denominator is defined as follows, in the discrete and continuous cases, respectively:
θ∈T
Proof
For x ∈ S , note that f (x) is simply the normalizing constant for the function θ ↦ h(θ)f (x ∣ θ) . It may not be necessary
to explicitly compute f (x), if one can recognize the functional form of θ ↦ h(θ)f (x ∣ θ) as that of a known
distribution. This will indeed be the case in several of the examples explored below.
If the parameter space T has finite measure c (counting measure in the discrete case or Lebesgue measure in the
continuous case), then one possible prior distribution is the uniform distribution on T , with probability density
function h(θ) = 1/c for θ ∈ T . This distribution reflects no prior knowledge about the parameter, and so is called the
non-informative prior distributioon.
Random Samples
Of course, an important and essential special case occurs when X = (X , X , … , X ) is a random sample of size n
1 2 n
from the distribution of a basic variable X. Specifically, suppose that X takes values in a set R and has probability
density function g(⋅ ∣ θ) for a given θ ∈ T . In this case, S = R and the probability density function f (⋅ ∣ θ) of X given
n
θ is
Real Parameters
Suppose that θ is a real-valued parameter, so that T ⊆ R . Here is our main definition.
θ∈T
https://stats.libretexts.org/Bookshelves/Probability_Theory/Probability_Mathematical_Statistics_and_Stochastic_Processes_(Siegrist)/07%3A_Poi… 1/8
9/8/25, 12:32 PM 7.4: Bayesian Estimation - Statistics LibreTexts
2. If Θ has a continuous distribution on T then
Recall that E(Θ ∣ X) is a function of X and, among all functions of X , is closest to Θ in the mean square sense. Of
course, once we collect the data and observe X = x, the Bayesian estimate of θ is E(Θ ∣ X = x). As always, the term
estimator refers to a random variable, before the data are collected, and the term estimate refers to an observed value of
the random variable after the data are collected. The definitions of bias and mean square error are as before, but now
conditioned on Θ = θ ∈ T .
As before, bias(U ∣ θ) = E(U ∣ θ) − θ and mse(U ∣ θ) = var(U ∣ θ) + bias (U ∣ θ) . Suppose now that we observe the
2
(X , X , … , X ) for each n ∈ N . Again, the most common case is when we are sampling from a distribution, so that
1 2 n +
the sequence is independent and identically distributed (given θ ). We have the natural asymptotic properties that we
have seen before.
Often we cannot construct unbiased Bayesian estimators, but we do hope that our estimators are at least asymptotically
unbiased and consistent. It turns out that the sequence of Bayesian estimators U is a martingale. The theory of
martingales provides some powerful tools for studying these estimators.
From the Bayesian perspective, the posterior distribution of Θ given the data X = x is of primary importance. Point
estimates of θ derived from this distribution are of secondary importance. In particular, the mean square error function
u ↦ E[(Θ − u) ∣ X = x) , minimized as we have noted at E(Θ ∣ X = x), is not the only loss function that can be used.
2
(Although it's the only one that we consider.) Another possible loss function, among many, is the mean absolute error
function u ↦ E(|Θ − u| ∣ X = x), which we know is minimized at the median(s) of the posterior distribution.
Conjugate Families
Often, the prior distribution of Θ is itself a member of a parametric family, with the parameters specified to reflect our
prior knowledge of θ . In many important special cases, the parametric family can be chosen so that the posterior
distribution of Θ given X = x belongs to the same family for each x ∈ S . In such a case, the family of distributions of
Θ is said to be conjugate to the family of distributions of X . Conjugate families are nice from a computational point of
view, since we can often compute the posterior distribution through a simple formula involving the parameters of the
family, without having to use Bayes' theorem directly. Similarly, in the case that the parameter is real valued, we can
often compute the Bayesian estimator through a simple formula involving the parameters of the conjugate family.
Special Distributions
The Bernoulli Distribution
Suppose that X = (X , X , …) is sequence of independent variables, each having the Bernoulli distribution with
1 2
unknown success parameter p ∈ (0, 1). In short, X is a sequence of Bernoulli trials, given p. In the usual language of
reliability, X = 1 means success on trial i and X = 0 means failure on trial i . Recall that given p, the Bernoulli
i i
Note that the number of successes in the first trials is . Given p, random variable has the binomial
n
n Yn = ∑ Xi Yn
i=1
https://stats.libretexts.org/Bookshelves/Probability_Theory/Probability_Mathematical_Statistics_and_Stochastic_Processes_(Siegrist)/07%3A_Poi… 2/8
9/8/25, 12:32 PM 7.4: Bayesian Estimation - Statistics LibreTexts
1 a−1 b−1
h(p) = p (1 − p) , p ∈ (0, 1) (7.4.7)
B(a, b)
and has mean a/(a + b) . For example, if we know nothing about p, we might let a = b = 1 , so that the prior
distribution is uniform on the parameter space (0, 1) (the non-informative prior). On the other hand, if we believe that
p is about , we might let a = 4 and b = 2, so that the prior distribution is unimodal, with mean . As a random
2 2
3 3
process, the sequence X with p randomized by P , is known as the beta-Bernoulli process, and is very interesting on its
own, outside of the context of Bayesian estimation.
Proof
Thus, the beta distribution is conjugate to the Bernoulli distribution. Note also that the posterior distribution depends
on the data vector X only through the number of successes Y . This is true because Y is a sufficient statistic for p. In
n n n
particular, note that the left beta parameter is increased by the number of successes Y and the right beta parameter is n
a + Yn
Un = (7.4.10)
a + b + n
Proof
In the beta coin experiment, set n = 20 and p = 0.3, and set a = 4 and b = 2. Run the simulation 100 times and note
the estimate of p and the shape and location of the posterior probability density function of p on each run.
For n ∈ N , +
a(1 − p) − bp
bias(Un ∣ p) = , p ∈ (0, 1) (7.4.11)
a + b + n
Note also that we cannot choose a and b to make Un unbiased, since such a choice would involve the true value of p,
which we do not know.
In the beta coin experiment, vary the parameters and note the change in the bias. Now set n = 20 and p = 0.8, and
set a = 2 and b = 6. Run the simulation 1000 times. Note the estimate of p and the shape and location of the
posterior probability density function of p on each update. Compare the empirical bias to the true bias.
For n ∈ N , +
2 2 2
p[n − 2 a(a + b)] + p [(a + b) − n] + a
mse(Un ∣ p) = , p ∈ (0, 1) (7.4.13)
2
(a + b + n)
The sequence (U n
: n ∈ N+ ) is mean-square consistent.
Proof
In the beta coin experiment, vary the parameters and note the change in the mean square error. Now set n = 10
and p = 0.7, and set a = b = 1 . Run the simulation 1000 times. Note the estimate of p and the shape and location of
the posterior probability density function of p on each update. Compare the empirical mean square error to the true
mean square error.
Interestingly, we can choose a and b so that U has mean square error that is independent of the unknown parameter p:
n
mse(U n ∣ p) = , p ∈ (0, 1) (7.4.16)
− 2
4(n + √n )
https://stats.libretexts.org/Bookshelves/Probability_Theory/Probability_Mathematical_Statistics_and_Stochastic_Processes_(Siegrist)/07%3A_Poi… 3/8
9/8/25, 12:32 PM 7.4: Bayesian Estimation - Statistics LibreTexts
In the beta coin experiment, set n = 36 and a = b = 3 . Vary p and note that the mean square error does not change.
Now set p = 0.8 and run the simulation 1000 times. Note the estimate of p and the shape and location of the
posterior probability density function on each update. Compare the empirical bias and mean square error to the
true values.
Recall that the method of moments estimator and the maximum likelihood estimator of p (on the interval (0, 1) ) is the
sample mean (the proportion of successes):
n
Y 1
Mn = = ∑ Xi (7.4.17)
n n
i=1
n
p(1 − p) . To see the connection between the estimators, note
from (6) that
a + b a n
Un = + Mn (7.4.18)
a + b + n a + b a + b + n
So U is a weighted average of
n
a/(a + b) (the mean of the prior distribution) and Mn (the maximum likelihood
estimator).
the parameter space is { , 1}. This setup corresponds to the tossing of a coin that is either fair or two-headed, but we
1
don't know which. We model p with a random variable P that has the prior probability density function h given by
h(1) = a , h ( ) = 1 − a, where a ∈ (0, 1) is chosen to reflect our prior knowledge of the probability that the coin is
1
two-headed. If we are completely ignorant, we might let a = (the non-informative prior). If with think the coin is 1
1. h(1 ∣ X n
) = n
2 a
2 a+(1−a)
if Y n
= n and h(1 ∣ X n
) = 0 if Y n
< n
2. h ( 1
2
∣ Xn ) = n
1−a
if Y n
= n and h ( 1
2
∣ Xn ) = 1 if Y n
< n
2 a+(1−a)
Proof
Now let
n+1
2 a + (1 − a)
pn = (7.4.23)
n+1
2 a + 2(1 − a)
1. U n
= pn if Y n
= n
2. U n
=
1
2
if Y n
< n
Proof
If we observe Y < n then U gives the correct answer . This certainly makes sense since we know that we do not
n n
1
have the two-headed coin. On the other hand, if we observe Y = n then we are not certain which coin we have, and n
the Bayesian estimate p is not even in the parameter space! But note that p → 1 as n → ∞ exponentially fast. Next
n n
let's compute the bias and mean-square error for a given p ∈ { , 1}. 1
For n ∈ N , +
1. bias(U n ∣ 1) = pn − 1
n
2. bias (U
1 1 1
n ∣ ) = ( ) (pn − )
2 2 2
2
, then Un is positively biased for
sufficiently large n (depending on a).
For n ∈ N , +
1. mse(U n
∣ 1) = (pn − 1)
2
https://stats.libretexts.org/Bookshelves/Probability_Theory/Probability_Mathematical_Statistics_and_Stochastic_Processes_(Siegrist)/07%3A_Poi… 4/8
9/8/25, 12:32 PM 7.4: Bayesian Estimation - Statistics LibreTexts
2. mse (U 1 1 n 1 2
n
∣ ) = ( ) (pn − )
2 2 2
on N with unknown success parameter p ∈ (0, 1). Recall that these variables can be interpreted as the number of trials
+
between successive successes in a sequence of Bernoulli trials. Given p, the geometric distribution has probability
density function
x−1
g(x ∣ p) = p(1 − p) , x ∈ N+ (7.4.29)
Once again for n ∈ N , let Y = ∑ X . In this setting, Y is the trial number of the nth success, and given p, has the
n
+ n i=1 i n
The posterior distribution of P given X n = (X 1 , X 2 , … , X n ) is beta with left parameter a + n and right parameter
b + (Y − n) .
n
Proof
Thus, the beta distribution is conjugate to the geometric distribution. Moreover, note that in the posterior beta
distribution, the left parameter is increased by the number of successes n while the right parameter is increased by the
number of failures Y − n , just as in the Bernoulli model. In particular, the posterior left parameter is deterministic and
depends on the data only through the sample size n.
a + n
Vn = (7.4.32)
a + b + Yn
Proof
Recall that the method of moments estimator of p, and the maximum likelihood estimator of p on the interval (0, 1) are
both W = 1/M = n/Y . To see the connection between the estimators, note from (19) that
n n n
1 a a + b n 1
= + (7.4.33)
Vn a + n a a + n Wn
So 1/V (the reciprocal of the Bayesian estimator) is a weighted average of (a + b)/a (the reciprocal of the mean of the
n
prior distribution) and 1/W (the reciprocal of the maximum likelihood estimator).
n
parameter λ ∈ (0, ∞). Recall that the Poisson distribution is often used to model the number of “random points” in a
region of time or space and is studied in more detail in the chapter on the Poisson Process. The distribution is named
for the inimitable Simeon Poisson and given λ, has probability density function
x
−λ
λ
g(x ∣ λ) = e , x ∈ N (7.4.34)
x!
Once again, for , let . Given , random variable also has a Poisson distribution, but with
n
n ∈ N+ Yn = ∑i=1 X i λ Yn
parameter nλ.
Suppose now that we model λ with a random variable Λ having a prior gamma distribution with shape parameter
k ∈ (0, ∞) and rate parameter r ∈ (0, ∞). As usual k and r are chosen to reflect our prior knowledge of λ . Thus the
and the mean is k/r. The scale parameter of the gamma distribution is b = 1/r, but the formulas will work out nicer if
we use the rate parameter.
The posterior distribution of Λ given X n = (X 1 , X 2 , … , X n ) is gamma with shape parameter k + Yn and rate
parameter r + n.
https://stats.libretexts.org/Bookshelves/Probability_Theory/Probability_Mathematical_Statistics_and_Stochastic_Processes_(Siegrist)/07%3A_Poi… 5/8
9/8/25, 12:32 PM 7.4: Bayesian Estimation - Statistics LibreTexts
Proof
It follows that the gamma distribution is conjugate to the Poisson distribution. Note that the posterior rate parameter is
deterministic and depends on the data only through the sample size n.
Proof
Since V is a linear function of Y , and we know the distribution of Y given λ ∈ (0, ∞), we can compute the bias and
n n n
For n ∈ N , +
k − rλ
bias(Vn ∣ λ) = , λ ∈ (0, ∞) (7.4.40)
r + n
Note that, as before, we cannot choose k and r to make V unbiased, without knowledge of λ.
n
For n ∈ N , +
2
nλ + (k − rλ)
mse(Vn ∣ λ) = , λ ∈ (0, ∞) (7.4.42)
(r + n)2
Recall that the method of moments estimator of λ and the maximum likelihood estimator of λ on the interval (0, ∞)
are both M = Y /n , the sample mean. This estimator is unbiased and has mean square error λ/n . To see the
n n
So V is a weighted average of k/r (the mean of the prior distribution) and M (the maximum likelihood estimator).
n n
with unknown mean μ ∈ R but known variance σ ∈ (0, ∞) . Of course, the normal distribution plays an especially
2
important role in statistics, in part because of the central limit theorem. The normal distribution is widely used to
model physical quantities subject to numerous small, random errors. In many statistical applications, the variance of
the normal distribution is more stable than the mean, so the assumption that the variance is known is not entirely
artificial. Recall that the normal probability density function (given μ) is
2
1 1 x − μ
g(x ∣ μ) = exp[− ( ) ], x ∈ R (7.4.45)
−−−
√ 2 πσ 2 σ
Again, for n ∈ N let . Recall that also has a normal distribution (given ) but with mean and
n
+
Yn = ∑ Xi Yn μ nμ
i=1
variance nσ . 2
Suppose now that μ is modeled by a random variable Ψ that has a prior normal distribution with mean a ∈ R and
variance b ∈ (0, ∞) . As usual, a and b are chosen to reflect our prior knowledge of μ. An interesting special case is
2
when we take b = σ, so the variance of the prior distribution of Ψ is the same as the variance of the underlying
sampling distribution.
https://stats.libretexts.org/Bookshelves/Probability_Theory/Probability_Mathematical_Statistics_and_Stochastic_Processes_(Siegrist)/07%3A_Poi… 6/8
9/8/25, 12:32 PM 7.4: Bayesian Estimation - Statistics LibreTexts
Proof
Therefore, the normal distribution is conjugate to the normal distribution with unknown mean and known variance.
Note that the posterior variance is deterministic, and depends on the data only through the sample size n. In the special
case that b = σ, the posterior distribution of Ψ given X is normal with mean (Y + a)/(n + 1) and variance
n n
σ /(n + 1) .
2
Proof
Note that U n
= (Yn + a)/(n + 1) in the special case that b = σ.
For n ∈ N , +
2
σ (a − μ)
bias(Un ∣ μ) = , μ ∈ R (7.4.56)
σ 2 + n b2
When b = σ, bias(U n
∣ μ) = (a − μ)/(n + 1) .
For n ∈ N , +
2 4 4 2
nσ b + σ (a − μ)
mse(Un ∣ μ) = , μ ∈ R (7.4.58)
(σ 2 + n b2 )2
When b = σ, mse(U ∣ μ) = [nσ + (a − μ) ]/(n + 1) . Recall that the method of moments estimator of μ and the
2 2 2
maximum likelihood estimator of μ on R are both M = Y /n , the sample mean. This estimator is unbiased and has
n n
mean square error var(M ) = σ /n. To see the connection between the estimators, note from (25) that
2
2 2
σ nb
Un = a + Mn (7.4.60)
2 2 2 2
nb + σ nb + σ
So U is a weighted average of a (the mean of the prior distribution) and M (the maximum likelihood estimator).
n n
unknown left shape parameter a ∈ (0, ∞) and right shape parameter b = 1. The beta distribution is widely used to
model random proportions and probabilities and other variables that take values in bounded intervals (scaled to take
values in (0, 1)). Recall that the probability density function (given a) is
a−1
g(x ∣ a) = a x , x ∈ (0, 1) (7.4.61)
Suppose now that a is modeled by a random variable A that has a prior gamma distribution with shape parameter
k ∈ (0, ∞) and rate parameter r ∈ (0, ∞). As usual, k and r are chosen to reflect our prior knowledge of a . Thus the
The posterior distribution of A given X n = (X 1 , X 2 , … , X n ) is gamma, with shape parameter k + n and rate
parameter r − ln(X X ⋯ X ) .
1 2 n
Proof
Thus, the gamma distribution is conjugate to the beta distribution with unknown left parameter and right parameter 1.
Note that the posterior shape parameter is deterministic and depends on the data only through the sample size n.
https://stats.libretexts.org/Bookshelves/Probability_Theory/Probability_Mathematical_Statistics_and_Stochastic_Processes_(Siegrist)/07%3A_Poi… 7/8
9/8/25, 12:32 PM 7.4: Bayesian Estimation - Statistics LibreTexts
k + n
Un = (7.4.65)
r − ln(X 1 X 2 ⋯ X n )
Proof
Given the complicated structure, the bias and mean square error of U given a ∈ (0, ∞) would be difficult to compute
n
explicitly. Recall that the maximum likelihood estimator of a is W = −n/ ln(X X ⋯ X ). To see the connection
n 1 2 n
So 1/U (the reciprocal of the Bayesian estimator) is a weighted average of r/k (the reciprocal of the mean of the prior
n
with unknown shape parameter a ∈ (0, ∞) and scale parameter b = 1. The Pareto distribution is used to model certain
financial variables and other variables with heavy-tailed distributions, and is named for Vilfredo Pareto. Recall that the
probability density function (given a) is
a
g(x ∣ a) = , x ∈ [1, ∞) (7.4.67)
a+1
x
Suppose now that a is modeled by a random variable A that has a prior gamma distribution with shape parameter
k ∈ (0, ∞) and rate parameter r ∈ (0, ∞). As usual, k and r are chosen to reflect our prior knowledge of a . Thus the
Proof
Thus, the gamma distribution is conjugate to Pareto distribution with unknown shape parameter. Note that the
posterior shape parameter is deterministic and depends on the data only through the sample size n.
k + n
Un = (7.4.71)
r + ln(X 1 X 2 ⋯ X n )
Proof
Given the complicated structure, the bias and mean square error of U given a ∈ (0, ∞) would be difficult to compute
explicitly. Recall that the maximum likelihood estimator of a is W = n/ ln(X X ⋯ X ). To see the connection
n 1 2 n
So 1/U (the reciprocal of the Bayesian estimator) is a weighted average of r/k (the reciprocal of the mean of the prior
n
This page titled 7.4: Bayesian Estimation is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle
Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform.
https://stats.libretexts.org/Bookshelves/Probability_Theory/Probability_Mathematical_Statistics_and_Stochastic_Processes_(Siegrist)/07%3A_Poi… 8/8