0% found this document useful (0 votes)
5 views8 pages

7.4 - Bayesian Estimation - 2

The document discusses Bayesian estimation, focusing on modeling parameters as random variables with prior distributions and updating these with observed data using Bayes' theorem. It explains the concepts of posterior distributions, Bayesian estimators, and their properties, including bias and mean square error. Additionally, it highlights the use of conjugate families and provides examples, particularly with the Bernoulli distribution and beta prior distributions.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views8 pages

7.4 - Bayesian Estimation - 2

The document discusses Bayesian estimation, focusing on modeling parameters as random variables with prior distributions and updating these with observed data using Bayes' theorem. It explains the concepts of posterior distributions, Bayesian estimators, and their properties, including bias and mean square error. Additionally, it highlights the use of conjugate families and provides examples, particularly with the Bernoulli distribution and beta prior distributions.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

9/8/25, 12:32 PM 7.

4: Bayesian Estimation - Statistics LibreTexts

7.4: Bayesian Estimation

Basic Theory
The General Method
Suppose again that we have an observable random variable X for an experiment, that takes values in a set S . Suppose
also that distribution of X depends on a parameter θ taking values in a parameter space T . Of course, our data variable
X is almost always vector-valued, so that typically S ⊆ R for some n ∈ N . Depending on the nature of the sample
n
+

space S , the distribution of X may be discrete or continuous. The parameter θ may also be vector-valued, so that
typically T ⊆ R for some k ∈ N .
k
+

In Bayesian analysis, named for the famous Thomas Bayes, we model the deterministic, but unknown parameter θ with a
random variable Θ that has a specified distribution on the parameter space T . Depending on the nature of the
parameter space, this distribution may also be either discrete or continuous. It is called the prior distribution of Θ and is
intended to reflect our knowledge of the parameter θ , before we gather data. After observing X = x ∈ S , we then use
Bayes' theorem, to compute the conditional distribution of Θ given X = x. This distribution is called the posterior
distribution of Θ, and is an updated distribution, given the information in the data. Here is the mathematical
description, stated in terms of probability density functions.

Suppose that the prior distribution of Θ on T has probability density function h, and that given Θ = θ ∈ T , the
conditional probability density function of X on S is f (⋅ ∣ θ). Then the probability density function of the posterior
distribution of Θ given X = x ∈ S is

h(θ)f (x ∣ θ)
h(θ ∣ x) = , θ ∈ T (7.4.1)
f (x)

where the function in the denominator is defined as follows, in the discrete and continuous cases, respectively:

f (x) = ∑ h(θ)f (x|θ), x ∈ S

θ∈T

f (x) = ∫ h(θ)f (x ∣ θ) dθ, x ∈ S


T

Proof

For x ∈ S , note that f (x) is simply the normalizing constant for the function θ ↦ h(θ)f (x ∣ θ) . It may not be necessary
to explicitly compute f (x), if one can recognize the functional form of θ ↦ h(θ)f (x ∣ θ) as that of a known
distribution. This will indeed be the case in several of the examples explored below.
If the parameter space T has finite measure c (counting measure in the discrete case or Lebesgue measure in the
continuous case), then one possible prior distribution is the uniform distribution on T , with probability density
function h(θ) = 1/c for θ ∈ T . This distribution reflects no prior knowledge about the parameter, and so is called the
non-informative prior distributioon.

Random Samples
Of course, an important and essential special case occurs when X = (X , X , … , X ) is a random sample of size n
1 2 n

from the distribution of a basic variable X. Specifically, suppose that X takes values in a set R and has probability
density function g(⋅ ∣ θ) for a given θ ∈ T . In this case, S = R and the probability density function f (⋅ ∣ θ) of X given
n

θ is

f (x1 , x2 , … , xn ∣ θ) = g(x1 ∣ θ)g(x2 ∣ θ) ⋯ g(xn ∣ θ), (x 1 , x 2 , … , x n ) ∈ S (7.4.3)

Real Parameters
Suppose that θ is a real-valued parameter, so that T ⊆ R . Here is our main definition.

The conditional expected value E(Θ ∣ X) is the Bayesian estimator of θ .


1. If Θ has a discrete distribution on T then

E(Θ ∣ X = x) = ∑ θh(θ ∣ x), x ∈ S (7.4.4)

θ∈T

https://stats.libretexts.org/Bookshelves/Probability_Theory/Probability_Mathematical_Statistics_and_Stochastic_Processes_(Siegrist)/07%3A_Poi… 1/8
9/8/25, 12:32 PM 7.4: Bayesian Estimation - Statistics LibreTexts
2. If Θ has a continuous distribution on T then

E(Θ ∣ X = x) = ∫ θh(θ ∣ x)dθ, x ∈ S (7.4.5)


T

Recall that E(Θ ∣ X) is a function of X and, among all functions of X , is closest to Θ in the mean square sense. Of
course, once we collect the data and observe X = x, the Bayesian estimate of θ is E(Θ ∣ X = x). As always, the term
estimator refers to a random variable, before the data are collected, and the term estimate refers to an observed value of
the random variable after the data are collected. The definitions of bias and mean square error are as before, but now
conditioned on Θ = θ ∈ T .

Suppose that U is the Bayes estimator of θ .


1. The bias of U is bias(U ∣ θ) = E(U − θ ∣ Θ = θ) for θ ∈ T .
2. The mean square error of U is mse(U ∣ θ) = E[(U − θ) ∣ Θ = θ] for θ ∈ T .
2

As before, bias(U ∣ θ) = E(U ∣ θ) − θ and mse(U ∣ θ) = var(U ∣ θ) + bias (U ∣ θ) . Suppose now that we observe the
2

random variables (X , X , X , …) sequentially, and we compute the Bayes estimator U of θ based on


1 2 3 n

(X , X , … , X ) for each n ∈ N . Again, the most common case is when we are sampling from a distribution, so that
1 2 n +

the sequence is independent and identically distributed (given θ ). We have the natural asymptotic properties that we
have seen before.

Let U = (U n : n ∈ N + ) be the sequence of Bayes estimators of θ as above.


1. U is asymptotically unbiased if bias(U n
∣ θ) → 0 as n → ∞ for each θ ∈ T .
2. U is mean-square consistent if mse(U n
∣ θ) → 0 as n → ∞ for each θ ∈ T .

Often we cannot construct unbiased Bayesian estimators, but we do hope that our estimators are at least asymptotically
unbiased and consistent. It turns out that the sequence of Bayesian estimators U is a martingale. The theory of
martingales provides some powerful tools for studying these estimators.
From the Bayesian perspective, the posterior distribution of Θ given the data X = x is of primary importance. Point
estimates of θ derived from this distribution are of secondary importance. In particular, the mean square error function
u ↦ E[(Θ − u) ∣ X = x) , minimized as we have noted at E(Θ ∣ X = x), is not the only loss function that can be used.
2

(Although it's the only one that we consider.) Another possible loss function, among many, is the mean absolute error
function u ↦ E(|Θ − u| ∣ X = x), which we know is minimized at the median(s) of the posterior distribution.

Conjugate Families
Often, the prior distribution of Θ is itself a member of a parametric family, with the parameters specified to reflect our
prior knowledge of θ . In many important special cases, the parametric family can be chosen so that the posterior
distribution of Θ given X = x belongs to the same family for each x ∈ S . In such a case, the family of distributions of
Θ is said to be conjugate to the family of distributions of X . Conjugate families are nice from a computational point of

view, since we can often compute the posterior distribution through a simple formula involving the parameters of the
family, without having to use Bayes' theorem directly. Similarly, in the case that the parameter is real valued, we can
often compute the Bayesian estimator through a simple formula involving the parameters of the conjugate family.

Special Distributions
The Bernoulli Distribution
Suppose that X = (X , X , …) is sequence of independent variables, each having the Bernoulli distribution with
1 2

unknown success parameter p ∈ (0, 1). In short, X is a sequence of Bernoulli trials, given p. In the usual language of
reliability, X = 1 means success on trial i and X = 0 means failure on trial i . Recall that given p, the Bernoulli
i i

distribution has probability density function


x 1−x
g(x ∣ p) = p (1 − p) , x ∈ {0, 1} (7.4.6)

Note that the number of successes in the first trials is . Given p, random variable has the binomial
n
n Yn = ∑ Xi Yn
i=1

distribution with parameters n and p.


Suppose now that we model p with a random variable P that has a prior beta distribution with left parameter
a ∈ (0, ∞) and right parameter b ∈ (0, ∞) , where a and b are chosen to reflect our initial information about p. So P has

probability density function

https://stats.libretexts.org/Bookshelves/Probability_Theory/Probability_Mathematical_Statistics_and_Stochastic_Processes_(Siegrist)/07%3A_Poi… 2/8
9/8/25, 12:32 PM 7.4: Bayesian Estimation - Statistics LibreTexts
1 a−1 b−1
h(p) = p (1 − p) , p ∈ (0, 1) (7.4.7)
B(a, b)

and has mean a/(a + b) . For example, if we know nothing about p, we might let a = b = 1 , so that the prior
distribution is uniform on the parameter space (0, 1) (the non-informative prior). On the other hand, if we believe that
p is about , we might let a = 4 and b = 2, so that the prior distribution is unimodal, with mean . As a random
2 2

3 3

process, the sequence X with p randomized by P , is known as the beta-Bernoulli process, and is very interesting on its
own, outside of the context of Bayesian estimation.

For n ∈ N , the posterior distribution of


+
P given X n = (X 1 , X 2 , … , X n ) is beta with left parameter a + Yn and
right parameter b + (n − Y ). n

Proof

Thus, the beta distribution is conjugate to the Bernoulli distribution. Note also that the posterior distribution depends
on the data vector X only through the number of successes Y . This is true because Y is a sufficient statistic for p. In
n n n

particular, note that the left beta parameter is increased by the number of successes Y and the right beta parameter is n

increased by the number of failures n − Y . n

The Bayesian estimator of p given X is n

a + Yn
Un = (7.4.10)
a + b + n

Proof

In the beta coin experiment, set n = 20 and p = 0.3, and set a = 4 and b = 2. Run the simulation 100 times and note
the estimate of p and the shape and location of the posterior probability density function of p on each run.

Next let's compute the bias and mean-square error functions.

For n ∈ N , +

a(1 − p) − bp
bias(Un ∣ p) = , p ∈ (0, 1) (7.4.11)
a + b + n

The sequence U = (U n : n ∈ N + ) is asymptotically unbiased.


Proof

Note also that we cannot choose a and b to make Un unbiased, since such a choice would involve the true value of p,
which we do not know.

In the beta coin experiment, vary the parameters and note the change in the bias. Now set n = 20 and p = 0.8, and
set a = 2 and b = 6. Run the simulation 1000 times. Note the estimate of p and the shape and location of the
posterior probability density function of p on each update. Compare the empirical bias to the true bias.

For n ∈ N , +

2 2 2
p[n − 2 a(a + b)] + p [(a + b) − n] + a
mse(Un ∣ p) = , p ∈ (0, 1) (7.4.13)
2
(a + b + n)

The sequence (U n
: n ∈ N+ ) is mean-square consistent.
Proof

In the beta coin experiment, vary the parameters and note the change in the mean square error. Now set n = 10
and p = 0.7, and set a = b = 1 . Run the simulation 1000 times. Note the estimate of p and the shape and location of
the posterior probability density function of p on each update. Compare the empirical mean square error to the true
mean square error.

Interestingly, we can choose a and b so that U has mean square error that is independent of the unknown parameter p:

Let n ∈ N and let a = b = √−


+
n/2. Then

n
mse(U n ∣ p) = , p ∈ (0, 1) (7.4.16)
− 2
4(n + √n )

https://stats.libretexts.org/Bookshelves/Probability_Theory/Probability_Mathematical_Statistics_and_Stochastic_Processes_(Siegrist)/07%3A_Poi… 3/8
9/8/25, 12:32 PM 7.4: Bayesian Estimation - Statistics LibreTexts
In the beta coin experiment, set n = 36 and a = b = 3 . Vary p and note that the mean square error does not change.
Now set p = 0.8 and run the simulation 1000 times. Note the estimate of p and the shape and location of the
posterior probability density function on each update. Compare the empirical bias and mean square error to the
true values.

Recall that the method of moments estimator and the maximum likelihood estimator of p (on the interval (0, 1) ) is the
sample mean (the proportion of successes):
n
Y 1
Mn = = ∑ Xi (7.4.17)
n n
i=1

This estimator has mean square error mse(Mn ∣ p) =


1

n
p(1 − p) . To see the connection between the estimators, note
from (6) that
a + b a n
Un = + Mn (7.4.18)
a + b + n a + b a + b + n

So U is a weighted average of
n
a/(a + b) (the mean of the prior distribution) and Mn (the maximum likelihood
estimator).

Another Bernoulli Distribution


Bayesian estimation, like other forms of parametric estimation, depends critically on the parameter space. Suppose
again that (X , X , …) is a sequence of Bernoulli trials, given the unknown success parameter p, but suppose now that
1 2

the parameter space is { , 1}. This setup corresponds to the tossing of a coin that is either fair or two-headed, but we
1

don't know which. We model p with a random variable P that has the prior probability density function h given by
h(1) = a , h ( ) = 1 − a, where a ∈ (0, 1) is chosen to reflect our prior knowledge of the probability that the coin is
1

two-headed. If we are completely ignorant, we might let a = (the non-informative prior). If with think the coin is 1

more likely to be two-headed, we might let a = . Again let Y = ∑ X for n ∈ N . 3 n


n i=1 i +
4

The posterior distribution of P given X n


= (X 1 , X 2 , … , X n ) is
n

1. h(1 ∣ X n
) = n
2 a

2 a+(1−a)
if Y n
= n and h(1 ∣ X n
) = 0 if Y n
< n

2. h ( 1

2
∣ Xn ) = n
1−a
if Y n
= n and h ( 1

2
∣ Xn ) = 1 if Y n
< n
2 a+(1−a)

Proof

Now let
n+1
2 a + (1 − a)
pn = (7.4.23)
n+1
2 a + 2(1 − a)

The Bayes' estimator of p given X the statistic U defined by n n

1. U n
= pn if Y n
= n

2. U n
=
1

2
if Y n
< n

Proof

If we observe Y < n then U gives the correct answer . This certainly makes sense since we know that we do not
n n
1

have the two-headed coin. On the other hand, if we observe Y = n then we are not certain which coin we have, and n

the Bayesian estimate p is not even in the parameter space! But note that p → 1 as n → ∞ exponentially fast. Next
n n

let's compute the bias and mean-square error for a given p ∈ { , 1}. 1

For n ∈ N , +

1. bias(U n ∣ 1) = pn − 1
n
2. bias (U
1 1 1
n ∣ ) = ( ) (pn − )
2 2 2

The sequence of estimators (U n


: n ∈ N+ ) is asymptotically unbiased.
Proof

If p = 1, the estimator U is negatively biased; we noted this earlier. If


n
p =
1

2
, then Un is positively biased for
sufficiently large n (depending on a).

For n ∈ N , +

1. mse(U n
∣ 1) = (pn − 1)
2

https://stats.libretexts.org/Bookshelves/Probability_Theory/Probability_Mathematical_Statistics_and_Stochastic_Processes_(Siegrist)/07%3A_Poi… 4/8
9/8/25, 12:32 PM 7.4: Bayesian Estimation - Statistics LibreTexts

2. mse (U 1 1 n 1 2

n
∣ ) = ( ) (pn − )
2 2 2

The sequence of estimators U = (U n : n ∈ N + ) is mean-square consistent.


Proof

The Geometric distribution


Suppose that X = (X , X , …) is a sequence of independent random variables, each having the geometric distribution
1 2

on N with unknown success parameter p ∈ (0, 1). Recall that these variables can be interpreted as the number of trials
+

between successive successes in a sequence of Bernoulli trials. Given p, the geometric distribution has probability
density function
x−1
g(x ∣ p) = p(1 − p) , x ∈ N+ (7.4.29)

Once again for n ∈ N , let Y = ∑ X . In this setting, Y is the trial number of the nth success, and given p, has the
n
+ n i=1 i n

negative binomial distribution with parameters n and p.


Suppose now that we model p with a random variable P having a prior beta distribution with left parameter
a ∈ (0, ∞) and right parameter b ∈ (0, ∞) . As usual, a and b are chosen to reflect our prior knowledge of p.

The posterior distribution of P given X n = (X 1 , X 2 , … , X n ) is beta with left parameter a + n and right parameter
b + (Y − n) .
n

Proof

Thus, the beta distribution is conjugate to the geometric distribution. Moreover, note that in the posterior beta
distribution, the left parameter is increased by the number of successes n while the right parameter is increased by the
number of failures Y − n , just as in the Bernoulli model. In particular, the posterior left parameter is deterministic and
depends on the data only through the sample size n.

The Bayesian estimator of p based on X is n

a + n
Vn = (7.4.32)
a + b + Yn

Proof

Recall that the method of moments estimator of p, and the maximum likelihood estimator of p on the interval (0, 1) are
both W = 1/M = n/Y . To see the connection between the estimators, note from (19) that
n n n

1 a a + b n 1
= + (7.4.33)
Vn a + n a a + n Wn

So 1/V (the reciprocal of the Bayesian estimator) is a weighted average of (a + b)/a (the reciprocal of the mean of the
n

prior distribution) and 1/W (the reciprocal of the maximum likelihood estimator).
n

The Poisson Distribution


Suppose that X = (X , X , …) is a sequence of random variable each having the Poisson distribution with unknown
1 2

parameter λ ∈ (0, ∞). Recall that the Poisson distribution is often used to model the number of “random points” in a
region of time or space and is studied in more detail in the chapter on the Poisson Process. The distribution is named
for the inimitable Simeon Poisson and given λ, has probability density function
x
−λ
λ
g(x ∣ λ) = e , x ∈ N (7.4.34)
x!

Once again, for , let . Given , random variable also has a Poisson distribution, but with
n
n ∈ N+ Yn = ∑i=1 X i λ Yn

parameter nλ.
Suppose now that we model λ with a random variable Λ having a prior gamma distribution with shape parameter
k ∈ (0, ∞) and rate parameter r ∈ (0, ∞). As usual k and r are chosen to reflect our prior knowledge of λ . Thus the

prior probability density function of Λ is


k
r k−1 −rλ
h(λ) = λ e , λ ∈ (0, ∞) (7.4.35)
Γ(k)

and the mean is k/r. The scale parameter of the gamma distribution is b = 1/r, but the formulas will work out nicer if
we use the rate parameter.

The posterior distribution of Λ given X n = (X 1 , X 2 , … , X n ) is gamma with shape parameter k + Yn and rate
parameter r + n.
https://stats.libretexts.org/Bookshelves/Probability_Theory/Probability_Mathematical_Statistics_and_Stochastic_Processes_(Siegrist)/07%3A_Poi… 5/8
9/8/25, 12:32 PM 7.4: Bayesian Estimation - Statistics LibreTexts
Proof

It follows that the gamma distribution is conjugate to the Poisson distribution. Note that the posterior rate parameter is
deterministic and depends on the data only through the sample size n.

The Bayesian estimator of λ based on X n = (X 1 , X 2 , … , X n ) is


k + Yn
Vn = (7.4.39)
r + n

Proof

Since V is a linear function of Y , and we know the distribution of Y given λ ∈ (0, ∞), we can compute the bias and
n n n

mean-square error functions.

For n ∈ N , +

k − rλ
bias(Vn ∣ λ) = , λ ∈ (0, ∞) (7.4.40)
r + n

The sequence of estimators V = (V n : n ∈ N + ) is asymptotically unbiased.


Proof

Note that, as before, we cannot choose k and r to make V unbiased, without knowledge of λ.
n

For n ∈ N , +

2
nλ + (k − rλ)
mse(Vn ∣ λ) = , λ ∈ (0, ∞) (7.4.42)
(r + n)2

The sequence of estimators V = (V n : n ∈ N + ) is mean-square consistent.


Proof

Recall that the method of moments estimator of λ and the maximum likelihood estimator of λ on the interval (0, ∞)
are both M = Y /n , the sample mean. This estimator is unbiased and has mean square error λ/n . To see the
n n

connection between the estimators, note from (21) that


r k n
Vn = + Mn (7.4.44)
r + n r r + n

So V is a weighted average of k/r (the mean of the prior distribution) and M (the maximum likelihood estimator).
n n

The Normal Distribution


Suppose that X = (X , X , …) is a sequence of independent random variables, each having the normal distribution
1 2

with unknown mean μ ∈ R but known variance σ ∈ (0, ∞) . Of course, the normal distribution plays an especially
2

important role in statistics, in part because of the central limit theorem. The normal distribution is widely used to
model physical quantities subject to numerous small, random errors. In many statistical applications, the variance of
the normal distribution is more stable than the mean, so the assumption that the variance is known is not entirely
artificial. Recall that the normal probability density function (given μ) is
2
1 1 x − μ
g(x ∣ μ) = exp[− ( ) ], x ∈ R (7.4.45)
−−−
√ 2 πσ 2 σ

Again, for n ∈ N let . Recall that also has a normal distribution (given ) but with mean and
n
+
Yn = ∑ Xi Yn μ nμ
i=1

variance nσ . 2

Suppose now that μ is modeled by a random variable Ψ that has a prior normal distribution with mean a ∈ R and
variance b ∈ (0, ∞) . As usual, a and b are chosen to reflect our prior knowledge of μ. An interesting special case is
2

when we take b = σ, so the variance of the prior distribution of Ψ is the same as the variance of the underlying
sampling distribution.

For n ∈ N , the posterior distribution of Ψ given X


+ n = (X 1 , X 2 , … , X n ) is normal with mean and variance given
by
2 2
Yn b + aσ
E(Ψ ∣ Xn ) = (7.4.46)
nb2 + σ 2
2 2
σ b
var(Ψ ∣ Xn ) = (7.4.47)
nb2 + σ 2

https://stats.libretexts.org/Bookshelves/Probability_Theory/Probability_Mathematical_Statistics_and_Stochastic_Processes_(Siegrist)/07%3A_Poi… 6/8
9/8/25, 12:32 PM 7.4: Bayesian Estimation - Statistics LibreTexts
Proof

Therefore, the normal distribution is conjugate to the normal distribution with unknown mean and known variance.
Note that the posterior variance is deterministic, and depends on the data only through the sample size n. In the special
case that b = σ, the posterior distribution of Ψ given X is normal with mean (Y + a)/(n + 1) and variance
n n

σ /(n + 1) .
2

The Bayesian estimator of μ is


2 2
Yn b + aσ
Un = (7.4.55)
nb2 + σ 2

Proof

Note that U n
= (Yn + a)/(n + 1) in the special case that b = σ.

For n ∈ N , +

2
σ (a − μ)
bias(Un ∣ μ) = , μ ∈ R (7.4.56)
σ 2 + n b2

The sequence of estimators U = (U n : n ∈ N + ) is asymptotically unbiased.


Proof

When b = σ, bias(U n
∣ μ) = (a − μ)/(n + 1) .

For n ∈ N , +

2 4 4 2
nσ b + σ (a − μ)
mse(Un ∣ μ) = , μ ∈ R (7.4.58)
(σ 2 + n b2 )2

The sequence of estimators U = (U n : n ∈ N + ) is mean-square consistent.


Proof

When b = σ, mse(U ∣ μ) = [nσ + (a − μ) ]/(n + 1) . Recall that the method of moments estimator of μ and the
2 2 2

maximum likelihood estimator of μ on R are both M = Y /n , the sample mean. This estimator is unbiased and has
n n

mean square error var(M ) = σ /n. To see the connection between the estimators, note from (25) that
2

2 2
σ nb
Un = a + Mn (7.4.60)
2 2 2 2
nb + σ nb + σ

So U is a weighted average of a (the mean of the prior distribution) and M (the maximum likelihood estimator).
n n

The Beta Distribution


Suppose that X = (X , X , …) is a sequence of independent random variables each having the beta distribution with
1 2

unknown left shape parameter a ∈ (0, ∞) and right shape parameter b = 1. The beta distribution is widely used to
model random proportions and probabilities and other variables that take values in bounded intervals (scaled to take
values in (0, 1)). Recall that the probability density function (given a) is
a−1
g(x ∣ a) = a x , x ∈ (0, 1) (7.4.61)

Suppose now that a is modeled by a random variable A that has a prior gamma distribution with shape parameter
k ∈ (0, ∞) and rate parameter r ∈ (0, ∞). As usual, k and r are chosen to reflect our prior knowledge of a . Thus the

prior probabiltiy density function of A is


k
r
k−1 −ra
h(a) = a e , a ∈ (0, ∞) (7.4.62)
Γ(k)

The mean of the prior distribution is k/r.

The posterior distribution of A given X n = (X 1 , X 2 , … , X n ) is gamma, with shape parameter k + n and rate
parameter r − ln(X X ⋯ X ) .
1 2 n

Proof

Thus, the gamma distribution is conjugate to the beta distribution with unknown left parameter and right parameter 1.
Note that the posterior shape parameter is deterministic and depends on the data only through the sample size n.

The Bayesian estimator of a based on X is n

https://stats.libretexts.org/Bookshelves/Probability_Theory/Probability_Mathematical_Statistics_and_Stochastic_Processes_(Siegrist)/07%3A_Poi… 7/8
9/8/25, 12:32 PM 7.4: Bayesian Estimation - Statistics LibreTexts
k + n
Un = (7.4.65)
r − ln(X 1 X 2 ⋯ X n )

Proof

Given the complicated structure, the bias and mean square error of U given a ∈ (0, ∞) would be difficult to compute
n

explicitly. Recall that the maximum likelihood estimator of a is W = −n/ ln(X X ⋯ X ). To see the connection
n 1 2 n

between the estimators, note from (29) that


1 k r n 1
= + (7.4.66)
Un k + n k k + n Wn

So 1/U (the reciprocal of the Bayesian estimator) is a weighted average of r/k (the reciprocal of the mean of the prior
n

distribution) and 1/W (the reciprocal of the maximum likelihood estimator).


n

The Pareto Distribution


Suppose that X = (X , X , …) is a sequence of independent random variables each having the Pareto distribution
1 2

with unknown shape parameter a ∈ (0, ∞) and scale parameter b = 1. The Pareto distribution is used to model certain
financial variables and other variables with heavy-tailed distributions, and is named for Vilfredo Pareto. Recall that the
probability density function (given a) is
a
g(x ∣ a) = , x ∈ [1, ∞) (7.4.67)
a+1
x

Suppose now that a is modeled by a random variable A that has a prior gamma distribution with shape parameter
k ∈ (0, ∞) and rate parameter r ∈ (0, ∞). As usual, k and r are chosen to reflect our prior knowledge of a . Thus the

prior probabiltiy density function of A is


k
r k−1 −ra
h(a) = a e , a ∈ (0, ∞) (7.4.68)
Γ(k)

For n ∈ N , the posterior distribution of A given


+
X n = (X 1 , X 2 , … , X n ) is gamma, with shape parameter k + n

and rate parameter r + ln(X X ⋯ X ) .


1 2 n

Proof

Thus, the gamma distribution is conjugate to Pareto distribution with unknown shape parameter. Note that the
posterior shape parameter is deterministic and depends on the data only through the sample size n.

The Bayesian estimator of a based on X is n

k + n
Un = (7.4.71)
r + ln(X 1 X 2 ⋯ X n )

Proof

Given the complicated structure, the bias and mean square error of U given a ∈ (0, ∞) would be difficult to compute
explicitly. Recall that the maximum likelihood estimator of a is W = n/ ln(X X ⋯ X ). To see the connection
n 1 2 n

between the estimators, note from (31) that


1 k r n 1
= + (7.4.72)
Un k + n k k + n Wn

So 1/U (the reciprocal of the Bayesian estimator) is a weighted average of r/k (the reciprocal of the mean of the prior
n

distribution) and 1/W (the maximum likelihood estimator).


n

This page titled 7.4: Bayesian Estimation is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle
Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform.

https://stats.libretexts.org/Bookshelves/Probability_Theory/Probability_Mathematical_Statistics_and_Stochastic_Processes_(Siegrist)/07%3A_Poi… 8/8

You might also like