Stat 200: Introduction to Statistical Inference Autumn 2018/19
Lecture 3: The method of moments
Lecturer: Art B. Owen October 2
Disclaimer: These notes have not been subjected to the usual scrutiny reserved for formal publications. They
are meant as a memory aid for students who took stat 200 at Stanford University. They may be distributed
outside this class only with the permission of the instructor. Also, Stanford University holds the copyright.
Abstract
These notes are mnemonics about what was covered in class. They don’t replace being present or
reading the book. Reading ahead in the book is very effective.
In reviewing probability we emphasized moments. Turns out there’s a method for using them.
3.1 Preamble
Now we do statistics. I used the example of an MD working in a neighborhood who sees cholesterol numbers
X1 , X2 , . . . , Xn for n patients. That is a fine record of the past but what does it tell about future patients?
We let Yi = 1Xi >240 as an indicator variable for a patient with a worrisome cholesterol level.
Here is how we usually work in statistics. We consider first that the data we got are a random sample from
some distribution F . We suppose next that F belongs to a known parametric family such as the normal
distributions N (µ, σ 2 ) with PDF
1 x − µ 2
1 −
f (x; µ, σ) = √ e 2 σ , x∈R
2π
or Poisson with probability mass function (PMF) p(x; λ) = e−λ λx /x! for x = 0, 1, 2, . . . . In general X ∼ F
with PDF f (x; θ) (PMFs are similar).
The past data are independent and identically distributed (IID) from this distribution. So are the future
values and they’re independent of the past ones. The only thing we don’t know is θ. If we knew θ we would
know the distribution of these random variables completely.
At a high level, here are the tasks this framework brings:
1. find some estimate of θ of the form θ̂ = T (X1 , . . . , Xn )
that θ̂ is a function of the data (called a ‘statistic’). It is therefore subject to the laws of probability.
2. what good (or bad) properties does our estimate have? Mainly, in what ways is it close to the true θ?
3. given two estimates can we decide which is better?
4. in this framework, do we think that some particular value θ0 is compatible with the data we have
seen? We will then test a hypothesis that θ = θ0 . Often we set things up so that θ0 = 0 is meaningful
scientifically and then we want to test whether the true θ could really be 0.
5. we may also want to test whether the data are compatible with our chosen distribution family f (x; θ).
These are ‘goodness of fit tests’. Maybe we were wrong about that assumption.
c Stanford University 2018
3-2 Lecture 3: The method of moments
6. We often want a confidence interval. Let L(X1 , . . . , Xn ) and U (X1 , . . . , Xn ) be two statistics. They
form a 99% confidence interval for θ if
Pr(L(X1 , . . . , Xn ) 6 θ 6 U (X1 , . . . , Xn )) = 0.99.
Those are the main tasks we will look at: estimation, testing, and forming confidence intervals. We usually
work with IID Xi . Some of the methods and problems we consider can extend beyond the case of IID Xi .
In this class (but the prior lecture notes) we had a review of Rice ch 5, law of large numbers (LLN) and
central limit theorem (CLT).
3.2 Method of moments
Let’s think of our friend the MD with Yi ∈ {0, 1} for i = 1, . . . , n. Since cholesterol levels Xi are IID the Yi
are IID
PnBern(p) for some unknown p. Suppose that of n = 200 patients there were 15 with high cholesterol.
I.e., i=1 Yi = 15.
Pn
In the method of moments we find E(Y ; p) under our parametric model and equate it to Ȳ = (1/n) i=1 Yi .
We estimate that the population mean equals the sample mean. The Bern(p) distribution has mean p. The
data have mean 15/200 = 0.075. Equating them gives us p̂ = .075 as our estimate of p. This almost seems
too easy. We have 7.5% high cholesterol in the sample so we guess it is 7.5% in the population. Then again
without other information, why would one estimate it to be say 7% or 8% when the sample proportion was
7.5%?
In this case, np̂ ∼ Bin(n, p). So we almost know the exact distribution of p̂. The sticking point that we will
have to get around later is that the distribution of our estimate p̂ depends on the true p which we don’t
know. We face many seemingly circular arguments like this in statistics.
3.2.1 Normal data
Now suppose that Xi ∼ N (µ, σ 2 ). Then E(X) = µ so we use µ̂ = X̄ as before. But we still need an estimate
of σ 2 . We get that by using two moments, solving
E(X) = X̄, and E(X 2 ) = X 2 .
Pn Pn
Here X 2 = (1/n) i=1 Xi2 and more generally X r = (1/n) i=1 Xir . Rice’s notation is µr = E(X r ) and
µ̂r = X r . The new equation is E(X 2 ) = µ2 + σ 2 . Solving two moment equations gives us µ̂ = X̄ as before
and v
n u n
2 1 X
2
u1 X
σ̂ = (Xi − X̄) , i.e., σ̂ = t (Xi − X̄)2 .
n i=1 n i=1
Many students will already be familiar with a different estimator
n
1 X
s2 = (Xi − X̄)2 .
n − 1 i=1
This will come up for us later.
If X and Xi ∼ N (µ, σ 2 ) then
X − µ 240 − µ 240 − µ
Pr(X > 240) = Pr > = Pr N (0, 1) > = 1 − Φ((240 − µ)/σ)
σ σ σ
Lecture 3: The method of moments 3-3
where Φ is the CDF of the N (0, 1) distribution. The MD could also estimate the fraction of high cholesterol
patients by 1 − Φ((240 − µ̂)/σ̂). Later we will look at criteria for choosing from among two or more different
estimators.
3.2.2 Gamma data
The Gamma distribution with shape α > 0 is denoted Gam(α) (in these notes). It has PDF f (x; α) =
xα−1 e−x /Γ(α) for 0 < x < ∞, where Γ is the Gamma function (see lecture 2). If Y ∼ Gam(α) and X = Y /λ
for λ > 0 then X has the Gam(α, λ) distribution. This is the definition of Gam(α, λ). The original is then
Gam(α, 1).
Your probability background should enable you to do these two things:
1. show that the PDF of Gam(α, λ) is f (x; α, λ) = f (x/λ; α)/λ, and
2. evaluate this PDF.
In class we worked out that the method of moments estimates are
X̄ 2 X̄
α̂ = 2 and λ̂ = 2 ,
σ̂ σ̂
where σ̂ 2 is the same as for the normal case.
3.3 How well does M.O.M. work?
Let θ be the true parameter value and θ̂ be our estimate of it. When we want to keep track of the sample
size n we write θ̂ as θ̂n . A very mild requirement is that θ̂n should converge to θ as n → ∞. An estimator
that cannot get the right answer on unbounded sample sizes is problematic. We want Pr(|θ̂n − θ| > ) → 0.
We want that for all > 0 and for all θ too. We have a set Θ (capital Greek letter θ) containing all possible
values of θ. Now our estimator is consistent if
lim Pr(|θ̂n − θ| > ; θ) = 0, for all > 0 and all θ ∈ Θ.
n→∞
Our Θ is the set of all possible values that θ could take. Often θ = Rr for some r > 1. Other times only
certain values of θ are possible. For N (µ, σ 2 )
Θ = {(µ, σ) | µ ∈ R, σ > 0}
and the Gamma family has θ = (α, λ) ∈ Θ where
Θ = {(α, λ) | 0 < α < ∞, 0 < λ < ∞}.
Later Rice uses Θ (capital of the Greek leter θ) to denote a random variable that takes the value θ. We will
keep the two usages separate.
If θ = E(X) and θ̂n = X̄ based on n IID Xi then θ̂n is automatically consistent by the law of large numbers.
That was easy.
Now suppose that θ = g(E(X)) and θ̂n = g(X̄). If g is continuous at θ then for any > 0 there exists a
δ > 0 such that |X̄ − E(X)| < δ implies that |g(X̄) − g(E(X))| < . This is from the definition of continuity.
That means
Pr(|θ̂n − θ| > ) = Pr(|g(X̄) − g(E(X))| > ) 6 Pr(|X̄ − e(X)| > δ) → 0
by the LLN. So θ̂n is consistent when g is continuous at the true θ. If g is a continuous function of Θ (i.e.,
continuous everywhere) then the moment estimator θ̂n is consistent.
3-4 Lecture 3: The method of moments
3.3.1 Delta method variance
We will use a Taylor expansion in order to apply the CLT to the method of moments. Rice uses a Taylor
expansion on the method of maximum likelihood so we might as well add that in for the method of moments
too (to be consistent).
If θ̂ = X̄ then a CLT for X̄ immediately gives one for θ̂. The same would happen for a linear function
θ̂ = a + b × X̄. If X̄ ≈ N (µ, σ 2 /n) then a + b × X̄ ≈ N (a + bµ, b2 σ 2 /n)
If X̄ is the average of n IID random variables having mean µ and variance σ 2 then X̄ converges to µ = E(X)
(by the LLN). Let’s make a Taylor approximation to g(X̄) at µ:
1
g(X̄) = g(µ) + (X̄ − µ)g 0 (µ) + (X̄ − µ)2 g 00 (µ) + · · · .
2
Therefore
1
g(X̄) − g(µ) = (X̄ − µ)g 0 (µ) + (X̄ − µ)2 g 00 (µ) + · · · .
2
√
Now E((X̄ − µ)2 ) = σ 2 /n so the typical size of |X̄ − µ| is about σ/ n. Higher powers of |X̄ − µ| are then
relatively negligble. So we can work with the approximations
g(X̄) − g(µ) ≈ (X̄ − µ)g 0 (µ) ≈ N (0, σ 2 g 0 (µ)2 /n).
We wrote an infinite Taylor expansion but we could also terminate it using g 00 (µ) where µ∗ is somewhere
between X̄ and µ. A more advanced course would take more care about conditions on g than we do here.
There is one situation where the argument above goes wrong. If g 0 (µ) = 0 then the first term (X̄ − µ)g 0 (µ)
is no longer dominant. We would then have to find the smallest order derivative of g that is not zero.
Our delta method approximation is then
σ 2 g 0 (µ)2
θ̂n = g(X̄) ≈ N θ,
n
if Xi are independent with mean µ and variance σ 2 , θ = g(E(X)) = g(µ) for a function g that is smooth
and has g 0 (µ) 6= 0.
3.3.2 Problems and advantages of moments
Maybe θ has r = 3 parameters in it. Then we form and solve equations E(X k ) = X k for k = 1, 2, 3. We
could be out of luck of E(X 3 ) = ∞. Then the method of moments would not deliver estimates for us.
We could have X with a PDF f (x; θ) where θ = E(X) is known to satisfy θ > 0. Yet we might get θ̂ = X̄ < 0.
In some settings we can get a negative estimate σ̂ 2 for a variance. We probably knew that σ̂ 2 would be
wrong but still getting σ̂ 2 < 0 remains embarrasing.
We might have a parameter that must be an integer. Suppose your puppy got 0 or 1 or 2 copies of a certain
gene from its parents. Call that value θ. Now you get some random variables that are (say) Poi(10 × θ).
After a bit of algebra you find that the method of moments estimate is θ̂ = X̄/10 but it is not 0 or 1 or 2.
Now some advantages. Suppose that n = 1011 . Then θ̂ will require a lot of summing. You can however
spread those sums over tens of thousands of computers all running in parallel (if you have them). More
complicated estimates can be harder to parallelize.
Lecture 3: The method of moments 3-5
A second advantage. Suppose that you really want θ = E(X) or Var(X) or both and you made an estimate
assuming that Xi have a PDF with the parametric distribution form f (x; θ). If you were wrong about that
f your X̄ will still be consistent for E(X).