Decision Theory
(Largely parametric) Estimation
Prof. Dr. Matei Demetrescu
1 / 36
Stronger assumptions may actually help...
We've seen that
■ some estimators don't require any assumptions except the ones we
make about the sampling scheme, yet
■ others manage to estimate parameters on the basis of minimal
additional assumptions.
Neither the plug-in estimator(s) nor the method-of-moment approach
make any optimality/eciency claim.
This may however be achieved in a more parametric setup!
2 / 36
Today's outline
(Largely parametric) Estimation
1 The MLE
2 MLE (asymptotic) properties
3 Bayesian inference
4 Up next
3 / 36
The MLE
Moving on to...
1 The MLE
2 MLE (asymptotic) properties
3 Bayesian inference
4 Up next
4 / 36
The MLE
R. A. Fisher: Likelihood estimation
The idea is to pick those parameter values as estimates that have most
plausibly generated the observed sample, as quantied by the likelihood
⃗ (x; θ).
L(θ; x) := fX
Formally, our Maximum Likelihood estimate, is for a given sample x,
θ̂ = arg max L(θ; x).
θ∈Θ
Given random sampling (and measurability) we up with the ML estimator
⃗ = arg max log L(θ; X).
θ̂ = arg max L(θ; X) ⃗
θ∈Θ θ∈Θ 5 / 36
The MLE
Dierent samples, dierent outcomes
−10
2.0e−05
−20
Log−Likelihood
Likelihood
−30
1.0e−05
−40
−50
0.0e+00
−2 −1 0 1 2 −2 −1 0 1 2
mu mu 6 / 36
The MLE
Implementation of the MLE
For smooth log-likelihoods with the maximum in the interior of Θ,
■ Solve the rst-order conditions (f.o.c.),
⃗
∂ℓ(θ̂;X)
⃗ = ∇ℓ(θ; X)
∇ log L(θ; X) ⃗ = = 0;
∂θ
■ More often than not this needs to be done numerically.
⃗
∂ 2 ℓ(θ̂;X)
■ The 2nd-order condition is a negative denite Hessian
∂θ∂θ ′
.
Even without interior solutions or smoothness of ℓ, a value θ̂ that solves
⃗
arg max ℓ(θ; X) is a ML estimate, no matter how it is obtained.
θ∈Θ
7 / 36
The MLE
A hopefully seldom situation
Take e.g. Θ = [0, ∞).
0
−2
With no stationary points,
Log−Likelihood
we need to check the behavior at
−4
the boundaries!
−6
(Should in fact do that anyway.)
−8
−10
In this case, θ̂ = 0.
0 2 4 6 8 10
mu 8 / 36
The MLE
Two very dierent cases
Example (Smooth... )
Let X⃗ = (X1 , ..., Xn ) be a iid sample from an exponential population,
f (x; θ) = 1θ exp − xθ , x ∈ (0, ∞), θ ∈ Θ = (0, ∞).
The MLE of θ is X̄n .
Example (... and not so smooth)
Let X⃗ = (X1 , ..., Xn ) be a iid sample from a uniform population
1
f (x; a, b) = b−a I[a,b] (x), −∞ < a < b < ∞.
The MLEs are â = X[1] and b̂ = X[n] . 9 / 36
The MLE
Concentrating out parameters
′
Let θ = θ ′1 , θ ′2 ; θ1 is of interest and θ2 is a nuisance ...
Example (Normal population)
Let θ = (µ, σ)′ , with σ 2 the nuisance parameter. Then
n n 1 1 Xn
ℓ µ, σ 2 ; x = − log 2π − log σ 2 − (Xi − µ)2
2 2 2σ 2 i=1
1
Pn
(X i − µ)
leading to the f.o.c. ∇ℓ = σ 2 i=1
= 0.
− 2σn2 + 2σ1 4 ni=1 (Xi − µ)2
P
We may solve the f.o.c. for µ independently of σ 2 !
10 / 36
The MLE
Prole likelihood
Example (continued)
In fact, one may compute σ 2 (µ) = n1 ni=1 (Xi − µ)2 and plug it into ℓ
P
to obtain the concentrated or prole likelihood
Pn 2
ℓ = − n2 log 2π − n2 − n2 log n1 i=1 (X i − µ)
which is still optimized at µ̂ = X̄ .
■ Even if we can't directly solve the f.o.c. for θ1 only
■ we may still solve for θ 2 = f (θ 1 ), and plug in this expression in ℓ to
obtain the prole log-likelihood.
11 / 36
The MLE
Conditional likelihood I
Conditioning on a (sucient) statistic S for θ2 may also eliminate the
nuisance parameter.
A simple and intuitive form of this conditioning is available for
regression models with stochastic regressors.
f (x, y; θ) be the joint density of X and Y (both scalar for simplicity).
Let
■ The likelihood is L(θ) = ni=1 f (xi , yi ; θ)
Q
■ ... which can be a complicated expression (even in log form).
But say you're only interested in the dependence of Y on X ...
12 / 36
The MLE
Conditional likelihood II
Express the joint pdf as
f (x, y; θ) = fY |X (y, x; θ 1 ) · fX (x; θ 2 ) .
leading to a log-likelihood
Xn Xn
ℓ= log fY |X (Yi , Xi ; θ 1 ) + log fX (Xi ; θ 2 )
i=1 i=1
where maximizations w.r.t. θ1 and θ2 are entirely independent!
If the marginal behavior of X is not of interest, just specify fY |X !
(Or just the regression function if not interested in more.) 13 / 36
The MLE
ML as MM
iid ⃗ = Pn ln f (X i ; θ).
Let X i ∼ f (x; θ) s.t. ln L(θ; X) i=1
Regularity conditions assumed, we have (in the continuous case)
∂ ln f (X i ;θ)
E ∂θ = 0.
Hence,
∂ ln f (X i ;θ 0 )
Eθ0 (g(X i , θ 0 )) = Eθ0 ∂θ = 0,
which are moment conditions with sample counterparts
n n
1X 1 X ∂ ln f (X i ; θ) ⃗)
1 ∂ ln L(θ; X
g(X i , θ) = = = 0.
n i=1 n i=1 ∂θ n ∂θ
14 / 36
The MLE
Quasi likelihood
Often, we do not really know what the population distribution looks like.
■ But you still have a iid sample X 1 , . . . , X n , and
■ you have assumed something, say that X ∼ g (x, θ).
(Perhaps even knowing it's not the right density.)
This allows you to set up a likelihood,
Yn
L(θ; X 1 , . . . , X n ) = fX
⃗ (X 1 . . . , X n ; θ) = g (X i , θ) .
i=1
And you can easily compute the resulting ML estimator.
But if actually X i ∼ f (x), what does θ̂ behave like?
15 / 36
The MLE
Some simple linear regression
Example
Assume Y |X = x ∼ N α + βX, σ 2 . The conditional likelihood is then
all you have,
Yn 1 Y −α−βX 2
− 12 ( i σ i )
L (α, β) = √ e
i=1 2πσ 2
leading to the log-likelihood
n n 1 Xn
ℓ (α, β) = − ln 2π − ln σ 2 − 2 (Yi − α − βXi )2 .
2 2 2σ i=1
16 / 36
The MLE
Quasi-ML
Maximizing the Gaussian (conditional) likelihood is equivalent to
minimizing the residual sum of squares. But Least Squares is nice!
So Gaussian Quasi-ML also behaves nicely (in semiparametric ways):
■ The important part is that parameters of interest still identied!
■ Identication: expected gradient of quasi-log-likelihood still has the
true parameters as zeros, which amounts to
■ ... correctly specifying the conditional mean for the regression case.
From this perspective, Quasi-ML can be interpreted as a
method-of-moments estimator too.
17 / 36
MLE (asymptotic) properties
Moving on to...
1 The MLE
2 MLE (asymptotic) properties
3 Bayesian inference
4 Up next
18 / 36
MLE (asymptotic) properties
Sometimes we're (not) lucky
In many cases, MLEs are explicit functions of the sample, ⃗ ,
θ̂ = θ(X) and
we may analyze it directly:
■ either derive exact sampling distribution, or
■ provide some asymptotic approximation (using WLLN, CLT, delta
method).
Otherwise, asymptotics are the only chance:
■ Argue rst that the ML estimator is consistent, and
■ (for smooth likelihoods) linearize the f.o.c. to establish normality.
We don't deal with this (but see maybe additional material in moodle).
19 / 36
MLE (asymptotic) properties
Need to talk about identication
Note that
■ ... identication is a blurred concept in the nonparametric case.
■ This changes in the (fully specied) parametric case!
Denition (model-based, global identication)
A parametric
modelis identied
if and only if, for any sample X
⃗,
⃗ X |θ 2 implies θ 1 = θ 2 .
⃗
⃗ X |θ 1 = fX
fX ⃗
If identication is given then the MLE must be unique.
20 / 36
MLE (asymptotic) properties
The asymptotic take
If consistency is given, then
Proposition
Let θ = θ0 . Regularity conditions assumed, it then holds
2
E ∂ ln f (Xi ; θ0 )/∂θ
√ d
n(θ̂ − θ0 ) → N 0 , h 2 i2 .
∂ ln f (Xi ;θ0 )
E ∂θ 2
21 / 36
MLE (asymptotic) properties
The information equality
Under additional regularity conditions,
2
∂ ln f (Xi ; θ) 2 ∂ ln f (Xi ; θ)
E = −E = −I(θ),
∂θ ∂θ2
so that
!
√
2 −1
d
n(θ̂ − θ0 ) → N 0 ; −I −1 (θ0 ) ≡ N 0 ; E ∂ ln f∂θ(Xi ;θ)
.
In such cases the asymptotic variance of the MLE converges to the
so-called Cramér-Rao lower bound: the MLE is asymptotically ecient.
22 / 36
MLE (asymptotic) properties
Estimating standard errors
Consistent estimates for the variance of the asymptotic distribution of
MLEs can be obtained by
2 n
1 X ∂ 2 ln f (Xi ; θ̂)
∂ ln f (X i ; θ 0 )
E
b =
∂θ2 n i=1 ∂θ2
2 n
∂ ln f (X i ; θ 0 ) 1 X ∂ ln f (Xi ; θ̂) 2
E
b = .
∂θ n i=1 ∂θ
This delivers the so called asymptotic standard errors of MLEs.
23 / 36
Bayesian inference
Moving on to...
1 The MLE
2 MLE (asymptotic) properties
3 Bayesian inference
4 Up next
24 / 36
Bayesian inference
A change of perspective
So far, we assumed that
■ our sample ⃗
X is random with some joint distribution/pdf ⃗ (x; θ),
fX
where θ is some xed, though unknown parameter.
■ The data are used to obtain an estimate for θ, and
■ the sampling distribution of the estimator describes the uncertainty,
■ which was analyzed in a (ctional?) repeated sampling framework.
In Bayesian statistics, θ itself is regarded as a random variable.
A random θ captures the uncertainty about the parameter,
... and allows for (dierent) inference from data!
25 / 36
Bayesian inference
Why would we want random parameters?
If θ is (regarded as) random, then ⃗
X and θ have a joint distribution.
Moreover, conditional distributions have nice interpretations:
■ fX
⃗
⃗
X|θ is the distribution of the sample given specic parameter
values (i.e. the data generating process)
⃗
■ f θ|X is the distribution of the parameters given the sample.
posterior distribution
The so-called ⃗
f θ|X then conveys all the
available information available about θ.
The goal of Bayesian statistics is to provide ⃗
f θ|X .
26 / 36
Bayesian inference
Setup
Assume that ⃗ = (X 1 , ..., X n )
X is a iid sample from the joint pdf
⃗ (x|θ),
fX dening the sample likelihood.
We have ⃗ (x|θ) = L (θ; x),
fX but would like ⃗
f θ|X !
To get there, we need another ingredient:
■ Any information we have about θ before observing the data ⃗
X is
represented by the so-called prior density denoted by f (θ).
■ This is a density which we need to specify in such a way that it
reects our prior knowledge or beliefs (or both).
27 / 36
Bayesian inference
The Bayes theorem
If we have a prior, then we may resort to Bayes' theorem.
The posterior density is then
f (θ, x) f ⃗ (x|θ)f (θ) f ⃗ (x|θ)f (θ)
f (θ|x) = = X = R X .
f (x) f (x) fX⃗ (x|θ)f (θ)dθ
The posterior combines
■ the prior knowledge about θ summarized by the prior f (θ) with
■ the
information
about ⃗ , contained
θ in the data X in the likelihood
⃗
L θ; X given as fX ⃗
⃗ (x|θ) evaluated at X .
28 / 36
Bayesian inference
Almost like learning
The posterior
■ represents our revised knowledge/beliefs about the distribution of θ
after seeing ⃗
the data X
■ is obtained as a mixture of the prior information and current
information (data!)
■ is available to be the prior when the next body of data is available.
Essentially, we continuously update our knowledge about θ.
One other advantage over classical statistics is that
■ once you agree on prior and model,
■ you always get the same answer from the data. 29 / 36
Bayesian inference
Focus on the essential
In this setting, f (x) is referred to as the marginal data density, which
does not involve the parameter of interest. With ∝ meaning `is
proportional to',
⃗ (x|θ) · f (θ) .
f (θ|x) ∝ fX
| {z } | {z } |{z}
posterior likelihood prior
Note that the product ⃗
L θ; X f (θ) does not dene a proper density. It
represents a so called density kernel for the posterior density of θ.
■ Sometimes the kernel suces (e.g. when nding the posterior mode).
■ Other times we need to compute it or some other workaround.
30 / 36
Bayesian inference
The normal-normal example
Example
Let X
⃗ = (X1 , ..., Xn ) be a iid sample from a N (µ, σ 2 ) population.
■ The variance σ 2 is assumed to be known.
■ Say the prior information can be represented by a µ ∼ N (m, v 2 )
prior distribution, where the prior parameters (m, v 2 ) are also known.
The posterior f (µ|X)
⃗ is then a N (µ∗ , σ 2 ) density with
∗
σ2v2 nX̄v 2 + mσ 2
σ∗2 1
P
= 2 , µ∗ = , where X̄ = n i Xi .
nv + σ 2 nv 2 + σ 2
31 / 36
Likelihood dominance Typical prior vs. posterior
0 1 2 3 4 0.0 0.5 1.0 1.5
−2
−1
0
Index
1
2
Bayesian inference
Flat Likelihood Data−Prior conflict
0 1 2 3 4 0.0 0.5 1.0 1.5
−2
−1
0
Index
The normal-normal example cont'd
1
2
32 / 36
Bayesian inference
Modelling choices
One problem with the Bayes approach is its dependence on a prior. (The
likelihood is also a matter of choice, actually, but we ignore that for now.)
■ While there are cases where a particular prior is well-justied,
■ The prior may seem entirely arbitrary in others.
Good Bayesians make choices transparent, and assess sensitivity of the
results w.r.t. the choices they make.
33 / 36
Bayesian inference
In practice
Applied researchers may consider
■ Priors chosen for computational convenience (e.g. conjugate priors);
■ So-called non-informative priors (e.g. uniform distributions);
■ Empirical Bayes (estimate the prior from data);
■ Hierarchical priors; or just
■ Other people's priors.
(For more interesting stu take a class dedicated to Bayesian Statistics.)
34 / 36
Up next
Moving on to...
1 The MLE
2 MLE (asymptotic) properties
3 Bayesian inference
4 Up next
35 / 36
Up next
Coming up
Review of testing
36 / 36