0% found this document useful (0 votes)
29 views36 pages

02 Review Estimation 2

Uploaded by

dummyrummy50
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views36 pages

02 Review Estimation 2

Uploaded by

dummyrummy50
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

Decision Theory

(Largely parametric) Estimation

Prof. Dr. Matei Demetrescu

1 / 36
Stronger assumptions may actually help...
We've seen that

■ some estimators don't require any assumptions except the ones we

make about the sampling scheme, yet

■ others manage to estimate parameters on the basis of minimal

additional assumptions.

Neither the plug-in estimator(s) nor the method-of-moment approach

make any optimality/eciency claim.

This may however be achieved in a more parametric setup!

2 / 36
Today's outline
(Largely parametric) Estimation
1 The MLE

2 MLE (asymptotic) properties

3 Bayesian inference

4 Up next

3 / 36
The MLE

Moving on to...
1 The MLE

2 MLE (asymptotic) properties

3 Bayesian inference

4 Up next

4 / 36
The MLE

R. A. Fisher: Likelihood estimation


The idea is to pick those parameter values as estimates that have most
plausibly generated the observed sample, as quantied by the likelihood
⃗ (x; θ).
L(θ; x) := fX
Formally, our Maximum Likelihood estimate, is for a given sample x,

θ̂ = arg max L(θ; x).


θ∈Θ

Given random sampling (and measurability) we up with the ML estimator

⃗ = arg max log L(θ; X).


θ̂ = arg max L(θ; X) ⃗
θ∈Θ θ∈Θ 5 / 36
The MLE

Dierent samples, dierent outcomes

−10
2.0e−05

−20
Log−Likelihood
Likelihood

−30
1.0e−05

−40
−50
0.0e+00

−2 −1 0 1 2 −2 −1 0 1 2

mu mu 6 / 36
The MLE

Implementation of the MLE


For smooth log-likelihoods with the maximum in the interior of Θ,
■ Solve the rst-order conditions (f.o.c.),


∂ℓ(θ̂;X)
⃗ = ∇ℓ(θ; X)
∇ log L(θ; X) ⃗ = = 0;
∂θ

■ More often than not this needs to be done numerically.



∂ 2 ℓ(θ̂;X)
■ The 2nd-order condition is a negative denite Hessian
∂θ∂θ ′
.

Even without interior solutions or smoothness of ℓ, a value θ̂ that solves



arg max ℓ(θ; X) is a ML estimate, no matter how it is obtained.
θ∈Θ
7 / 36
The MLE

A hopefully seldom situation


Take e.g. Θ = [0, ∞).

0
−2
With no stationary points,

Log−Likelihood
we need to check the behavior at

−4
the boundaries!

−6
(Should in fact do that anyway.)

−8
−10
In this case, θ̂ = 0.
0 2 4 6 8 10

mu 8 / 36
The MLE

Two very dierent cases


Example (Smooth... )

Let X⃗ = (X1 , ..., Xn ) be a iid sample from an exponential population,


f (x; θ) = 1θ exp − xθ , x ∈ (0, ∞), θ ∈ Θ = (0, ∞).


The MLE of θ is X̄n .

Example (... and not so smooth)

Let X⃗ = (X1 , ..., Xn ) be a iid sample from a uniform population


1
f (x; a, b) = b−a I[a,b] (x), −∞ < a < b < ∞.
The MLEs are â = X[1] and b̂ = X[n] . 9 / 36
The MLE

Concentrating out parameters


′
Let θ = θ ′1 , θ ′2 ; θ1 is of interest and θ2 is a nuisance ...

Example (Normal population)

Let θ = (µ, σ)′ , with σ 2 the nuisance parameter. Then


n n 1 1 Xn
ℓ µ, σ 2 ; x = − log 2π − log σ 2 − (Xi − µ)2

2 2 2σ 2 i=1
 1
Pn 
(X i − µ)
leading to the f.o.c. ∇ℓ = σ 2 i=1
= 0.
− 2σn2 + 2σ1 4 ni=1 (Xi − µ)2
P

We may solve the f.o.c. for µ independently of σ 2 !


10 / 36
The MLE

Prole likelihood
Example (continued)

In fact, one may compute σ 2 (µ) = n1 ni=1 (Xi − µ)2 and plug it into ℓ
P

to obtain the concentrated or prole likelihood


Pn 2
ℓ = − n2 log 2π − n2 − n2 log n1 i=1 (X i − µ)
which is still optimized at µ̂ = X̄ .
■ Even if we can't directly solve the f.o.c. for θ1 only

■ we may still solve for θ 2 = f (θ 1 ), and plug in this expression in ℓ to

obtain the prole log-likelihood.


11 / 36
The MLE

Conditional likelihood I
Conditioning on a (sucient) statistic S for θ2 may also eliminate the

nuisance parameter.

A simple  and intuitive  form of this conditioning is available for

regression models with stochastic regressors.

f (x, y; θ) be the joint density of X and Y (both scalar for simplicity).


Let

■ The likelihood is L(θ) = ni=1 f (xi , yi ; θ)


Q

■ ... which can be a complicated expression (even in log form).

But say you're only interested in the dependence of Y on X ...


12 / 36
The MLE

Conditional likelihood II
Express the joint pdf as

f (x, y; θ) = fY |X (y, x; θ 1 ) · fX (x; θ 2 ) .

leading to a log-likelihood

Xn Xn
ℓ= log fY |X (Yi , Xi ; θ 1 ) + log fX (Xi ; θ 2 )
i=1 i=1

where maximizations w.r.t. θ1 and θ2 are entirely independent!

If the marginal behavior of X is not of interest, just specify fY |X !


(Or just the regression function if not interested in more.) 13 / 36
The MLE

ML as MM
iid ⃗ = Pn ln f (X i ; θ).
Let X i ∼ f (x; θ) s.t. ln L(θ; X) i=1

Regularity conditions assumed, we have (in the continuous case)


 
∂ ln f (X i ;θ)
E ∂θ = 0.
Hence,
 
∂ ln f (X i ;θ 0 )
Eθ0 (g(X i , θ 0 )) = Eθ0 ∂θ = 0,
which are moment conditions with sample counterparts

n n
1X 1 X ∂ ln f (X i ; θ) ⃗)
1 ∂ ln L(θ; X
g(X i , θ) = = = 0.
n i=1 n i=1 ∂θ n ∂θ
14 / 36
The MLE

Quasi likelihood
Often, we do not really know what the population distribution looks like.

■ But you still have a iid sample X 1 , . . . , X n , and


■ you have assumed something, say that X ∼ g (x, θ).

(Perhaps even knowing it's not the right density.)

This allows you to set up a likelihood,

Yn
L(θ; X 1 , . . . , X n ) = fX
⃗ (X 1 . . . , X n ; θ) = g (X i , θ) .
i=1

And you can easily compute the resulting ML estimator.

But if actually X i ∼ f (x), what does θ̂ behave like?


15 / 36
The MLE

Some simple linear regression


Example

Assume Y |X = x ∼ N α + βX, σ 2 . The conditional likelihood is then




all you have,


Yn 1 Y −α−βX 2
− 12 ( i σ i )
L (α, β) = √ e
i=1 2πσ 2
leading to the log-likelihood
n n 1 Xn
ℓ (α, β) = − ln 2π − ln σ 2 − 2 (Yi − α − βXi )2 .
2 2 2σ i=1

16 / 36
The MLE

Quasi-ML
Maximizing the Gaussian (conditional) likelihood is equivalent to

minimizing the residual sum of squares. But Least Squares is nice!

So Gaussian Quasi-ML also behaves nicely (in semiparametric ways):


■ The important part is that parameters of interest still identied!

■ Identication: expected gradient of quasi-log-likelihood still has the


true parameters as zeros, which amounts to

■ ... correctly specifying the conditional mean for the regression case.

From this perspective, Quasi-ML can be interpreted as a

method-of-moments estimator too.


17 / 36
MLE (asymptotic) properties

Moving on to...
1 The MLE

2 MLE (asymptotic) properties

3 Bayesian inference

4 Up next

18 / 36
MLE (asymptotic) properties

Sometimes we're (not) lucky


In many cases, MLEs are explicit functions of the sample, ⃗ ,
θ̂ = θ(X) and

we may analyze it directly:

■ either derive exact sampling distribution, or

■ provide some asymptotic approximation (using WLLN, CLT, delta

method).

Otherwise, asymptotics are the only chance:

■ Argue rst that the ML estimator is consistent, and

■ (for smooth likelihoods) linearize the f.o.c. to establish normality.

We don't deal with this (but see maybe additional material in moodle).
19 / 36
MLE (asymptotic) properties

Need to talk about identication


Note that

■ ... identication is a blurred concept in the nonparametric case.

■ This changes in the (fully specied) parametric case!

Denition (model-based, global identication)

A parametric
  modelis identied
 if and only if, for any sample X
⃗,
⃗ X |θ 2 implies θ 1 = θ 2 .

⃗ X |θ 1 = fX
fX ⃗

If identication is given then the MLE must be unique.


20 / 36
MLE (asymptotic) properties

The asymptotic take


If consistency is given, then

Proposition

Let θ = θ0 . Regularity conditions assumed, it then holds


  2  
E ∂ ln f (Xi ; θ0 )/∂θ
√ d  
n(θ̂ − θ0 ) → N  0 ,  h 2 i2 .
 ∂ ln f (Xi ;θ0 ) 
E ∂θ 2

21 / 36
MLE (asymptotic) properties

The information equality


Under additional regularity conditions,

   2 
∂ ln f (Xi ; θ) 2 ∂ ln f (Xi ; θ)
E = −E = −I(θ),
∂θ ∂θ2

so that
!

  2 −1
d
n(θ̂ − θ0 ) → N 0 ; −I −1 (θ0 ) ≡ N 0 ; E ∂ ln f∂θ(Xi ;θ)

.

In such cases the asymptotic variance of the MLE converges to the

so-called Cramér-Rao lower bound: the MLE is asymptotically ecient.


22 / 36
MLE (asymptotic) properties

Estimating standard errors


Consistent estimates for the variance of the asymptotic distribution of

MLEs can be obtained by

2 n
1 X ∂ 2 ln f (Xi ; θ̂)
 
∂ ln f (X i ; θ 0 )
E
b =
∂θ2 n i=1 ∂θ2
 2  n 
∂ ln f (X i ; θ 0 ) 1 X ∂ ln f (Xi ; θ̂) 2
E
b = .
∂θ n i=1 ∂θ

This delivers the so called asymptotic standard errors of MLEs.


23 / 36
Bayesian inference

Moving on to...
1 The MLE

2 MLE (asymptotic) properties

3 Bayesian inference

4 Up next

24 / 36
Bayesian inference

A change of perspective
So far, we assumed that

■ our sample ⃗
X is random with some joint distribution/pdf ⃗ (x; θ),
fX
where θ is some xed, though unknown parameter.
■ The data are used to obtain an estimate for θ, and

■ the sampling distribution of the estimator describes the uncertainty,

■ which was analyzed in a (ctional?) repeated sampling framework.

In Bayesian statistics, θ itself is regarded as a random variable.


A random θ captures the uncertainty about the parameter,

... and allows for (dierent) inference from data!


25 / 36
Bayesian inference

Why would we want random parameters?


If θ is (regarded as) random, then ⃗
X and θ have a joint distribution.

Moreover, conditional distributions have nice interpretations:


 
■ fX


X|θ is the distribution of the sample given specic parameter

values (i.e. the data generating process)


 

■ f θ|X is the distribution of the parameters given the sample.

posterior distribution
 
The so-called ⃗
f θ|X then conveys all the

available information available about θ.


 
The goal of Bayesian statistics is to provide ⃗
f θ|X .
26 / 36
Bayesian inference

Setup
Assume that ⃗ = (X 1 , ..., X n )
X is a iid sample from the joint pdf

⃗ (x|θ),
fX dening the sample likelihood.
 
We have ⃗ (x|θ) = L (θ; x),
fX but would like ⃗
f θ|X !

To get there, we need another ingredient:

■ Any information we have about θ before observing the data ⃗


X is

represented by the so-called prior density denoted by f (θ).


■ This is a density which we need to specify in such a way that it

reects our prior knowledge or beliefs (or both).

27 / 36
Bayesian inference

The Bayes theorem


If we have a prior, then we may resort to Bayes' theorem.

The posterior density is then


f (θ, x) f ⃗ (x|θ)f (θ) f ⃗ (x|θ)f (θ)
f (θ|x) = = X = R X .
f (x) f (x) fX⃗ (x|θ)f (θ)dθ

The posterior combines

■ the prior knowledge about θ summarized by the prior f (θ) with


■ the
 information
 about ⃗ , contained
θ in the data X in the likelihood

L θ; X given as fX ⃗
⃗ (x|θ) evaluated at X .
28 / 36
Bayesian inference

Almost like learning


The posterior

■ represents our revised knowledge/beliefs about the distribution of θ


after seeing ⃗
the data X

■ is obtained as a mixture of the prior information and current

information (data!)

■ is available to be the prior when the next body of data is available.

Essentially, we continuously update our knowledge about θ.


One other advantage over classical statistics is that

■ once you agree on prior and model,

■ you always get the same answer from the data. 29 / 36


Bayesian inference

Focus on the essential


In this setting, f (x) is referred to as the marginal data density, which
does not involve the parameter of interest. With ∝ meaning `is

proportional to',

⃗ (x|θ) · f (θ) .
f (θ|x) ∝ fX
| {z } | {z } |{z}
posterior likelihood prior

 
Note that the product ⃗
L θ; X f (θ) does not dene a proper density. It

represents a so called density kernel for the posterior density of θ.


■ Sometimes the kernel suces (e.g. when nding the posterior mode).

■ Other times we need to compute it  or some other workaround.


30 / 36
Bayesian inference

The normal-normal example


Example

Let X
⃗ = (X1 , ..., Xn ) be a iid sample from a N (µ, σ 2 ) population.
■ The variance σ 2 is assumed to be known.
■ Say the prior information can be represented by a µ ∼ N (m, v 2 )
prior distribution, where the prior parameters (m, v 2 ) are also known.
The posterior f (µ|X)
⃗ is then a N (µ∗ , σ 2 ) density with

σ2v2 nX̄v 2 + mσ 2
σ∗2 1
P
= 2 , µ∗ = , where X̄ = n i Xi .
nv + σ 2 nv 2 + σ 2
31 / 36
Likelihood dominance Typical prior vs. posterior

0 1 2 3 4 0.0 0.5 1.0 1.5

−2
−1
0

Index
1
2
Bayesian inference

Flat Likelihood Data−Prior conflict

0 1 2 3 4 0.0 0.5 1.0 1.5


−2
−1
0

Index
The normal-normal example cont'd

1
2

32 / 36
Bayesian inference

Modelling choices
One problem with the Bayes approach is its dependence on a prior. (The

likelihood is also a matter of choice, actually, but we ignore that for now.)

■ While there are cases where a particular prior is well-justied,

■ The prior may seem entirely arbitrary in others.

Good Bayesians make choices transparent, and assess sensitivity of the

results w.r.t. the choices they make.

33 / 36
Bayesian inference

In practice
Applied researchers may consider

■ Priors chosen for computational convenience (e.g. conjugate priors);

■ So-called non-informative priors (e.g. uniform distributions);

■ Empirical Bayes (estimate the prior from data);

■ Hierarchical priors; or just

■ Other people's priors.

(For more interesting stu take a class dedicated to Bayesian Statistics.)

34 / 36
Up next

Moving on to...
1 The MLE

2 MLE (asymptotic) properties

3 Bayesian inference

4 Up next

35 / 36
Up next

Coming up

Review of testing

36 / 36

You might also like