STA732
Statistical Inference
Lecture 23: Large-Sample Theory for Likelihood Ratio Tests
Yuansi Chen
Spring 2022
Duke University
https://www2.stat.duke.edu/courses/Spring22/sta732.01/
1
Recap from Lecture 22
1. Canonical linear model
𝑍0 𝜇0
⎛ ⎞ ⎛ ⎛ ⎞ ⎞
⎜ ⎜
𝑍 = ⎜𝑍1 ⎟ ∼ 𝒩 ⎜⎜𝜇1 ⎟
⎟ ⎜ ⎟ , 𝜎𝕀𝑛 ⎟
⎟
𝑍
⎝ 𝑟⎠ ⎝⎝ ⎠0 ⎠
• 𝜎2 known, 𝑑1 = 1, 𝑍-test: 𝑍𝜎1
• 𝜎2 unknown, 𝑑1 = 1, 𝑡-test: 𝑍𝜎1̂
2
‖𝑍1 ‖2
• 𝜎2 known, 𝑑1 ≥ 1, 𝜒2 -test: 𝜎2
2
‖𝑍 ‖ /𝑑
• 𝜎2 unknown, 𝑑1 ≥ 1, 𝐹 -test: 1𝜎2̂ 2 1
2. General linear model: find an orthonormal matrix 𝑄 such that
𝑄⊤ 𝑌 follows the canonical linear model
2
Goal of Lecture 22
1. Wald test
2. Score test
3. Generalized likelihood ratio test
Chap. 17.1-3 of Keener or 12.4 in Lehmann and Romano
3
Review the asymptotics of MLE
Setup
i.i.d.
𝑋1 , … , 𝑋𝑛 ∼ 𝑝𝜃 (𝑥), 𝑝𝜃 (⋅) is “regular” enough (check the
conditions in Thm 9.14 of Keener)
4
Consistency of MLE on compact Ω
Define
𝑊𝑖 (𝜃) = ℓ1 (𝜃; 𝑋𝑖 ) − ℓ1 (𝜃0 ; 𝑋𝑖 )
1 𝑛
𝑊̄ 𝑛 = ∑ 𝑊𝑖
𝑛 𝑖=1
We know that
𝔼𝑊𝑖 (𝜃) = −𝒟KL (𝜃0 ‖ 𝜃) ≤ 0
and it becomes = 0 iff 𝑃𝜃 = 𝑃𝜃0 .
Consistency result
If model is identifiable, 𝑊𝑖 continuous random function, then
𝑝
• ∥𝑊̄ 𝑛 − 𝔼𝑊̄ 𝑛 ∥∞ → 0 on compact Ω.
𝑝
• Then 𝜃𝑛̂ → 𝜃0 (convergence of argmax requires uniform convergence
5
result in Thm 9.4 Keener)
Asymptotic distribution of MLE
MLE satisfies
0 = ∇ℓ𝑛 (𝜃𝑛̂ ) = ∇ℓ𝑛 (𝜃0 ) + ∇2 ℓ𝑛 (𝜃𝑛̃ )(𝜃𝑛̂ − 𝜃0 ).
Then
−1
√ 1 1
𝑛(𝜃𝑛̂ − 𝜃0 ) = (− ∇2 ℓ𝑛 (𝜃𝑛̃ )) ( √ ∇ℓ𝑛 (𝜃0 ))
𝑛 𝑛
−1 𝑝
• (− 𝑛1 ∇2 ℓ𝑛 (𝜃𝑛̃ )) → 𝐼1 (𝜃0 )−1 (convergence of a random function
evaluated on a random point requires uniform convergence result in Thm
9.4 Keener!)
• √1 ∇ℓ (𝜃 ) ⇒ 𝒩(0, 𝐼1 (𝜃0 )) (CLT)
𝑛 𝑛 0
√
By Slutsky’s thm, 𝑛(𝜃𝑛̂ − 𝜃0 ) ⇒ 𝒩(0, 𝐼1 (𝜃0 )−1 )
6
√
𝑛(𝜃𝑛̂ − 𝜃0 ) ⇒ 𝒩(0, 𝐼1 (𝜃0 )−1 )
We can use the asymptotic distribution to compute confidence
regions!
7
Wald test
Intuition for Wald-type confidence regions (1)
Assume we have an estimator 𝐼𝑛̂ ⪰ 0 such that
1 ̂ 𝑝
𝐼 → 𝐼1 (𝜃0 )
𝑛 𝑛
Then we can use it as plug-in estimate for 𝐼1 (𝜃0 ) in asymptotic
distribution
√
Since 𝑛(𝜃𝑛̂ − 𝜃0 ) ⇒ 𝒩(0, 𝐼1 (𝜃0 )−1 ),
1/2 √
then (𝐼1 (𝜃0 )) 𝑛(𝜃𝑛̂ − 𝜃0 ) ⇒ 𝒩(0, 𝕀𝑑 ),
by Slutsky’s thm,
1/2
𝐼𝑛̂ (𝜃𝑛̂ − 𝜃0 ) ⇒ 𝒩(0, 𝕀𝑑 )
8
Intuition for Wald-type confidence regions (2)
Under the null hypothesis 𝐻0 ∶ 𝜃 = 𝜃0 , we have
1/2 2
∥𝐼𝑛̂ (𝜃𝑛̂ − 𝜃0 )∥ ⇒ 𝜒2𝑑
2
We can construct a test that rejects for large value of
1/2 2
∥𝐼𝑛̂ (𝜃𝑛̂ − 𝜃0 )∥ :
2
𝜙=1 1/2
2
̂ −𝜃 )∥ >𝜒2 (𝛼)
∥𝐼𝑛̂ (𝜃𝑛 0 𝑑
2
Remark
• The test might not have the correct level. It only has
asymptotic level 𝛼
• The confidence region is an ellipsoid
−1/2
𝜃𝑛̂ + 𝐼𝑛̂ 𝔹(0, 𝜒2𝑑 (𝛼))
9
Two options for 𝐼𝑛̂
1. 𝐼𝑛 (𝜃𝑛̂ ) obtained by plugging in the MLE
𝐼𝑛̂ = 𝐼𝑛 (𝜃𝑛̂ )
= Var𝜃 (∇ℓ𝑛 (𝜃; 𝑋)) ∣𝜃=𝜃 ̂
𝑛
2. Observed Fisher information
𝐼𝑛̂ = −∇2 ℓ𝑛 (𝜃𝑛̂ ; 𝑋)
Remark:
𝑝
Both should have 𝑛1 𝐼𝑛̂ → 𝐼1 (𝜃0 ) in “regular” model i.i.d. setting
10
Wald interval for 𝜃𝑗
√
Since 𝑛(𝜃𝑛̂ − 𝜃0 ) ⇒ 𝒩(0, 𝐼1 (𝜃0 )−1 ),
then by multiplying (1, 0, … , 0)⊤ , we obtain
√
̂ − 𝜃 ) ⇒ 𝒩(0, (𝐼 (𝜃 )−1 ) )
𝑛(𝜃𝑛,𝑗 0,𝑗 1 0 𝑗𝑗
Using 𝑛1 𝐼𝑛̂ as plug-in estimate for 𝐼1 (𝜃0 ), we obtain univariate
interval
̂ ± √(𝐼 −1
𝐶𝑗 = 𝜃𝑛,𝑗 ̂
𝑛 ) ⋅ 𝑧𝛼/2
𝑗𝑗
11
Wald interval for 𝜃𝑗
√
Since 𝑛(𝜃𝑛̂ − 𝜃0 ) ⇒ 𝒩(0, 𝐼1 (𝜃0 )−1 ),
then by multiplying (1, 0, … , 0)⊤ , we obtain
√
̂ − 𝜃 ) ⇒ 𝒩(0, (𝐼 (𝜃 )−1 ) )
𝑛(𝜃𝑛,𝑗 0,𝑗 1 0 𝑗𝑗
Using 𝑛1 𝐼𝑛̂ as plug-in estimate for 𝐼1 (𝜃0 ), we obtain univariate
interval
̂ ± √(𝐼 −1
𝐶𝑗 = 𝜃𝑛,𝑗 ̂
𝑛 ) ⋅ 𝑧𝛼/2
𝑗𝑗
glm function in R uses the above intervals:
with 𝐼𝑛̂ = −∇2 ℓ𝑛 (𝜃𝑛̂ )
11
Confidence ellipsoid for 𝜃0,𝑆
Want to provide confidence ellipsoid for 𝜃0,𝑆 = (𝜃0,𝑗 )𝑗∈𝑆 , |𝑆| = 𝑘
We have
√
̂ − 𝜃 ) ⇒ 𝒩(0, (𝐼 (𝜃 )−1 ) )
𝑛 (𝜃𝑛,𝑆 0,𝑆 1 0 𝑆𝑆
Then the confidence ellipsoid is
1/2
̂ + ((𝐼 −1
𝜃𝑛,𝑆 ̂
𝑛 )𝑆𝑆 ) 𝔹(0, 𝜒𝑘 (𝛼))
12
Example: generalized linear model with fixed design
Suppose 𝑥1 , … , 𝑥𝑛 ∈ ℝ𝑑 fixed
ind.
𝑌𝑖 ∼ 𝑝𝜂𝑖 (𝑦𝑖 ) = 𝑒𝜂𝑖 𝑦𝑖 −𝐴(𝜂𝑖 ) ℎ(𝑦𝑖 )
where 𝜂𝑖 = 𝛽 ⊤ 𝑥𝑖
Link function
Let 𝜇𝑖 (𝛽) = 𝔼𝛽 𝑌𝑖 . If 𝑓(𝜇𝑖 ) = 𝛽 ⊤ 𝑥𝑖 , then 𝑓 is called link function.
Common examples
⊤
ind. 𝑒𝑥𝑖 𝛽
• Logistic regression: 𝑌𝑖 ∼ Bernoulli ( ⊤ )
1+𝑒𝑥𝑖 𝛽
ind. ⊤
• Poisson log-linear model: 𝑌𝑖 ∼ Poisson(𝑒𝑥𝑖 𝛽 )
13
Confidence interval in generalized linear model
𝑛
ℓ𝑛 (𝛽; 𝑌 ) = ∑(𝑥⊤ ⊤
𝑖 𝛽)𝑦𝑖 − 𝐴(𝑥𝑖 𝛽) − log ℎ(𝑦𝑖 )
𝑖=1
𝑛
∇ℓ𝑛 (𝛽; 𝑌 ) = ∑ 𝑦𝑖 𝑥𝑖 − 𝐴′ (𝑥⊤
𝑖 𝛽)𝑥𝑖
𝑖=1
𝑛
= ∑ (𝑦𝑖 − 𝜇𝑖 (𝛽)) 𝑥𝑖
𝑖=1
𝑛
−∇2 ℓ𝑛 (𝛽; 𝑌 ) = ∑ 𝐴″ (𝑥⊤ ⊤
𝑖 𝛽)𝑥𝑖 𝑥𝑖
𝑖=1
𝑛
= ∑ Var𝛽 (𝑦𝑖 )𝑥𝑖 𝑥⊤
𝑖
𝑖=1
= Var𝛽 (∇ℓ𝑛 (𝛽; 𝑌 ))
in GLM, −∇2 ℓ𝑛 (𝛽; 𝑌 ) is not random 14
Can estimate 𝐼𝑛̂ by plug-in MLE
Apply our asymptotic directly (or do Taylor expansion from scracth)
1/2
𝐼𝑛̂ (𝜃𝑛̂ − 𝜃0 ) ⇒ 𝒩(0, 𝕀𝑑 )
15
Pros and cons of Wald test
Advantages
• Easy to invert, simple confidence regions
• Asympotically correct level
Disadvantages
• Have to compute MLE
• Depends on parameterization
• Relies on second order Taylor expansion of ℓ𝑛
• Need MLE to be consistent
• Confidence region might go outside of Ω
16
Score test
Intuition for score test
Testing 𝐻0 ∶ 𝜃 = 𝜃0 vs. 𝐻1 ∶ 𝜃 ≠ 𝜃0
We can bypass quadratic approximation by using the score as test
statistics
1
√ ∇ℓ𝑛 (𝜃0 ) ⇒ 𝒩(0, 𝐼1 (𝜃0 ))
𝑛
17
Score test
Reject 𝐻0 ∶ 𝜃 = 𝜃0 if
2
∥𝐼𝑛 (𝜃0 )−1/2 ∇ℓ𝑛 (𝜃0 )∥2 ≥ 𝜒2𝑑 (𝛼)
if 𝑑 = 1, we just use 𝑍-test instead
18
Score test
Reject 𝐻0 ∶ 𝜃 = 𝜃0 if
2
∥𝐼𝑛 (𝜃0 )−1/2 ∇ℓ𝑛 (𝜃0 )∥2 ≥ 𝜒2𝑑 (𝛼)
if 𝑑 = 1, we just use 𝑍-test instead
Advantages of score test
• No quadratic approximation
• No MLE
Disadvantage is that it might not be easy to invert the test
18
Score test is invariant to reparameterization
Assume 𝑑 = 1, 𝜃 = 𝑔(𝜉) with 𝑔′ (𝜉) > 0,
𝑞𝜉 (𝑥) = 𝑝𝑔(𝜉) (𝑥),
show that the two test statistics are the same a.s.
19
Example 1: 𝑠-parameter exponential family
i.i.d.
Suppose 𝑋1 , … , 𝑋𝑛 ∼ 𝑝𝜂 (𝑥) = exp(𝜂⊤ 𝑇 (𝑥) − 𝐴(𝜂))ℎ(𝑥). Derive
score test for 𝐻0 ∶ 𝜂 = 𝜂0
20
Example 2: Pearson 𝜒2 test
Suppose 𝑁 = (𝑁1 , … , 𝑁𝑑 ) ∼ Multinom(𝑛, (𝜋1 , … , 𝜋𝑑 )), with
density
𝑁 𝑁
𝑛!𝜋1 1 ⋯ 𝜋𝑑 𝑑
1
𝑁1 ! ⋯ 𝑁𝑑 ! ∑ 𝑁𝑖 =𝑛
𝑑
Note since ∑𝑗=1 𝜋𝑗 = 1, this is a full-rank (𝑑 − 1)-param exp family,
with the possible parameterization
⎧ 1
{ 1+∑𝑘>1 𝑒𝜂𝑘 𝑗=1
𝜋𝑗 =
⎨
{ 𝑒𝜂 𝑗
𝑗>1
⎩ 1+∑𝑘>1 𝑒𝜂𝑘
Derive score test.
21
Generalized likelihood ratio test
GLRT in simple vs composite two-sided testing
Testing 𝐻0 ∶ 𝜃 = 𝜃0 vs. 𝐻1 ∶ 𝜃 ≠ 𝜃0
Taylor expansion around 𝜃𝑛̂ gives
1
ℓ𝑛 (𝜃0 ) − ℓ𝑛 (𝜃𝑛̂ ) = ∇ℓ(𝜃𝑛̂ ) + (𝜃0 − 𝜃𝑛̂ )⊤ ∇2 ℓ𝑛 (𝜃𝑛̃ )(𝜃0 − 𝜃𝑛̂ )
2
2
1/2
1 1 √
= 0 − ∥(− ∇2 ℓ𝑛 (𝜃𝑛̃ )) ( 𝑛(𝜃0 − 𝜃𝑛̂ ))∥
2 𝑛
2
1
⇒ − 𝜒2𝑑
2
why?
Test statistic in GLRT
2 log(𝜆) = 2 (ℓ𝑛 (𝜃𝑛̂ ) − ℓ𝑛 (𝜃0 )) ⇒ 𝜒2𝑑
22
GLRT in composite vs composite
Testing 𝐻0 ∶ 𝜃 ∈ Ω0 vs. 𝐻1 ∶ 𝜃 ∈ Ω\Ω0
The generalized likelihood ratio is
supΩ 𝐿(𝜃)
𝜆= 1
supΩ 𝐿(𝜃)
0
The test statistic is
2 log(𝜆) = 2 (ℓ𝑛 (𝜃𝑛̂ ) − ℓ𝑛 (𝜃0̂ ))
where 𝜃0̂ = arg max𝜃∈Ω0 ℓ𝑛 (𝜃)
23
Asympotitic distribution of 2 log(𝜆)
Asymptotic distribution of 2 log(𝜆), see 17.2 Keener
Assume Ω = ℝ𝑑 , Ω0 𝑑0 -dim subspace. 𝜃0 in interior of Ω0 , 𝜃𝑛̂ is
consistent, 𝑝𝜃 (⋅) is “regular” (as in the asymptotic of MLE), then
2 log(𝜆) = 2 (ℓ𝑛 (𝜃𝑛̂ ) − ℓ𝑛 (𝜃0̂ )) ⇒ 𝜒2𝑑−𝑑0
where 𝜃0̂ = arg max𝜃∈Ω0 ℓ𝑛 (𝜃)
24
Intuition for the asymptotic distribution
(See rigorous derivation in 17.2 Keener)
Assume 𝜃0 = 0, 𝐼0 (0) = 𝕀𝑑 (after reparameterization), then
• 𝜃𝑛̂ ≈ 𝒩(𝜃0 , 𝑛1 𝕀𝑑 )
• locally, ∇2 ℓ𝑛 (𝜃) ≈ 𝑛𝕀𝑑 near 𝜃0
2
• ℓ𝑛 (𝜃) − ℓ𝑛 (𝜃𝑛̂ ) ≈ 𝑛
2 ∥𝜃 − 𝜃𝑛̂ ∥
2
2
• 𝜃0̂ ≈ arg min𝜃∈Ω0 ∥𝜃 − 𝜃∥̂ = ProjΩ (𝜃𝑛̂ )
2 0
•
2
2 (ℓ𝑛 (𝜃𝑛̂ ) − ℓ𝑛 (𝜃0̂ )) ≈ 𝑛 ∥𝜃𝑛̂ − ProjΩ (𝜃𝑛̂ )∥
0 2
⇒ 𝜒2𝑑−𝑑0
25
Asymptotic equivalence of the three tests
How close are the three tests asymptotically?
1/2 2
• Wald test: ∥𝐽𝑛̂ (𝜃𝑛̂ − 𝜃0 )∥
2
1/2 2
• Score test: ∥𝐽𝑛 (𝜃0 ) ∇ℓ𝑛 (𝜃0 )∥2
• GLRT: ℓ𝑛 (𝜃𝑛̂ ) − ℓ𝑛 (𝜃0 )
all are related to (for large 𝑛)
2
∥𝐼𝑛 (𝜃0 )1/2 (𝜃𝑛̂ − 𝜃0 )∥
2
26
Summary
• Wald test: test statistic based on quadratic approx
• Score test: test statistic using score
• Generalized likelihood ratio test: 2 log(𝜆)
We intuitively derived its asympotitic distribution
Read Page 362 of Keener for strengths and weaknesses
27
What is next?
• Final review
28
Thank you for attending
See you on Wednesday in Old
Chem 025
29
30