Advanced Econometrics I
Jürgen Meinecke
Lecture 3 of 12
Research School of Economics, Australian National University
                                                               1 / 36
Roadmap
  Ordinary Least Squares Estimation
     Basic Asymptotic Theory (part 2 of 2)
     Asymptotic Distribution of the OLS Estimator
     Asymptotic Variance Estimation
                                                    2 / 36
Let there be a probability space (Ω, ℱ , 𝑃)
  • Ω is the outcome space
  • ℱ collects events from Ω
  • 𝑃 is a probability measure on ℱ
Example (Only Looks Like Rolling a Die)
  • Ω = {1, 2, 3, 4, 5, 6}
  • ℱ = {{1, 3, 5} , {2, 4, 6} , Ω, ∅}
  • Consider all 𝐴 ∈ ℱ
                 ⎧0    if 𝐴 = ∅
                 {
                 {
                 {
                 {1/2 if 𝐴 = {1, 3, 5}
        𝑃(𝐴) = ⎨
                 {
                 { 1/2 if 𝐴 = {2, 4, 6}
                 {
                 {1
                 ⎩     if 𝐴 = Ω
Notice that 𝑃({2}) is not specified
                                              3 / 36
Definition (Random Variable—first attempt)
A random variable on (Ω, ℱ ) is a function 𝑍 ∶ Ω → R.
Example
           ⎧
           {18 if 𝜔 even,
    𝑋(𝜔) = ⎨
           ⎩24 if 𝜔 odd
           {
Induced probability Pr(𝑋 = 18) ∶= 𝑃({2, 4, 6}) = 1/2
Instead of writing Pr(𝑋 = 18) I will use 𝑃(𝑋 = 18)
Example
           ⎧
           {2    if 𝜔 = 6,
    𝑌(𝜔) = ⎨
           {
           ⎩7    if 𝜔 = 1
Induced probability Pr(𝑌 = 2) ∶= 𝑃({6}) = ?
                                                        4 / 36
The event {6} is not assigned a probability
Of course we have a reasonable suspicion that 𝑃({6}) should equal
1/6, but strictly speaking this hasn’t been defined two slides earlier
So we have to treat 𝑃({6}) as unknown
To make sure that our random variable is not ill-defined like this we
need to rule out such situations
Here’s a more robust definition
                                                                         5 / 36
Definition (Random Variable—second and final attempt)
A random variable on (Ω, ℱ ) is a function 𝑍 ∶ Ω → R such that
    {𝜔 ∈ Ω ∶ 𝑍(𝑤) ∈ 𝐵} ∈ ℱ       for all 𝐵 ∈ ℬ(R).
ℬ(R) is the 𝜎-algebra generated by the closed intervals [𝑎, 𝑏], for
𝑎, 𝑏 ∈ R
Intuition: ℬ(R) describes all events that can be created out of all
the points on the real line
ℬ(R) is a rich set containing pretty much every subset of R that we
will ever be dealing with (including intervals, points)
I don’t need you to understand all intricacies here
Bottom line is:
The image 𝑍(𝑤) gets pulled back to an element of ℱ for which
probabilities are well-defined
Using this more robust definition, 𝑌 is not a random variable
                                                                      6 / 36
To see this, pick the subset 𝐵 = {2} from ℬ(R)
  • pick 𝐵 = {2}
  • {𝜔 ∈ Ω ∶ 𝑌(𝜔) = 2} = {6} ∉ ℱ
  • same for 𝐵 = {7}
The problem here is that 𝑌 is not ℱ -measurable
                                                  7 / 36
Definition (Distribution or Law)
Given a random variable 𝑍 on a probability space (Ω, ℱ , 𝑃), the
distribution or law of the random variable is the probability
measure defined by
    𝜇(𝐵) ∶= 𝑃(𝑍 ∈ 𝐵),       𝐵 ∈ ℬ(R).
We say that 𝜇 is the distribution of 𝑍, or ℒ(𝑍) is the law of 𝑍.
Definition (Distribution Function)
The distribution function of a random variable 𝑍 is defined by
   𝐹(𝑧) ∶= 𝜇((−∞, 𝑧]) = 𝑃(𝑍 ≤ 𝑧),      𝑧 ∈ R.
𝐹 is also referred to as cumulative distribution function or cdf.
There is a one-to-one correspondence between distribution and cdfs
So we use them interchangeably
                                                                     8 / 36
Definition (Weak Convergence)
Let 𝐹 be a distribution function, and {𝐹𝑁 } be a sequence of
distribution functions. Then 𝐹𝑁 converges weakly to 𝐹 if
lim𝑁→∞ 𝐹𝑁 (𝑧) = 𝐹(𝑧) for each 𝑧 at which 𝐹 is continuous.
             w
We write 𝐹𝑁 → 𝐹.
                              w
Equivalently we could say 𝜇𝑁 → 𝜇 for weak convergence
Definition (Convergence in Distribution)
Let 𝑍 be a random variable, and {𝑍𝑁 } be a sequence of random
                                                               w
variables. Then 𝑍𝑁 converges in distribution or law to 𝑍 if 𝐹𝑁 → 𝐹.
              d
We write 𝑍𝑁 → 𝑍.
Now we turn to a few practical results that will help us soon when we
derive the asymptotic distribution of 𝛽OLS
                                        ̂
                                                                        9 / 36
Theorem (Continuous Mapping Theorem)
      d                  d
If 𝑍𝑁 → 𝑍 then 𝑔(𝑍𝑁 ) → 𝑔(𝑍) for continuous 𝑔.
Corollary
      d
If 𝑍𝑁 → 𝑁(0, Ω) then
                             d
                     𝐴𝑍𝑁 → 𝑁(0, 𝐴Ω𝐴′ )
                             d
            (𝐴 + o𝑝 (1))𝑍𝑁 → 𝑁(0, 𝐴Ω𝐴′ ),
and since 𝑍 ∼ 𝑁(0, Ω) ⇒ 𝑍′ Ω−1 𝑍 ∼ 𝜒 2 (dim(𝑍)),
                             d
               𝑍′𝑁 Ω−1 𝑍𝑁 → 𝜒 2 (𝑑𝑖𝑚(𝑍𝑁 ))
                             d
   𝑍′𝑁 (Ω + o𝑝 (1))−1 𝑍𝑁 → 𝜒 2 (𝑑𝑖𝑚(𝑍𝑁 )).
                                                   10 / 36
                                                         𝑁
Another important result for the sample average 𝑍̄ 𝑁 ∶= ∑𝑖=1 𝑍𝑖 /𝑁.
Theorem (Central Limit Theorem (CLT))
Let 𝑍1 , 𝑍2 , … be a sequence of independent and identically
                                          2
distributed random vectors with E ∥𝑍𝑖 ∥ < ∞. Then
                      d
    √𝑁 (𝑍̄ 𝑁 − 𝜇𝑍 ) →   N(0, E ((𝑍𝑖 − 𝜇𝑍 )(𝑍𝑖 − 𝜇𝑍 )′ ) ),
where 𝜇𝑧 ∶= E𝑍𝑖 .
Notice:
  • ‖𝑧‖ ∶= √𝑧′ 𝑧 is the Euclidian norm here
          2
  • E ∥𝑍𝑖 ∥ < ∞ is an economical way of saying that all components
    of 𝑍𝑖 have finite means, variances, and covariances
The CLT is a remarkable result
                                        p
From the WLLN we know that (𝑍̄ 𝑁 − 𝜇𝑍 ) → 0
At the same time √𝑁 → ∞
Yet their product converges to a normal distribution!
                                                                      11 / 36
The restrictions imposed in it don’t seem very strong
For example, it does not matter what distribution the 𝑍𝑖 come from
                   2
(as long as E ∥𝑍𝑖 ∥ < ∞)
The sample average multiplied by √𝑁 converges to a normal
distribution
                                                                     12 / 36
Conventional terminology with regard to the result
                    d
    √𝑁 (𝑍̄ 𝑁 − 𝜇𝑍 ) → N(0, Ω)
where Ω ∶= E ((𝑍𝑖 − 𝜇𝑍 )(𝑍𝑖 − 𝜇𝑍 )′ )
  • 𝑍̄ 𝑁 is asymptotically normally distributed
  • The large sample distribution of 𝑍̄ 𝑁 is normal
  • Ω is the asymptotic variance of √𝑁 (𝑍̄ 𝑁 − 𝜇𝑍 )
  • Ω/𝑁 is the asymptotic variance of 𝑍̄ 𝑁
                                                      13 / 36
Primitive usage
  • when the sample size 𝑁 is large yet finite
  • the sample average 𝑍̄ 𝑁 almost has a normal distribution
  • around the population mean 𝜇𝑍
  • with variance Ω/𝑁
  • irrespective of the underlying distribution of the 𝑍1 , 𝑍2 , …
Practical meaning of CLT: for large sample sizes
         𝑎𝑝𝑝𝑟𝑜𝑥
    𝑍̄ 𝑁 ∼ 𝑁(𝜇𝑍 , Ω/𝑁)
                                                                     14 / 36
Let’s sketch the proof for a scalar-version of the CLT, where E𝑍𝑖 = 𝜇𝑍
and Var 𝑍𝑖 = 𝜎𝑍2
We know from undergrad that E𝑍̄ 𝑁 = 𝜇𝑍 and Var 𝑍̄ 𝑁 = 𝜎𝑍2 /𝑁,
therefore CLT says that
                     d
     √𝑁 (𝑍̄ 𝑁 − 𝜇𝑍 ) → N(0, 𝜎𝑍2 )
or, equivalently,
     √𝑁 (𝑍̄ 𝑁 − 𝜇𝑍 ) d
                     → N(0, 1)
           𝜎𝑍
To prove this, we need a new concept
                                                                         15 / 36
Definition (Moment Generating Function)
Let 𝑍 be a random variable, the moment generating function (mgf)
of 𝑍 is given by 𝑀𝑍 (𝑡) = E (𝑒𝑡𝑍 ), where 𝑡 ∈ R.
Fun facts about the mgf
  • The curvature of the mgf at zero describes all moments:
    𝑑 𝑘 𝑀𝑍
      𝑑𝑡𝑘
             = E𝑍𝑘
           (0)
    𝑘th derivative evaluated at zero is equal to 𝑘th moment
    (hence that name)
  • not every random variable has a well-defined mgf
    (there exists a generalization, called characteristic function that
    overcomes this problem, mgf is a slightly less general version
    but easier to work with)
  • for random variables whose mgf exist:
    two random variables have identical distributions if and only if
    their mgf are the same
                                                                          16 / 36
Mgf can be a useful device for establishing limiting distributions
Lemma (Curtiss’ Continuity Theorem)
Let 𝑀𝑍 (𝑡) be the mgf of 𝑍 and let 𝑀𝑍𝑁 (𝑡) be the mgf of 𝑍𝑁 .
                                                      d
If lim𝑁→∞ 𝑀𝑍𝑁 (𝑡) = 𝑀𝑍 (𝑡) for every 𝑡 then 𝑍𝑁 → 𝑍.
This is based on Lévy’s Continuity Theorem (1937)
                              √𝑁(𝑍̄ 𝑁 −𝜇𝑍 ) d
We’re interested in showing       𝜎𝑍        →   N(0, 1)
                                  √𝑁(𝑍̄ −𝜇 )
Let’s consider the mgf of 𝑍̃𝑁 ∶=       𝑁
                                      𝜎𝑍
                                         𝑍
and show that its limit is equal to the mgf of a 𝑁(0, 1)
Wait! What is the mgf of the standard normal distribution?
Lemma
                                                          2 /2
The mgf of the standard normal distribution is 𝑡 ↦ 𝑒𝑡            .
(Proof: see assignment)
                                                                     17 / 36
               √𝑁(𝑍̄ 𝑁 −𝜇𝑍 )       (∑ 𝑍𝑖 −𝑁𝜇𝑍 )
Notice 𝑍
       ̃𝑁 ∶=
                   𝜎𝑍          =
                                      𝜎𝑍 √𝑁
                                       ∑(𝑍𝑖 − 𝑁𝜇𝑍 ) ⎞⎞
                       ) = E⎛⎜exp ⎛
                    ̃𝑁
     ̃𝑁 (𝑡) = E (𝑒
    𝑀𝑍             𝑡𝑍               ⎜𝑡              ⎟⎟
                             ⎝      ⎝     𝜎𝑍 √𝑁     ⎠⎠
                (𝑍    − 𝜇  )           (𝑍  − 𝜇  )           (𝑍 − 𝜇𝑍 ) ⎞⎞
       ⎜exp ⎛
    = E⎛     ⎜𝑡 1         𝑍 ⎞
                             ⎟ ⋅ exp ⎛
                                     ⎜𝑡 2      𝑍 ⎞
                                                  ⎟ ⋯ exp ⎛
                                                          ⎜𝑡 𝑁        ⎟⎟
       ⎝     ⎝ 𝜎𝑍 √𝑁 ⎠               ⎝ 𝜎 𝑍 √𝑁 ⎠           ⎝ 𝜎 𝑍 √𝑁 ⎠ ⎠
                (𝑍 − 𝜇𝑍 ) ⎞⎞                 (𝑍 − 𝜇𝑍 ) ⎞⎞
       ⎜exp ⎛
    = E⎛     ⎜𝑡 1            ⎟⎟ ⋯ E ⎛⎜exp ⎛
                                          ⎜𝑡 𝑁          ⎟⎟
       ⎝     ⎝ 𝜎𝑍 √𝑁 ⎠⎠              ⎝    ⎝ 𝜎𝑍 √𝑁 ⎠⎠
                                        𝑁
                 (𝑍 − 𝜇𝑍 ) ⎞⎞⎞
     = E⎛
        ⎜⎛⎜exp ⎛
               ⎜𝑡 1        ⎟⎟⎟
        ⎝⎝     ⎝ 𝜎𝑍 √𝑁 ⎠⎠⎠
                       𝑁
                 𝑡 ⎞
     = 𝑚𝑍1 ⎛
           ⎜        ⎟
           ⎝ 𝜎 𝑧 √𝑁 ⎠
where we define 𝑚𝑍1 (𝑡) ∶= E (𝑒𝑡(𝑍1 −𝜇𝑍 ) )
                                                                      18 / 36
Copy and paste last line: 𝑚𝑍1 (𝑡) ∶= E (𝑒𝑡(𝑍1 −𝜇𝑍 ) )
Notice that
   • 𝑚𝑍1 (0) = 1
   • 𝑚′𝑍1 (0) = E(𝑍1 − 𝜇𝑍 ) = 0
   • 𝑚″𝑍1 (0) = E(𝑍1 − 𝜇𝑍 )2 = 𝜎𝑍2
Applying a second order Taylor approximation (at zero):
           𝑚𝑍1 (𝑡) ≈ 𝑚𝑍1 (0) + 𝑚′𝑍1 (0) ⋅ 𝑡 + (1/2)𝑚″𝑍1 (0) ⋅ 𝑡2
                    = 1 + (1/2)𝜎𝑍2 ⋅ 𝑡2
and therefore,
             𝑡 ⎞                  𝑡2
    𝑚𝑍1 ⎛
        ⎜       ⎟ = 1 + 𝜎𝑍2 ⋅
        ⎝ 𝜎𝑧 √𝑁 ⎠             2 ⋅ 𝜎𝑍2 𝑁
                        𝑡2 /2
                  = 1+
                         𝑁
                                                                   19 / 36
Connecting the dots
                                 𝑁                      𝑁
                  ⎛ 𝑡 ⎞                       𝑡2 /2
     ̃𝑁 (𝑡) = 𝑚𝑍1 ⎜
    𝑀𝑍                     ⎟         = (1 +         )
                  ⎝ 𝜎 𝑧 √𝑁 ⎠                   𝑁
And finally, to evaluate the limit use this result:
Lemma
                  𝑐 𝑁
lim𝑁→∞ (1 +       𝑁
                    )   = 𝑒𝑐 .
It follows that
                                         𝑁
                           𝑡2 /2                     2 /2
    lim 𝑀𝑍
         ̃𝑁 (𝑡) = lim (1 +       )            = 𝑒𝑡          ,
    𝑁→∞          𝑁→∞        𝑁
which is the mgf of a standard normal distribution
                   d
It follows that 𝑍
                ̃𝑁 → 𝑁(0, 1)
                                                                20 / 36
Illustration of CLT
The underlying distribution of 𝑍1 , … , 𝑍𝑁 is exponential
                                                            21 / 36
Illustration of CLT
The underlying distribution of 𝑍1 , … , 𝑍𝑁 is exponential
                                                            22 / 36
Roadmap
  Ordinary Least Squares Estimation
     Basic Asymptotic Theory (part 2 of 2)
     Asymptotic Distribution of the OLS Estimator
     Asymptotic Variance Estimation
                                                    23 / 36
We know that 𝛽OLS
              ̂   ∈ 𝐿2
We would like to know the exact distribution of 𝛽OLS
                                                 ̂   for finite
samples (so-called small sample distribution)
Remember
                       𝑁                −1   𝑁
    𝛽OLS
     ̂   = 𝛽∗ + (∑           𝑋𝑖 𝑋𝑖′ )        ∑     𝑋𝑖 𝑢𝑖
                       𝑖=1                   𝑖=1
      𝛽∗ = E(𝑋𝑖 𝑋𝑖′ )−1 E(𝑋𝑖 𝑌𝑖 )
We suspect that 𝛽OLS
                 ̂ |𝑋𝑖 ∼ N(⋅, ⋅) if 𝑢𝑖 ∼ N(⋅, ⋅)
In the absence of such a restrictive assumption, we are unable to
determine the exact distribution of 𝛽OLS
                                      ̂
We approximate exact distribution by asymptotic distribution
Our hope is that the asymptotic (aka large sample) distribution is a
good approximation
                                                                       24 / 36
The CLT will be our main tool in deriving the asymptotic distribution
of 𝛽OLS
    ̂
Big picture: we already know that 𝛽OLS
                                   ̂ − 𝛽∗ = o𝑝 (1)
From what I said earlier, we may suspect that √𝑁(𝛽OLS
                                                  ̂ − 𝛽∗ ) could
converge to a normal distribution
To derive this result, let’s recall the following representation of the
OLS estimator from last week:
                                   −1
                   1 𝑁                 1 𝑁
    𝛽OLS
      ̂  =𝛽 + ⎜ ∑ 𝑋𝑖 𝑋𝑖 ⎟ ⎛
             ∗   ⎛             ′ ⎞    ⎜ ∑ 𝑋𝑖 𝑢 𝑖 ⎞
                                                 ⎟
                 ⎝ 𝑁 𝑖=1         ⎠ ⎝ 𝑁 𝑖=1       ⎠
Let’s re-arrange terms …
                                                                          25 / 36
Copy and past, for convenience:
                                 −1
                      𝑁                   𝑁
                    1                   1
    𝛽OLS
     ̂     =𝛽∗   +⎛
                  ⎜ ∑ 𝑋𝑖 𝑋𝑖′ ⎞
                             ⎟        ⎛
                                      ⎜ ∑ 𝑋𝑖 𝑢 𝑖 ⎞
                                                 ⎟
                    𝑁
                  ⎝ 𝑖=1      ⎠          𝑁
                                      ⎝ 𝑖=1      ⎠
Then isolating √𝑁(𝛽OLS
                   ̂ − 𝛽∗ ):
                                         −1
                            𝑁
                      1                             1 𝑁
         ̂ − 𝛽∗ ) = ⎛
    √𝑁 (𝛽OLS        ⎜ ∑ 𝑋𝑖 𝑋𝑖′ ⎞
                               ⎟              ⎛
                                              ⎜√𝑁 ⎛
                                                  ⎜ ∑ 𝑋𝑖 𝑢 𝑖 ⎞
                                                             ⎟⎞
                                                              ⎟
                      𝑁
                    ⎝ 𝑖=1      ⎠              ⎝   ⎝ 𝑁 𝑖=1    ⎠⎠
Can you see how the CLT can now be applied to the second factor on
the rhs?
Let’s break the rhs up again into its bits and pieces
                                                                     26 / 36
We’ve already shown last week that, given E(𝑋𝑖 𝑋𝑖′ ) < ∞,
                   −1
         𝑁
    ⎛ 1
    ⎜ ∑ 𝑋𝑖 𝑋𝑖′ ⎞
               ⎟        = E(𝑋𝑖 𝑋𝑖′ )−1 + o𝑝 (1) = O𝑝 (1)
    ⎝ 𝑁 𝑖=1    ⎠
For the second factor on the rhs, we know that E (∑ 𝑋𝑖 𝑢𝑖 /𝑁) = 0,
then applying the CLT is easy:
     ⎛      1 𝑁           d
     ⎜√𝑁 ⎛⎜ ∑ 𝑋𝑖 𝑢𝑖 ⎞⎟⎞⎟→   N(0, E(𝑢2𝑖 𝑋𝑖 𝑋𝑖′ ))
     ⎝      𝑁
          ⎝ 𝑖=1      ⎠⎠
                                                                     27 / 36
Using our tools from basic asymptotic theory (part 2)
Proposition (Asymptotic Distribution of OLS Estimator)
                                             −1
                                   𝑁                      𝑁
    √𝑁 (𝛽OLS
         ̂      − 𝛽∗ )   =⎛
                          ⎜𝑁 −1 ∑ 𝑋𝑖 𝑋𝑖′ ⎞
                                         ⎟        ⎛
                                                  ⎜𝑁 −1/2 ∑ 𝑋𝑖 𝑢𝑖 ⎞
                                                                  ⎟
                          ⎝     𝑖=1      ⎠        ⎝       𝑖=1     ⎠
                         d
                         → N(0, Ω)
where Ω ∶= E(𝑋𝑖 𝑋𝑖′ )−1 E(𝑢2𝑖 𝑋𝑖 𝑋𝑖′ )E(𝑋𝑖 𝑋𝑖′ )−1 .
Ω is the asymptotic variance of √𝑁 (𝛽OLS
                                     ̂ − 𝛽∗ )
Ω/𝑁 is the asymptotic variance of 𝛽OLS
                                   ̂
We take this to mean that 𝛽OLS
                           ̂   has an approximate normal
distribution with mean 𝛽 and variance Ω/𝑁
                        ∗
                                                                      28 / 36
Roadmap
  Ordinary Least Squares Estimation
     Basic Asymptotic Theory (part 2 of 2)
     Asymptotic Distribution of the OLS Estimator
     Asymptotic Variance Estimation
                                                    29 / 36
The asymptotic variance of √𝑁(𝛽OLS
                               ̂ − 𝛽∗ ) is
    Ω ∶= E(𝑋𝑖 𝑋𝑖′ )−1 E(𝑢2𝑖 𝑋𝑖 𝑋𝑖′ )E(𝑋𝑖 𝑋𝑖′ )−1
The rhs is a function of unobserved population moments
How would we estimate Ω?
                                                   𝑁
Clearly, we estimate E(𝑋𝑖 𝑋𝑖′ ) by (1/𝑁) ∑𝑖=1 𝑋𝑖 𝑋𝑖′
But what about E(𝑢2𝑖 𝑋𝑖 𝑋𝑖′ )?
We don’t know 𝑢𝑖
                                                         30 / 36
                                                        𝑁
If we observed 𝑢𝑖 then we would surely use (1/𝑁) ∑𝑖=1 𝑢2𝑖 𝑋𝑖 𝑋𝑖′
That would be an unbiased variance estimator
But we don’t observe the errors 𝑢𝑖 , instead we “observe” the
residuals 𝑢𝑖̂ ∶= 𝑌𝑖 − 𝑋𝑖′ 𝛽OLS
                           ̂
                              𝑁
So how about using (1/𝑁) ∑𝑖=1 𝑢2𝑖̂ 𝑋𝑖 𝑋𝑖′ to estimate the middle piece?
While this is in principal the right idea, it results in a biased variance
estimator
Let’s try understand the source of this bias
                                                                             31 / 36
First some new tools
Let 𝑀𝑋 ∶= 𝐼𝑁 − 𝑃𝑋 with 𝑃𝑋 ∶= 𝑋(𝑋 ′ 𝑋)−1 𝑋 ′
Then 𝑢̂ = 𝑀𝑋 𝑢
Cool facts about 𝑀𝑋 :
𝑀𝑋 = 𝑀 𝑋 ′ (symmetric) and 𝑀 𝑀 = 𝑀 (idempotent)
                            𝑋 𝑋   𝑋
The trace of a 𝐾 × 𝐾 matrix is the sum of its diagonal elements:
          𝐾
tr 𝐴 ∶= ∑𝑖=1 𝑎𝑖𝑖
Savvy tricks: tr (𝐴𝐵) = tr (𝐵𝐴) and tr (𝐴 + 𝐵) = tr 𝐴 + tr 𝐵
Then
            𝑁
                           tr (𝑢𝑢
                                ̂ ′̂ )   tr (𝑢′̂ 𝑢)̂   tr ((𝑀𝑋 𝑢)′ (𝑀𝑋 𝑢))
    𝜎̂ 𝑢2 ∶= ∑ 𝑢2𝑖̂ /𝑁 =               =             =
           𝑖=1
                               𝑁             𝑁                 𝑁
         tr (𝑢′ 𝑀𝑋′ 𝑀 𝑢)
                     𝑋     tr (𝑢′ 𝑀𝑋 𝑢)   tr (𝑀𝑋 𝑢𝑢′ )
       =                 =              =
                 𝑁              𝑁              𝑁
Aside: dim 𝑀𝑋 = 𝑁 × 𝑁 and dim(𝑢𝑢′ ) = 𝑁 × 𝑁
                                                                             32 / 36
Now studying the conditional expectation
   E (𝜎̂ 𝑢2 |𝑋) = E (tr (𝑀𝑋 𝑢𝑢′ )|𝑋) /𝑁
              = tr (𝐸 (𝑀𝑋 𝑢𝑢′ |𝑋)) /𝑁
              = tr (𝑀𝑋 E (𝑢𝑢′ |𝑋)) /𝑁
              = 𝜎𝑢2 ⋅ tr (𝑀𝑋 ) /𝑁
              = 𝜎𝑢2 ( 𝑁−𝐾
                       𝑁
                          )
              < 𝜎𝑢2 ,
where in the fourth equality we simplified our lives by setting
E(𝑢𝑢′ |𝑋) = 𝜎𝑢2 𝐼𝑁 (conditional homoskedasticity)
(The fifth equality will be justified in Assignment 3)
Big picture: 𝜎̂ 𝑢2 is downwards biased which is not good
Confidence intervals based on 𝜎̂ 𝑢2 would be too narrow
Statistical inference based on 𝜎̂ 𝑢2 would be too optimistic
                                                                  33 / 36
There is an easy fix!
              𝑁             1      𝑁
Use 𝑠2𝑢 ∶=      𝜎̂ 2
             𝑁−𝐾 𝑢
                       =   𝑁−𝐾
                                 ∑𝑖=1 𝑢2𝑖̂ instead
Obviously 𝑠2𝑢 will be unbiased
I’m not particularly concerned about this bias
That’s because 𝑁 should be a much larger number than 𝐾
The whole idea of using asymptotic approximations to finite sample
distributions is to let 𝑁 → ∞ while 𝐾 is fixed
In other words lim𝑁→∞ 𝜎̂ 𝑢2 = lim𝑁→∞ 𝑠2𝑢
(asymptotic bias is the same)
                                                                     34 / 36
Combining things, we propose the following asymptotic variance
estimator
Definition (Asymptotic Variance Estimator)
                          −1                                         −1
              𝑁                       𝑁                 𝑁
    Ω̂ = ⎛
         ⎜𝑁1
             ∑ 𝑋𝑖 𝑋𝑖′ ⎞
                      ⎟        ⎛  1
                               ⎜ 𝑁−𝐾 ∑ 𝑢2𝑖̂ 𝑋𝑖 𝑋𝑖′ ⎞
                                                   ⎟⎛ 1
                                                     ⎜𝑁 ∑ 𝑋𝑖 𝑋𝑖′ ⎞
                                                                 ⎟
         ⎝ 𝑖=1        ⎠        ⎝     𝑖=1           ⎠ ⎝ 𝑖=1       ⎠
Stata calculates Ω̂ when you type something like
     regress lwage schooling experience, robust
Textbooks call Ω̂ the heteroskedasticity robust variance estimator
The standard errors derived from Ω̂ are sometimes referred to as
Eicker-Huber-White standard errors
(or some subset permutation of these names)
                                                                          35 / 36
Notice: Wooldridge, on page 61, proposes this version
Definition (Asymptotic Variance Estimator)
                                −1                                       −1
                   𝑁                      𝑁                 𝑁
    Ω̂ Wool
       dridge
              =⎛
               ⎜𝑁1
                   ∑ 𝑋𝑖 𝑋𝑖′ ⎞
                            ⎟        ⎛
                                     ⎜𝑁1
                                         ∑ 𝑢2𝑖̂ 𝑋𝑖 𝑋𝑖′ ⎞
                                                       ⎟⎛ 1
                                                         ⎜𝑁 ∑ 𝑋𝑖 𝑋𝑖′ ⎞
                                                                     ⎟
               ⎝ 𝑖=1        ⎠        ⎝ 𝑖=1             ⎠ ⎝ 𝑖=1       ⎠
This is NOT what Stata implements
(to the best of my knowledge)
But from what I said earlier, it merely creates rounding error
Asymptotically they are all identical
(because 𝐾 is a finite number)
                                                                              36 / 36