Lecture: IV and 2SLS Estimators (Wooldridge’s book chapter 15)
1
Endogeneity
• Endogeneity issue arises when the key regressor is correlated with the error term.
cov(x, u) 6= 0 (Endogeneity) (1)
• This can happen when (i) there are omitted variables; (ii) there is reverse causation or
simultaneity; (iii) there is measurement error
• In the presence of endogeneity, OLS estimator is biased
β̂ → p β + bias (2)
or equivalently, causal effect cannot be identified
cov(y, x) = β cov(x, x) + cov(x, u) (3)
cov(y, x) − cov(x, u) cov(y, x)
β = 6= (4)
cov(x, x) cov(x, x)
• The primary goal of econometrics is to resolve the endogeneity (identification) issue
2
Instrumental Variables (IV) Can Help
• If there is endogeneity, 2SLS or IV estimator based on valid IVs is consistent (β can be
identified)
• IV is valid if (i) it is uncorrelated with error term (exogeneity); (ii) it is correlated with
the key regressor (relevance); (iii) it has no direct effect on y or it is excluded in the
structural form (exclusion)
• A valid IV is hard to find
• For instance, the number of raining days is a valid IV for watching TV if (1) it is
uncorrelated with autism gene (exogeneity); (2) it is correlated with watching TV
(relevance); (3) it cannot have direct effect on developing autism (exclusion). Notice
that it is allowed to indirectly affect autism through watching TV.
3
Structural Form vs Reduced Form
Consider a linear model in which x2 is assumed to be exogenous
y = β1 x1 + β2 x2 + u (5)
• We are interested in estimating β1 that measures the marginal effect of x1 on y
• This is reduced form if x1 is also exogenous. OLS can be applied to the reduced form
• This is structural form if x1 is endogenous. Most economics models are structural
forms. OLS becomes biased. Instead we may need to find IV.
4
IVs
• x2 cannot be used as IV. It satisfies exogeneity, and maybe relevance. But it does not
satisfy exclusion
• The valid IV should be an exogenous variable that matters for x1 (relevance) but only
has indirect effect on y through its effect on x1 (exclusion)
• β1 is just-identified if there is only one IV (excluded exogenous variable). In this case,
2SLS is also called IV estimator.
• β1 is over-identified if there are multiple IVs.
• β1 is under-identified if there is no excluded exogenous variable.
• For instance, we have over-identification if we know the number of raining days and the
number of snowy days. If only one is known, we have just identification.
5
Apple Story
• You can think of x1 as a partially rotten apple consisting of two parts: the bad
endogenous part (correlated with u) and the good exogenous part (uncorrelated with u)
• OLS is bad since it uses the whole apple
• IV estimation is good because IV is used as knife to remove the endogenous part, and
only the exogenous part is used in the estimation.
• When people ask about your identification strategy, typically they wonder how the bad
part of apple is removed or how the good part is isolated
• We hope the good part is big, i.e., the IV and x1 are not weakly related
• It is a good idea to use more IV (over-identification) to isolate bigger exogenous part of
the apple
6
Big Picture
7
Big Picture
• The box defines the structural model in which y depends on x1 , x2 and u.
• x1 is the variable of interest, for which we want to quantify its marginal (causal) effect
on y. However, x1 is endogenous because it is linked to u. OLS is biased because of the
x1u link.
• To solve the endogeneity or identification issue, we need help—an IV variable z which
is outside the box (exclusion), is related to x1 (relevance), and is unrelated to u
(exogeneity)
• Notice that x2 is exogenous because there is no link between x2 and u. x2 cannot be used
as IV because it is inside the box (fails exclusion). Instead, x2 is called controlled
variable (included exogenous variable)
• Critical thinking: what if we do not control for x2 ? (Hint: think about the potential link
between z and x2 )
• You need to draw and justify this big picture if you decide to use IV methodology
8
Stata
Suppose there are two valid IVs z1 and z2 . The stata command for 2SLS estimator is
ivreg y (x1 = z1 z2) x2, first
• It is important to control for x2 , which can make exogeneity condition more likely to
hold for z1 and z2
• The option first reports the first-stage regression that regresses x1 onto z1, z2 and x2 .
The residual of the first-stage regression is the bad part of apple, and can be used to
implement Hausman test. The weak IV test is just the F-value for testing both
coefficients of z1 and z2 being zero. The fitted value of first-stage regression is the good
part of apple, so is the IV variable used in the second-stage
• We obtain 2SLS estimator by regressing y onto the first-stage fitted value and x2 using
OLS (second-stage). The ivreg command does all these for you
• Important: z1 , z2 are excluded exogenous variables. while x2 is included exogenous
variable (control variable).
9
Three Little Pigs Story
• Recall the first stage regression
x = c1 z1 + c2 z2 + . . . cm zm + included exogenous variable + v
• (Hausman Test): The null hypothesis is that the regressor is exogenous (so OLS is good
and IV is not needed). We run the first stage regression and save the residual v̂. Then
run an auxiliary regression y = xβ + d v̂ and test H0 : d = 0. Small p value indicates that
the regressor is endogenous and IV is needed.
• (Stock-Yogo Test): The null hypothesis is that c1 = c2 = . . . = cm = 0, meaning that the
IV is irrelevant (weak IV). We reject the null hypothesis if F statistic exceeds 10
• (Over-identification or Sargan’s J Test) The key coefficient is over-identified if the
number of IV exceeds the number of endogenous regressor by q > 0. In that case we
can test the null hypothesis that all IVs are exogenous. We run the auxiliary regression
û2sls = a1 z1 + a2 z2 + . . . am zm + included exogenous regressor
and compute nR2 ∼ χ 2 (q). Big nR2 leads to rejection, so at least one IV is endogenous.
10
Math (Optional)
Consider a simple regression y = β0 + β1 x + u, where x is endogenous cov(x, u) 6= 0. To
derive a formula for the IV estimator, assume there is only one excluded exogenous variable
z satisfying cov(z, u) = 0 and cov(x, z) 6= 0. It follows that
cov(y, z) = cov(β0 + β1 x + u, z) = β1 cov(x, z) (6)
cov(y, z)
β1IV = (7)
cov(x, z)
Some old school people want to rewrite it as
cov(y, z)/var(z) reduce-form OLS estimate
β1IV = = (8)
cov(x, z)/var(z) first-stage OLS estimate
When there are multiple instrumental variables, the IV estimator is called 2SLS estimator
cov(y, x̂) cov(y, x̂)
β12SLS = = = OLS estimate of regressing y onto x̂ (9)
cov(x, x̂) var(x̂)
where x̂ is the fitted value of regressing x onto the multiple IV variables (first-stage
regression).
11
(Optional) Matrix Algebra I
• Let X be the matrix for the regressors in the structural form X = (x1 , x2 ). Note x1 is
endogenous while x2 is exogenous
• Let Z be the matrix for all exogenous variables Z = (z1 , z2 , x2 ). Note x2 is included
exogenous variable, while z1 , z2 are excluded exogenous variables
• Define the projection matrix P = Z(Z 0 Z)−1 Z 0 . The fitted value of first stage is Xb = PX.
Note Xb is exogenous, and the fitted value for x2 is itself
• The second stage uses Xb as regressors and apply OLS
−1
2SLS 0 0
β̂ = Xb Xb Xb Y (10)
0
−1 0
= X PX X PY (11)
0 0 −1 0
−1 0 0 −1 0
= X Z(Z Z) Z X X Z(Z Z) Z Y (12)
where we use the fact that P is symmetric and idempotent P0 = P and PP = P
12
(Optional) Matrix Algebra II
• It follows that
2SLS 0
−1 0
β̂ = β + X PX X PU
So β̂ 2SLS is unbiased if IVs are valid
• The variance-covariance matrix for β̂ 2SLS is (assuming homoscedasticity)
2SLS 2 0
−1
var-cov(β̂ ) = σ X PX
• CLT implies that in large sample
−1
β̂ 2SLS ∼ N β , σ 2 X 0 PX
and Wald statistic can be constructed to test H0 : Rβ = r
0 h −1 0 i−1 2SLS
2SLS 2 0
Wald Test = Rβ̂ − r Rσ X PX R Rβ̂ −r
13
No Free Lunch (trade off between unbiasedness and efficiency)
Recall that the first-stage regression is basically a decomposition
x = x̂ + r̂ (13)
which implies the following decomposition of total sum square (TSS)
T SS = ESS + RSS, T SS ≥ ESS (14)
or in this case, loosely speaking, we have
X 0 X ≥ X 0 PX, (X 0 X)−1 ≤ (X 0 PX)−1 , var-cov(β̂ OLS ) ≤ var-cov(β̂ 2SLS ) (15)
In words, IV estimator is less efficient than OLS estimator by having bigger variance (and
smaller t value). Intuitively this is because only part of the apple is eaten.
14
(Optional) Matrix Algebra III
It is straightforward to account for heteroskedasticity. The robust variance-covariance matrix
for β̂ 2SLS allowing for heteroskedasticity is
2SLS 0
−1 0 0 −1
robust var-cov(β̂ ) = X PX X PΩPX X PX
where Ω = E(UU 0 ). To estimate the meat in the middle of that sandwich, using
n
X 0 PΩPX
b = Xb0 Ω
b Xb = ∑ û2i b x0i
xi b
i=1
where û denotes the 2SLS residual
b = Y − X β̂ 2SLS
u
15
(Optional) GMM Estimator
If sample is not IID (i.e., there is heteroscedasticity or serial correlation), a more efficient
estimator is generalized method of moments (GMM) estimator
−1
GMM 0 −1 0 −1
β̂ = Xb Ω Xb Xb Ω Y (16)
0 0 −1
−1 0 0 −1
= X P Ω PX X P Ω PY (17)
Basically GMM is combining GLS with IV estimation.
16