Advanced Econometric Methods I:
Lecture notes on Weak Instruments
November 20, 2020
We illustrate the weak instruments problem, discuss detecting weak instruments and propose
weak instrument robust methods.
1 Motivation
Instrument exogeneity and instrument relevance are two crucial requirements in empirical
analysis using instrumental variables (IV) regressions (simple IV, 2SLS and GMM). In in-
strumental variables regression, the instruments are called weak if their correlation with the
endogenous regressors, conditional on any controls, is close to zero. When this correlation
is sufficiently small, conventional approximations to the distribution of GMM and other IV
estimators are generally unreliable.
In particular, GMM estimators can be badly biased, while t and Wald tests may fail to
control size and conventional GMM confidence intervals may cover the true parameter value
far less often than we intend. A recognition of this problem has led to a great deal of work
on econometric methods applicable to models with weak instruments.
We concentrate on detection of weak instruments and weak-instrument robust inference.
The problem of detection is relevant because weak-instrument robust methods can be more
complicated to use than standard GMM methods, so if instruments are plausibly strong it
is convenient to report GMM estimates and standard errors. If instruments are weak, on
the other hand, then practitioners are advised to use weak instrument robust methods for
1
inference, the second topic of these notes. We refer to Andrews, Stock & Sun (2019) for a
recent review of the literature.
The remainder of this note is organized as follows. In the next section we consider a
simple model that illustrates the problem. We provide a theoretical analysis and simulation
evidence. Then we formulate the problem in a more general setting and discuss detecting
weak instruments and weak instrument robust methods.
2 Illustrative example
Consider the simple IV model for scalar variables
yi = zi δ + εi
zi = xi β + ui
where {(yi , xi , zi )} is an iid sample, with E(x2i ) > 0, E(zi2 ) > 0,
2 2
εi iid 0 σε ρ σε ρ
|xi ∼ , and Ω =
2 2
ui 0 ρ σu ρ σu
In this simple model there is endogeneity if ρ 6= 0 and the rank condition simplifies to
E(zi xi ) = E(x2i )β .
which implies that we need β 6= 0. The simple IV estimator is given by
Pn
IV xi y i
δ̂ = Pi=1
n
i=1 xi zi
2
And from the standard theory (see Hayashi Chapter 3) we have
√ IV
√ σε2
IV d n(δ̂ − δ) d
n(δ̂ − δ) → N 0, tδ = q → N (0, 1)
β 2 E(x2i ) IV
Avar(δ̂ )
[
[ δ̂ IV ) = σ̂ε2 ( n x2i ) ( n xi zi )−2 and σ̂ε2 = 1
Pn
− zi δ̂ IV )2 .
P P
where Avar( i=1 i=1 n i=1 (yi
Recall that the goal of the asymptotic analysis was to get a good approximation for the
finite sample behavior of the estimator δ̂ IV . In particular, these limiting distributions are
only useful for testing if they lead to good finite sample properties.1
2.1 Simulation evidence
To investigate the quality of the standard asymptotic theory we conduct the following sim-
ulation experiment. We simulate S = 1000 data sets of length n = 1000 from the simple
IV model. We take δ = 1, xi ∼ N ID(0, 1), the errors εi and ui are drawn from the normal
distribution with unit variances and correlation ρ = 0.5 and we vary
β = 0, 0.2, 0.4, 0.6, 0.8, 1
In Figures 1 we show the empirical distribution of δ̂ IV . While our theory implies that
√ IV a
n(δ̂ − δ) ∼ N (0, Avar(δ̂ IV ), we find that for small β’s the empirical distributions are far
away from the normal. In particular, the distribution is not centered at one, hence there is
an asymptotic bias, and the tails of the distribution are much heavier when compared to the
normal distribution.
In Figure 2 we show the empirical distribution of the t-statistic tδ . Here we should have
a
that t ∼ N (0, 1) under H0 : δ = 1, but instead we find again that the distribution for small
β is far from the normal.
Note that this is not a small sample problem as n = 1000. Also, note that apart from the
small β, the simulation experiment is very stylized with independent sampling and normal
1
Examples of good properties would include negligible bias and correct rejection probabilities of tests in
large samples.
3
errors. Figures 1 and 2 become considerably uglier when allowing for heteroskedasticity and
serial correlation.
Nelson & Startz (1990a,b) and Bound, Jager & Baker (1995) were the first to conduct this
kind of simulation experiment. The latter were motivated by the study of Angrist & Krueger
which used the quarter-year-of-birth as an instrument for education. Bound, Jager & Baker
(1995) mimicked the study but replaced the instrument by a randomly drawn instrument.
With this irrelevant instrument they obtained the same results and thus showed the problem
of weak instruments.
0.4 β= 0 4 β= 0.2
0.3 3
0.2 2
0.1 1
-100 -50 0 50 100 0.25 0.50 0.75 1.00 1.25
β= 0.4 β= 0.6
7.5 10
5.0
5
2.5
0.8 0.9 1.0 1.1 1.2 0.85 0.90 0.95 1.00 1.05 1.10
15 β= 0.8 20 β= 1
15
10
10
5
5
0.90 0.95 1.00 1.05 1.10 0.925 0.95 0.975 1 1.025 1.05 1.075
Figure 1: Empirical distribution of the δ̂ IV statistic for different instrument strengths.
4
0.75 β= 0 β= 0.2
0.4
0.50
0.2
0.25
-1 0 1 2 3 4 -2 -1 0 1 2 3 4 5
β= 0.4
0.4 0.4 β= 0.6
0.2 0.2
-2 -1 0 1 2 3 4 -3 -2 -1 0 1 2 3 4
β= 0.8
0.4 β= 1
0.4
0.2 0.2
-3 -2 -1 0 1 2 3 4 -3 -2 -1 0 1 2 3 4
Figure 2: Empirical distribution of the tδ statistic for different instrument strengths.
2.2 Extreme case β = 0
To get a feeling for the problem we consider the extreme scenario where β = 0. This is the
case where the instrument is completely irrelevant. It follows that
Pn Pn
IV x i ε i xi εi
δ̂ − δ = Pi=1
n = Pni=1 ,
i=1 xi zi i=1 xi ui
√
as when β = 0 we have zi = ui . To obtain a non-degenerate limit we rescale by 1/ n and
take the limit2 Pn
√1 xi εi
IV n i=1 d ξε
δ̂ −δ = Pn → , (1)
√1 xi u i ξu
n i=1
2
Note that
Pn 2
1
√ Pni=1 xi εi → d
N
0
,
Var(xi εi ) Cov(xi εi , xi ui )
=N
0
, E(x2i )
σε ρ
n i=1 x i ui 0 Cov(x i ε i , xi ui ) Var(x i u i ) 0 ρ σu2
The result then follows from Lemma 2.3.b in Hayashi.
5
where (ξε , ξu )0 ∼ N (0, Ω). If we define κ = ρ/σu2 and ξε = κξu + ξ we can write
d ξ
δ̂ IV − δ → κ + .
ξu
This implies that if the instrument is irrelevant we have that
(i) δ̂ IV is inconsistent (expected as δ is not identified)
(ii) δ̂ IV is centered around δ + κ as ξ/ξu is symmetric (note that κ is exactly the limit of
the inconsistent OLS estimator for δ)
(iii) Asymptotically δ̂ IV has heavy tails as ξ/ξu is Cauchy distributed
What we have here is an example of non-uniform asymptotics. If the instrument is relevant
E(zi xi ) 6= 0, then as the sample size increases (n → ∞) we have that δ̂tIV is consistent and
asymptotically normal. If E(zi xi ) = 0, as shown before, the asymptotics breaks down. So,
the correlation equals zero is a point of discontinuity of asymptotics. That is, the limit of
√ IV
n(δ̂t − δ) depends on the value of E(zi xi ), which is a nuisance parameter (parameter that
we do not care about per se, but which affects the distribution) in this case. It means that
√
the convergence of n(δ̂tIV − δ) to normal distribution is not uniform with respect to the
nuisance parameter. That is, if E(zi xi ) 6= 0 but is very small, the convergence is slow and it
requires a larger sample to allow for normal approximation to be accurate. One may hope
that another asymptotic embedding will provide better asymptotic approximation.
3 Weak instrument asymptotics
To get a better asymptotic approximation – that is uniform across the strength of the
instruments – Staiger & Stock (1997) introduced so-called weak instrument asymptotics.
The main idea is to model β as a parameter that slowly drifts to zero when the sample size
increases. This mimics more accurately empirical settings where β is small.
6
We illustrate this approach in a slightly more general setting then before. We consider
the case where there is one endogenous variable and K instruments. We retain the iid and
homoskedasticity assumptions. The model then becomes
yi = zi δ + εi
zi = x0i β + ui
The two stage least squares estimator is given by (see Hayashi section 3.8)
δ̂ 2SLS = (Z 0 PX Z)−1 Z 0 PX y
where we use the notation y = (y1 , . . . , yn ), Z = (z1 , . . . , zn )0 , X = (x1 , . . . , xn )0 and PX =
X(X 0 X)−1 X 0 . To model the idea that β is small we use the drifting parameter set-up3
√
β = c/ n ,
where c is a fixed K × 1 vector. This drifting parameter ensures that the F -statistic that
corresponds to the test for H0 : β = 0 does not diverge under the alternative when n → ∞
(note that W = kF as we have homoskedasticity, see Hayashi page 128).4 This is a key
difference as under the standard asymptotics of the GMM theory we implicitly assume that
the F -statistic diverges to infinity under the alternative. Instead now we have that
K × F = β̂ 0 (X 0 X)β̂/σ̂u2
= β 0 (X 0 X)β/σ̂u2 + 2β 0 X 0 u/σ̂u2 + u0 X(X 0 X)−1 X 0 u/σ̂u2
d
→ χ2 (E(µ2 ), K)
3
We used this as well when we studied the power of the asymptotic t test. The objective here is similar
in the sense that we use the drifting parameters to obtain a more relevant asymptotic approximation. Other
examples in the literature that use this approach include the unit root literature, e.g. Bobkoski 1983,
Cavanagh 1985, Chan and Wei 1987, and Phillips 1987.
4
I use the F statistic on purpose here as this is the statistic that the weak instrument literature uses.
7
where χ2 (E(µ2 ), K) denotes the non-central χ2 distribution with degrees of freedom K and
non-centrality parameter E(µ2 ), where µ2 is known as the concentration parameter
µ2 = β 0 X 0 Xβ/σu2
and note that E(µ2 ) is constant as n → ∞.
Intuitively, this asymptotic approximation implies that under the alternative of a non-
zero β the F -statistic does not diverge. This corresponds better to the finite sample behavior
of the first stage F -statistic as in practice we often do not find very large values of F which
would justify the conventional asymptotic set-up.
Next, we exploit the drifting parameter approach to derive the limiting distribution of
δ̂ 2SLS . We first consider
Z 0 PX Z = (Xβ + U )0 X(X 0 X)−1 X 0 (Xβ + U )
−1
1 0 1 0 1
= √ (Xβ + U ) X XX √ X 0 (Xβ + U )
n n n
−1/20 −1/2
1 0 0 0 1 0 1 0 1
= √ (β X X + U X) XX XX √ (X 0 Xβ + X 0 U )
n n n n
−1/20 −1/2
1 0 0 1 0 1 0 1 0 1 0 1 0
= cX X+ √ U X XX XX X Xc + √ X U
n n n n n n
1/2 −1/20 ! 1/20 −1/2 !
1 1 1 1 1 1
= c0 X 0X + √ U 0X X 0X X 0X c+ X 0X √ X 0U
n n n n n n
d
→ (λ + ζu )0 (λ + ζu )
8
where λ = c0 E(xi x0i )1/2 and (ζε0 , ζu0 )0 ∼ N (0, Ω ⊗ IK ).5 Similarly, we have that
Z 0 PX ε = (Xβ + U )0 X(X 0 X)−1 X 0 ε
1/2 −1/20 !
1 1 1 −1/2 1
= c0 X 0X + √ U 0X X 0X (X 0 X) √ X 0ε
n n n n
d
→ (λ + ζu )0 ζε
Combing the two results gives
d (λ + ζu )0 ζε
δ̂ 2SLS − δ →
(λ + ζu )0 (λ + ζu )
This expression covers both the weak identified and strong identified cases.
• If λ0 λ = 0, we are back to the extreme case where the instruments are irrelevant and
we get
d ζu0 ζε
δ̂ 2SLS − δ →
ζu0 ζu
note that if K = 1 then this is the same as what we derived in equation (1).
• If λ0 λ → ∞ we are back to the strong instrument case as then 6
p
δ̂ 2SLS − δ → 0
and also
√ d
λ0 λ(δ̂ 2SLS − δ) → N (0, σε2 )
p
Since under the drifting parameter sequence we have that the concentration parameter µ2 →
λ0 λ/σu2 , we can think about µ2 as playing the role of the effective sample size. If µ2 is large
5
Note that
IK σε2
IK ρ
Ω ⊗ IK =
IK ρ IK σu2
6 d p
Formally, I use here that if Xn → c where c is a constant then Xn → c.
9
then the conventional normal approximation applies, but if µ2 is small we instead convergence
to a mixture of normal random variables.
The proposed weak IV asymptotics yields good approximations to sampling distributions
uniformly in µ2 for n moderate or large. Under this nesting estimators are not consistent,
not normal, test statistics (including the J-test) do not have standard distributions and
confidence intervals do not have correct coverage.
But note that because µ2 is unknown and cannot be consistently estimated, the distri-
butions cannot be used directly practice to obtain a correct distribution for δ̂ 2SLS .
4 Detecting weak instruments
The previous section suggests that if µ2 is large enough we could use our conventional GMM
theory. This raises the question of how to detect weak instruments in applications. A
natural initial suggestion is to test the hypothesis that the first-stage is equal to zero, β = 0.
As noted in Stock & Yogo (2005), however, conventional methods for inference on δ are
unreliable not only for β = 0, but also for β in a neighborhood of zero. Hence, we may reject
that β = 0 even when conventional inference procedures are unreliable. To overcome this
issue, we need formal procedures for detecting weak instruments, rather than tests for total
non-identification.
The weak instrument asymptotics above show that the first-stage F -statistic is dis-
tributed as a non-central χ2 with a non-centrality parameter directly related to the con-
centration parameter µ2 . As a result, the first-stage F -statistic can serve as an indicator of
the value of µ2 .
Idea: look at the first-stage F -statistic since it is an indicator of µ2 and choose a cut-off
that would guarantee either (i) a relative bias less than x% (for estimation) or (ii) that a
level α test over-rejects less than x% (for testing and confidence sets).
(i) By relative bias we mean the following: the maximum bias of 2SLS is no more than
x% of the bias of OLS. Stock and Yogo (2005) provides a critical value table for a 5% test for
10
different levels of bias and different K. The critical value for the tolerance level 5% ranges
from 13 to 21 as K ranges from 3 to 30.
(ii) Besides bias, one also might want to know how far the usual t-statistic based on 2SLS
is from N (0, 1). This determines how much over-rejection the 2SLS-based t-test has. Using
similar logic, Stock and Yogo also proposed a F -statistic based test for the null hypothesis
H0 : a 5% t-test over rejects by x%, versus the alternative that it over rejects more. The
critical values for such a hypothesis is also tabulated in Stock and Yogo (2005). For x = 5,
the critical value ranges from 16 to 86, (!) when K ranges from 1 to 30. (This probably
means 2SLS-based t-test usually doesnt work...)
Implementation
• the first stage F -statistic is the statistic for testing β = 0 in the first stage regression
zi = x0i β + ui
• We know that EF ≈ 1 + E(µ2 ) in large samples,7 so we can estimate µ2 /K by F − 1.
• Compare the obtained µ2 /K with cut-off (tables in Stock and Yogo). Note that the
cutoffs are far higher than the critical values for the F -test (weak instruments are more
often than what would be detected by pre-test β = 0)
The formal calculations of Staiger & Stock and Stock & Yogo imply a somewhat crude
rule of thumb that is typically used in applied work: when F < 10 we have weak instruments,
when F > 10 we have strong instruments and conventional GMM methods can be used.
The procedure described above works only for a single endogenous variable zi . If the
regression has more than one endogenous (instrumented) regressor, then the analog of the
F -test will be the first-stage matrix and a test for rank of this matrix. See Cragg and Donald
(1993) for more details.
Caution! This test for weak IV assumes a homoskedastic setting! In a heteroskedastic
7
The mean of χ2 (E(µ2 ), K) is K + µ2 , such that dividing by K gives the result.
11
settings researchers often report the robust Wald statistic for judging instrument strength.8
In settings with a single endogenous variable the Kleibergen & Paap (2007) Wald statistic
is equivalent to the Wald statistic of Hayashi (divided by K) for testing β = 0, while in
settings with multiple endogenous regressors it is a robust analog of the Cragg & Donald
(1993) statistic. There seems to be no theoretical justification for the use of the Wald
statistic to gauge instrument strength in non-homoskedastic settings, see Andrews, Stock &
Sun (2019) for more discussion.
As an alternative, Montiel Olea & Pflueger (2013) propose a test for weak instruments
based the effective first-stage F -statistic.
K σ̂u2
F ef f = F
[ β̂) 1 n xi x0i
P
Trace Avar( n i=1
The F ef f statistic allows for heteroskedasticity and for K = 1 can be used with the Stock &
Yogo (2005) critical values based on t-test size (the mean of the IV estimate does not exist
when k = 1).9
5 Weak instrument robust methods
As an alternative it is also possible to use methods that are robust to weak instruments.
That is, instead of using the GMM estimator (or 2SLS/IV estimators) for inference we can
consider different statistics that are robust to the weak instruments problem. The statistics
that are currently known are not estimators for δ, but rather test statistics for H0 : δ = δ̄.
Tests robust towards weak instruments are supposed to maintain the correct size no
matter whether instruments are weak or strong. These can be achieved in two ways: using
8
Use of F-statistics in non-homoskedastic settings is built into common statistical software. When run
without assuming homoskedastic errors the popular ivreg2 command in Stata automatically reports the
Kleibergen & Paap (2007) Wald statistic for testing that β has reduced rank along with critical values based
on Stock & Yogo (2005) (Baum et al., 2007), though the output warns users about Stock & Yogo (2005)s
homoskedasticity assumption.
9
This statistic can also be used with serial correlation. This would require adjusting the estimator for
Avar(β̂) which is something that will be discussed in advanced econometrics III.
12
statistics whose distribution do not depend on µ2 or using conditioning on sufficient statistics
for µ2 . The problem of robust inferences is fully solved for the case of one endogenous
variable. It is still an open question for the case of more than one endogenous variable.
There are two widely known statistics whose distributions do not depend on µ: Anderson-
Rubin (AR) and Lagrange Multiplier (LM).
AR test: Consider again the IV model with scalar zi and K × 1 vector of instruments
xi :
yi = zi δ + εi
zi = x0i β + ui
We test for hypothesis H0 : δ = δ̄. Under the null, yi zi δ is equal to εi and is uncorrelated
with xi (due to exogeneity of instruments). The suggested test statistic is
(y − Z δ̄)0 PX (y − Z δ̄)
AR =
(y − Z δ̄)0 MX (y − Z δ̄)/(n − K)
The distribution of AR does not depend on µ2 as asymptotically
d
AR → χ2 (K)
The formula may remind you of the J-test for over-identifying restrictions. It would be a
J-test if one were to plugs in δ̂ 2SLS .
In a more general situation of more than one endogenous variable and/or included ex-
ogenous regressors AR statistic is F -statistic testing that all coefficients on X are zero in
the regression of yZδ on Z. Note, that one tests all coefficients δ simultaneously (as a set)
in a case of more than one endogenous regressor.
13
AR confidence set: One can construct a confidence set robust towards weak instru-
ments based on the AR test by inverting it. That is, by finding all δ which are not rejected
by the data. In this case, it is the set :
δ̄ : AR(δ̄) < χ2 (K, 1 − α)
The nice thing about this procedure is that solving for the confidence set is equivalent to
solving a quadratic inequality. This confidence set can be empty with positive probability
(caution!).
LM test: The LM test formula can be found in Kleibergen (2002). It has a χ2 (1)
distribution irrespective of the strength of the instruments. The problem with this test,
though, is that it has non-monotonic power and tends to produce wider confidence sets than
the CLR test described below.
Conditional tests: The idea comes from Moreira (2003). He suggested that one consider
any test statistic conditional on a sufficient statistic for µ2 be called Qn . By definition of
sufficient statistic, the conditional (on Qn ) distribution of any variable does not depend on
µ2 . So, instead of using fixed critical values, one would use critical values depending on
realization of Qn (that is, random) q1−α (Qn ). Moreira also showed that any test that has
exact size α for all values of (nuisance) parameter µ2 is a conditional test. The CLR (the
conditional likelihood ratio test) is a conditional test based on the LR statistic. CLR is
preferable since it possesses some optimality properties.
6 Final comments
Weak instruments is an asymptotic problem, or better to say, a problem of non-uniformity of
classical GMM asymptotics. As a result, bootstrap, Edgeworth expansion, and subsampling
14
are not appropriate solutions.
7 References
• Anderson T, Rubin H. 1949. Estimators for the parameters of a single equation in a
complete set of stochastic equations. Annals of Mathematical Statistics 21:570-582
• Bound J, Jaeger D, Baker R. 1995. Problems with instrumental variables estima-
tion when the correlation between the instruments and the endogeneous explanatory
variable is weak. Journal of the American Statistical Association 90:443-450
• Cragg J, Donald S. 1993. Testing identifiability and specification in instrumental
variable models. Econometric Theory 9:222-240
• Kleibergen F. 2002. Pivotal statistics for testing structural parameters in instrumental
variables regression. Econometrica 70:1781-1803
• Kleibergen F, Paap R. 2007. Generalized reduced rank tests using the singular value
decomposition. Journal of Econometrics 133:97-126
• Montiel Olea J, Pflueger C. 2013. A robust test for weak instruments. Journal of
Business and Economic Statistics 31:358-369
• Moreira M. 2003. A conditional likelihood ratio test for structural models. Economet-
rica 71:1027-1048
• Nelson C, Startz R. 1990a. The distribution of the instrumental variable estimator
and its t-ratio when the instrument is a poor one. Journal of Business 63:5125-5140
• Nelson C, Startz R. 1990b. Some further results on the exact small sample properties
of the instrumental variable estimator. Econometrica 58:967-976
• Staiger D, Stock J. 1997. Instrumental variables regression with weak instruments.
Econometrica 65:557-586
15
• Stock J, Yogo M. 2005. Identification and inference for econometric models: Essays
in honor of thomas rothenberg, chap. Testing for Weak Instruments in Linear IV
Regression. Cambridge University Press, 80-108
16