THE
QUARTERLY JOURNAL
Downloaded from https://academic.oup.com/qje/article-abstract/134/2/557/5195544 by 81695661, OUP on 12 April 2019
OF ECONOMICS
Vol. 134 May 2019 Issue 2
CHANNELING FISHER: RANDOMIZATION TESTS AND THE
STATISTICAL INSIGNIFICANCE OF SEEMINGLY
SIGNIFICANT EXPERIMENTAL RESULTS∗
ALWYN YOUNG
I follow R. A. Fisher’s The Design of Experiments (1935), using randomiza-
tion statistical inference to test the null hypothesis of no treatment effects in a
comprehensive sample of 53 experimental papers drawn from the journals of the
American Economic Association. In the average paper, randomization tests of the
significance of individual treatment effects find 13% to 22% fewer significant re-
sults than are found using authors’ methods. In joint tests of multiple treatment
effects appearing together in tables, randomization tests yield 33% to 49% fewer
statistically significant results than conventional tests. Bootstrap and jackknife
methods support and confirm the randomization results. JEL Codes: C12, C90.
I. INTRODUCTION
In contemporary economics, randomized experiments are
seen as solving the problem of endogeneity, allowing for the iden-
∗ I am grateful to Larry Katz, Alan Manning, David McKenzie, Ben Olken,
Steve Pischke, Jonathan de Quidt, Eric Verhoogen, and anonymous referees for
helpful comments; to Ho Veng-Si for numerous conversations; and to the fol-
lowing scholars (and by extension their coauthors) who, displaying the high-
est standards of academic integrity and openness, generously answered ques-
tions about their randomization methods and data files: Lori Beaman, James
Berry, Yan Chen, Maurice Doyon, Pascaline Dupas, Hanming Fang, Xavier Giné,
Jessica Goldberg, Dean Karlan, Victor Lavy, Sherry Xin Li, Leigh L. Linden,
George Loewenstein, Erzo F. P. Luttmer, Karen Macours, Jeremy Magruder, Michel
André Maréchal, Susanne Neckerman, Nikos Nikiforakis, Rohini Pande, Michael
Keith Price, Jonathan Robinson, Dan-Olof Rooth, Jeremy Tobacman, Christian
Vossler, Roberto A. Weber, and Homa Zarghamee.
C The Author(s) 2018. Published by Oxford University Press on behalf of President
and Fellows of Harvard College. All rights reserved. For Permissions, please email:
journals.permissions@oup.com
The Quarterly Journal of Economics (2019), 557–598. doi:10.1093/qje/qjy029.
Advance Access publication on November 21, 2018.
557
558 THE QUARTERLY JOURNAL OF ECONOMICS
tification and estimation of causal effects. Randomization has an
additional strength: it allows for the construction of tests that
are exact, that is, with a distribution that is known no matter
Downloaded from https://academic.oup.com/qje/article-abstract/134/2/557/5195544 by 81695661, OUP on 12 April 2019
what the sample size or characteristics of the errors. Randomized
experiments, however, rarely make use of such methods, largely
only presenting conventional econometric tests using asymptot-
ically accurate clustered/robust covariance estimates. In this ar-
ticle, I apply randomization tests to 53 randomized experiments,
using them to construct counterparts to conventional tests of the
significance of individual treatment effects, as well as tests of the
combined significance of multiple treatment effects appearing to-
gether within regressions or in tables presented by authors. In
tests of individual treatment effects, on average randomization
tests reduce the number of significant results relative to those
found by authors by 22% and 13% at the .01 and .05 levels, respec-
tively. The reduction in rates of statistical significance is greater
in higher dimensional tests. In joint tests of all treatment effects
appearing together in tables, for example, on average randomiza-
tion inference produces 49% and 33% fewer .01 and .05 significant
results, respectively, than comparable conventional tests based on
clustered/robust covariance estimates. Bootstrap and jackknife
methods validate randomization results, producing substantial
reductions in rates of statistical significance relative to authors’
methods.
The discrepancy between the results reported in journals and
those found in this article can be traced to leverage, a mea-
sure of the degree to which individual observations on right-
hand-side variables take on extreme values and are influen-
tial. A concentration of leverage in a few observations makes
coefficients and standard errors extremely volatile, as their
value becomes dependent on the realization of a small num-
ber of residuals, generating t-statistic distributions with much
larger tail probabilities than recognized by putative degrees of
freedom and producing sizable size distortions. I find that the
discrepancy between authors’ results and those based on random-
ization, bootstrap, or jackknife inference are largely limited to the
papers and regressions with concentrated leverage. The results
presented by most authors in the first table of their main analysis
are generally robust to the use of alternative inference procedures,
but as the data are explored, through the subdivision of the sam-
ple or the interaction of treatment measures with nontreatment
covariates, leverage becomes concentrated in a few observations
CHANNELING FISHER 559
and large discrepancies appear between authors’ results and those
found using alternative methods. In sum, regression design is sys-
tematically worse in some papers, and systematically deteriorates
Downloaded from https://academic.oup.com/qje/article-abstract/134/2/557/5195544 by 81695661, OUP on 12 April 2019
within papers as authors explore their data, producing less reli-
able inference using conventional procedures.
Joint and multiple testing is not a prominent feature of
experimental papers (or any other field in economics), but ar-
guably it should be. In the average paper in my sample,
60% of regressions contain more than one reported treatment
effect.1 When a .01 significant result is found, on average there
are 4.0 reported treatment effects (and additional unreported co-
efficients on treatment measures), but only 1.6 of these are signif-
icant. Despite this, only two papers report any F-tests of the joint
significance of all treatment effects within a regression. Similarly,
when a table reports a .01 significant result, on average there
are 21.2 reported treatment effects and only 5.0 of these are sig-
nificant, but no paper provides combined tests of significance at
the table level. Authors explore their data, independently and at
the urging of seminar participants and referees, interacting treat-
ment with participant covariates within regressions and varying
specifications and samples across columns in tables. Readers need
assistance in evaluating the evidence presented to them in its
entirety. Increases in dimensionality, however, magnify the woes
brought on by concentrated leverage, as inaccuracies in the esti-
mation of high-dimensional covariance matrices and extreme tail
probabilities translate into much greater size distortions. One of
the central arguments of this article is that randomization pro-
vides virtually the only reliable approach to accurate inference in
high-dimensional joint- and multiple-testing procedures, as even
other computationally intensive methods, such as the bootstrap,
perform poorly in such situations.
Randomization tests have what some consider a major
weakness: they provide exact tests but only of sharp nulls, that is,
nulls that specify a precise treatment effect for each participant.
Thus, in testing the null of no treatment effects, randomization
inference does not test whether the average treatment effect is
zero, but rather whether the treatment effect is zero for each
and every participant. This null is not unreasonable, despite its
1. In this article, I use the term regression to refer broadly to an estimation pro-
cedure involving dependent and independent variables that produces coefficients
and standard error estimates. Most of these are ordinary least squares (OLS).
560 THE QUARTERLY JOURNAL OF ECONOMICS
apparent stringency, as it merely states that the experimental
treatment is irrelevant, a benchmark arguably worth examining
and (hopefully) rejecting. The problem is that randomization
Downloaded from https://academic.oup.com/qje/article-abstract/134/2/557/5195544 by 81695661, OUP on 12 April 2019
tests are not necessarily robust to deviations away from sharp
nulls, as in the presence of unaccounted-for heterogeneity in
treatment effects they can have substantial size distortions
(Chung and Romano 2013; Bugni, Canay, and Shaikh 2017). This
is an important concern, but not one that necessarily invalidates
this article’s results or its advocacy of randomization methods.
First, confirmation from bootstrap and jackknife results, which
test average treatment effects, and the systematic concentration
of differences in high-leverage settings support the interpretation
that the discrepancies between randomization results and au-
thors’ methods have more to do with size distortions in the latter
than in the former. Second, average treatment effects intrinsically
generate treatment-dependent heteroskedasticity, which renders
conventional tests inaccurate in finite samples as well. Although
robust covariance estimates have asymptotically correct size,
asymptotic accuracy in the face of average treatment effects is
equally a feature of randomization inference, provided treatment
is balanced or appropriately studentized statistics are used in
the analysis (Janssen 1997; Chung and Romano 2013; Bugni,
Canay, and Shaikh 2017). I provide simulations that suggest that
in the face of heterogeneous treatment effects, t-statistic-based
randomization tests provide size that is much more accurate
than clustered/robust methods. Moreover, in high-dimensional
tests randomization tests appear to provide the only basis for
accurate inference, if only of sharp nulls.
This article takes well-known issues and explores them in a
broad practical sample. Consideration of whether randomization
inference yields different results than conventional inference is
not new. Lehmann (1959) showed that in a simple test of binary
treatment, a randomization t-test has an asymptotic distribution
equal to the conventional t-test; Imbens and Wooldridge (2009)
found little difference between randomization and conventional
tests for binary treatment in a sample of eight program evalua-
tions. The tendency of White’s (1980) robust covariance matrix
to produce rejection rates higher than nominal size was quickly
recognized by MacKinnon and White (1985), while Chesher and
Jewitt (1987) and Chesher (1989) traced the bias and volatility
of these standard error estimates to leverage. This article links
these literatures, finding that randomization and conventional
CHANNELING FISHER 561
results are very similar in the low-leverage situations examined
in earlier papers but differ substantially both in individual
results and average rejection rates in high-leverage conditions,
Downloaded from https://academic.oup.com/qje/article-abstract/134/2/557/5195544 by 81695661, OUP on 12 April 2019
where clustered/robust procedures produce large size distortions.
Several recent papers (Anderson 2008; Heckman et al. 2010;
Lee and Shaikh 2014; List, Shaikh, and Xu 2016) have explored
the robustness of individually significant results to step-down
randomization multiple-testing procedures in a few experiments.
This article, in contrast, emphasizes the differences between
randomization and conventional results in joint and multiple
testing and shows how increases in dimensionality multiply
the problems and inaccuracies of inexact inferential procedures,
making randomization inference a nearly essential tool in these
methods. It also highlights the practical value of joint tests
as an alternative approach with different power properties
than multiple-testing procedures.
The article proceeds as follows. Section II explains the
criteria used to select the 53-paper sample, which uses every
paper revealed by a keyword search on the American Economic
Association (AEA) website that provides data and code and allows
for randomization inference. Section III provides background
information in the form of a thumbnail review of the role of
leverage in generating volatile coefficients and standard error
estimates, the logic and methods of randomization inference, and
the different emphasis of joint- and multiple-testing procedures.
Section IV uses Monte Carlos to illustrate how unbalanced lever-
age produces size distortions using clustered/robust techniques,
the comparative robustness of t-statistic-based randomization
tests to deviations away from sharp nulls, and the expan-
sion of inaccuracies in high-dimensional testing. Section V
provides the analysis of the sample, producing the results men-
tioned above, and Section VI concludes with some suggestions
for improved practice.
The results of this research are anonymized, as the objec-
tive of this article is to improve methods, not target individ-
ual results. Thus, no information can be provided in the article,
public use files, or private discussion regarding the results for
particular papers. For the sake of transparency, I provide code
that shows how each paper was analyzed, but the reader eager
to know how a particular paper fared will have to execute this
code themselves. A Stata ado file, available on my website, cal-
culates randomization p-values for most Stata estimation com-
562 THE QUARTERLY JOURNAL OF ECONOMICS
mands, allowing users to call for randomization tests in their own
research.
Downloaded from https://academic.oup.com/qje/article-abstract/134/2/557/5195544 by 81695661, OUP on 12 April 2019
II. THE SAMPLE
My sample is based on a search on www.aeaweb.org
using the abstract and title keywords “random” and “experi-
ment” restricted to the American Economic Review (AER), Amer-
ican Economic Journal (AEJ): Applied Economics and AEJ:
Microeconomics which yielded papers up through the March 2014
issue of the AER. I then dropped papers that:
• did not provide public use data files and Stata compatible
do-file code;
• were not randomized experiments;
• did not have data on participant characteristics; or
• had no regressions that could be analyzed using random-
ization inference.
Public use data files are necessary to perform any analy-
sis and I had prior experience with Stata, which is by far the
most popular program in this literature. My definition of a ran-
domized experiment excluded natural experiments (e.g., based on
an administrative legal change) but included laboratory experi-
ments (i.e., experiments taking place in universities or research
centres or recruiting their subjects from such populations). The
sessional treatment of laboratory experiments is not generally ex-
plicitly randomized, but when queried laboratory experimenters
indicated that they believed treatment was implicitly randomized
through the random arrival of participants to different sessions.
The requirement that the experiment contain data on partici-
pant characteristics was designed to filter out a sample that used
mainstream multivariate regression techniques with estimated
coefficients and standard errors.
Not every regression presented in papers based on random-
ized experiments can be analyzed using randomization inference.
To allow for randomization inference, the regression must con-
tain a common outcome observed under different treatment con-
ditions. This is often not the case. For example, if participants are
randomly given different roles and the potential action space dif-
fers for the two roles (e.g., in the dictator-recipient game), there
is no common outcome between the two groups that can be ex-
amined. In other cases, participants under different treatment
CHANNELING FISHER 563
regimes do have common outcomes, but authors evaluate each
treatment regime using a separate regression, without using any
explicit inferential procedure to compare differences. One could, of
Downloaded from https://academic.oup.com/qje/article-abstract/134/2/557/5195544 by 81695661, OUP on 12 April 2019
course, develop appropriate conventional and randomization tests
by stacking the regressions, but this involves an interpretation
of the authors’ intent in presenting the side-by-side regressions,
which could lead to disputes. I make it a point to adhere to the
precise specification presented in tables.
Within papers, regressions were selected if they allow for and
do not already use randomization inference and:2
• appear in a table and involve a coefficient estimate and
standard error;
• pertain to treatment effects and not to an analysis of ran-
domization balance, sample attrition, or nonexperimental
cohorts;
while tests were done on the null that:
• randomized treatment has no effect, but participant char-
acteristics or other nonrandomized treatment conditions
might have an influence.
In many tables, means are presented, without standard errors or
p-values, that is, without any attempt at statistical inference. I do
not test these. Variations on regressions presented in tables are
often discussed in surrounding text, but interpreting the specifi-
cation correctly without the aid of the supplementary information
presented in tables is extremely difficult because there are often
substantial do-file inaccuracies. Consequently, I limited myself to
specifications presented in tables. Papers often include tables de-
voted to an analysis of randomization balance or sample attrition,
with the intent of showing that treatment was uncorrelated with
either. I do not include any of these in my analysis. Similarly, I
drop regressions projecting the behavior of nontreatment cohorts
on treatment measures, which are typically used by authors to
reinforce the internal validity of the experiment. In difference-
in-difference equations, I only test the treatment coefficients
associated with differences during the treatment period. I test,
universally, the null of no randomized treatment effect, including
2. One paper used randomization inference throughout and was dropped, and
five others present some randomization-based exact (e.g., Wilcoxon rank sum)
tests.
564 THE QUARTERLY JOURNAL OF ECONOMICS
treatment interactions with other covariates, while allowing
participant characteristics or nonrandomized experimental
treatment to influence behavior. For example, in the regression
Downloaded from https://academic.oup.com/qje/article-abstract/134/2/557/5195544 by 81695661, OUP on 12 April 2019
y = α + βT T + βage age + βT ∗ age T ∗ age + βconvex convex
(1) + βT ∗ convex T ∗ convex + ε,
where T is a randomized treatment measure, age is participant
age, and convex is a nonrandomized payment scheme introduced
in later rounds of an experiment, I rerandomize the allocation
of T, repeatedly recalculating T∗ age and T∗ convex, and use the
distribution of test statistics to test the null that T, T∗ age, and
T∗ convex have 0 effects on all participants.
I was able to analyze almost all papers and regressions that
met the sample selection guidelines described above. The do files
of papers are often inaccurate, producing regressions that are dif-
ferent from those reported in the published paper, but an analysis
of the public use data files generally allows one to arrive at a
specification that reproduces the published results within a small
margin of error on coefficients and standard errors. There are only
a handful of regressions, in four papers, that could not be repro-
duced and included in the sample. To permute the randomization
outcomes of a paper, one needs information on stratification (if
any was used) and the code and methods that produced compli-
cated treatment measures distributed across different data files.
I have called on a large number of authors who have generously
answered questions and provided code to identify strata, create
treatment measures, and link data files. Knowing no more than
that I was working on a paper on experiments, these authors
displayed an extraordinary degree of scientific openness and in-
tegrity. Only two papers, and an additional segment from another,
were dropped from my sample because authors could not provide
the information necessary to rerandomize treatment outcomes.
Table I summarizes the characteristics of my final sample,
after reduction based on the criteria described above. I examine
53 papers, 14 of which are laboratory experiments and 39 of which
are field experiments. Twenty-seven of the papers appeared in
the AER, 21 in the AEJ: Applied Economics, and 5 in the AEJ:
Microeconomics. The number of tables reporting estimates and
standard errors for treatment effects that I am able to analyze
using randomization inference varies substantially across papers,
CHANNELING FISHER 565
TABLE I
CHARACTERISTICS OF THE SAMPLE
Downloaded from https://academic.oup.com/qje/article-abstract/134/2/557/5195544 by 81695661, OUP on 12 April 2019
53 Papers
Treatment Coefficients 1,780 Regressions
Location Journal Tables Reported Unreported Method Covariance
39 Field 27 AER 17 1–2 17 2–30 41 0 0.67 OLS 0.25 default
14 Lab 26 AEJ 17 3–4 18 32–80 7 1–48 0.22 MLE 0.70 cl/robust
19 5–8 18 90–260 5 76–744 0.11 other 0.04 bootstrap
0.02 other
Notes. For papers, numbers reported are number of papers by characteristic. For regressions, numbers
reported are the average across papers of the share of regressions within each paper with the noted charac-
teristic.
with 17 papers having only one or two such tables and 19 pre-
senting five to eight. The number of coefficients reported in these
tables varies even more, with one paper reporting 260 treatment
coefficients and another only 2. I deal with the heterogeneity in
the number of treatment results by adopting the convention of
always reporting the average across papers of the within-paper
average measure, so each paper, regardless of the number of
coefficients, regressions, or tables, carries an equal weight in sum-
mary statistics. Although most papers report all of the treatment
effects in their estimating equations, some do not, and the number
of such unreported auxiliary coefficients ranges from 1 to 48 in
seven papers to 76 to 744 in five papers. To avoid the distracting
charge that I tested irrelevant treatment characteristics, I restrict
the analysis that follows to reported coefficients. Results that
include unreported treatment effects, in the Online Appendix,
exhibit very much the same patterns.
My sample contains 1,780 regressions, broadly defined
as a self-contained estimation procedure with dependent and
independent variables that produces coefficient estimates and
standard errors. In the average paper, 67% of these are OLS
regressions (including weighted), 22% involve maximum likeli-
hood estimation (MLE; mostly discrete choice models), and the
remaining 11% include handfuls of quantile regressions, two-step
Heckman models, and other methods. In the typical paper, 25%
of regressions make use of Stata’s default (i.e., homoskedastic)
covariance matrix calculation, 70% avail themselves of clus-
tered/robust estimates of covariance, 4% use the bootstrap, and
the remaining 2% use hc2/hc3 type corrections of clustered/robust
566 THE QUARTERLY JOURNAL OF ECONOMICS
Downloaded from https://academic.oup.com/qje/article-abstract/134/2/557/5195544 by 81695661, OUP on 12 April 2019
FIGURE I
Sensitivity to Outliers
covariance estimates. In 171 regressions in 12 papers (8 lab,
4 field), treatment is applied to groups, but the authors do not
cluster or systematically cluster at a lower level of aggregation.
This is not best practice, as correlations between the residuals
for individuals playing games together in a lab or living in the
same geographical region are quite likely. By clustering at below
treatment level, these authors treat the grouping of observations
in laboratory sessions or geographical areas as nominal. In
implementing randomization, bootstrap, and jackknife inference
in this article, I defer to this judgment, randomizing and sampling
at the level at which they clustered (or didn’t), treating the actual
treatment grouping as irrelevant. Results with randomization
and sampling at the treatment level, reported in the Online
Appendix, find far fewer significant treatment effects.3
III. ISSUES AND METHODS
III.A. Problems of Conventional Inference in Practical
Applications
One of the central characteristics of my sample is its remark-
able sensitivity to outliers. Figure I, Panel A plots the maximum
3. In another three papers, the authors generally cluster at treatment level
but fail to cluster some regressions. I randomize these at the treatment level
so as to calculate the joint distribution of coefficients across equations. In three
papers where the authors cluster across treatment groupings, I rerandomize at
the treatment level.
CHANNELING FISHER 567
and minimum coefficient p-values, using authors’ methods, found
when one deletes one cluster or observation at a time from each
regression in my sample against the p-value found with the
Downloaded from https://academic.oup.com/qje/article-abstract/134/2/557/5195544 by 81695661, OUP on 12 April 2019
full sample.4 With the removal of just one observation, 35% of
.01-significant reported results in the average paper can
be rendered insignificant at that level. Conversely, 16% of
.01-insignificant reported results can be found to be significant
at that level. Figure I, Panel B graphs the difference between
these maximum and minimum p-values against the number of
clusters/observations in the regression. In the average paper the
mean difference between the maximum and minimum delete-one
p-values is .23. To be sure, the problem is more acute in smaller
samples, but surprising sensitivity can be found in samples with
1,000 clusters or observations and even in those with more than
50,000 observations.
A few simple formulas identify the sources of delete-one sen-
sitivity. In OLS regressions, which make up much of my sample,
the coefficient estimate with observation i removed (β̂˜ i ) is related
to the coefficient estimate from the full sample (β̂) through the
formula:
x̃i ε̂i
(2) β̂˜ i = β̂ − 2 ,
x̃i (1 − hii )
i
where x̃i denotes the ith residual from the projection of indepen-
dent variable x on the other regressors in the n × k matrix of
regressors X, ε̂i the ith residual of the full regression, and hii ,
commonly known as leverage, is the ith diagonal element of the
hat matrix H = X(X X)−1 X .5 The robust variance estimate can be
expressed as:
1 2 n x̃ 2
(3) 2 h̃ii ε̂i , where h̃ii = i 2 .
x̃i n− k x̃i
i
4. Where authors cluster, I delete clusters; otherwise I delete individual ob-
servations.
5. So called because ŷ = Hy. The formula for the deletion of vector i of clus-
x̃ (I −H )−1 ε̂
tered observations is β̂˜ i = β̂− i i (x̃ x̃)
ii i
. When the coefficient on a variable is
determined by an individual observation or cluster, hii equals 1 or (in the cluster
case) Ii – Hii is singular. In this case, the delete-i formula for the remaining coeffi-
cients calculates Hii using the residuals of the remaining regressors projected on
the variable in question.
568 THE QUARTERLY JOURNAL OF ECONOMICS
h̃ii might be called coefficient leverage, because it is the ith diag-
onal element of the hat matrix H̃ = x̃(x̃ x̃)−1 x̃ for the partitioned
regression. As seen in expressions (2) and (3), when coefficient
Downloaded from https://academic.oup.com/qje/article-abstract/134/2/557/5195544 by 81695661, OUP on 12 April 2019
leverage is concentrated in a few observations, coefficient and
standard error estimates, depending on the realization of residu-
als, are potentially sensitive to the deletion of those observations.
Sensitivity to a change in the sample is an indication that
results are dependent on the realizations of a small set of distur-
bances. In non-i.i.d. settings, this translates into inaccurate infer-
ence for a given sample, the object of interest in this article. The
summation in expression (3) is a weighted average as h̃ii varies be-
tween 0 and 1 and sums to 1 across all observations. With concen-
trated leverage, robust standard error estimates depend heavily
on a small set of stochastic disturbances and become intrinsically
more volatile, producing t-statistic distributions that are more dis-
persed than recognized by nominal degrees of freedom. When the
effects of right-hand-side variables are heterogeneous, the resid-
uals have a heteroskedastic variance that is increasing in the
magnitude of the regressor. This makes the robust standard error
even more volatile, as it now places a disproportionate weight on
disproportionately volatile residuals. Concentrated leverage also
shrinks estimated residuals, as coefficient estimates respond to
the realization of the disturbance, so a heavy weight is placed on
residuals that are biased toward 0, biasing the standard error es-
timate in the same direction.6 Thus, the estimates of the volatility
of coefficients and of the volatility of the standard error are both
biased downward, producing t-statistic distributions with under-
appreciated tail probabilities.
Table II reports the total coefficient leverage accounted for
by the clusters or observations with the largest leverage in my
sample. I calculate the observation-level shares h̃ii , sum across ob-
servations within clusters if the regression is clustered, and then
report the average across papers of the within-paper mean share
of the cluster/observation with the largest coefficient leverage
(“max”) and the total leverage share accounted for by the largest
1st, 5th, and 10th percentiles of the distribution. I include mea-
sures for non-OLS regressions in these averages as well, because
all of these contain a linear xi β term and leverage plays a similar
6. As an extreme example, when coefficient leverage for observation i is 1,
ŷi = yi , the estimated residual for i is always 0 and the robust standard error
estimate for the coefficient is 0 as well.
CHANNELING FISHER 569
TABLE II
SHARES OF COEFFICIENT LEVERAGE FOR REPORTED TREATMENT EFFECTS
With authors’ covariates Without covariates
Downloaded from https://academic.oup.com/qje/article-abstract/134/2/557/5195544 by 81695661, OUP on 12 April 2019
Max 1% 5% 10% Max 1% 5% 10%
All 53 papers 0.058 0.091 0.216 0.338 0.057 0.089 0.217 0.345
Low leverage (18 papers) 0.008 0.043 0.167 0.290 0.007 0.038 0.159 0.281
Medium leverage (17 papers) 0.030 0.070 0.200 0.322 0.029 0.070 0.207 0.338
High leverage (18 papers) 0.134 0.158 0.279 0.402 0.132 0.157 0.283 0.414
First table (45 papers) 0.031 0.064 0.183 0.302 0.027 0.058 0.173 0.293
Other tables (45 papers) 0.049 0.085 0.215 0.341 0.050 0.084 0.214 0.342
With interactions (29 papers) 0.054 0.109 0.268 0.411 0.062 0.120 0.303 0.462
Without interactions (29 papers) 0.036 0.065 0.176 0.297 0.028 0.053 0.150 0.264
Notes. Figures are the average across papers of the within-paper average measure for reported coefficients.
Max, 1%, 5%, and 10% = cumulative leverage share of clusters/observations with the largest leverage, ranging
from the observation with the maximum through the 1st, 5th, and 10th percentiles of the distribution; “without
covariates” = leverage shares with nontreatment covariates other than the constant term excluded from the
regression.
role in their standard error estimates.7 As shown in the table, in
the mean paper the largest cluster/observation has a leverage of
0.058, while the top 1st and 10th percentiles account for 0.091
and 0.338 of total leverage, respectively. These shares vary sub-
stantially by paper. Dividing the sample into thirds based on the
average within-paper share of the maximal cluster/observation,
one sees that in the low-leverage third the average share of this
cluster/observation is 0.008, whereas in the high-leverage third it
is 0.134 (with a mean as high as 0.335 in one paper).
Table II also compares the concentration of leverage in the
first table where authors present their main results against later
tables, in papers that have more than one table reporting treat-
ment effects.8 Leverage is more concentrated in later tables, as
authors examine subsets of the sample or interact treatment with
nontreatment covariates. Specifically comparing coefficients ap-
pearing in regressions where treatment is interacted with covari-
ates against those where it is not, in papers that contain both
types of regressions, we see that regressions with covariates have
7. Thus, the often used robust covariance for matrix maximum likelihood
models can be re-expressed as ARA , where, with D1 and D2 denoting diagonal
matrices of the observational-level 1st and 2nd derivatives of the ln-likelihood with
respect to xi β, R = (X X)−1 (X D1 D1 X)(X X)−1 and A = (−X D2 X)−1 X X. With D1
serving as the residual, leverage plays the same role in determining the elements
of R as it does in the OLS covariance estimate.
8. Main results are identified as the first section with this title (a frequent
feature) or a title describing an outcome of the experiment (e.g., “Does treatment-
name affect interesting-outcome?”).
570 THE QUARTERLY JOURNAL OF ECONOMICS
more concentrated leverage. The presence of nontreatment covari-
ates in the regression per se, however, does not have a very strong
effect on coefficient leverage, as the table shows by recalculat-
Downloaded from https://academic.oup.com/qje/article-abstract/134/2/557/5195544 by 81695661, OUP on 12 April 2019
ing treatment coefficient leverage shares with nontreatment co-
variates excluded from the regression (but treatment interactions
with covariates retained). This is to be expected if covariates are
largely orthogonal to treatment.
A few examples illustrate how regression design can lead
to concentrated leverage. Binary treatment applied 50/50 to the
whole sample, with otherwise only a constant term in the re-
gression, produces uniform leverage. Apply three binary treat-
ments and control each to one-quarter of the population, and in
a joint regression with a constant term each treatment arm con-
centrates the entirety of leverage in one-half of the observations.
The clustered/robust covariance estimate is now based on only
half of the residuals and consequently has a volatility (degrees of
freedom) consistent with half the sample size. As is often done,
run the regression using only one of the three treatment mea-
sures as a right-hand-side variable, so that binary treatment in
the regression is applied in 25/75 proportions, and one-quarter of
observations account for three-quarters of leverage. Apply 50/50
binary treatment, and create a second treatment measure by in-
teracting it with a participant characteristic that rises uniformly
in even discrete increments within treatment and control, and
one-fifth of observations account for about three-fifths of coeffi-
cient leverage for the binary treatment measure (even without
the nontreatment characteristic in the regression). Seemingly in-
nocuous adjustments in regression design away from the binary
50/50 baseline generate substantially unbalanced leverage, pro-
ducing clustered/robust covariance estimates and t-statistics that
are much more dispersed than recognized.
III.B. Randomization Statistical Inference
Randomization inference provides exact tests of sharp (i.e.,
precise) hypotheses no matter what the sample size, regression
design, or characteristics of the disturbance term. The typical
experimental regression can be described as yE = TE β t + Xβ x
+ ε, where yE is the n × 1 vector of experimental outcomes, TE
an n × t matrix of treatment variables (including possibly inter-
actions with nontreatment covariates), and X an n × k matrix of
nonrandomized covariates. Conventional econometrics describes
CHANNELING FISHER 571
the statistical distribution of the estimated βs as coming from
the stochastic draw of the disturbance term ε, and possibly
the regressors, from a population distribution. In contrast, in
Downloaded from https://academic.oup.com/qje/article-abstract/134/2/557/5195544 by 81695661, OUP on 12 April 2019
randomization inference the motivating thought experiment is
that, given the sample of experimental participants, the only
stochastic element determining the realization of outcomes is
the randomized allocation of treatment. For each participant, the
observed outcome yi is conceived as a determinate function of the
treatment ti allocated to that participant, yi (ti ). Consequently, the
known universe of potential treatment allocations determines the
statistical distribution of the estimated βs and can be used to test
sharp hypotheses that precisely specify the treatment effect for
each participant, because sharp hypotheses of this sort allow the
calculation of what outcomes would have been for any potential
random allocation of treatment. Consider, for example, the null
hypothesis that the treatment effects in the equation above equal
β 0 for all participants. Under this null, the outcome vector that
would have been observed had the treatment allocation been TS
rather TE is given by yS = yE – TE β 0 + TS β 0 and this value,
along with TS and the invariant characteristics X can be used to
calculate estimation outcomes under treatment allocation TS .9
An exact test of a sharp null is constructed by calculating pos-
sible realizations of a test statistic and rejecting if the observed
realization in the experiment itself is extreme enough. In the typ-
ical experiment there is a finite set of equally probable potential
treatment allocations TS . Let f(TE ) denote a test statistic calcu-
lated using the treatment applied in the experiment and f(TS )
the known (under the sharp null) value the statistic would have
taken if the treatment allocation had been TS . If the total number
of potential treatment allocations in is M, the p-value of the
experiment’s test statistic is given by:
1
M
randomization p-value = IS (> T E )
M
S=1
1
M
(4) +U ∗ IS (= T E ) ,
M
S=1
9. Imbens and Rubin (2015) provide a thorough presentation of inference
using randomized experiments, contrasting and exploring the Fisherian potential
outcomes and Neymanian population sampling approaches.
572 THE QUARTERLY JOURNAL OF ECONOMICS
where IS (>TE ) and IS (=TE ) are indicator functions for f(TS ) >
f(TE ) and f(TS ) = f(TE ), respectively, and U is a random variable
drawn from the uniform distribution. In words, the p-value of the
Downloaded from https://academic.oup.com/qje/article-abstract/134/2/557/5195544 by 81695661, OUP on 12 April 2019
randomization test equals the fraction of potential outcomes that
have a more extreme test statistic added to the fraction that have
an equal test statistic times a uniformly distributed random num-
ber. In the Online Appendix, I prove that this p-value is uniformly
distributed, that is, the test is exact with a rejection probability
equal to the nominal level of the test.
Calculating equation (4), evaluating f(TS ) for all possible
treatment realizations in , is generally impractical. However,
under the null random sampling with replacement from allows
the calculation of an equally exact p-value provided the origi-
nal treatment result is automatically counted as a tie with itself.
Specifically, with N additional draws (beyond the original treat-
ment) from , the p-value of the experimental result is given by:
1
N
sampling randomization p-value = IS (> T E )
N+1
S=1
1
N
(5) +U ∗ 1+ IS (= T E ) .
N+1
S=1
In the Online Appendix, I show that this p-value is uniformly
distributed regardless of the number of draws N used in its eval-
uation.10 This establishes that size always equals nominal value,
even though the full distribution of randomization outcomes is
not calculated. However, provided it is a concave function of the
nominal size of the test, power is increasing in N (Jockel 1986).
Intuitively, as the number of draws increases, the procedure is
better able to identify what constitutes an outlier outcome in the
distribution of the test statistic f(). In my analysis, I use 10,000
draws to evaluate equation (5). When compared with results cal-
culated with fewer draws, I find no appreciable change in rejection
rates beyond 2,000 draws.
One drawback of randomization inference, easily missed in
the short presentation above, is that in equations with multiple
10. The proof is a simple extension of Jockel’s (1986) result for nominal size
1
equal to a multiple of N+1 . It generalizes to treatment allocations that are not
equally probable by simply duplicating each treatment outcome in according to
its relative frequency, so that each element in becomes equally likely.
CHANNELING FISHER 573
treatment measures the p-value of the null for one coefficient
generally depends on the null assumed for other treatment
measures, as these nulls influence the outcome yS that would
Downloaded from https://academic.oup.com/qje/article-abstract/134/2/557/5195544 by 81695661, OUP on 12 April 2019
have been observed for treatment allocation TS . It is possible in
some multitreatment cases to calculate p-values for individual
treatment measures that do not depend on the null for other
treatments by considering a subset of the universe of potential
randomization allocations that holds other treatments constant.11
Such calculations, however, must be undertaken with care, as
there are many environments where it is not possible to con-
ceive of holding one treatment measure constant while varying
another.12 In results reported in this article, I always test the
null that all treatment effects are 0 and all reported p-values for
joint or individual test statistics are under that joint null. In the
Online Appendix I calculate, where possible, alternative p-values
for individual treatment effects in multitreatment equations that
do not depend on the null for other treatment measures. On
average, the results are less favorable to my sample (i.e., reject
less often and produce bigger p-value changes).
I make use of two randomization-based test statistics, which
find counterparts in commonly used bootstrap tests. The first
is based on a comparison of the Wald statistics of the conven-
tional tests of the significance of treatment effects, as given by
β̂ t (TS )V(β̂ t (TS ))−1 β̂ t (TS ), where β̂ t and V(β̂ t ) are the regression’s
treatment coefficients and the estimated covariance matrix of
those coefficients. This method in effect calculates the probability
(6) β̂ t (TS )V(β̂ t (TS ))−1 β̂ t (TS ) β̂ t (TE )V(β̂ t (TE ))−1 β̂ t (TE )
I use the notation (TS ) to emphasize that both the coefficients and
covariance matrix are calculated for each realization of the ran-
11. Consider the case with control and two mutually exclusive treatment
regimes denoted by the dummy variables T1 and T2 . Holding the allocation of T2
constant (for example), one can rerandomize T1 across those who received T1 or
control, modifying y for the hypothesized effects of T1 only, and calculate a p-value
for the effect of T1 that does not depend on the null for T2 .
12. Consider, for example, the case of treatment interactions with covariates
(which arises frequently in my sample), as in the equation y = α + β T T + β age age
+ β T∗ age T∗ age + ε. It is not possible to rerandomize T holding T∗ age constant, or
to change T∗ age while holding T constant, so there is no way to calculate a p-value
for either effect without taking a stand on the null for the other.
574 THE QUARTERLY JOURNAL OF ECONOMICS
dom draw TS from . This test might be called the randomization-
t, as in the univariate case it reduces to a comparison of squared
t-statistics. It corresponds to bootstrap tests based on the per-
Downloaded from https://academic.oup.com/qje/article-abstract/134/2/557/5195544 by 81695661, OUP on 12 April 2019
centiles of Wald statistics.
An alternative test of no treatment effects is to compare the
relative values of β̂ t (TS )V(β̂ t ())−1 β̂ t (TS ), where V(β̂ t ()) is the
covariance of β̂ t across the universe of potential treatment draws
in . In this case, a fixed covariance matrix is used to evaluate
the coefficients produced by each randomized draw TS from ,
calculating the probability
(7) β̂ t (TS )V(β̂ t ())−1 β̂ t (T S ) β̂ t (TE )V(β̂ t ())−1 β̂ t (TE )
In the univariate case, this reduces to the square of the coefficients
divided by a common variance and, after eliminating the common
denominator, a simple comparison of squared coefficients. Hence,
I refer to this as the randomization-c. It corresponds to bootstrap
tests which use the distribution of bootstrapped coefficients to
calculate the covariance matrix. In the analysis of my sample, I
use 10,000 randomization draws to approximate V(β̂ t ()).
Although in principle all randomization test statistics are
equally valid, in practice I find the randomization-t to be superior
to the -c. First, when jointly testing more than one treatment ef-
fect, the -c relies on a sampling approximation of the coefficient
covariance matrix. Consequently, the comparison in inequality
(7) is not strictly speaking a draw-by-draw comparison of f(TS ) to
f(TE ), and the assumptions underlying the proof that equation (5)
is exact do not hold. In fact, in simulations (further below) I find
statistically significant deviations from nominal size of -c in joint
tests of true sharp nulls. Second, when the sharp null is false and
heterogeneity in treatment effects exists, the randomization-c per-
forms very poorly, even in tests of individual treatment effects, but
the randomization-t does quite well, as shown later. The greater
robustness of the randomization-t to an error in the underlying as-
sumptions is clearly a desirable feature. That said, in the actual
analysis of my sample results using the randomization-c and -t
are very similar.
III.C. Joint- versus Multiple-Hypothesis Testing
I use joint- and multiple-testing procedures to test the
null that all treatment effects reported together in regressions
CHANNELING FISHER 575
2
Joint
Multiple
Downloaded from https://academic.oup.com/qje/article-abstract/134/2/557/5195544 by 81695661, OUP on 12 April 2019
FIGURE II
Acceptance Regions for Joint and Multiple Testing with Independent Estimates
or tables are 0. The two approaches provide power against
different alternatives, as illustrated in Figure II, which considers
the case of testing the significance of two coefficients whose
distribution is known to be normal and independent of each
other.13 The rectangle in the figure is the acceptance region for
the two coefficients tested individually with a multiple-testing
adjustment to critical values, whereas the oval is the Wald
acceptance region for the joint significance of the two coefficients.
In the multiple-testing framework, to keep the probability of
one or more Type I errors across the two tests at level α, one
could select a size η for each test such that 1 – (1 – η)2 = α. The
probability of no rejections, under the null, given by the integral
of the probability density inside the rectangle, then equals 1 – α.
The integral of the probability density inside the Wald ellipse is
also 1 – α. The Wald ellipse, however, is the translation-invariant
procedure that minimizes the area such that the probability of
falling in the acceptance region is 1 – α.14 It achieves this, relative
to the multiple-testing rectangle, by dropping corners where
the probability of two extreme outcomes is low and increasing
the acceptance region along the axes. Consequently, the joint
test has greater power to reject in favor of alternatives within
quadrants, whereas multiple testing has greater power to reject
when alternatives lie on axes. In the analysis of the experimental
13. A version of this diagram can be found in Savin (1984).
14. A procedure is translation invariant if, after adding a constant to both the
point estimate and the null, one remains in the confidence region. Stein (1962)
provides examples of procedures that do not satisfy this requirement but produce
smaller confidence regions.
576 THE QUARTERLY JOURNAL OF ECONOMICS
sample further below, I find that joint testing produces rejection
rates that are generally slightly greater than those found using
multiple-testing procedures, that is, while articles emphasize the
Downloaded from https://academic.oup.com/qje/article-abstract/134/2/557/5195544 by 81695661, OUP on 12 April 2019
extreme effects of individual statistically significant treatment
measures, evidence in favor of the relevance of treatment is at
least as often found in the modest effects of multiple aspects
of treatment.
Multiple testing is an evolving literature. The classical Bon-
α
ferroni method evaluates each test at the N level, which, based
on Boole’s inequality, ensures that the probability of a Type I
error in N tests is less than or equal to α for any correlation be-
tween the test statistics. For values of α such as 0.01 or 0.05,
α 1
the gap between N and the p-value cutoff η = 1 − (1 − α) N that
would be appropriate if the test statistics were known to be in-
dependent, as in the example above, is minuscule. Nevertheless,
because Bonferroni’s method does not make use of information on
the covariance of p-values, it can be quite conservative. For ex-
ample, if the p-values of individual tests are perfectly correlated
under the null, then α is the αth percentile of their minimum
and hence provides an α probability of a Type I error when ap-
plied to all tests. In recognition of this, Westfall and Young (1993)
suggested using bootstrap or randomization inference to calcu-
late the joint distribution of p-values and then using the αth per-
centile of the minimum as the cutoff value. In the analysis of the
sample below, I find that Westfall and Young’s procedure yields
substantially higher rejection rates than Bonferroni’s method in
table-level tests of treatment effects, as coefficients appearing in
different columns of tables are often highly (if not perfectly) cor-
related.
While joint testing produces a single 0/1 decision, multiple
testing allows for further tests, as following an initial rejection one
can step-down through the remaining tests using less demanding
cutoffs (e.g., Holm 1979; Westfall and Young 1993). Step-down
procedures of this sort require either “subset pivotality” (Westfall
and Young 1993), that is, that the multivariate distribution
of p-values for subsets of hypotheses does not depend on the
truth of other hypotheses, or more generally, that critical values
are weakly monotonic in subsets of hypotheses (Romano and
Wolf 2005). Both conditions trivially hold when authors kindly
project a different dependent variable on a single treatment
measure in each column of a table. This rarely occurs. Within
CHANNELING FISHER 577
equations, treatment measures are interacted with covariates,
making the calculation of a randomization distribution without
an operational null on each treatment measure impossible, as
Downloaded from https://academic.oup.com/qje/article-abstract/134/2/557/5195544 by 81695661, OUP on 12 April 2019
noted earlier. Across columns of tables the same dependent
variable is usually projected on slightly different specifications
or subsamples, making the existence of nonzero effects in one
specification and a sharp null in another logically impossible.15
However, the null that every aspect of treatment has zero effects
everywhere on everyone can always be tested.
I use joint- and multiple-testing procedures in this article to
highlight the relevance of randomization inference in these, as
the size distortions of inexact methods are much larger in higher-
dimensional joint tests and in evaluating extreme tail probabili-
ties. In multiple testing I restrict attention to the existence of any
rejection, as this initial test can be applied to any group of results
in my sample. Alternative multiple-testing procedures start with
the same initial Bonferroni or Westfall–Young cutoff, and hence
their initial decisions are subsumed in those results.16 The exis-
tence of any rejection in multiple testing also produces a result
equivalent to the joint test, that is, a statement that the combined
null is false, allowing a comparison of the two methods and of the
evidentiary value of traditionally emphasized treatment effects
on axes against that provided by the combinations of treatment
effects found within quadrants.
IV. MONTE CARLOS
In this section, I use simulations with balanced and
unbalanced regression design and fixed and heterogeneous
treatment effects to compare rejection rates of true nulls using
clustered/robust covariance estimates to results obtained using
randomization inference, as well as those found using the
bootstrap and jackknife. For randomization inference and the
bootstrap I use the randomized and bootstrapped distribution
15. As examples: (i) having rejected the null of zero effects for women, it is not
possible to consider a sharp null of zero in an equation that combines men and
women; (ii) having rejected the null of zero effects in the projection of an outcome
on treatment and covariates, it is not possible to then consider a sharp null of zero
in the projection of the outcome on treatment alone.
16. Thus, for example, control of the false discovery rate at rate α using
Benjamini and Hochberg’s (1995) step-down procedure imposes a rejection crite-
α
rion of N for the first step.
578 THE QUARTERLY JOURNAL OF ECONOMICS
of coefficients and robust t-statistics to evaluate the p-value,
that is, the -c and -t methods described earlier in inequalities
(6) and (7). The bootstrap-t is generally considered superior to
Downloaded from https://academic.oup.com/qje/article-abstract/134/2/557/5195544 by 81695661, OUP on 12 April 2019
the -c because its rejection probabilities converge more rapidly
asymptotically to nominal size (Hall 1992). For OLS regressions,
the jackknife substitutes ε∼i , the residual for observation i when
the delete-i coefficient estimates are used to calculate predicted
n 21
values for that observation, for the [ n−k ] adjusted estimated
residual used in the clustered/robust formula (3) earlier, which
has the disadvantage of being biased toward 0 in high-leverage
observations. It is equivalent to the hc3 finite-sample correction of
clustered/robust covariance estimates, which appears to provide
better inference in finite samples (MacKinnon and White 1985).
Table III reports size at the .05 level of the different methods
in tests of individual coefficients. Panel A uses the data generat-
ing process yi = α + ti β i + εi , with εi distributed i.i.d. standard
normal and ti a 0/1 treatment measure administered in a bal-
anced (50/50) or unbalanced (10/90) fashion. For β i , I consider both
fixed treatment effects, with β i = β for all observations, and het-
erogeneous treatment effects, with β i distributed i.i.d. standard
normal or i.i.d. chi2 . Panel B uses the data-generating process
yi = α + ti ∗ β i + ti ∗ xi ∗ γ i + xi + εi , where εi is again distributed
i.i.d. normal and ti is a 0/1 treatment measure administered in
a balanced (50/50) fashion. Treatment interacts with a partici-
pant characteristic xi , which is distributed i.i.d. exponential with
mean 1. Once again, the parameters β i and γ i are either fixed
or distributed i.i.d. normal or chi2 . Sample sizes range from 20
to 2,000. In each case, I use OLS to estimate average treatment
effects in a specification that follows the data-generating process.
With 10,000 simulations there is a 0.99 probability of estimated
size lying between 0.044 and 0.056 if the rejection probability is
actually 0.05.
Two patterns emerge in Table III. First, with evenly dis-
tributed leverage, all methods, with the exception perhaps of the
bootstrap-c, do reasonably well. This is apparent in the left side of
Panel A, where leverage is evenly distributed in all samples, but
also in the rapid convergence of size to nominal value with un-
balanced regression design in the right side of Panel A, where the
maximal leverage of a single observation falls from 0.45 to 0.045 to
0.0045 with the increase in the sample size. Things proceed much
less smoothly in Panel B, where the exponentially distributed co-
variate ensures that the maximal observation leverage share re-
CHANNELING FISHER 579
TABLE III
SIZE AT THE 0.05 LEVEL IN 10,000 SIMULATIONS
(REJECTION RATES IN TESTS OF THE TRUE MEAN OF THE DATA-GENERATING PROCESS)
Downloaded from https://academic.oup.com/qje/article-abstract/134/2/557/5195544 by 81695661, OUP on 12 April 2019
Robust Rand-t Rand-c Boot-t Boot-c J-knife Robust Rand-t Rand-c Boot-t Boot-c J-knife
(1) (2) (3) (4) (5) (6) (1) (2) (3) (4) (5) (6)
Panel A: Tests of effects of binary treatment (ti ) given the data-generating process yi = α + ti ∗ β i + ε i
Balanced regression design Unbalanced regression design
Fixed treatment effects: β i = β, ε i ∼ standard normal
20 0.048 0.048 0.048 0.039 0.068 0.044 0.241 0.046 0.048 0.000 0.108 0.140
200 0.048 0.048 0.048 0.050 0.051 0.048 0.067 0.051 0.050 0.049 0.063 0.057
2,000 0.049 0.049 0.049 0.050 0.050 0.049 0.053 0.052 0.052 0.051 0.053 0.052
Heterogeneous treatment effects: β i ∼ standard normal, εi ∼ standard normal
20 0.052 0.052 0.052 0.040 0.073 0.046 0.283 0.089 0.129 0.000 0.129 0.172
200 0.053 0.052 0.052 0.053 0.055 0.052 0.064 0.051 0.131 0.045 0.060 0.055
2,000 0.049 0.048 0.048 0.048 0.048 0.049 0.052 0.052 0.137 0.051 0.052 0.051
Heterogeneous treatment effects: β i ∼ chi2 , ε i ∼ standard normal
20 0.060 0.062 0.062 0.046 0.082 0.055 0.290 0.091 0.144 0.000 0.131 0.174
200 0.054 0.055 0.055 0.051 0.055 0.053 0.083 0.065 0.189 0.056 0.079 0.071
2,000 0.045 0.045 0.045 0.045 0.045 0.045 0.054 0.052 0.195 0.051 0.054 0.053
Panel B: Tests given the data-generating process yi = α + ti ∗ β i + ti ∗ xi ∗ γ i + xi + ε i
Of coefficient on binary treatment (ti ) Of coefficient on interaction (ti ∗ xi )
Fixed treatment effects: β i = β, γ i = γ , ε i ∼ standard normal
20 0.064 0.051 0.052 0.025 0.049 0.039 0.114 0.052 0.054 0.037 0.021 0.042
200 0.053 0.049 0.049 0.051 0.052 0.048 0.064 0.049 0.050 0.054 0.051 0.049
2,000 0.051 0.050 0.050 0.051 0.051 0.051 0.052 0.049 0.050 0.050 0.051 0.049
Heterogeneous treatment effects: β i , γ i , & ε i ∼ standard normal
20 0.077 0.055 0.056 0.028 0.052 0.038 0.205 0.074 0.067 0.068 0.042 0.065
200 0.071 0.055 0.056 0.052 0.066 0.053 0.101 0.066 0.072 0.058 0.087 0.072
2,000 0.054 0.053 0.054 0.047 0.056 0.051 0.058 0.053 0.056 0.048 0.058 0.053
Heterogeneous treatment effects: β i & γ i ∼ chi2 , ε i ∼ standard normal
20 0.072 0.057 0.055 0.022 0.046 0.034 0.205 0.076 0.071 0.073 0.038 0.074
200 0.069 0.059 0.060 0.052 0.066 0.054 0.134 0.105 0.107 0.100 0.117 0.107
2,000 0.064 0.061 0.062 0.058 0.065 0.061 0.077 0.074 0.072 0.063 0.077 0.074
Note. 20, 200, 2,000 = number of observations in the regression; randomization and bootstrap test statistics
evaluated using 1,000 draws with -t versions using robust standard error estimates; clustered/robust tests
evaluated using conventional n – k degrees of freedom; jackknife tests evaluated using n – 1 degrees of
freedom.
mains above 0.029 (and as high as 0.11) more than one-quarter
of the time even in samples with 2,000 observations. Asymptotic
theorems rely on an averaging that is precluded when estimation
places a heavy weight on a small number of observations, so re-
gression design rather than the crude number of observations is
probably a better guide to the quality of inference based on these
theorems.
580 THE QUARTERLY JOURNAL OF ECONOMICS
Second, Table III shows that the randomization-t is much
more robust than the -c to deviations from the sharp null.
When heterogeneous treatment effects not accounted for in
Downloaded from https://academic.oup.com/qje/article-abstract/134/2/557/5195544 by 81695661, OUP on 12 April 2019
the randomization test are introduced, rejection rates using
the randomization-c rise well above nominal value, but the
randomization-t continues to do well, with rejection probabilities
that are closer to nominal size than any method other than the
bootstrap-t. The intuition for this result is fairly simple. Hetero-
geneous treatment effects introduce heteroskedasticity correlated
with extreme values of the regressors, making coefficient esti-
mates more volatile. When treatment is rerandomized with a
sharp null adjustment to the dependent variable equal to the
mean treatment effect of the data-generating process, the av-
erage treatment effect is retained, but the correlation between
the residual and the treatment regressor is broken, so the coef-
ficient estimate becomes much less volatile. When the deviation
of the original coefficient estimate from the null is compared to
this randomized distribution of coefficients, it appears to be an
outlier, generating a randomization-c rejection probability well
in excess of nominal size, as shown in the table. In contrast,
the randomization-t adjusts the initial coefficient deviation from
the null using its large robust standard error estimate and all
the later, less volatile, coefficient deviations from the null us-
ing the much smaller robust standard error estimates that arise
when heteroskedasticity is no longer correlated with the regres-
sors. By implicitly taking into account how rerandomization re-
duces the correlation between heteroskedastic residuals and the
treatment regressor, the randomization-t adjusts for how reran-
domization reduces the dispersion of coefficient estimates around
the null.
Table IV evaluates rejection rates of the different methods
in joint and multiple tests of the significance of treatment ef-
fects in 10 independent equations of the form used in Table III,
Panel A. The top panel reports the frequency with which at least
one true null is rejected using the Bonferroni multiple-testing
critical value of 0.05
10
= 0.005. The bottom panel reports the fre-
quency with which the joint null that all 10 coefficients equal the
data-generating value are rejected, using White’s (1982) exten-
sion of the robust covariance method to estimate the covariance of
treatment coefficients across equations for the conventional Wald
statistic and the randomization and bootstrap estimates of its
distribution. The most notable difference in the pattern of results,
CHANNELING FISHER 581
TABLE IV
SIZE AT THE 0.05 LEVEL IN 10,000 SIMULATIONS OF JOINT AND MULTIPLE TESTS
(10 INDEPENDENT EQUATIONS WITH THE
Downloaded from https://academic.oup.com/qje/article-abstract/134/2/557/5195544 by 81695661, OUP on 12 April 2019
DATA-GENERATING PROCESS OF TABLE III, PANEL A)
Balanced regression design Unbalanced regression design
Robust Rand-t Rand-c Boot-t Boot-c J-knife Robust Rand-t Rand-c Boot-t Boot-c J-knife
(1) (2) (3) (4) (5) (6) (1) (2) (3) (4) (5) (6)
Panel A: Bonferroni tests: probability of a rejection of any of the 10 true nulls
Fixed treatment effects: β i = β, ε i ∼ standard normal
20 0.056 0.055 0.055 0.026 0.128 0.049 0.682 0.051 0.045 0.000 0.065 0.489
200 0.047 0.047 0.047 0.046 0.053 0.046 0.098 0.051 0.053 0.039 0.081 0.080
2,000 0.052 0.052 0.052 0.051 0.051 0.052 0.052 0.048 0.048 0.047 0.052 0.051
Average treatment effects: β i ∼ standard normal, εi ∼ standard normal
20 0.053 0.052 0.052 0.023 0.126 0.046 0.813 0.143 0.193 0.000 0.094 0.611
200 0.052 0.051 0.051 0.050 0.056 0.051 0.106 0.056 0.277 0.041 0.090 0.085
2,000 0.047 0.047 0.047 0.045 0.044 0.046 0.051 0.049 0.275 0.048 0.050 0.050
Average treatment effects: β i ∼ chi2 , ε i ∼ standard normal
20 0.078 0.078 0.078 0.037 0.164 0.068 0.821 0.157 0.230 0.000 0.093 0.621
200 0.057 0.059 0.059 0.051 0.061 0.057 0.173 0.106 0.436 0.078 0.143 0.143
2,000 0.051 0.051 0.051 0.051 0.052 0.051 0.073 0.068 0.478 0.060 0.071 0.071
Panel B: Joint tests: probability of rejecting the jointly true null
Fixed treatment effects: β i = β, γ i = γ , ε i ∼ standard normal
20 0.594 0.053 0.060 0.000 0.595 0.378 0.998 0.051 0.058 0.000 0.000 0.995
200 0.071 0.049 0.054 0.045 0.078 0.060 0.386 0.048 0.055 0.007 0.346 0.324
2,000 0.053 0.048 0.054 0.049 0.057 0.052 0.069 0.048 0.054 0.046 0.071 0.065
Average treatment effects: β i , γ i , & ε i ∼ standard normal
20 0.615 0.064 0.074 0.000 0.617 0.404 1.00 0.217 0.217 0.000 0.000 1.00
200 0.076 0.050 0.056 0.046 0.084 0.064 0.457 0.087 0.385 0.002 0.411 0.392
2,000 0.050 0.047 0.051 0.048 0.054 0.049 0.068 0.051 0.392 0.047 0.070 0.064
Average treatment effects: β i and γ i ∼ chi2 , ε i ∼ standard normal
20 0.671 0.094 0.106 0.000 0.672 0.466 1.00 0.316 0.315 0.000 0.000 1.00
200 0.089 0.063 0.069 0.046 0.093 0.075 0.536 0.141 0.617 0.002 0.487 0.473
2,000 0.051 0.050 0.056 0.048 0.058 0.050 0.086 0.068 0.644 0.052 0.088 0.083
Notes. Covariance matrix for the joint Wald test using robust, randomization-t, and bootstrap-t methods
calculated following White (1982). Robust standard error used for -t based distributions in multiple tests.
relative to Table III, is the magnitude of the size distortions. In
small samples the robust and jackknife approaches have rejection
probabilities approaching 1.0, particularly in joint tests, whereas
bootstrap rejection probabilities range from 0 to well above 0.05
but are rarely near 0.05. Even with perfectly balanced leverage,
in small samples joint and multiple tests often have rejection
582 THE QUARTERLY JOURNAL OF ECONOMICS
probabilities that are well outside the 0.99 probability interval
for an exact test.17
Size distortions increase with dimensionality in joint and
Downloaded from https://academic.oup.com/qje/article-abstract/134/2/557/5195544 by 81695661, OUP on 12 April 2019
multiple tests for different reasons. In the case of multiple tests,
the problem is that a change in the thickness of the tails of a dis-
tribution generally results in a proportionally greater deviation at
more extreme tail values. Thus, a test that has twice (or one-half)
the nominal rejection probability at the .05 level will typically
have more than twice (less than one-half) the nominal rejection
probability at the .005 level. Consequently, as N increases and the
α
N
Bonferroni cutoff falls, the probability of a rejection across any
of the N coefficients will deviate further from its nominal value,
so small size distortions in the test of one coefficient become large
size distortions in the test of N.
In the case of joint tests, intuition can be found by noting that
the Wald statistic is actually the maximum squared t-statistic that
can be found by searching over all possible linear combinations w
of the estimated coefficients (Anderson 2003), that is:
2
−1 (w β̂)
(8) β̂ V̂ β̂ = Max .
w w V̂w
When the covariance estimate equals the true covariance ma-
trix V times a scalar error, that is, V̂ = V σσ̂ 2 , as is the
2
case with homoskedastic errors and covariance estimates, this
search is actually very limited and produces a variable with
a chi2 or F distribution.18 However, when V̂ is no sim-
ple scalar multiple of the true covariance V, the search
17. As noted earlier, in joint tests the randomization-c is no longer exact
even in tests of sharp nulls, as the covariance matrix in the calculation of the
distribution of test statistics in inequality (7) is only an approximation to the
covariance matrix across all possible randomization draws. This is clearly seen in
the rejection probabilities of 0.060 and 0.058 in samples of 20 in Panel B’s analysis
of joint tests of fixed treatment effects.
18. Employing the transformations w̃ = V /2 w and β̃ = V− /2 β̂, plus the nor-
1 1
malization w̃ w̃ = 1:
2 2
(w β̂) (w̃ β̃) β̃ β̃
Max = Max =
w w V̂w w̃ w̃ V−1/2 V̂V−1/2 w̃ σ̂ 2 /σ 2
2
The last equality follows because the denominator reduces to σσ̂ 2 no matter what
the w̃ such that w̃ w̃ = 1, while the maximum of the numerator across w̃ equals
β̃ β̃, which is typically an independent chi2 variable with k degrees of freedom
(dof). Thus, the maximum is either distributed chi2 with k dof (when asymp-
totically σ̂ 2 = σ 2 ) or else equals k times an Fk,n-k (when the denominator is
CHANNELING FISHER 583
Downloaded from https://academic.oup.com/qje/article-abstract/134/2/557/5195544 by 81695661, OUP on 12 April 2019
FIGURE III
Randomization-t versus Conventional p-Values
Each figure shows 3,000 paired p-values, 1,000 for each of the treatment effect
data-generating processes (fixed, normal, and chi2 ) in the indicated table panel
with N observations.
possibilities expand, allowing for much larger tail outcomes. This
systematically produces rejection probabilities much greater than
size in clustered/robust joint tests.19 At the same time, if the boot-
strapped or randomized distribution of V̂ is even slightly misrep-
resentative of its true distribution, the two methods can greatly
over- or understate the search possibilities in the original proce-
dure, producing large size distortions of their own. Thinking of a
joint test as a maximization problem, I believe, provides some in-
tuition for why errors in approximating the distribution increase
with the dimensionality of the test.
Figure III graphs randomization-t p-values against those
found using conventional techniques. In each panel I take the first
1,000 results from each of the three data-generating processes
for parameters (fixed, normal, and chi2 ), comparing results with
small (N = 20) and large (N = 2,000) samples. Panel A graphs
p-values from the balanced regression design of the upper left
panel of Table III, where robust p-values are nearly exact in both
2
(n – k)−1 times a chi2 variable with n – k dof). However, when V̂ = V σσ̂ 2 the
search possibilities in the denominator clearly expand.
19. Young (2018a) provides further evidence of this for the case of F-tests of
coefficients in a single regression.
584 THE QUARTERLY JOURNAL OF ECONOMICS
small and large samples, showing that randomization and con-
ventional p-values are almost identical in both cases. Panel B
graphs the p-values of the lower right panel of Table III, where
Downloaded from https://academic.oup.com/qje/article-abstract/134/2/557/5195544 by 81695661, OUP on 12 April 2019
robust methods have positive size distortions in small samples.
In small samples, randomization p-values are concentrated above
conventional results, with particularly large gaps for statistically
significant results, but in large samples the two types of results
are once again almost identical. Panel C graphs the joint tests
of the lower left panel of Table IV, where robust methods pro-
duce large size distortions in small samples but have accurate
size in large samples. In small samples the pattern of random-
ization p-values lying above robust results, particularly for small
conventional p-values, is accentuated, but once again differences
all but disappear in large samples.20
Panels A–C of Figure III might lead to the conclusion
that randomization and conventional p-values agree in large
samples or when both p-values are nearly exact. Panel D shows
this is not the case by examining conventional inference with
the default homoskedastic covariance estimate for the highly
leveraged coefficients tested in the lower left panel of Table III.
With samples of 20 observations, despite the fact that errors are
heteroskedastic in two-thirds of the simulations (β i distributed
chi2 or normal), conventional and randomization-t inference
using the homoskedastic covariance estimate produce rejection
rates that are very close to nominal value (i.e., .048 and .051
at the .05 level, respectively). Nevertheless, randomization and
conventional p-values are scattered above and below each other.21
As the sample size increases, the default covariance estimate
results in a growing rejection probability for the conventional test
(0.080 at the .05 level), but no change in randomization rejection
rates, so randomization p-values end up systematically above the
conventional results. The pattern that does emerge from these
simulations is that randomization and conventional p-values are
quite close when maximal leverage, either through regression
design or the effects of sample size, is relatively small and conven-
tional and randomization inference are exact or very nearly so.
20. Figures for bootstrap-t and jackknife p-values compared with robust
p-values show the same patterns.
21. This is not an artifact of the use of the homoskedastic covariance estimate
under heteroskedastic conditions. The dispersion of p-values in the case of fixed
treatment effects, where both methods are exact, is similar.
CHANNELING FISHER 585
Beyond size, there is the question of power. In the Online
Appendix I vary the mean treatment effect of the data-generating
processes in the upper panel of Table III and calculate the
Downloaded from https://academic.oup.com/qje/article-abstract/134/2/557/5195544 by 81695661, OUP on 12 April 2019
frequency with which randomization-t and conventional robust
inference reject the incorrect null of zero average or sharp treat-
ment effects. When both methods have size near nominal value,
their power is virtually identical. When conventional robust
inference has large size distortions, that is, in small samples
with unbalanced regression design, randomization inference has
substantially lower power. This is to be expected, as a tendency
to reject too often becomes a valuable feature when the null
is actually false. However, from the point of view of Bayesian
updating between nulls and alternatives, it is the ratio of power to
size that matters, and here randomization inference dominates,
with ratios of power to size that are above (and as much as two
to three times) those offered by robust inference when the latter
has positive size distortions.
To conclude, Tables III and IV show the clear advantages
of randomization inference, particularly randomization inference
using the randomization-t. When the sharp null is true, random-
ization inference is exact no matter what the characteristics of
the regression. Moreover, the fact that randomization inference is
superior to all other methods when the sharp null is true, does
not imply the inverse, that is, that it is inferior to all other meth-
ods when the sharp null is false. When unrecognized heteroge-
neous treatment effects are present, the randomization-t test of
the sharp fixed null produces rejection probabilities that are often
quite close to nominal value, and in fact closer than most other
testing procedures. In the case of high-dimensional multiple- and
joint-testing problems, it is arguably the only basis to construct
reliable tests in small samples, albeit only of sharp nulls.
V. RESULTS
This section applies the testing procedures described above
to the 53 papers in my sample. As the number of coefficients,
regressions, and tables varies greatly by paper, reported results
are the average across papers of within-paper rejection rates, so
that each paper carries an equal weight in summary statistics. All
randomization tests are based upon the distribution of t and Wald
statistics, which, as noted above, are more robust to deviations
away from sharp nulls in favor of heterogeneous treatment ef-
586 THE QUARTERLY JOURNAL OF ECONOMICS
TABLE V
INDIVIDUAL STATISTICAL SIGNIFICANCE OF REPORTED TREATMENT EFFECTS
.01 .05 .01 .05 .01 .05 .01 .05
Downloaded from https://academic.oup.com/qje/article-abstract/134/2/557/5195544 by 81695661, OUP on 12 April 2019
All papers Low leverage Medium leverage High leverage
(53 papers) (18 papers) (17 papers) (18 papers)
Authors’ p-value 0.216 0.354 0.199 0.310 0.164 0.313 0.283 0.437
Randomization-t 0.78 0.87 0.96 0.98 0.79 0.96 0.65 0.74
Bootstrap-t 0.79 0.84 0.99 0.98 0.87 0.89 0.60 0.70
Jackknife 0.78 0.83 0.95 0.89 0.87 0.91 0.61 0.73
First table Other tables With interactions No interactions
(45 papers) (45 papers) (29 papers) (29 papers)
Authors’ p-value 0.303 0.446 0.188 0.338 0.148 0.292 0.310 0.450
Randomization-t 0.82 0.97 0.81 0.84 0.76 0.82 0.87 0.97
Bootstrap-t 0.85 0.91 0.90 0.80 0.86 0.80 0.87 0.88
Jackknife 0.91 0.94 0.81 0.79 0.80 0.83 0.93 0.89
Notes. Based on 4,044 reported treatment coefficients. .01, .05 = level of the test. Top row reports average
across papers of within-paper fraction of significant results evaluated using authors’ methods; values in lower
rows are average fraction of significant results evaluated using indicated method divided by the top row.
Randomization and bootstrap use 10,000 iterations to calculate p-values based on the distribution of squared
t-statistics (calculated using authors’ methods); interactions refers to coefficients in regressions that interact
treatment with participant characteristics or other nontreatment covariates; first/other table and with/no
interactions comparisons involve fewer than 53 papers because not all papers have both types of regressions.
fects. Corresponding tests based on the distribution of coefficients
alone produce very similar results and are reported in the Online
Appendix. Reported bootstrap tests are also based on the distri-
bution of t and Wald statistics, which asymptotically and in simu-
lation produce more accurate size. Results using the bootstrapped
distribution of coefficients are reported in the Online Appendix
and have systematically higher rejection rates. To avoid contro-
versy, I restrict the tests to treatment effects authors report in
tables, rather than the unreported and arguably less important
coefficients on other treatment measures in the same regressions.
Results based on all treatment measures are reported in the On-
line Appendix and, with a few noted exceptions, exhibit similar
patterns. Details on the methods used to implement the random-
ization, bootstrap, and jackknife tests are given in the Online Ap-
pendix. Variations on these methods (also reported there) produce
results that are less favorable to authors and conventional tests.
Table V tests the statistical significance of individual treat-
ment effects. The top row in each panel reports the average across
papers of the fraction of coefficients that are statistically signif-
icant using authors’ methods (rounded to three decimal places),
and lower rows report the ratio of the same measure calculated
using alternative procedures to the figure in the top row (rounded
to two decimal places for contrast). In the upper left panel we see
CHANNELING FISHER 587
that using authors’ methods in the typical paper 21.6% and 35.4%
of reported coefficients on treatment measures are significant at
the .01 and .05 levels, respectively, but that the average number of
Downloaded from https://academic.oup.com/qje/article-abstract/134/2/557/5195544 by 81695661, OUP on 12 April 2019
significant treatment effects found using the randomization dis-
tribution of the t-statistic is only 78% and 88% as large at the
two levels. Jackknife and t-statistic-based bootstrap significance
rates agree with the randomization results at the .01 level and
find somewhat lower relative rates of significance (83% to 84%) at
the .05 level.
Table V divides the sample into low, medium, and high lever-
age groups based on the average share of the cluster or observa-
tion with the greatest coefficient leverage, as described in Table II.
As shown, the largest difference between the three methods and
authors’ results is found in high-leverage papers, where on aver-
age randomization techniques find only 65% and 74% as many
significant results at the .01 and .05 levels, respectively. Differ-
ences in rejection rates in the one-third of papers with the lowest
average leverage are minimal. Jackknife and bootstrap results
follow this pattern as well. The lower panel of the table compares
treatment effects appearing in first tables to those in other ta-
bles, and those in regressions with treatment interactions with
covariates against those without, in papers that have both types
of coefficients. Results in first tables and in regressions with-
out interactions tend to be more robust to alternative methods,
with randomization rejection rates at the .05 level, in particu-
lar, coming in close (97%) to those found using authors’ methods.
Regressions in the Online Appendix of conventional versus ran-
domization significance differences on dummies for a first table
regression or one without interactions, as well as the number of
observations, find that the addition of maximal coefficient lever-
age to the regression generally moves the coefficients on these
measures substantively toward 0, while leaving the coefficient on
leverage largely unchanged. Regression design is systematically
worse in some papers and deteriorates within papers as authors
explore their data using subsamples and interactions with covari-
ates and this, rather than being in a first table or regression with-
out covariates per se, appears to be the determinant of differences
between authors’ results and those found using randomization
methods.
Table VI tests the null that all reported treatment effects in
regressions with more than one reported treatment coefficient are
zero using joint- and multiple-testing methods. The average num-
588
TABLE VI
JOINT STATISTICAL SIGNIFICANCE OF REPORTED TREATMENT EFFECTS (REGRESSION LEVEL)
(REGRESSIONS WITH MORE THAN ONE REPORTED TREATMENT COEFFICIENT)
All papers Low leverage Medium leverage High leverage First table Other tables
(47 papers) (16 papers) (16 papers) (15 papers) (29 papers) (29 papers)
.01 .05 .01 .05 .01 .05 .01 .05 .01 .05 .01 .05
Significant coef. 0.431 0.643 0.353 0.596 0.450 0.607 0.495 0.731 0.469 0.620 0.413 0.584
Panel A: Joint test based on F and Wald statistics
Authors’ method 0.438 0.546 0.435 0.508 0.392 0.539 0.490 0.595 0.383 0.528 0.400 0.473
Randomization-t 0.76 0.83 1.01 1.00 0.90 0.94 0.42 0.58 0.84 0.92 0.84 0.86
Bootstrap-t 0.72 0.81 0.96 0.96 0.84 0.91 0.39 0.57 0.93 0.84 0.74 0.81
Jackknife 0.90 0.88 0.98 0.96 0.97 0.91 0.76 0.79 0.98 0.96 0.90 0.92
Panel B: Presence of at least one significant measure in multiple testing based on Bonferroni (B) and Westfall–Young (WY) methods
Authors’ p-value (B) 0.335 0.494 0.274 0.426 0.322 0.526 0.415 0.533 0.340 0.501 0.306 0.442
Randomization-t (B) 0.73 0.85 0.99 1.02 0.59 0.88 0.66 0.67 0.81 0.96 0.78 0.82
Bootstrap-t (B) 0.76 0.88 1.06 1.03 0.80 0.87 0.51 0.76 1.01 0.90 0.86 0.85
Jackknife (B) 0.80 0.85 1.03 0.93 0.78 0.94 0.64 0.68 0.98 0.97 0.85 0.84
Randomization-t (WY) 0.76 0.89 1.00 1.08 0.68 0.92 0.66 0.70 0.78 0.96 0.85 0.89
Bootstrap-t (WY) 0.77 0.92 1.07 1.03 0.80 0.93 0.52 0.81 1.01 0.94 0.87 0.87
THE QUARTERLY JOURNAL OF ECONOMICS
Notes. Unless otherwise noted, as in Table V. Based on 922 regressions with multiple reported treatment effects. Significant coef. = presence of any coefficient in the regression
with an authors’ p-value below the indicated level; top row in each panel reports average across papers of the fraction of tests within each paper rejecting the combined null using
authors’ methods; lower rows report the same number calculated using alternative methods divided by the top row. Because the sample is restricted to regressions with more
than one reported treatment coefficient, the number of papers appearing in each group differs from Table V, as indicated. Randomization and bootstrap tests are based on the
distribution of the Wald statistic.
Downloaded from https://academic.oup.com/qje/article-abstract/134/2/557/5195544 by 81695661, OUP on 12 April 2019
CHANNELING FISHER 589
ber of reported treatment effects tested in such regressions in a
paper ranges from 2 to 17.5, with a median of 3.0 and mean of
3.7. The top row of the table records (using three decimal places)
Downloaded from https://academic.oup.com/qje/article-abstract/134/2/557/5195544 by 81695661, OUP on 12 April 2019
the average fraction of regressions which find at least one indi-
vidually .01 or .05 significant treatment coefficient using authors’
methods. Below this I report (also using three decimal places)
the average fraction of regressions which, again using authors’
covariance calculation methods, either reject the combined null of
zero effects directly in a joint F/Wald test (Panel A) or implicitly by
having a minimum coefficient p-value that lies below the Bonfer-
roni multiple-testing-adjusted cutoff (Panel B). As expected, the
Bonferroni adjustment reduces the average number of significant
α
results, as the movement from an α to an N p-value cutoff raises
the average critical value of the t- or z-statistic for .01 signifi-
cance from ±2.6 to ±3.0 in the average paper. Joint tests expand
the critical region on any given axis further than multiple-testing
procedures; in the case of my sample to an average .01 t- or z-
critical value of ± 3.5 in the average paper. Despite this, joint
tests have systematically higher rejection rates, in the sample as
a whole and in every subsample examined in the table, as evidence
against the irrelevance of treatment is found not in extreme coef-
ficient values along the axes but in a combination of moderate val-
ues within quadrants. Although Wald ellipses expand acceptance
regions along the axes, the area that receives all of the attention
in the published discussion of individually significant coefficients,
they do so to tighten the rejection region within quadrants, and
this may yield otherwise underappreciated evidence against the
null that experimental treatment is irrelevant. In a similar vein,
when these tests are expanded to all coefficients, not merely those
reported, rejection rates in joint and multiple tests actually rise
slightly, despite the increase in critical values, as evidence against
the null is found in treatment measures authors did not empha-
size (Online Appendix).
Panels A and B of Table VI also report (using two decimal
places for contrast) the average rejection rates of tests based on
randomization, bootstrap, and jackknife techniques expressed
as a ratio of the average rejection rate of the corresponding
test using authors’ methods. The relative reduction in rejection
rates using randomization techniques is slightly greater than in
Table V’s analysis of coefficients and is especially pronounced in
high-leverage papers, where, in joint tests, randomization tests
find only 42% and 58% as many significant results as authors’
590 THE QUARTERLY JOURNAL OF ECONOMICS
methods. This may be a consequence of the greater size distor-
tions of clustered/robust methods in higher-dimensional tests,
especially in high-leverage situations, discussed earlier. In joint
Downloaded from https://academic.oup.com/qje/article-abstract/134/2/557/5195544 by 81695661, OUP on 12 April 2019
tests bootstrap and jackknife results are alternately somewhat
more and less pessimistic than those based on randomization
inference, but both show similar patterns, with differences with
conventional results concentrated in higher-leverage subsamples.
Westfall–Young randomization and bootstrap measures raise
rejection rates relative to Bonferroni-based results, as should
be expected, as they calculate the joint distribution of p-values
avoiding the “worst-case scenario” assumptions of the Bonferroni
cutoffs.22 Levels and patterns of relative rejection rates are
quite similar when the tests are expanded to include unreported
treatment effects (Online Appendix).
Table VII reports results for joint tests of reported treatment
effects appearing together in tables. The results presented in
tables usually revolve around a theme, typically the exploration of
alternative specifications in the projection of one or more related
outcomes of interest on treatment, treatment interactions with
covariates, and treatment subsamples. The presence of both sig-
nificant and insignificant coefficients in these tables calls for some
summary statistic, evaluating the results in their entirety, and
Table VII tries to provide these. For the purpose of Wald statistics
in joint tests, I estimate the joint covariance of coefficients across
equations using White’s (1982) formula.23 Calculation of this co-
variance estimate reveals that, unknown to readers (and possibly
authors as well), coefficients presented in tables are often perfectly
collinear, as the manipulation of variables and samples eventually
22. Although a conventional equivalent of the Westfall–Young multiple-testing
procedure could be calculated using the covariance estimates and assumed nor-
mality of coefficients, I report the Westfall–Young randomization and bootstrap
rejection rates as a ratio of the conventional Bonferroni results to facilitate a
comparison with the absolute rejection rates of the randomization and bootstrap
Bonferroni tests, which are also normalized by the conventional Bonferroni results.
23. Because White’s theory is based on maximum likelihood estimation, this is
the one place where I modify authors’ specifications, using the maximum-likelihood
representation of their estimating equation where it exists. Differences in individ-
ual coefficient estimates are 0 or minimal. Some estimation methods (e.g., quantile
regressions) have no maximum-likelihood representation and are not included in
the tests. In the few cases where the number of clusters does not exceed the number
of treatment effects, I restrict the table-level joint test to the subset of coefficients
that Stata does not drop when it inverts the covariance matrix.
TABLE VII
JOINT STATISTICAL SIGNIFICANCE OF REPORTED TREATMENT EFFECTS (TABLE LEVEL)
All papers Low leverage Medium leverage High leverage First table Other tables
(53 papers) (18 papers) (17 papers) (18 papers) (45 papers) (45 papers)
.01 .05 .01 .05 .01 .05 .01 .05 .01 .05 .01 .05
Significant coef. 0.662 0.818 0.617 0.788 0.602 0.753 0.764 0.908 0.711 0.889 0.630 0.786
Panel A: Joint test based on Wald statistics
Conventional 0.493 0.622 0.337 0.487 0.431 0.522 0.706 0.850 0.422 0.556 0.483 0.606
Randomization-t 0.51 0.67 0.92 1.00 0.40 0.74 0.38 0.45 0.79 0.80 0.62 0.78
Bootstrap-t 0.21 0.33 0.33 0.51 0.18 0.41 0.18 0.19 0.38 0.45 0.23 0.40
Jackknife 0.77 0.84 0.92 0.86 0.76 0.91 0.71 0.78 0.89 0.84 0.81 0.84
Panel B: Presence of at least one significant measure in multiple testing based on Bonferroni (B) and Westfall–Young (WY) methods
CHANNELING FISHER
Authors’ p-value (B) 0.377 0.542 0.329 0.489 0.275 0.475 0.521 0.659 0.400 0.556 0.349 0.491
Randomization-t (B) 0.61 0.81 0.88 0.98 0.63 0.75 0.43 0.72 0.78 1.00 0.69 0.84
Bootstrap-t (B) 0.69 0.78 1.00 1.06 0.62 0.73 0.54 0.62 0.78 0.92 0.79 0.82
Jackknife (B) 0.71 0.85 1.00 0.97 0.66 0.87 0.56 0.73 0.89 1.00 0.74 0.84
Randomization-t (WY) 0.77 0.91 1.18 1.06 0.67 0.96 0.55 0.77 1.00 1.12 0.79 0.92
Bootstrap-t (WY) 0.79 0.87 1.20 1.09 0.73 0.91 0.55 0.68 0.89 1.00 0.89 0.95
Notes. Unless otherwise noted, as in Table VI. Based on 198 tables. Comparisons for first tables limited to papers with these and other tables.
591
Downloaded from https://academic.oup.com/qje/article-abstract/134/2/557/5195544 by 81695661, OUP on 12 April 2019
592 THE QUARTERLY JOURNAL OF ECONOMICS
produces results that simply repeat earlier information.24 These
linear combinations are dropped in the case of joint tests, as they
are implicitly subsumed in the joint test of the zero effects of
Downloaded from https://academic.oup.com/qje/article-abstract/134/2/557/5195544 by 81695661, OUP on 12 April 2019
the remaining coefficients. I retain them in the multiple-testing
calculations based on individual p-values, however, because they
provide a nice illustration of the advantages of Westfall–Young
procedures in environments with strongly correlated coefficients.
The discrepancies between the rejection rates found using
different methods in the joint tests of Table VII are much larger
than those found in the preceding tables. Randomization tests
show only 51% as many significant results at the .01 level as clus-
tered/robust joint tests, while the bootstrap finds merely 21% as
many significant results, and the jackknife does more to validate
clustered/robust methods with 77% as many significant results.
The number of treatment effects reported in tables ranges from 2
to 96, with the average table in a paper having a 53-paper mean of
19 and median of 17. As found in the Monte Carlos earlier, in high-
dimensional joint tests of this type, clustered/robust and jackknife
methods appear to have rejection probabilities much greater than
nominal size, while Wald-based bootstraps grossly underreject.
The results in Table VI are consistent with this pattern. Random-
ization inference based on Wald statistics in exact tests of sharp
nulls is arguably the only credible basis for tests of this sort.
Turning to the Bonferroni multiple tests in Table VII, which
rely only on the accuracy of covariance estimates for individual
coefficients, the agreement between methods is better here, with
a reduction in significance rates from conventional results of 61%
to 81% using randomization inference and 69% to 85% using the
bootstrap or jackknife. These are larger proportional reductions
in significance rates than in any of the preceding tables. As shown
in Section IV, size distortions grow in multiple testing as propor-
tional deviations from the nominal size are greater at the more
extreme tail cutoffs used in Bonferroni testing. This again argues
in favor of the use of randomization inference, as this is the only
basis to ensure accurate size, at least for tests of sharp nulls, at
.001 or .0001 levels.
24. Excluding the tables where the number of tested treatment effects is
greater than or equal to the number of clusters, in the average paper 14% of
tables contain perfectly collinear results. In such tables, on average one-fifth of
the reported results are collinear with the remaining four-fifths.
CHANNELING FISHER 593
Table VII highlights the advantages of Westfall–Young
methods, using randomization or bootstrap inference to calculate
the joint distribution of p-values within tables rather than
Downloaded from https://academic.oup.com/qje/article-abstract/134/2/557/5195544 by 81695661, OUP on 12 April 2019
adopting the conservative Bonferroni cutoff. Switching from
the Bonferroni cutoff to the Westfall–Young calculation raises
the relative number of significant randomization-t results by
fully one-quarter (from 61% to 77%) at the .01 level and by
one-eighth (from 81% to 91%) at the .05 level. This reflects the
extensive repetition of information in tables, in the form of minor
changes in the specification of right-hand-side variables or highly
correlated or collinear left-hand-side variables. Westfall–Young
methods incorporate this, raising the critical p-value for a finding
of a significant result. This is a nice feature, as it allows the
exploration of the data and presentation of correlated information
without unduly penalizing authors. Critical p-values only become
more demanding to the degree that new specifications add un-
correlated (i.e., new) information on the significance of treatment
measures.
Finally, it is worth noting that Table VII reinforces Ta-
ble VI’s message that evidence against the null of experimental
irrelevance can often be found within quadrants rather than
along axes. Although in this case the average t- or z-statistic for
rejection along an axis in the average paper rises from 3.4 in
the Bonferroni test to 5.5 in the joint test, rejection rates in the
conventional joint test are still greater than in the conventional
Bonferroni test. Randomization inference produces a greater pro-
portional reduction relative to conventional results in joint tests,
but the average absolute rejection rates of the randomization-t
in joint tests within papers at the .01 and .05 levels (25.1%
and 41.9%, respectively) are comparable to those found using
randomization-t Bonferroni tests (23.1% and 43.9%), although
with Westfall–Young methods rejection rates are higher (28.9%
and 49.5%). In a similar vein, conventional and randomization
rejection rates are actually slightly higher once treatment effects
that were not reported by authors are included (Online Appendix).
The preceding presentation is frequentist, in keeping with
the emphasis on “starred” significant results in journals. In this
context, all that matters are the 0/1 significance rates reported
above, as a p-value of .011 is no more significant at the .01 level
than a p-value of .11. Seminar participants and referees, however,
often ask whether p-value changes are substantial, presumably
reflecting quasi-Bayesian calculations involving the likelihood of
594 THE QUARTERLY JOURNAL OF ECONOMICS
Downloaded from https://academic.oup.com/qje/article-abstract/134/2/557/5195544 by 81695661, OUP on 12 April 2019
FIGURE IV
Randomization-t versus Conventional p-Values
in Tests of Reported Treatment Effects.
Multiple testing p-value = min(1,N∗ pmin), where pmin is the minimum p-value
in the N individual tests. Joint tests for tables calculated using White’s (1982)
clustered/robust joint covariance estimate for multiple maximum likelihood equa-
tions.
outcomes under different hypotheses.25 To this end, Figure IV
graphs the randomization-t p-values against the conventional
p-values for the tests discussed above. As can be seen, there are
often very substantial differences, concentrated in particular
in tests with conventionally statistically significant results.
These patterns are consistent with those found earlier in Monte
Carlos for unbalanced regression design, where conventional
methods have sizable size distortions. Table VIII focuses on the
average within-paper distribution of randomization p-values for
conventional results that are statistically significant at the .01
or .05 levels. As can be seen, in tests at the coefficient level in
the average paper about two-thirds of changes in significance
merely bump the p-value into the next category (0.160/0.248
and 0.101/0.147). In contrast, in tests at the table level, the
biggest movement by far is into p-values in excess of .20, which
account for anything from 30% to 60% of all changes in .01
or .05 statistical significance in the average paper. The gaps
between randomization and conventional results are greatest in
high-dimensional tests where, as seen in Section IV, conventional
clustered/robust tests have gross size distortions.
In the presentation above I focused on those methods that
produce the results most favorable to authors. This is of great-
25. Similarly, in a frequentist world a failure to reject doesn’t confirm the null,
but in a Bayesian world large p-values increase its posterior probability, which
might explain why authors emphasize the statistical insignificance of treatment
coefficients in regressions related to randomization balance and sample attrition.
CHANNELING FISHER 595
TABLE VIII
DISTRIBUTION OF RANDOMIZATION p-VALUES
FOR CONVENTIONALLY SIGNIFICANT RESULTS
Downloaded from https://academic.oup.com/qje/article-abstract/134/2/557/5195544 by 81695661, OUP on 12 April 2019
Individual Joint Multiple Joint Multiple
treatment tests testing tests testing
effects (regression) (regression) (table) (table)
.01 .05 .01 .05 .01 .05 .01 .05 .01 .05
< .01 0.752 ↓ 0.747 ↓ 0.699 ↓ 0.477 ↓ 0.635 ↓
.01–.05 0.160 0.853 0.139 0.820 0.199 0.813 0.194 0.656 0.226 0.775
.05–.10 0.068 0.101 0.033 0.075 0.044 0.099 0.071 0.083 0.006 0.077
.10–.20 0.014 0.029 0.063 0.078 0.015 0.020 0.040 0.051 0.033 0.037
> .20 0.005 0.017 0.017 0.027 0.042 0.068 0.217 0.210 0.100 0.111
Notes. Reported figures are the average across papers of the within-paper distribution of randomization-t
p-values when conventional tests register a significant result at the level specified; (↓) included in the category
below; multiple-testing p-values are the Bonferroni adjusted minimum, as in Figure IV.
est relevance in the evaluation of the impact of randomization
inference on p-values. For example, in the 12 papers in my sam-
ple where authors systematically clustered below treatment level,
I follow their lead and treat the grouping of treatment in lab
sessions and geographic neighborhoods as nominal, rerandom-
izing across observations rather than treatment groupings. If
instead I were to cluster both the conventional and random-
ization tests at treatment level, then in the average paper in
this group the fraction of .01 conventionally individually sig-
nificant treatment effects that have individual randomization
p-values in excess of .10 rises from 0.0% to 18.6%. Similarly,
when the randomization-c is applied or when randomization-t
“conditional” p-values that do not rely on a joint zero null for all
treatment effects are calculated, 7.3% and 4.7%, respectively, of
.01 conventionally individually significant results in the average
paper have individual randomization p-values in excess of .10 (see
the Online Appendix). Authors deserve the benefit of the doubt,
but the minimization of differences in the tables above should not
be read as a guide to expected differences in randomization and
conventional results, particularly in highly leveraged estimation.
VI. CONCLUSION
If there is one message in this article, it is that there is
value added in paying attention to regression and experimental
design. Balanced designs lead to uniform leverage with con-
596 THE QUARTERLY JOURNAL OF ECONOMICS
ventional results that are less sensitive to outliers and less
subject to size distortions with clustered/robust covariance
estimates, and produce p-values that are nearly identical across
Downloaded from https://academic.oup.com/qje/article-abstract/134/2/557/5195544 by 81695661, OUP on 12 April 2019
clustered/robust, randomization, bootstrap, or jackknife pro-
cedures. Regressions with multiple treatments and treatment
interactions with participant characteristics generate concen-
trated leverage, producing coefficients and clustered/robust
standard errors that depend heavily on a limited set of obser-
vations and have a volatility typically greater than standard
error and degrees-of-freedom estimates. Rather than main-
tain the fiction that identification and inference comes from
the full sample, more accurate results might be achieved by
breaking the experiment and regression into groups based on
treatment regime or participant characteristics, each with a
balanced treatment design.
Consideration of experimental and regression design can
also play a role in multiple testing. Although the tests used in
this article can evaluate the general question of treatment rele-
vance, more discerning results can be achieved if regressions are
designed in a fashion that allows step-down procedures to control
the Type I error or false discovery rate (e.g., Holm 1979; Westfall
and Young 1993; Benjamini and Hochberg 1995). Practically
speaking, to allow for this, tests have to be set up in a fashion that
allows subset pivotality (Westfall and Young 1993), where the
distribution of each randomization test statistic is independent
of the nulls for other treatment results. Dividing regressions into
mutually exclusive samples is a trivially easy way to ensure this.
In sum, if the exploration of the effects of multiple treatment
regimes and the differing effects of treatment in population
subgroups is indeed the original intent of an experiment, then it
is best to build this into the experimental design by stratifying
the application of treatment to ensure balanced regression design
for each treatment/control pair and population subgroup. This
will allow accurate inference for individual coefficients and
enable the application of multiple-testing procedures to control
error rates across families of hypotheses.
That said, there will obviously be occasions where outlier
treatment values arise through factors beyond experimenters’
control or a conscious attempt to achieve greater power by expand-
ing the range of treatment variation. In such highly leveraged
circumstances, as well as in the case of high-dimensional joint
CHANNELING FISHER 597
and multiple testing, randomization tests of sharp nulls provide
a means to construct tests with credible finite-sample rejection
probabilities.
Downloaded from https://academic.oup.com/qje/article-abstract/134/2/557/5195544 by 81695661, OUP on 12 April 2019
LONDON SCHOOL OF ECONOMICS
SUPPLEMENTARY MATERIAL
An Online Appendix for this article can be found at The Quar-
terly Journal of Economics online. Data and code replicating tables
and figures in this article can be found in Young (2018b), in the
Harvard Dataverse, doi:10.7910/DVN/JX6HCJ.
REFERENCES
Anderson, Michael L., “Multiple Inference and Gender Differences in the Effects of
Early Intervention: A Reevaluation of the Abecedarian, Perry Preschool and
Early Training Projects,” Journal of the American Statistical Association, 103
(2008), 1481–1495.
Anderson, Theodore W., An Introduction to Multivariate Statistical Analysis, 3rd
ed. (New Jersey: John Wiley & Sons, 2003).
Benjamini, Yoav, and Yosef Hochberg, “Controlling the False Discovery Rate: A
Practical and Powerful Approach to Multiple Testing,” Journal of the Royal
Statistical Society, Series B (Methodological), 57 (1995), 289–300.
Bugni, Federico A., Ivan A. Canay, and Azeem M. Shaikh, “Inference under
Covariate-Adaptive Randomization,” Unpublished manuscript, Duke Univer-
sity, Northwestern University, and University of Chicago, 2017.
Chesher, Andrew, “Hajek Inequalities, Measures of Leverage and the Size of Het-
eroskedasticity Robust Wald Tests,” Econometrica, 57 (1989), 971–977.
Chesher, Andrew, and Ian Jewitt, “The Bias of a Heteroskedasticity Consistent
Covariance Matrix Estimator,” Econometrica, 55 (1987), 1217–1222.
Chung, Eun Yi, and Joseph P. Romano, “Exact and Asymptotically Robust Permu-
tation Tests,” Annals of Statistics, 41 (2013), 484–507.
Fisher, Ronald A., The Design of Experiments (Edinburgh: Oliver and Boyd, 1935).
Hall, Peter, The Bootstrap and Edgeworth Expansion (New York: Springer, 1992).
Heckman, James, Seong Hyeok Moon, Rodrigo Pinto, Peter Savelyev, and Adam
Yavitz, “Analyzing Social Experiments as Implemented: A Reexamination of
the Evidence from the HighScope Perry Preschool Program,” Quantitative
Economics, 1 (2010), 1–46.
Holm, Sture, “A Simple Sequentially Rejective Multiple Test Procedure,” Scandi-
navian Journal of Statistics, 6 (1979), 65–70.
Imbens, Guido W., and Donald B. Rubin, Causal Inference for Statistics, Social,
and Biomedical Sciences: An Introduction (New York: Cambridge University
Press, 2015).
Imbens, Guido W., and Jeffrey M. Wooldridge, “Recent Developments in the Econo-
metrics of Program Evaluation,” Journal of Economic Literature, 47 (2009),
5–86.
Janssen, Arnold, “Studentized Permutation Tests for Non-i.i.d. Hypotheses and
the Generalized Behrens-Fisher Problem,” Statistics & Probability Letters, 36
(1997), 9–21.
Jockel, Karl-Heinz, “Finite Sample Properties and Asymptotic Efficiency of Monte
Carlo Tests,” Annals of Statistics, 14 (1986), 336–347.
Lee, Soohyung, and Azeem M. Shaikh, “Multiple Testing and Heterogeneous Treat-
ment Effects: Re-Evaluating the Effect of Progresa on School Enrollment,”
Journal of Applied Economics, 29 (2014), 612–626.
598 THE QUARTERLY JOURNAL OF ECONOMICS
Lehmann, E. L., Testing Statistical Hypotheses (New York: John Wiley & Sons,
1959).
List, John A., Azeem M. Shaikh, and Yang Xu, “Multiple Hypothesis Testing in
Experimental Economics,” Unpublished manuscript, 2016.
Downloaded from https://academic.oup.com/qje/article-abstract/134/2/557/5195544 by 81695661, OUP on 12 April 2019
MacKinnon, James G., and Halbert White, “Some Heteroskedasticity-Consistent
Covariance Matrix Estimators with Improved Finite Sample Properties,” Jour-
nal of Econometrics, 29 (1985), 305–325.
Romano, Joseph P., and Michael Wolf, “Exact and Approximate Stepdown Methods
for Multiple Hypothesis Testing,” Journal of the American Statistical Associ-
ation, 100 (2005), 94–108.
Savin, N. E., “Multiple Hypothesis Testing,” in Handbook of Econometrics, vol. 2,
Zvi Griliches and Michael D. Intriligator, eds (Amsterdam: North-Holland,
1984).
Stein, C. M., “Confidence Sets for the Mean of a Multivariate Normal Distribution,”
Journal of the Royal Statistical Society, Series B (Methodological), 24 (1962),
265–296.
Westfall, Peter H., and S. Stanley Young, Resampling-Based Multiple-Testing:
Examples and Methods for P-value Adjustment (New York: John Wiley &
Sons, 1993).
White, Halbert, “A Heteroskedasticity-Consistent Covariance Matrix Estimator
and a Direct Test for Heteroskedasticity,” Econometrica, 48 (1980), 817–838.
———, “Maximum Likelihood Estimation of Misspecified Models,” Econometrica,
50 (1982), 1–25.
Young, Alwyn, “Consistency without Inference: Instrumental Variables in Prac-
tical Application,” Unpublished manuscript, London School of Economics,
2018a.
———, “Replication Data for: ‘Channelling Fisher: Randomization Tests and the
Statistical Insignificance of Seemingly Significant Experimental Results’,”
Harvard Dataverse (2018b), doi:10.7910/DVN/JX6HCJ.
Copyright © 2019 President & Fellows of Harvard University. Copyright of Quarterly Journal
of Economics is the property of President & Fellows of Harvard University and its content
may not be copied or emailed to multiple sites or posted to a listserv without the copyright
holder's express written permission. However, users may print, download, or email articles for
individual use.