100% found this document useful (1 vote)
138 views14 pages

Efron 1994

This article discusses using bootstrap methods to assess the accuracy of an estimator in situations with missing data. Specifically, it covers three topics: 1) bootstrap methods for missing data and their relationship to multiple imputation theory, 2) computationally efficient ways to implement bootstrap methods for missing data, and 3) practical and theoretical differences between bootstrap methods and multiple imputation approaches, as well as some useful similarities. The article uses a simple example of missing exam scores to illustrate these points.

Uploaded by

W
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
138 views14 pages

Efron 1994

This article discusses using bootstrap methods to assess the accuracy of an estimator in situations with missing data. Specifically, it covers three topics: 1) bootstrap methods for missing data and their relationship to multiple imputation theory, 2) computationally efficient ways to implement bootstrap methods for missing data, and 3) practical and theoretical differences between bootstrap methods and multiple imputation approaches, as well as some useful similarities. The article uses a simple example of missing exam scores to illustrate these points.

Uploaded by

W
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

This article was downloaded by: [New York University]

On: 21 October 2014, At: 18:56


Publisher: Taylor & Francis
Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House,
37-41 Mortimer Street, London W1T 3JH, UK

Journal of the American Statistical Association


Publication details, including instructions for authors and subscription information:
http://www.tandfonline.com/loi/uasa20

Missing Data, Imputation, and the Bootstrap


a
Bradley Efron
a
Stanford University , Stanford , CA , 94305
Published online: 27 Feb 2012.

To cite this article: Bradley Efron (1994) Missing Data, Imputation, and the Bootstrap, Journal of the American Statistical
Association, 89:426, 463-475

To link to this article: http://dx.doi.org/10.1080/01621459.1994.10476768

PLEASE SCROLL DOWN FOR ARTICLE

Taylor & Francis makes every effort to ensure the accuracy of all the information (the “Content”) contained
in the publications on our platform. However, Taylor & Francis, our agents, and our licensors make no
representations or warranties whatsoever as to the accuracy, completeness, or suitability for any purpose of
the Content. Any opinions and views expressed in this publication are the opinions and views of the authors,
and are not the views of or endorsed by Taylor & Francis. The accuracy of the Content should not be relied
upon and should be independently verified with primary sources of information. Taylor and Francis shall not be
liable for any losses, actions, claims, proceedings, demands, costs, expenses, damages, and other liabilities
whatsoever or howsoever caused arising directly or indirectly in connection with, in relation to or arising out of
the use of the Content.

This article may be used for research, teaching, and private study purposes. Any substantial or systematic
reproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in any
form to anyone is expressly forbidden. Terms & Conditions of access and use can be found at http://
www.tandfonline.com/page/terms-and-conditions
Missing Data, Imputation, and the Bootstrap
Bradley EFRON*

Missing data refers to a class of problems made difficult by the absence of some portions of a familiar data structure. For example,
a regression problem might have some missing values in the predictor vectors. This article concerns nonparametric approaches to
assessing the accuracy of an estimator in a missing data situation. Three main topics are discussed bootstrap methods for missing
data, these methods' relationship to the theory of multiple imputation, and computationally efficient ways of executing them. The
simplest form of nonparametric bootstrap confidence interval turns out to give convenient and accurate answers. There are interesting
practical and theoretical differences between bootstrap methods and the multiple imputation approach, as well as some useful
similarities.
KEY WORDS: Bayesian bootstrap; Bootstrap confidence intervals; Data augmentation; Ignorable nonresponse; Nonparametric
MLE.

1. INTRODUCTION i;,= i + &; + Is; (1.3)


Missing data refers to a class of problems made difficult were used to replace the missing scores oii = ?, of course
by the absence of some part of a familiar data structure. In setting = o,, for the observed numerical scores. This is
a correlation problem, for example, some of the data pairs not necessarily a good imputation scheme, as will be dis-
might be missing one of the measurements. A substantial
Downloaded by [New York University] at 18:56 21 October 2014

cussed, but it is typical of "best-fit'' imputation methods


theory of imputation has been developed during the past 15 often used in practical situations (see Little and Rubin 1987,
years to efficiently estimate a parameter of interest d in a chap. 2). For now we will take it as a given part of the example
missing data situation and to assess the variability of the under discussion.
estimate 8. This theory, as developed by Rubin (1987), Tan- The imputed data set consists of n = 22 rows (say, iifor
ner and Wong (1987), and several other authors mentioned i = 1, 2, . . . , n ) , from which we can calculate an empirical
in Section 3, is based on an appealing Bayesian analysis of covariance matrix
the missing data structure. In this article the bootstrap, a
frequentist device, is brought to bear on missing data prob-
lems, with a particular emphasis on nonparametric situa-
tions. There are interesting practical and theoretical differ-
ences between the bootstrap and imputation approaches, as The maximum eigenvalue of 1is an obvious estimate 8 for
well as some similarities. the parameter of interest 0, in this case giving
The left panel of Table 1 presents a simple example of a
missing data situation. Twenty-two students have each taken
8 = maximum eigenvalue of 1= 633.2. (1.5)
five exams, labeled A, B, C, D, E, but some of the A and We could just as well divide by n - 1 instead of n in defining
the E scores marked "?," have been lost. We suppose that if 1,but (1.4) makes 8 equal the normal theory maximum
there were no missing data, then we would consider the rows likelihood estimate of 0, which is handy for later comparisons.
of the matrix to be a random sample of size n = 22 from an The calculation of 8 illustrates the traditional purpose of
unknown five-dimensional probability distribution F, and imputation: to replace missing values with numbers so that
that our goal is to estimate 0, the maximum eigenvalue of familiar statistical methods can be used. Here the familiar
the covariance matrix of F, method is the covariance calculation ( 1.4). The question re-
d = maximum eigenvalue of $, (1.1) mains, "How good an estimate of d is @?"?' This article con-
cerns bootstrap approaches to answering this question, par-
where $ is the covariance matrix of a single vector drawn ticularly in nonparametric situations.
from F. The simplest nonparametric bootstrap approach, some-
The right panel of Table 1 shows an imputed data set in times called just the nonparametric bootstrap in what follows,
which the missing student scores have been replaced by es- begins by writing the observed data set as
timates obtained from a two-way linear model. Parameter
estimates i, & ,and Bj were obtained by minimizing the sum 0 = (01, 0 2 , . . . , 0"). (1.6)
of squares
Here 0;is the ith row of the matrix in the left panel of Table
I22 5 \
1, including the question marks; for example, o1 = (?, 63,
65, 70, 63). A nonparametric bootstrap sample,

o* = (o:, o:, . . . ,OX), (1 -7)


the sum being over the observed numerical scores oij in the is obtained by drawing the '0 randomly and with replace-
left panel, those having a numerical value and not a question ment from the set { oI,oz, . . . ,on}(see Efron and Tibshirani
mark. Then the imputed values
* Bradley Efron is Professor of Statistics, Stanford University, Stanford, 0 1994 American Statistical Association
CA 94305. I am grateful to Paul Holland for several helpful discussions Journal of the American Statistical Association
concerning missing data. June 4994, VOl. 89, No. 426, Theory and Methods

463
464 Journal of the American Statistical Association. June 1994

much to recommend it. It is nonparametric, is applicable to


Table 1. The Student Score Data
any kind of imputation procedure, and requires no knowl-
Observed Data o Imputed Data x^ edge of the missing-data mechanism. Its main practical dis-
advantage is the computational expense of the 2,000 or so
Student A B C D E A 9 C D E
bootstrap replications required for reasonable numerical ac-
1 ? 63 65 70 63 56.21 63 65 70 63 curacy.
2 53 61 72 64 73 53 61 72 64 73 Row 2 of Table 2 is based on an analytic approximation
3 51 67 65 65 ? 51 67 65 65 58.94
4 ? 69 53 53 53 47.96 69 53 53 53 to the BC, confidence limits that requires no Monte Carlo
5 ? 69 61 55 45 48.46 69 61 55 45 replications.This method, called ABC by DiCiccio and Efron
6 ? 49 62 63 62 49.96 49 62 63 62 ( 1 992), is discussed in Section 6. The computational burden
7 44 61 52 62 ? 44 61 52 62 51.69
8 49 41 61 49 ? 49 41 61 49 46.94 is only a few percentage points of that for BC,, a savings
9 30 69 50 52 45 30 69 50 52 45 that can be crucial when using imputation methods more
10 ? 59 51 45 51 42.46 59 51 45 51 elaborate than ( 1.3).
11 ? 40 56 54 ? 39.54 40 56 54 44.33
12 42 60 54 49 ? 42 60 54 49 48.19 Imputation method (1.2), (1.3) is suspect here because it
13 ? 63 53 54 ? 46.21 63 53 54 50.99 imputes only best-fit values and ignores residual variability.
14 ? 55 59 53 ? 45.21 55 59 53 49.99 It seems likely that 2 will have a smaller empirical covariance
15 ? 49 45 48 ? 36.87 49 45 48 41.66
16 17 53 57 43 51 17 53 57 43 51 matrix than the original unobservable data set x. We can
17 39 46 46 32 ? 39 46 46 32 37.69 avoid this kind of bias by assuming a parametric model and
38 41 33
Downloaded by [New York University] at 18:56 21 October 2014

18 48 44 48 38 41 44 33 estimating 8 by maximum likelihood. A multivariate normal


19 46 40 47 29 ? 46 40 47 29 37.44
20 30 34 43 46 ia 30 34 43 46 18 model applied to the observed data o in Table 1 gives max-
21 ? 30 32 35 21 20.46 30 32 35 21 imum likelihood estimate (MLE) 8 = 631.3-not much dif-
22 ? 26 15 20 9.87 26 15 20 14.66 ferent than the previous estimate 633.2. The MLE was ob-
NOTE Left panel: 22 students have each taken 5 exams, lM&d A, 6.C, D, and E. Some of tained using a variant of Dempster, Laird, and Rubin's (1977)
the SMKes for A and E, indicated by "?." are missing. Right panel: The missing data have been EM algorithm.
imputed from a two-way additive model. The full data set, taken from Mardia. Kent, and Bibby
(1979). appears in taMe 1 of Efrm (t992a). The simple nonparametric bootstrap analysis described
above can be applied just as well to the normal-theory MLE
as to any other estimate. The computational economy of
1986). Then we can use the two-way fitting algorithm (1.2), the ABC method is essential here because of difficulties in
(1.3) to impute the missing values in o* (giving, say, ?*), calculating 8. Row 4 of Table 2 shows the ABC limits for 8
and compute the bootstrap covariance matrix based on the MLE, these being somewhat wider than the
limits based on (1.2)-(1.5). The delta method estimate of
standard error, discussed in Section 5, was 253.9, about 15%
bigger than the corresponding standard error for 8 given by
(1.2)-( 1 S).This is a price we pay for reducing bias. Shorter
(1.8)
Finally, we calculate 8*, the maximum eigenvalue of $*, a
bootstrap replication of 8. 6.633.2
A computer program drew 2,200 independent bootstrap I

samples o* and evaluated 8* for each one. Figure 1 shows


the histogram of the 2,200 8* values, the histogram being
notably long-tailed to the right. These have empirical stan-
dard deviation 2 12.0, which by definition is the bootstrap
estimate of standard error for 8. The average 8* value is

3
610.3, giving a bootstrap bias estimate -22.9 = 610.3 - 8.
tI
The number of bootstrap samples, 2,200, is ten times that ~

needed for estimating the standard error of 8 (see Efron 1987,


sec. 9). But we need that many for the more delicate task of
forming an approximate confidence interval for 8. Row 1 of
Table 2 gives a-level approximate confidence limits with a
= .025, .05, . . . , .975, using the bootstrap confidence interval
I
method called BC, by Efron ( 1987), explained in Section 2. I

The 90%central interval, for example, runs from the .05 to I I I I


I
I I I 1
.95 limit, 8 E [379, 1,1641. Notice that this interval extends 0 200 400 600 800 1000 1200 1400
more than twice as far to the right of 8 = 633.2 as to the left,
reflecting the asymmetry of the bootstrap histogram as well Figure 1. Histogram of 2,200Nonparametric Bootstrap Replications of
the Maximum Eigenvalue 6, Using Estimates Based on the Two-way Fitting
as a bias correction.
Algorithm (1.2),(1.3). This histogram has mean 610.3 and standard de-
Section 2 discusses the logic of the nonparametric boot- viation i = 212.0,the bootstrap estimate of standard error. The first row
strap method ( 1.6), ( 1.7). More elaborate bootstraps are of Table 2 gives approximate confidence limits for 8 based on this histo-
available, as will be discussed, but the simple method has gram.
Efron: Missing Data and the Bootstrap 465

Table 2. Approximate Nonparametric Confidence Limits for the Maximum Eigenvalue 0, (1. l), Given the Observed Data o in Table 1

Confidence limit 01: ,025 .050 ,100 ,160 .840 ,900 ,950 .975
1. BC. 341 379 429 478 966 1,059 1,164 1,253
2. ABC 340 379 430 476 946 1,046 1,172 1,295
3. Full-mechanism SC,: 349 387 439 490 970 1,074 1,213 1,300
4. ABC for MLE: 289 353 409 458 1.014 1,135 1,307 1,474
5. Multiple Imputation: 345 382 428 468 864 946 1,063 1,177
NOTE Row 1: Nonparametric BC. method based on the 2200 bootstrap replicationsof Figure 1. Row 2 An analytic approximationto row 1 called ABC. requiringno Monte Carla replications. Row
3: A more elaborate bootstrap cmfidence method discussed in Section 2. Row 4 ABC limits for 0 based on the normal theory MLE 8 instead of t
k best-fit imputation(1.3). Row 5: Multiple imputation,
or data augmentation, limits. The five methods are explained in Sections 2, 3, and 5

confidence intervals are not necessarily better confidence in- The point is this: In nonparametric situations it is useful
tervals, but in this case the normal theory MLE seems a little to assess the statistical properties of a variety of estimators.
unrobust; see the end of Section 5. The nonparametric bootstrap has the advantage of applying
Based on ideas suggested by the EM algorithm, Rubin in a simple way to any estimator 8. Of course, this does not
proposed a theory of multiple imputation for assessing the obviate the need to sensibly select 8. It does mean that the
variability of an estimator 8 obtained in a missing data sit- statistician can choose 8 using the usual variety of exploratory
Downloaded by [New York University] at 18:56 21 October 2014

uation. Some good references are Rubin (1987), Rubin and tools, including experience and intuition, with the assurance
Schenker ( 1986), Tanner ( 199 l ) , and others listed in Section of being able to obtain reasonable estimates of 8’s vari-
4. Tanner and Wong (1987) gave a neat computational de- ability.
scription of the ideas involved, using the term data aug- The nonparametric and parametric situations converge
mentation. The multiple imputation or data augmentation when we have categorical data. In this case there is a non-
approach, described in Section 4, is quite different from the parametric MLE 8, which is asymptotically optimal in terms
bootstrap approach of Table 2. It is based on Bayesian rather of bias, variance, and so on. Now it is possible to directly
than frequentist theory. Nevertheless, Section 5 shows that compare the nonparametric bootstrap with multiple impu-
bootstrap methods can be useful for implementing data aug- tation. This comparison is made in Section 3.
mentation. Row 5 of Table 2 refers to approximate confi-
dence limits based on a data augmentation scheme. Section 2. NONPARAMETRIC BOOTSTRAP METHODS
6 briefly summarizes the advantages and disadvantages of This section discusses the logic of the simple nonpara-
the different approaches. metric bootstrap method that produced Figure 1. For com-
This article concerns nonparametric error estimates for parison, we also describe a different nonparametric bootstrap
missing data problems. There is an important practical dif- for missing data problems. The different structure of the two
ference between the parametric and nonparametric situa- bootstraps manifests itself in what we need to assume about
tions. It is natural in a parametric framework to estimate 8 the missing data mechanism. Near the end of the section we
by its MLE, 8. We know that 8 has asymptotically optimal briefly describe the BC, system of approximate confidence
properties for estimating 8, in terms of bias, variance, or
intervals used in Table 2.
more general confidence statements. The only problem is to Figure 2 diagrams a missing data problem and its non-
make such statements on the basis of a partially missing parametric bootstrap analysis. F is a population of units X j ,
data set.
The choice of an estimator 8 is usually not so clear in a
nonparametric setting. In our maximum eigenvalue problem,
we might have considered three different estimators: (1) the with N possibly infinite. A missing data mechanism, say 0,
= c(XJ),results in a population G of partially concealed
best-fit estimator (1.2)-( 1.5); (2) the Buck estimator (Buck
1960; Little 1983), in which 2,Jin ( 1.3) is replaced by a linear objects
regression predictor based on the complete cases and the G = { O j = c ( X , ) , j = 1 , 2, . . . , N } .
elements of $ in (1.4) are augmented by the addition of
residual covariances (to add variability back into the imputed In the example of Section 1, Xj is a vector of five scores for
values zl,);and (3) the normal-theory MLE. one student and 0, is the same vector with some or perhaps
The best-fit estimator could easily be biased downward. none of the numerical values concealed by question marks.
The normal-theory estimate, which is nearly optimal in a We wish to infer the value of a parameter of the popula-
normal sampling framework, seems to be overly variable for tion F ,
this data set. The Buck estimator is appealing here, being of
intermediate complexity between the best-fit and normal-
theory estimators. We could test the appeal by a nonpara- In our example s( F ) is the maximum eigenvalue of the co-
metric bootstrap analysis like that in Figure 1. The analysis variance matrix corresponding to F . A random sample of
might show that the Buck estimator is no more variable than size n is obtained from G , o = ( o l , 02, . . . , o n ) ,as on the
the best-fit estimator, in which case it would be preferable left side of Table 1. The nonparametric inference step esti-
in terms of smaller bias. mates G by the empirical distribution G of 0 ,
466

4
F

I
I
I

s
conceal
----- -> G
I

V
--sample
--
actual

> O
infer
->G
Journal of the American Statistical Association. June 1994

I
boots trap

sample
- > o*
infer
- => *;
I

It
V

eF
e Q*

Figure 2. Diagram of NonparametricBootstrapApplied to a Missing Data Problem. The individualmembers of a population of interest F are partially
concealed according to a missing-data mechanism, giving a population G of partially concealed objects; o is a random sample of size n from G; 6
is the empirical distribution corresponding to 0; and statistic 8 = t ( G ) estimates parameter 0 = t(G), which is intended to be a good approximation
to the actual parameter of interest 0, = s ( F ) . The bootstrap sampling and inference procedures duplicate those actually used, giving bootstrap
replications 8 = t(G '). The BC. and ABC methods, described later, give good approximate confidence intervals for 0 based on the bootstrap
replications.

6: probability l / n on oi for i = 1, 2, . . . , n . (2.4) hood function. In other words, we assume that each oihas
Downloaded by [New York University] at 18:56 21 October 2014

the appropriate marginal normal distribution obtained from


We use some function of 6, say
(2.6). Rubin (1987) called this latter assumption ignorable
8 = t(G), (2.5) nonresponse.
Fisher consistency says that in large samples our esti-
as an estimate of OF; for example, the bootstrap imputation
mate will tend toward the correct answer. This is an ob-
estimate (1.2)-( 1.5).
viously desirable property, but it may be costly to insist
The nonparametric bootstrap procedure repeats the actual
upon it. In the example of Table 2 it seems to cost an extra
sampling, inference, and estimation steps, but beginning with
15%. This situation resembles the robust estimation of a
G instead of G. The advantage of course is that G, unlike
population mean, where there is a trade-off between the
G, is known, so that we can carry out an unlimited number
bias and variance of possible estimators. Even if we com-
of bootstrap replications. Each replication involves drawing
pletely trust the normality assumption (2.6), ignorable
a bootstrap sample o* from G, as at (1.7), forming the em-
nonresponse is unverifiable, and is questionable in many
pirical distribution G* corresponding to the o* and calcu-
realistic circumstances. It fails if the missing data depends
lating 8* = t ( G * ) .
on the missing values; for example, if the question marks
A principal advantage of the nonparametric bootstrap
method is that it does not depend on the missing-data mech- in Table 1 occur more frequently at extreme values of A
anism. The process of concealment does not affect the and E.
method, which conceptually begins with G rather than F . The estimation problem becomes simpler if we are
This is also a potential serious disadvantage. Nothing in Fig- dealing with categorical data points x i , rather than with
ure 2 connects the parameter of interest, OF = s ( F ) , to the the five-dimensional vector-valued points of the eigenvalue
parameter being estimated, 0 = t( G ). example. In the categorical case it is easy to write down
Some choices o f t ( G) are better than others for reducing the nonparametric maximum likelihood estimator 6
the possible bias of 6 as an estimate of eF.Row 4 of Table 4 = t( 6).Fisher consistency (2.7) will now hold true for all
refers to the normal-theory MLE estimator, for which distributions F , as long as we have ignorable nonresponse.
t ( G )can be described as follows: assume that each student's The objection that the nonparametric bootstrap is esti-
score vector xi is a random draw from a five-dimensional mating the wrong parameter is now totally removed,
normal distribution though possibly still at the expense of excess variability
for 6 in reasonable-sized samples. The categorical case is
xi - N5(P, T), (2.6) discussed in Section 3, where it is used as a basis of com-
some components of which have been concealed by the parison between the nonparametric bootstrap and multiple
missing-data mechanism to give the observed vector oi , imputation.
i = 1, 2, . . . , n. We estimate p and $ by normal-theory Other bootstrap methods can be applied to missing-data
maximum likelihood and then take 6 to be the maximum problems. Thefull-mechanism bootstrap diagramed in Figure
eigenvalue of $. 3 begins more directly than the nonparametric bootstrap of
Under some circumstances this choice of t( ) will be - Figure 2. The original, unobservable, data set x = (x, , x2,
Fisher consistent for estimating the maximum eigenvalue, . . . , x n ) is assumed to be obtained by random sampling
in the sense that from F. The missing-data mechanism is applied to the com-
ponents of x, 0;= c ( x i ) ,giving the observed data 0 . Some
0 = t ( G ) = s( F ) = OF. (2.7) method of inference, necessarily more complicated than
The circumstances are that F is multivariate normal and (2.4), gives an estimate Q for F based on 0 . The parameter
that the missing data mechanism does not affect the likeli- of interest, 0 = s ( F ) , is then estimated by 6 = s(P). The
Efron: Missing Data and the Bootstrap 467

actual bootstrap

F -
cc
sample
> x
-#
conceal
>
- 0
- =
>
.
infer A

F -
sample
-
x*
conceal
-o* -
infer
__q
,*
F

Is
V
e
Is
8*

Figure 3. Full-Mechanism Bootstrap. x is a random sample from population of interest F; members of x are partially concealed by the missing-data
mechanism to give 0; F is an estimate of F based on 0; and the parameter of interest, 0 = s ( F ) , is estimated by 8 = s( F ) . The bootstrap sampling,
missing-data, and inference procedures are supposed to duplicate those that actually occurred. This requires specification of the missing-data
mechanism.

full-mechanism bootstrap repeats the sampling, missing-data,


and inference processes to yield bootstrap replications 8*
= s(F*).
A total of 2,500 full-mechanism bootstrap replications Past experience or a preliminary data analysis might provide
Downloaded by [New York University] at 18:56 21 October 2014

rough estimates of X o and XI. Then we could use (2.8) to


8* were obtained for the maximum eigenvalue problem of compute o* from x* in Figure 3, and so carry through the
the Introduction. The inference method was taken to be as
full-mechanism bootstrap.
simple as possible: F equaled the empirical distribution of
Bootstrap theory aims to provide statistical inferences from
the best-fit imputation 2 based on ( 1.2), ( 1.3), putting prob-
a bootstrap histogram like Figure 1. The BC, method used
ability on each vector Zi . (A more ambitious scheme might
in Table 2 is a way of forming highly accurate approximate
have obtained F by multiple imputations from 0 , as discussed
confidence intervals for the parameter
in Sec. 4.) The bootstrap samples x* = (x: , . . . , x,*) were
drawn by simple random sampling from F (i.e., from { gI, 6' = t ( G )
Z2, . . . ,2,,} ) , and then o* was obtained by concealing the from 8 = t ( 6) and bootstrap replications 8*. It applies to
same elements of Z* as those concealed by question marks both parametric and nonparametric situations (see Efron
on the left side of Table 1. For example, if xI were selected 1987). Typically the method is second-order accurate; if
to be Z222,then 0: = (?, 26, 15,20, 14.66). Best-fit imputation 8( a ) is the endpoint of a one-sided BC, interval of intended
(1.2), (1.3) applied to o* gave Zi*, then F*, and finally 8*. level a, then
The histogram of the 2,500 8* values looked much the
same as Figure 1, giving bootstrap standard error estimate
prob(6' < 8 ( a ) }= a + O,(l/n) (2.9)
2 = 219.8 and bias estimate -23.2, compared with 212.0 as the sample size n + co (see Remark D). This compares
and -22.9 previously. The BC, confidence limits were also with a + O , ( l / G ) if we use the standard interval 8
much the same as before, as seen in Row 3 of Table 2. + z'");,where S. is an estimated standard error for 8 and z(")
The full-mechanism bootstrap uses the same function s( ) - is the lOOath percentile of a N(0, 1) distribution,
to define and estimate the parameter of interest, avoiding Z(O) = @.-'(a), (2.10)
the possibility of definitional bias that we womed about ear-
lier. However a similar difficulty arises with the more com- 9 being the standard normal cdf. Remark B extends the BC,
plicated inference required in going from o to rather than method to the k-sample problem.
from o to G. The SC,interval endpoints are computed ti-om the percentiles
There is a serious disadvantage: In Figure 3 we need to of the bootstraphistogram.Let H be the empiricalcdf of the
specify the missing-data mechanism 07 = c(x7 ) in order bootstrap replications, say B of them ( B = 2,200 in Fig. I):
to obtain the bootstrap replications 8*. In other words, we H ( t ) = #{8* < t } / B . (2.1 1)
need to say what we would have observed in situations Then & I ( a ) is the lOOath bootstrap percentile. The a-level
other than the one that actually occurred. Simply con- BC, endpoint 8( a ) is defined by
cealing the same elements as in o is allowable if the missing-
data mechanism is what Little and Rubin (1987) called
missing completely at random, but this is usually an un-
realistic assumption. It is a much stronger assumption than where zo is the bias-correction constant
ignorabilit y.
On the other hand, if the missing data mechanism is non- zo = @+H(8). (2.13)
ignorable, then we may need to model it anyway, in which The computation of the acceleration constant a in (2.12) is
case the full-mechanism bootstrap becomes practical. Sup- described in Section 5 at (5.5).
pose that we believed that the observed first test scores in If zo and a equal 0 and H is the normal distribution
Table 1, the oIi, were missing in a way that depended on N( 8, G2), then 8( a ) equals the standard confidence limit 8
the actual score x I I, say + z(O)G.Otherwise, formula (2.12) corrects all of the second-
468 Journal of the American Statistical Association, June 1994

order errors made by the standard intervals (see Efron 1987, are interpreted as ordinary observations. Rubin (1987, 1978)
sec. 2). proposed drawing multiple random imputations of the miss-
Formula (2.12) looks complicated, but it is easy to apply. ing data rather than a single best-fit imputation. Variability
The hard part is computing the 2,000 bootstrap replications of results between the randomly imputed data sets can then
necessary to give the formula sufficient accuracy. The ABC be used to assess the true accuracy of an estimate 8. The
algorithm of Section 5 uses analytic approximations in place variability calculation is camed out by means of a Bayesian
of Monte Carlo replications. Typically it requires only a few updating scheme, quite different in concept from the boot-
percent as much computational effort as BC,. strap method of Section 2. This section briefly reviews mul-
tiple imputation, following the development of Tanner and
Remark A . The full-mechanism and nonparametric Wong (1987). Section 4 presents a third bootstrap method
bootstrap methods are identical in the important special case for missing data, based on multiple imputation. At first we
where we observe censored data from a survival analysis (see will use parametric notation, which is more natural in the
Efron 1981a). Bayesian framework. Then we will discuss categorical data,
Remark B. The basic idea of the nonparametric boot- for which it is easier to make a nonparametric comparison
strap of Figure 2 is to consider the oi , including the missing of multiple imputation with the bootstrap.
values, as randomly sampled points from a population G .
This idea has been used before in the sample survey literature, 3.1 Data Augmentation
though typically with methods like balanced repeated rep- Let o indicate the observed data set and let x indicate any
Downloaded by [New York University] at 18:56 21 October 2014

lications or the jackknife, rather than the bootstrap. Fay complete data set consonant with 0 . In the example of Table
(1986, sec. 4) used the jackknife on a complicated categorical 1, x could be any 22 X 5 matrix of numbers agreeing with
data problem with missing data. Greenwood‘s formula for o at all of its numerical entries. The actual complete data
the variance of a survival curve is a delta method version of set x giving rise to 0 , which would have been observed if
the same idea. Meng and Rubin (1989) suggested the pos- there were no missing data, is assumed to be sampled from
sibility of a bootstrap approach to missing-data problems. a parametric family with density function&(x), with 7 being
Laird and Lewis’ (1987) “Type I” and “Type 11-111” boot- an unknown p-dimensional parameter vector. Starting with
straps are examples, respectively, of the nonparametric and a prior density to(7 ) on t,Bayes’s theorem would give hy-
full-mechanism methods. pothetical posterior density [( 7 I x ) if x were observed and,
Remark C. The nonparametric bootstrap of Figure 2 more concretely, the actual posterior density [( 7 lo) having
can also be applied to K-sample problems observed 0 . A standard probability calculation relates [( 7 I 0 )
to [(six):
( G I ,G2, .,GK) + (01,02,. . ,OK), (2.14)
where the ok are independent random samples of size nk t(710) = 4(7lx)f(xlo) dx, (3.1)
obtained from populations Gk, the parameter of interest
being 0 = t ( G I ,G 2 , . . . , G K ) .The empirical distributions where f(x I 0 ) is the predictive density of x given 0 , the con-
Gk corresponding to the ok give independent bootstrap sam- ditional density integrating out 7 ,
ples 0: of size n k , from which we calculate G* and finally
a* = t( 6: , 6: , . . . , GE). The BC, intervals are calculated f(Xl0) = J h ( x l o ) t ~ d o )d7. (3.2)
from (2.12), (2.13) as before, with the acceleration a obtained
as in Remark J of Section 5. The integral in (3.1) is taken over all x consonant with 0 .
Remark D. DiCiccio and Efron ( 1992) showed that the Result (3. l), the data augmentation identity, can be stated
BC, and ABC intervals are second-order accurate if the data as follows. The posterior density of 7 , given the observed
set is obtained by random sampling from a multiparameter data 0 , is the average posterior density of 7 based on a com-

exponential family and 0 is a smooth function of the expec- plete data set x. The average is taken over the predictive
tation vector of the family. A discretization argument applies density of x given 0 . In a typical missing-data problem, com-
this result to the situation in Figure 2. We suppose that the puting [( 17 I x) is easy but computing [( 7 10) is difficult. If we
sample space of the oi can be discretized to a finite number can sample from the predictive density f(x I o), then (4.1)
of outcomes, say L of them. For the examples in Table 1, gives a practical way of approximating [( 9 I 0 ):
each of the five coordinates of an 0,vector takes its value in
the 102-element set { ?, 0, 1,2, . . . , loo} ,so we can take L (3.3)
= 1025 . The multiparameter exponential family referred to
above is the L-category family of multinomial distributions. where x(’),x ( ~ ). ,. . ,x ( ~are ) the multiple imputations, that
See the comments on finite sample spaces of Efron (1987, is, independent draws fromf(x10). This argument has a
sec. 8). circular look, because we need to know [( 7 lo) to calculate
3. MULTIPLE IMPUTATION f(x I 0 ) in (3.2). Tanner and Wong (1987) investigated an
iterative algorithm related to Gibbs’ sampling for actually
Best-fit imputation, as illustrated on the right side of Table carrying out (3.3). Noniterative approximationsare available,
1, conveys a false sense of accuracy if the imputed values as discussed later.
Efron: Missing Data and the Bootstrap 469

Most often, inferences are desired for some real-valued O( k ) is a subset of X,so that observing 0, = oj means that
function (or functions) of 7 , the unobserved X, = x, lies in the subset 0,.In what follows
we will let 6(x, 0)indicate whether or not x is among the
8 = S(?), (3.4) values contained in 0:
like the maximum eigenvalue in Section 1 rather than for
6(x, 0) = 1 i f x E o ( x E X,o E 0).(3.8)
the entire vector 7. The marginal posterior densities of 8, say
T(B lo) and a(8 I x) , are related by a marginalized version of = O ifxeo
(3.11,
As a simple example, suppose that a human population
4010) = s,
T(8lX)f(XlO) dx,
is categorized by sex and handedness. There are L = 4 original
(3.5) population values:

with f(x I 0 ) still being defined by (3.2). X(1) = (male, left), X(2) = (male, right),
The most obvious difficulty in applying (3.3)-(3.5) is the
X(3) = (female, left), and
generation of imputations x ( ~from ) the predictive density
f(x lo). A simple approach, called “poor man’s data aug- X(4) = (female, right). (3.9)
mentation” by Wei and Tanner (1 990), is to sample the x ( ~ )
If there are difficulties in ascertaining handedness, we will
from f , ( x I o), with 7 set equal to the MLE 5:
need K = 6 states in 0:O( k ) = X ( k ) for k = 1 , 2 , 3 , 4 and
-
drn) f , ( x I 01, the two additional states
Downloaded by [New York University] at 18:56 21 October 2014

independently for m = 1, 2, . . . , M . (3.6) O(5) = (male, ?) and O(6) = (female, ?). (3.10)
This could also be called a conditional parametric bootstrap In this case O(6) corresponds to { X(3), X(4)}, so that ob-
sample. In many situations (3.6) is quite satisfactory, though serving 0, = O(6) means that x, is either (female, left) or
it can underestimate variability if there is too much missing (female, right).
data. The missing-data mechanism can be described by the
A better approximation to the predictive density is often conditional probability density of 0, given X,, say
used. Each imputation is drawn from its own bootstrap-
ped choice of the parameter vector 7: c(olx) = prob{ 0, = 014= x }

x ( ~ w)j + ( m ) ( X I O ) , (for x E X,o E 0). (3.1 1)


independently for m = 1, 2, . . . , M . (3.7) Because X, is a member of 0,, only those o containing x
have positive probability:
A bootstrap parameter vector G* is the MLE for 7 based on
a bootstrap sample o*, (1.7). Rubin (198 1) and Efron (1982, c(olx) = 0 if 6(x, 0)= 0. (3.12)
sec. 10.6) pointed out that the bootstrap distribution of ij*
The populations F and G in (2.1), (2.2) correspond to
is quite close to the nonparametric Bayes posterior distri-
densities f = (fi ,f2,. . . ,h)
and g = (gl,g2, . . . , gK)on X
bution of 7 given 0 , starting from a vague Dirichlet prior for
and 0 respectively. For example,f3is the proportion of units
G, (2.2). The difference between (3.6) and (3.7) is the dif-
in F with X, equalling X(3). The two densities are related
ference between a first-level and a second-level bootstrap
by go = C, c( olx)fx, so f determines g. Letting C be the L
sample. A nice application of (3.7) was presented by Heitjan
X K matrix with entries c( o I x),
and Little (1991). Little and Rubin (1987, sec. 12.4) call
(3.7) aproper imputation method, as opposed to the improper g = fC. (3.13)
method (3.6), which underestimates variability. They used
the name approximate Bayesian bootstrap for procedures In our nonparametric setting f plays the role of the parameter
like (3.7). vector 7 in (3.1)-(3.7).
The data augmentation identity requires the correct spec- Bayes’s rule gives dr( x I o), the conditional probability
ification off,(xlo) in (3.2). In practice this means making density of x given 0:
some assumption about the missing-data mechanism, such dr(xI0) = L ~ ( o l x ) l g o . (3.14)
as ignorable nonresponse, as discussed next.
The L X K matrix Dr = ( dr( x lo)) inverts relationship (3.13),
3.2 Categorical Data
f = gD;. (3.15)
To compare multiple imputation with the nonparametric
bootstrap, we need to have a nonparametric interpretation Because Df depends on f, solving for f i n (3.15) is not the
of (3.1)-(3.7). This is easy to do in the case of categorical same as using the linear inversion f = gC‘(CC’)-’ . (We are
data, where the original population units X, in (2.1) have assuming that C does not depend on f. If it did, we would
only a finite number of possible values, say X, E X = { X ( l), be in the more difficult situation where the missing-data
X(2), . . . ,X ( L ) }. Similarly, the partially concealed objects mechanism itself provides information about f .)
0, = c(X,) produced by the missing-data mechanism are A random sample o = ( ol, 02, . . . , 0,) from G gives em-
limited to a finite set 0 = { 0(1), 0(2), . . . , O ( K ) } .Each pirical probabilities = (&, Q,. . . ,&),
470 Journal of the American Statistical Association, June 1994

ek = #{o, = O(k)}/n. (3.16) The first three steps of the multiple imputation algorithm
The MLE i of f is obtained from an empirical version of implement the approximate Bayesian bootstrap (3.7). Con-
ditional resampling is done according to the conditional
(3.15),
densities (3.8),
i = gD;. (3.17)
x:* loi - p i . ( - 10;) (i = 1, 2 , . . . , n ) . (3.21)
This is the self-consistencyproperty of the MLE as explained
by Dempster et al. ( 1977). Having selected i *, the sampling in (3.21) is conditionally
The trouble with using (3.17) to find the MLE i is that independent for i = 1, 2, . . . , n. The completed data set
x * * = (X** I , x $ * , . . . ,x,**)iswhatwecalledx~”~in(3.7).
Dr depends on the missing-data density c(oIx), which is
usually unknown. This is where ignorability comes in. Let The double star notation emphasizes the two levels of boot-
p r ( XI 0)indicate the “obvious” conditional density ofxgiven strap sampling involved.
0:
A basic assumption of the multiple imputation approach
is that inferences for 8 would be easy to make if there were
Pr(xlo) = 6(x, o1.L
Ic
XI
6(x’, O > . L G (3.18)
no missing data. We assume in Figure 4 that given x**, a
completed data set, it is easy to calculate a posterior density
[( f I x * * ) and marginalize [ to the appropriate posterior
pr( X I o) puts probabilities proportional tofx on each x in o. density T ( 8 I x * * ) for 8. In contrast, the nonparametric
Suppose that the selection mechanism of oi given x, was bootstrap uses the replications 8* to directly make inferences
predetermined in the following sense: First a disjoint partition about 8. The crucial marginalization step, from f to 8, is
Downloaded by [New York University] at 18:56 21 October 2014

P Iof X was defined, then x, was selectedaccordingto density handled automatically by the bootstrap confidence algorithm
f , and finally o, was chosen to be that member of PIinto BC, or ABC.
which x, fell. In this case p r (x, 10, ) would equal the actual Marginalization is a major difficulty in applying Bayesian
conditional density dr( x, I o, ), methods to high-dimensional problems, even without miss-
Ignorability implies that we can ignore the missing-data ing data. In genuine Bayesian situations there can be no
mechanism and set dr( X I 0 ) equal to p r ( X I 0) (see Rubin argument with inferences based on the posterior density
1986, sec. 7). Letting Pr be the L X K matrix (pr(xlo)), T (8 I 0 ) . However multiple imputation is often applied in an
(3.18) becomes f = gP; , and (3.15) takes on the more prac- objectivist framework, beginning, perhaps implicitly, with
tical form some form of uninformative prior. This can be tricky ground,
i = gp;. (3.19) where an apparently innocuous prior on the full parameter
vector leads to unexpected biases for 8 (see Bernard0 and
We will write f = MLE(g) to indicate the mapping from Berger 1991 and Tibshirani 1989). Section 4 concerns an
i to i implied by (3.19), ignoring questions of uniqueness easy and accurate marginalization technique, designed to
and existence. simplify the use of multiple imputation.
The nonparametric MLE for a parameter 8 = s( f ) is Figure 4 shows that the nonparametric bootstrap is simpler
8 = t ( 0 ) = s(MLE(g)). (3.20) and more direct than multiple imputation. On the other hand,
multiple imputation can be applied to parametric problems
If (3.20) is the function t used in Figure 2, then we know we
and can incorporate Bayesian information (Lazzeroni,
are estimating the correct parameter, because BF = s( f )
Schenker, and Taylor 1990). It more graphically conveys the
= t( g) = I3 as in (2.7). Also, 8 is asymptotically efficient. This
effect of the missing data, as will be seen in Figure 5. The two
choice o f t puts the nonparametric bootstrap on the same
methods are compared further in Section 6.
footing as multiple imputation.
Figure 4 is a comparison of the nonparametric bootstrap Remark F. Theoretically we could use the nonpara-
with multiple imputation in the nonparametric categorical data metric MLE (3.20) to estimate 8 in the maximum eigenvalue
problem. The top line shows the computational steps involved problem, by discretizing the data in Table 1 as we did in
in the nonparametric bootstrap. The resampling step o + o* Remark D. This is not at all practical, given only n = 22
is the simple bootstrap defined at (1.7). Here o and o* could points in a five-dimensional space. Ad hoc estimators like
just as wellbe labeledd and G* as in Figure2, or equivalently, those in Section 1 arise in nonparametric problems because
0 and g *. The bootstrap replication 8 * = t ( G * ) in Figure 2 is of the impracticality of full nonparametric maximum like-
given here by 8* = s(MLE(i*)) as in (3.20). lihood estimation.

Nonparametric
Bootstrap -
resample
’ 0
* MLE > ;*
.
s , *e. -
conditional
Mu1 t iply
Imputation 9
resample , o*
- MLE ~

f* * resample
>!! ** Bayes > s(E(~**) > n(e/x**)

Figure 4. Comparison of Nonparametric Bootstrap With Multiple Imputation for Categorical Data. Resample indicates nonparametric bootstrap
sample (1.7); MLE is the nonparametric maximum likelihood estimator for f, (3.19); the parameter of interest is 6 = s ( f); and conditional resamples
are obtained by the approximate Bayesian bootstrap method (3.21). Multiple imputation assumes that given the completed data set x *, it is easy to
compute the Bayes posterior density [ ( f I x * *) for f and then marginalize [ to the posterior density ~ ( I x6 * *) for 6.
Efron: Missing Data and the Bootstrap 47 1

I I
- . .
,.:
I
I
I
I
..' ~ I I
. . I I
I
I
I

Figure 5. Multiple-Imputation Bootstrap for the Maximum Eigenvalue Problem of Section 1. The left panel shows ABC confidence densities d ( 8 1 x("'))
for 25 imputed data sets x("'), m = 7, 2, . . . , 25. In the right panel, the solid line is ii(8 I o), ( 4 4 , the average of d ( 8 I xcm') for m = 1, 2, . . . , 50,
and the dashed line is & , ( 8 I o), the ABC density for the nonparametric bootstrap of Section 2; + ( 8 10) has a shorter upper tail than
~Lwr(~I0).
Downloaded by [New York University] at 18:56 21 October 2014

4. MULTIPLE-IMPUTATION BOOTSTRAP ?r t( 8 1 x ( ~ )will) be a good approximation to the posterior


density for 8, having begun with an appropriate uninfor-
Suppose now that we have satisfactorily solved the problem mative prior for the full parameter vector (see Efron 1993).
of sampling from the predictive density f(x 10) to obtain The ABC method described in Section 5 yields a very simple
imputations x ( ' ) , x(*), . . . , x("") and wish to construct good construction for ?r ( 0 I x ( ~ ) )We . will call the construction
approximate confidence intervals for a parameter of interest of multiple-imputation confidence intervals via (4.3) and the
8. The data augmentation identity (3.5) suggests using the ABC method the multiple implementation bootstrap. This
percentiles of the estimated posterior density method has an objectivist Bayesian rationale, like the Bayes-
i M ian bootstrap with which it begins.
(4.1) Figure 5 shows the application of the multiple-imputation
bootstrap to the maximum eigenvalue problem of Section
as the endpoints for such intervals. But to do so requires 1. M = 50 multiple imputations x ( ~were ) drawn using the
that the complete-data posterior density a(0 I x ) enjoy good approximate Bayesian bootstrap (3.7). The construction of
?r ( 0 I x ( ~ ) described
) , in Section 5 , was only partially non-
confidence properties. Choosing an appropriate uninfor-
mative prior for a high-dimensional vector parameter can be parametric; the family f,( x ) used in (3.7) was the five-di-
a difficult problem (see Berger and Bernard0 1991 and Tib- mensional normal ( 2 . 6 ) . Less parametric ways of drawing
shirani 1989). This section uses a bootstrap-based method the x('@) seemed impractical, given the small sample size and
developed by Efron (1992a) to avoid such choices and to the five-dimensional sample space.
simplify the calculation of (4.1). The left panel of Figure 5 shows the ABC densities
Let ex( a)indicate the a-level endpoint of an exact or ap- ) m = 1,2, . . . ,25. The solid curve in the left
? r t ( 0 I x ( ~ )for

proximate system of confidence intervals for e based on data panel is the average of all 50 densities: Fi t ( 0 I o ) , (4.3). Row
x . The conjidence density for 0 given x is defined to be 5 of Table 2 comprises the appropriate percentiles of
7;t ( e 10).
The ABC method can also be used as a computationally
(4.2) efficient way to implement the nonparametric bootstrap of
Section 2. The dashed curve in the right panel of Figure 5
This density assigns probability .01 to t9 lying between the is T~,,,,,~,.(0 1 o ) , the ABC confidence density appropriate to
.90 and .91 confidence limits, and so on. By definition, the the nonparametric bootstrap. Row 2 of Table 2 comprises
100ath percentile of ?rt(81x) is e x ( a ) ,so a t ( 0 l x ) is just the percentiles of ?rfionDar(6 lo).
another way of describing the function ex( a).But the con-
In this case the multiple imputation confidence limits are
fidence density is convenient for use in (4. l), giving
too short in the upper tail. Section 5 traces the difficulty to
an overly influential student in Table 1, combined with the
M
1""
~ i t ( e 1 0 )= -
m=l
c at(eIX(m)). (4.3) normal-theory imputations.
5. THE ABC ALGORITHM
Confidence densities can be thought of as a way to auto-
matically marginalize a high-dimensional posterior distri- The main disadvantage of the BC, method is the large
bution to a single parameter of interest. If ex( a) represents number of bootstrap replications required. This computa-
a second-order accurate confidence interval endpoint, then tional burden can often be avoided by using analytical ex-

- _ .
472 Journal of the American Statistical Association, June 1994

pansions in place of the bootstrap Monte Carlo replications. quickly computes a en route to the confidence interval limits.
DiCiccio and Efron (1992) developed an algorithm called The algorithm uses an analytic approximation for zo that is
ABC, standing for "approximate bootstrap confidence" in- often more accurate than (2.13) even for very large numbers
tervals, that uses numerical second derivatives to accurately of bootstrap replications. In the maximum eigenvalue anal-
approximate the endpoints of the BC, intervals. The devel- ysis of Figure 1, the algorithm gave (20,a ) = (.190, .099).
opment in that paper is mainly for parametric exponential In addition to the MLE 8, the standard intervals 8
family problems. Here the algorithm is adapted to nonpara- k z(")G require only the calculation of G, often taken to be
metric problems, actually simplifying the calculations. The (5.3). The ABC intervals require three more constants, ( a ,
algorithm is given as an S function in the Appendix. zo, c,). Besides the acceleration a and the bias correction zo,
Given a bootstrap sample o* = (ol, 0 2 , . . . , Xn), (1.71, we need the quadratic coefficient
let Pt indicate the proportion of times 0;is represented in
0 *:
PT = #{o: = oi}/n ( i = 1 , 2 , . .., n ) . (5.1)
With the data o fixed, we can think of a bootstrap replication
8* = t( G* ) in Figure 2 as a function of the resampling vector (5.6)
P* =(P:, P?, . . . , PX),say8*= T(P*).Theresampling
Computationally the ABC algorithm is only slightly more
vector takes its value in the simplex
ambitious than a delta method analysis of standard error
Downloaded by [New York University] at 18:56 21 October 2014

8, = { P : P; 2 0, ZP, = l}. (5.2) and bias for 8. It gives considerably more information,
though, in the form of second-order accurate approximate
The resampled statistic 8* = T(P*) can be thought of as a
confidence limits for 8. The definitions of a , 20,and c, were
function on the simplex, forming a resampling surface over
motivated and explained by DiCiccio and Efron ( 1992). The
S,, as in figure 6.1 of Efron (1982). The geometry of the
Appendix presents a nonparametric version of the ABC al-
resampling surface determines the bootstrap confidence in-
gorithm, written in the language S of Becker, Chambers, and
tervals for 8. In the BC, method the surface is explored by
Wilks (1988).
evaluating T(P*) for some 2,000 random choices of P*.
The ABC algorithm approximates the BC, interval end-
The ABC endpoints require a total of 2n 2 k recom- + +
putations of T(P), with k being the number of endpoints
points by exploring the local geometry of the resampling
desired. This amounts to 54 recomputations in rows 2 or 4
surface, its slopes and curvatures, near the central point of
of Table 2, compared to some 2,000 recomputations for BC,.
the simplex Po = 1 /n. This is done using numerical deriv-
The number can be further reduced by grouping the data
atives instead of Monte Carlo, enormously reducing the
points 0,, say into pairs.
computational burden. This tactic fails for unsmooth statis-
The ABC algorithm requires that the statistic of interest
tics like the sample median, but it has worked well for a large
be expressed in the resampling form 8* = T(P*). In the
number of examples in DiCiccio and Efron ( 1992), and also
maximum eigenvalue example, calculations ( 1.2)-( 1.5) are
here for the maximum eigenvalue problem. The ABC inter-
carried through with weight P: on o,, rather than weight
vals were proven to be second-order accurate for smooth
statistics by DiCiccio and Efron.
l / n . We minimize C,CJ P: [o,, - ( p a, PJ)l2 rather + +
Statistical error estimates based on derivatives are familiar than (1.2), impute 2: = i* 2: + +
6; for the missing ele-
from delta method or influence function calculations. For ments of o*, and calculate the weighted covariance matrix
example, the nonparametric delta method estimate of stan- n
dard error is $* = 2 p:(R: - ji*)(R? - ji*)'
I= 1

(5.3)
Li=l J (c* = 5 PTiT) (5.7)
The vector t = ( 2 , , i 2 , . . . , in) is the empirical influence i= 1

function, rather than (1.4); then 8* = T(P*) is the maximum eigen-


+
T((1 - c)PO &ei)- T(PO) value of $*. Usually the form of T(P* ) is obvious. Doubtful
ti = lim
E+O &
(5.4) cases can be resolved by remembering that when nP* is a
2

vector of integers, say ( N y , N ? , . . . , NX ), then T(P * ) is


where T(Po)= t( P) = 8 and ei is the ith coordinate vector the value of 8 applying to a sample of N : copies of 07,
(0, 0, . . . , 0, 1, 0, . . . , 0). Having calculated t , the accel- N: copies of 0: , and so on. Remark I concerns T(P* ) for
eration a in the BC, formula (2.12) is given by 8 the normal-theory MLE of 8.
It turns out to be easy to compute A +( 8 I x) if Ox( a)is the
( 5 . 5 ) endpoint of the ABC interval (Efron 1993). For a given
complete data set x, let (8, G, a , ZO,c,) be the five numbers
(Efron 1987, sec. 7). required for the ABC endpoints. These numbers are calcu-
Definition (2.13) for zo looks simple compared to ( 5 . 9 , lated by the program abcnon in the Appendix. Let X and w
but usually a is easier to estimate than zo. The ABC algorithm be defined as functions of 8 in the following way:
Efron: Missing Data and the Bootstrap 473

5% shorter than those in row 1 or 2 of Table 2 here, which


is consistent with having 10% more data.)
A possible source of the difficulty is the normal-theory
2x imputation (5.10). The imputed data sets x ( ~were ) centered
W =
(1 + 2aX) + (1 + 4aX)l/2 * away from i in a region where the complete-data delta
method standard errors were noticeably smaller; 2( x(")) av-
Then
eraged 187.5, compared to either $(%) = 219.8 or ;(x)
= 214.4 for the actual complete data set x.
The twenty-second student in Table 1 has by far the great-
est influence on the maximum eigenvalue estimate 8 = 633.2.
(see remark G and sec. 3). His empirical influence (5.10) was t 2 2 = 4,041.8, compared
Figure 5 was constructed by applying (4.3), (5.9) to the to the next-largest values t Z I = 1,435.5, t 2 = 819.6, ts
maximum eigenvalue problem of the Introduction. M = 50 = -623.8, . . . . The two missing values in 0 2 2 were imputed
multiple imputations x(") were constructed using the ap- in a noticeably different way by (5.10) as compared to (1.3),
proximate Bayesian bootstrap (3.7). For each one, a non- averaging (29.92, 18.44) over the 50 normal theory impu-
parametric bootstrap sample o* ( m ) gave i *( m ) and then tations compared to the best-fit imputation (9.87, 14.66),
$ * ( m ) = (;*("), $ * c m ) ) as in (1.7), (1.8); then the missing seen from i on the right side of Table 1.
components in the original data set o of Table 1 were filled Figure 5 was recomputed after changing xi?' to gZ2in all
in by sampling from the conditional normal distribution with 50 imputations. Now ?; ( 0 I 0 ) was in better agreement with
Downloaded by [New York University] at 18:56 21 October 2014

expectation vector ; * ( m ) and covariance matrix $*("), say ? r t ( 0 Io)-in fact, it was slightly longer-tailed to the right;
;(x(")) now averaged 226.7.

Remark G. Formula (5.9) is actually the confidence


independently for i = 1, 2, . . . , 22. (5.10)
density applying to the simpler approximate confidence in-
Poor man's data augmentation (3.6) gave results similar to terval method called ABC, by DiCiccio and Efron ( 1992a).
(5.10) in this example. It is not much more difficult to compute the genuine ABC
Each imputed data set x ( ~gave ) the five ABC numbers confidence density (Efron 1992, eq. 7.3), but the difference
;(m)
> 2 , zo( m ),and ci")) obtained from the program was not important here.
abcnon in the Appendix, and these gave the confidence den-
Remark H. It is possible to write down a K-sample ver-
sity x t ( 0 1 x(")) based on x("), (5.6). The appropriate resam-
sion of the ABC algorithm, but the one-sample program given
pling function T(P* ) for the mth case, called "tt( P)" in the
in the Appendix also handles the K-sample case, in the
program, is defined as follows. Let $* be the weighted co-
following way. The K-sample bootstrap replication 8*
variance matrix
= t ( G7 , G; , . . . , G;) has a resampling representation

$* = c P: (Xl"'
I= I
- b*)(x,
(m) - A *

p ) 8* = T ( P 7 , P2*,. . . , PE), (5.15)


where P: is the kth resampling vector, Pzi = # { O: = ok,}/
(*; = ZP: x,'"'); (5.1 1) n k , as in (5.1). Let P* be a single long resampling vector of
length Cf=I n k ,
then 8* = T(P* ) is the maximum eigenvalue of $*.
The curve ?rfiOnpar(0 lo) in Figure 5 was obtained by ap- P* = . . PTnl p2*I . . * > P?nnZ,. . . ,
( ~ 7 1 3 . 9 9 5 . . . , Pi,,,), 3

plying abcnon directly to the nonparametric bootstrap. The


function d* = T(P* ) was now defined to be the maximum (5.16)
eigenvalue of $* in (5.7). The five numbers (8, ,; a, zo, and and define the one-sample statistic
c,) were obtained from abcnon, giving ?rfionpar(0 10) from
( 5 . 8 ) , (5.9). Notice that in (5.7) the vectors$: vary as func- S ( P * ) = T ( P 7 , . . . , P;),
tions of P*, whereas in (5.1 1) the x,'") do not. nk

Table 2 shows that the multiple imputation intervals where P; = ( P z l ,P z 2 , . . . , P z n k ) / P:. (5.17)
1 J=
are somewhat too short in the upper direction. The mul-
tiple imputation standard error estimate (Rubin and Then it can be shown that applying the one-sample abc al-
Schenker 1986), which does not involve (5.9), is similarly gorithm of the Appendix to S( P* ) gives exactly the same
small: confidence intervals as applying the appropriate K-sample
abc program to (5.15). The acceleration a required for the
;,,"It = [35,150 +
2,8971'" = 195.1, (5.12) BC, intervals is obtained by applying (5.4), (5.5) to S(P*).
compared to the direct delta method estimate ;= 220.0,
obtained by applying (5.3) to T(P*) defined from (5.7).
6. SUMMARY
There is no gold standard by which to judge Table 2, but Three bootstrap methods for missing-data problems have
the multiple imputation intervals are even 10%shorter than been presented: nonparametric, full mechanism, and mul-
the intervals based on complete data for the 22 students (Ef- tiple imputation. Here is a brief summary of their advantages
ron 1992a, table 1). (The complete data intervals are about and drawbacks.
474 Journal of the American Statistical Association, June 1994

Nonparametric Bootstrap. This is easiest of the three


methods to apply, both conceptually and, if the ABC algo- (
;-n, ;;;:;;:;;
epSi = o, ool alpha = ( , 0 2 5 , ,05, 1, 16,,=,.9,.95, .975) ) , ,

rithm is used, computationally. It requires no knowledge of : ~ ~ T p ‘ p T s ” ~ t T ~ i ~ ~ ~ ~ is ~ reight


~ ~ ~ on~ [~ s ~ g s ~ ~ ~ ~ ~ ~
the concealment mechanism leading to the observed pattern ep <- epsi/n: I<- diag(n) ; PO<- rep(l/n,n)
to<- tt(Po)
of missing data. The method applies just as well to ad hoc (Icalculatet. andt.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
estimators like (1.2)-( 1.5) and to MLE’s. This can be con- t. <- t., <- numeric(n)
venient and efficient, as in the maximum eigenvalue prob- for(i in l Z n ) { di<-I[i, 1
tp <- tt (PO + ep di)
lem, but opens the possibility of definitional bias. In missing- tm c- tt (PO - ep di)
data problems this approach is limited to nonparametric set- t.[i]<-(tp-tm)/(Z+ep)
t.. [i]<- (tp- 2 * tO + tm)/ep”Z)
tin@?there being no Obvious parametric Of Figure #calculate sighat,a,z0,and cq . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2. The method is also limited to simple random sampling Sighat <- sqrt (sum(t . “ 2 ) ) /n
a <- (sum(t .“3)) /6 ’n”3 sighat”3)
situations, or multisample situations as in Remark B. The delta <- t , ( nA2. sighat)
BC, or ABC confidence intervals obtained from the non- cq- (tt (PO+ep‘delta) -2‘to + tt (PO-ep’delta) /(2+sighat*epA2)
bhat <- sum( t. . ) / ( 2 n”2)
parametric bootstrap are second-order accurate. If only a curv <- bhat/sighat - cq
standard error is required, a bootstrap estimate can be ob- zo <- qnorm(2 pnorm(a) pnorm( - curv) )
#calculate interval endpoints.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
tained from just a couple hundred bootstrap replications. w <- z0 + qnorm(alpha)
Full-Mechanism Bootstrap. This is the approach that ~~~~-<~~~s:,,at*~w~~~rm(
alpha)
most closely resembles bootstrap methods for problems abc <- seq(a1pha)
for ( i in seq (alpha)) abc [ i 1 <- tt (PO + lambda[ 11 delta)
without missing data. It can be applied to parametric or non-
Downloaded by [New York University] at 18:56 21 October 2014

limsc- cbind(a1pha. abc, stan)


parametric problems and to data situations more compli- XOutput in list form.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
cated than simple random sampling. It avoids the problem list ( lims=lims. stats=c (to,sighat,bhat ) . cons= ( c (a.20.cq) 1 , t . =t . )

of definitional bias and can even be used to assess the defi- [Received July 1992. Received June 1993.1
nitional bias in estimators like (1.2)-( 1.5). There is no
equivalent of the ABC algorithm for reducing the compu- REFERENCES
tational burden. Nor is there a simple formula like (5.5) for
the constant a used in the BC, method (though using a based Becker, R. A,, Chambers, J. M., and Wilks, A. R. (1988), The N e w s Lan-
on ( 5 . 5 ) seems to give reasonable results). guage, Pacific Grove, CA: Wadsworth and Brooks/Cole.
J. o.,and Bemardo, J. M. (1991), ‘‘On the Development of the
The full-mechanism bootstrap requires knowledge ofthe Berger, Reference Prior Method,” Proceedings of the 4th Valencia International
concealment mechanism x --* o in Figure 3. But it is some- Meeting on ~~~~~i~~statistics,
times of considerable interest, and even necessary to model Buck, S . F. (1960), “A Method of Estimation in Missing Values in Multi-
the concealment mechanism (see Rubin 1987, chap. 61, in variate Data Suitable for Use With an Electronic Computer,” Journal of
the Royal Statistical Society, Ser. B, 22, 302-306.
which case this is a less serious disadvantage. Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977), “Maximum Like-
~ ~ Imputation
l ~ i Bootstrap.
~ l ~ basic data augmen- lihood Estimation From Incomplete Data Via the EM Algorithm” (with
discussion), Journal ofthe Royal Statistical Society, Ser. B, 39, 1-38.
tation identity (3.1) is ideal for handling missing data Prob- DiCiccio, T. J., and Efron, B. (1992), “More Accurate Confidence Intervals
lems for which there is a genuine Bayes prior. Its application in Exponential Families,” Biometrika, 79.
to confidence intervals by of confidence densities Efron, B. (1 98 la), “Censored Data and the Boostrap,” Journal of the Amer-
ican Statistical Association, 76, 3 12-3 19.
(4.3), (5.9) is computationally straightforward once the (1981b), “Nonparametric Estimates of Standard Error: The Jack-
problem of Sampling from the predictive density f(x 10) iS knife, the Bootstrap, and Other Resampling Methods,” Biometrika, 68,
solved. Here we require knowing the conditional density of 589-599.
(1982), “The Jackknife, the Bootstrap, and Other Resampling Plans,”
x given 0 , but not of o given x as with the full-mechanism SIAM cBMs-NsF Monograph, 38.
bootstrap. Sampling methods like (3.6) or (3.7) are reasonable -( I 987), “Better Bootstrap Confidence Intervals and Bootstrap A p
surrogates for the predictive density. However the maximum proximations,” Journal ofthe American Statistical Association, 82, 17 I-
185.
eigenvalue example suggests that (4.3), (5.9) may be uncom- ( 1992), “Jackknife-After-BootstrapStandard Errors and Influence
fortably vulnerable to failures in the Parametric assumptions. Functions,” Journal ofthe Royal Statistical Society, Ser. B, 54, 83-127.
The multiple imputation bootstrap can be applied to -(1993), “Bayes and Likelihood Calculations From Confidence In-
parametric problems and to arbitrarily complicated data tervals,”Biometrika, 80, 3-26.
(1992b), “Six Questions Raised by the Bootstrap,” in Bootstrap
structures. Each multiple imputation x ( ~uses ) exactly the proceedings Volume, ed. R. hpage, New York: John Wiley.
same set of observed data, with only the imputed numbers Efron, B., and Stein, C. (1981), “The Jackknife Estimate of Variance,” The
Annals ofstatistics, 9, 586-596.
varying’ so that the are better conditioned On O ’ The Efron, B., and Tibshirani, R. J. (1986), “Bootstrap Methods for Standard
method fits in well with the EM algorithm, which is often ~rrors,Confidence Intervals, and Other Measures of Statistical Accuracy,”
used to find MLE’s in missing data situations. Results like Statistical science, I , 54-77.
~i~~~~5 give an assessment of how much the missing data Fay, R. E. (19861, “Causal Models for Patterns of Nonresponse,” Journal
of the American Statistical Association, 8 1, 354-365.
is affecting our answer. Asymptotic Properties ofthe Hafiigan, 0.F., and Little, R. J. A. (1991), “Multiple Imputation for the
imputation bootstrap, like second-order accuracy, have not Fatal Accident Reporting System,” Applied Statistics, 40, 13-29.
yet been investigated. Laird, N., and Lewis, T. A. (1987), “Empirical Bayes Confidence Intervals
Based on Bootstrap Samples” (with discussion), Journal ofthe American
APPENDIX: A NONPARAMETRIC ABC PROGRAM Statistical Association, 82, 739-757.
Lazzeroni, L. C., Schenker, N., and Taylor, J. M. G. (1990), “Robustness
The program abcnon, Written in the language s of kxker et al. of Multiple-lmputation Techniques to Model Misspecification,” in Pro-
(1988), evaluates the ABC intervals described in Section 3; t t ( P ) ceedings of the Survey Research Methods Section, American Statistical
is the resampling function T(P* ). Association.
Efron: Missing Data and the Bootstrap 475

Little, R. J . A. (1983), “The Ignorable Case,” in Incomplete Data In Sample (1987), Multiple Imputation for Nonresponse in Surveys, New York
Surveys, Vol. 2, Part VI, New York: Academic Press, pp. 341-382. John Wiley.
Little, R. J. A,, and Rubin, D. B. (1987), Statistical Analysis With Missing Rubin, D. B., and Schenker, N. (1986), “Multiple Imputation for Interval
Data, New York: John Wiley. Estimation From Simple Random Samples With Ignorable Nonresponse,”
Louis, T. A. (1982), “Finding the Observed Information Matrix When Using Journal ofthe American Statistical Association, 8 I , 366-374.
the EM Algorithm,” Journal ofthe Royal Statistical Society, Ser. B, 44, Tanner, M. A. (1991), Tools for Statistical Inference-Observed Data and
226-233. Data Augmentation Schemes, New York: Springer-Verlag.
Mardia, K. V., Kent, J. T., and Bibby, J. M. ( 1979), Multivariate Analysis, Tanner, M. A,, and Wong, W. H. (1987), “The Calculation of Posterior
New York: Academic Press. Distributions by Data Augmentation,” Journal of the American Statistical
Rubin, D. B. (1976), “Inference and Missing Data,” Biometrika, 63, 581-
Association, 82, 805-8 I 1.
592.
(1978), “Multiple Imputations in Sample Surveys-A Phenome- Tibshirani, R. (1989), “Noninformative Priors for One Parameter of Many,”
nological Bayesian Approach to Nonresponse,” Proceedings ofthe Survey Biometrika, 76, 604-608.
Research Methods Section, American Statistical Association, pp. 20-34. Wei, G . C. G.,and Tanner, M. A. (1990), “A Monte Carlo Implementation
(198 I ) , “The Bayesian Bootstrap,” The Annals ofStatistics, 9, 130- of the EM Algorithm and the Poor Man’s Data Augmentation Algo-
134. rithms,” Journal of the American Statistical Association, 85, 699-704.

COMMENT
Downloaded by [New York University] at 18:56 21 October 2014

Donald B. RUBIN*

Efron’s article is an interesting addition to the existing 1. CONFIDENCE VALIDITY WITHOUT


literature attempting to handle missing-data problems A WELL-DEFINED POPULATION ESTIMAND?
through frequentist resampling techniques. Other recent
Efron seems to depart from long-standing statistical tra-
contributions include those of Rao and Shao (1992) on jack-
dition, dating back to Neyman’s (1934) introduction of con-
knife methods following single hot-deck imputation, which
fidence intervals, by being willing to ascribe confidence va-
has been a survey practitioner’s tool for bootstrap imputation
lidity to an interval estimate without requiring that the
for the better part of this century, and Fay (1993) on variants
interval cover a well-defined population quantity (i.e., the
of such techniques, based on single and multiple imputation.
estimand) over repeated samples with probability at least as
Despite the fact that it is a pleasure to have Efron’s excep-
great as the stated confidence coefficient. That is, he seems
tional technical adroitness and creativity brought to bear on
willing to find a consistent estimate of standard error for a
the problem of missing data, several points of contention
statistic that is not consistent for a well-defined estimand
are raised by his article. Specifically, we have at least the
and claim confidence validity for the resulting random in-
following questions to consider:
terval.
1. Can we have confidence validity for an interval estimate In particular, consider Efron’s motivating example with
without a well-defined estimand? 5 variables, 22 students, and 22 missing values, where 6’
2. Can we claim distribution-free validity for a procedure is the maximum eigenvalue of the population variance-
whose operating characteristics vary critically with the covariance matrix and 8 is the maximum eigenvalue of the
underlying distributional assumptions? sample variance-covariance matrix of the data set with its
3. Within the missing-data context, should the responsi- missing values imputed by some method. Efron can be read
bilities and capabilities of the imputer and the ultimate as implying that confidence intervals based on the nonpara-
user be assumed to be the same? metric bootstrap are valid whether or not the imputation
4. Is it acceptable to have strong hidden assumptions when method tracks the actual mechanism that created the missing
creating imputations? data:
5. Was multiple imputation an outgrowth of EM or data More elaborate bootstraps are available, as will be discussed, but
augmentation? the simple method has much to recommend it. It is nonpara-
metric, applicable to any kind of imputation procedure, and re-
I believe that the answer to each of these questions is “no,” quires no knowledge of the missing-data mechanism. Its main
whereas it seems that Efron’s answers would all be “yes,” practical disadvantage is the computational expense of the 2,000
or so bootstrap replications required for reasonable numerical
and so he and I should be able to provide a lively exchange accuracy.. . .
of views for JASA’s readers.

0 1994 American Statistical Association


* Donald B. Rubin is Professor and Chairman, Department of Statistics, Journal of the American Statistical Association
Harvard University, Cambridge, MA 02138. June 1994, Vol. 89, No. 426, Theory and Methods

You might also like