Efficient Estimation from Right-Censored Data when
Failure Indicators are Missing at Random
Mark J. van der Laan
University of California, Berkeley
Ian W. McKeague
Florida State University
May 15, 2001
Abstract
The KaplanMeier estimator of a survival function is well known to be asymptotically efficient when cause of failure is always observed. It has been an open
problem, however, to find an efficient estimator when failure indicators are missing
at random. Lo (1991) showed that nonparametric maximum likelihood estimators
are inconsistent, and this has led to several proposals of ad hoc estimators, none of
which are efficient. We now introduce a sieved-nonparametric maximum likelihood
estimator, and show that it is efficient. Our approach is related to the estimation
of a bivariate survival function from bivariate right-censored data.
Introduction
Suppose that we wish to estimate a survival distribution based on right-censored data.
When cause of failure is always observed, the method of nonparametric maximum likelihood leads to the well-studied KaplanMeier (1958) estimator, which has many desirable
properties including asymptotic efficiency (Wellner, 1982). In this paper we address the
problem of finding an asymptotically efficient estimator when cause-of-failure information
is missing for some individuals.
Cause-of-failure information can be missing for a number of reasons. For example, in
epidemiological studies relevant death certificate information can be missing, or autopsy
results and hospital case notes can be inconclusive. In such cases it is not possible to
determine whether mortality is due to the cause of interest or due to extraneous causes.
In a study of the reporting of motorcycle injury fatalities occurring in Connecticut in
1987, Lapidus et al. (1994) found that 40% of death certificates were missing some or
1
AMS 1991 subject classifications. Primary: 62G07; secondary: 62F12
Key words and Phrases. KaplanMeier estimator, incomplete data, self-consistency, bivariate censorship, influence curve.
2
all of the required information. A study of mortality patterns among young people in
the Netherlands (Bijlsma, 1994) found that 9% of cases had ill-defined symptoms, signs
and conditions, and over 90% of those were registered as cause-unknown, mainly due
to missing death certificates of people who had died abroad.
Let T be the survival time of interest, let C be a censoring time which is independent
of T , and let  be a Bernoulli random variable which is allowed to depend on (T, C) in
a way to be specified in a moment. Let X = T  C and  = I(X = T ). If (X, ) is
always observed, i.e., we have classical right-censored data on T , then we can estimate the
survival function of T using the KaplanMeier estimator. In the problem studied here,
however, the failure indicator  is missing if  happens to be 0. That is, we have n i.i.d.
observations on Y = (X, , ).
Our goal is to efficiently estimate the survival function ST of T under the assumption
P ( = 1 | X, ) = P ( = 1 | X),
(1)
that  and  are conditionally independent given X. This assumption places our problem
in the framework of missing at random, introduced by Rubin (1976), or, more generally,
coarsening at random (CAR), see Heitjan and Rubin (1991), Jacobsen and Keiding
(1995) and Gill, van der Laan and Robins (1997). Coarsening is a sampling mechanism in
which instead of observing a random quantity of interest one is only able to observe that
it takes a value in some possibly randomly determined set of values. CAR isolates those
situations in which the coarsening mechanism can be ignored when making inferences.
Our approach is to find the nonparametric maximum likelihood estimator (NPMLE)
of ST based on reduced data produced by a discretization of X. In this way we repair
the usual NPMLE, which is inconsistent for estimating ST . The proposed estimator is
found by noticing that our problem can be considered as a special case of nonparametric
estimation of a bivariate distribution from bivariate right-censored data. Indeed, the
coarsening mechanism acting on (X, ) amounts to right-censorship of : observation of
Y is equivalent to observation of (X,  ), where  =   (2  1). We are then able
to use van der Laans (1996b) efficient sieved-NPMLE of a bivariate distribution function
to estimate the distribution F of (X, ). This estimator reduces to a simple and explicit
form Fn in our case. Finally, using the fact that ST is a simple functional  of F , we
construct the proposed estimator ST = (Fn ).
Many authors have studied our problem under the stronger assumption that  and
 are completely independent, i.e., that P ( = 1 | X, ) does not depend on (X, ).
The failure indicators are then said to be missing completely at random (MCAR), cf.
Little and Rubin (1987). MCAR can be checked from observation of (X,  ) given that
CAR is in effect, and it allows the use of relatively simple estimators. For example,
the survival distribution can be consistently estimated under MCAR by simply ignoring
the missing data (cases with  = 0) and applying the KaplanMeier estimator to the
complete data. However, this complete case estimator is highly inefficient if there is a
significant degree of missingness. The first attempt to improve upon the complete case
estimator was made by Dinse (1982) who used the EM algorithm to obtain a NPMLE. Lo
2
(1991) showed that there are infinitely many NPMLEs and some of them are inconsistent.
He constructed two alternative estimators, one of which is consistent and asymptotically
normal. Gijbels, Lin and Ying (1993) and McKeague and Subramanian (1996) have
proposed further improvements.
The stronger assumption of MCAR becomes a drawback when seeking a fully efficient
estimator. For an estimator to be efficient, it must not be based on any model assumptions
beyond minimal CAR because the function (x) = P ( = 1 | X = x), which specifies
the coarsening mechanism under CAR, factors out of the likelihood (i.e., the likelihood
factors into a part which only depends on the distribution functions of T and C, and
a part which only depends on ) and can be ignored as far as efficiency is concerned.
Hence (x) should be completely unspecified, and, from (2), efficient estimation of F or
ST must involve nonparametric estimation of (x), at least implicitly. This puts us in
the setting of an ill-posed inverse problem (cf. OSullivan, 1986) so that some kind of
regularization procedure (e.g., kernel smoothing, method of sieves, etc.) is needed; as in
density estimation or nonparametric regression, direct NPMLE is not successful.
Although the CAR assumption itself is fairly strong, it is the minimal condition on
the coarsening mechanism under which the survival distribution is identifiable from observations on (X,  ). Indeed, the independence between the survival and censoring
mechanisms ensures that the distribution of T is identifiable (see (3) and (4)) from the
distribution of (X, ), which can be expressed as
F (dx, )  P (X  dx,  = ) =
P ( = 1, X  dx,  = )
,
P ( = 1 | X  dx,  = )
(2)
  {0, 1}. In general, the numerator in F is identifiable but the denominator is not
because  is unobserved unless  = 1. The CAR assumption is precisely what is needed
to make the denominator identifiable, and implies that F will be identifiable if (x) is
bounded away from zero.
The CAR assumption can of course be violated in practice, e.g., in the motorcycle
injury fatalities example, the relevant death certificate information is more likely to be
missing when death is due to motorcycle injuries than when it is due to other (less
specific) causes. However, CAR cannot be checked from data on (X,  ) alone, cf. the
assumption of independence between T and C in the classical right-censored data model.
To judge whether CAR is in effect it would be necessary to have data on the coarsening
mechanism itself, that is, data on (X, ) when  = 0. In the motorcycle example, such
data is available through police accident reports (Lapidus et al., 1994), and the survival
distribution could be identified through estimation of F . In the present paper we shall
restrict attention to the situation where only (X,  ) is observed and CAR is in effect.
The paper is organized as follows. The proposed estimator ST is constructed in Section
2, and shown to be asymptotically efficient in Section 3. An alternative approach to the
problem, based on some general results of Robins and Rotnitzky (1992), is discussed
in Section 4. Some numerical results assessing the performance of ST are presented in
Section 5.
3
The proposed estimator
2.1
Special case of bivariate right-censored data
By the well-known product integral representation of the survival function ST on which
the Kaplan-Meier estimator is based we have
ST (t) =
(0,t] (1
where
T (dx) =
 T (dx)),
F (dx, 1)
.
SX (x)
(3)
(4)
Let D[0,  ] be the cadlag function space of real valued functions on [0,  ] endowed with
the supremum norm. The equations (3) and (4) define ST as a mapping  : (D[0,  ])2 
D[0,  ] from (F (x, 1), F (x, 0)) to ST :
ST (t) = (F )(t).
(5)
Gill and Johansen (1990) (the product integral mapping) and Gill (1989) proved that 
is compactly differentiable in the sense required by the functional delta-method (see Gill,
1989). Hence if we construct an efficient estimator of the bivariate distribution F (x, ),
then plugging this estimator in (5) provides us with an efficient estimator of ST . Here
we use the result that a compactly differentiable functional of an efficient estimator is
efficient (van der Vaart, 1991).
Another well-known fact concerning the univariate right-censored data model is that
for any bivariate distribution F (dx, ) there exist independent random variables T and C
such that (X = T  C,  = I(T < C))  F (see e.g. Bickel, Klaassen, Ritov, Wellner,
1993). In other words, F is completely unspecified. Hence the problem is to estimate F
nonparametrically using the i.i.d. data on (X,  ).
If we define C1 = 2 if  = 1 and C1 = 1 if  = 0, then
(X,  , ) = (X,   C1, I(  C1 = )).
In other words,  is right-censored by the discrete random variable C1 . This shows
that indeed estimating ST comes down to estimating a bivariate distribution of (X, ),
  {0, 1}, where X is always uncensored, but  is right-censored.
Estimation of a bivariate survival function based on bivariate right-censored data
is an extensively studied topic. The NPMLE for this problem is inconsistent due to
the fact that the lines induced by the singly-censored observations do not contain any
uncensored observations for continuous data. This lack of interaction with the uncensored
observations implies that the self-consistency equation (Efron, 1967) for the NPMLE has a
wide class of solutions. We refer to Pruitt (1991) and van der Laan (1996a,b) for discussion
on inconsistency of NPMLE in missing data models where the induced regions contain no
uncensored observations. The inconsistency of the NPMLE has led to many proposals of
4
ad hoc estimators, but these are not useful to us since they invariably require independence
of X and C1 (i.e., MCAR would be needed). Here we do not give a description of the
literature, but refer to the later developments in Bickel, Klaassen, Ritov and Wellner
(1993) or van der Laan (1996a,b).
In van der Laan (1996b) an asymptotically efficient estimator of the bivariate survival
function is proposed which is an NPMLE based on reduced data in the sense that the
uncensored components of the singly censored observations are interval censored by a
given grid partition. It was shown that for a fixed grid (that does not depend on n) this
estimator is asymptotically efficient for the reduced data, and if the width of the grid
converges to zero slowly enough with n, then the estimator is efficient. Moreover, this
estimator is suited to our problem since is applicable under any CAR-bivariate-censoring
mechanism. It turns out that this (in general implicit) estimator simplifies to a very
simple form in our special case. Here we propose this estimator.
2.2
The reduced data NPMLE of F
Let 0 = a0 < a1 < . . . < ak =  be a partition of the interval [0,  ], and set ak+1 = .
Define the discretized version Xd of X by:
(
Xd 
aj
X
if X  (aj , aj+1] and  = 0
if  = 1.
In other words, if  is observed, then X is unchanged, but if  is missing ( = 1), then X
is interval censored in the sense that we only observe that X  (aj , aj+1 ]. Our estimator
of F will be the NPMLE based on the reduced data (Xd , , ).
Let E(x)  (aj , aj+1 ], where aj is such that x  (aj , aj+1 ]. Let Rj  (aj , aj+1 ] 
{0, 1} be the regions for (X, ) implied by an observation (Xd = aj ,  = 1) with a
missing failure indicator. As in van der Laan (1996b) we restrict the NPMLE of F to
be discrete with pointmasses at all complete observations (Xi , i ) and on one (or more)
artificially chosen point in each Rj that contains no complete observations. Of course, if
the partition does not depend on n, then as n   all Rj contain complete observations
with probability tending to 1.
Let Fn be the NPMLE and denote its X-marginal distribution by FX,n. Also, let
fn (x, ) = Fn ({x}, ) be the density of Fn with respect to the counting measure on the
above mentioned support points. The self-consistency equation for fn is
fn (x, ) = Efn
n
1X
I(Xdi = x, i = )  reduced data
n i=1
(6)
which can be written (cf. (7.4) in Efron, 1967) as
fn (x, ) =
n
n
1X
1X
fn (x, )
I(Xi = x, i = , i = 1) +
I(Xi  E(x), i = 0)
. (7)
n i=1
n i=1
FX,n (E(x))
The way to read a self-consistency equation is that each observation gets mass 1/n which
has to be redistributed over its induced region for (X, ) according to the estimate (based
on fn itself) of the conditional distribution over this region. So a point (x, ) gets mass 1/n
from each complete observation on (x, ) and it gets mass 1/n times fn (x, )/FX,n(E(x))
from each incomplete observation with an implied region Rj containing (x, ).
5
2n
 = 1
5
2n
 = 0
 = 1
aj
aj+1
Figure 1: Pointmasses for the reduced data NPMLE of F .
At first sight it seems that (7) is not easily solvable. In this special case, however,
FX,n (E(x)) is known so we can obtain an explicit solution; incomplete observations with
a different Xd do not interact in the sense that their implied regions Rj are disjoint.
Hence the mass given to a region Rj is just 1/n times the number of observations with
X  (aj , aj+1 ]; other observations cannot give any mass to Rj . Thus FX,n(E(x)) equals
this fraction with Rj = E(x)  {0, 1}.
Denote the marginal distribution of X by PX . Let P0 and P1 be the sub-distributions
of (X, ) with  = 0 and 1, respectively. The marginal sub-distributions of the X
components of P0 and P1 are written P0,X and P1,X . A subscript n added to any of these
(sub)-distributions will indicate that we are referring to the empirical counterpart.
Note that PX,n (E(x)) is the fraction of Xi  E(x), and PX,n (E(x)) is the empirical
distribution of the discretized X. Thus we have FX,n(E(x)) = PX,n (E(x)). Also, the first
term on the right-hand side of (7) is just P1,n ({x}, ). For each (x, ) corresponding to a
complete observation (Xi = x, i = ) we can explicitly solve for fn (x, ), which provides
us with:
PX,n (E(x))
P1,n ({x}, ).
fn (x, ) =
(8)
P1,X,n (E(x))
The mass fn (x, ) for the artificially chosen points in the Rj that do not contain complete observations is only determined by FX,n((aj , aj+1 ]) = P0,n ((aj , aj+1 ]); so incomplete
observations with Xd = aj where Rj does not contain any complete observations can redistribute their mass in an arbitrary manner over Rj . The latter fact is exactly the reason
why the interval censoring of the observations with missing failure indicators is essential
for estimation of F ; regions implied by incomplete observations should contain complete
observations with probability tending to 1.
A simple example is helpful for understanding fn . Figure 1 displays five observations
in an interval, three of which have missing failure indicators. The combined mass of these
6
three points ( n3 ) is redistributed to the two complete observations, each of which will have
3
5
= 2n
. This agrees with the answer obtained from (8).
a mass of n1 + 2n
2.3
The estimator of the survival function of T
The estimator (8) provides us with an estimator of F (dx, 1) and of SX (x) = P (X  x).
Hence substitution of (8) into (5) provides us with our proposal for estimating ST (t):
ST (t) 
(0,t]
Fn (dx, 1)
1
,
SX,n (x)
(9)
where SX,n is the survival function corresponding to FX,n . Notice that if P ( = 1) = 1,
then ST is just the Kaplan-Meier estimator, as it should be.
Analysis of the estimator and its influence curve
We will first show that (8) indeed defines a sensible estimator. We need a slightly stronger
CAR assumption than the minimal-CAR assumption (1) in order that F be identifiable
from the discretized data; we assume
P ( = 1 | X, ) = P ( = 1 | XD ),
(10)
where XD = aj if X  (aj , aj+1 ]. Applying this condition to the denominator in (2) we
obtain
PX (E(x))
P1 (dx, ),
F (dx, ) =
(11)
P1,X (E(x))
which shows that (8) will provide us with an consistent estimator of F .
Now regard (11) as defining a map 1 : (D[0,  ])3  (D[0,  ])2 from the distributions
(P1 (x, 1), P1 (x, 0), P0 (x)), i.e. the distributions that determine the observation (Xd ,  ),
to the distributions (F (x, 1), F (x, 0)). Then
ST = (1 (P1 (, 1), P1 (, 0), P0 )
and
ST = (1 (P1,n (, 1), P1,n (, 0), P0,n )).
The functional delta-method (Gill, 1989) tells us that for proving weak convergence of 
n(ST  ST ) as random elements of D[0,  ] to a Gaussian process it suffices to prove
compact differentiability of  and 1 . The compact differentiability of  has already
been established [see Gill and Johansen (1990) and Gill (1989)], and the compact differR
entiability of 1 only requires compact differentiability of the mapping (F, G)  F dG.
The latter has been proved in Gill (1989). The assumptions needed here are that the
denominators are bounded away from zero; for  this means that SX ( ) > 0 and for 1
it means that P1,X ((aj , aj+1 ]) > 0 for all j = 0, 1, . . . , k.
7
As shown in van der Laan (1996b, Theorem 5.1) the estimator Fn of F is efficient for
the reduced data. Hence the compact differentiability of  and van der Vaarts (1991)
result imply that ST is efficient for the reduced data. This proves the following theorem.
Theorem 3.1 Let the partition 0 = a0 < a1 < . . . , ak =  be such that P (X 
(aj , aj+1 ],  = 1) > 0 for j = 0, 1, . . . , k  1 and P (X >  ) > 0. Also assume that
P ( = 1 | X, ) = P ( = 1 | XD ). Then n(ST  ST ) converges weakly as a sequence
of random elements of D[0,  ] to a Gaussian process. Moreover, ST (t) is asymptotically
efficient for the reduced data (Xd , , ).
Here XD and Xd can be chosen arbitrarily close to X. Of course, if  depends on the full
X, then the estimator will still be efficient if the mesh of the partition converges to zero
at a rate which is not too slow and not too quick (cf. van der Laan, 1996b, Theorem 5.1),
but ST would be inconsistent for a fixed partition. On the other hand, if  depends on X
only through XD (for some fixed partition), and the mesh of the partition converges to
zero slowly enough as n  , then ST (t) is asymptotically efficient for the original data
(cf. van der Laan, 1996b, Theorem 5.1). In particular, this holds if  is independent of X.
3.1
The influence curve of the estimator
The compact differentiability of  implies (see Gill, 1989) that ST (t) is asymptotically
linear:
n
1X
ST (t)  ST (t) =
ICt (Yi ) + oP (1/ n),
n i=1
where the i.i.d. random variables ICt (Yi ) are just the derivative d  d1 (P ) of   1
at P  (P1 (, 1), P1 (, 0), P0 ) applied to the empirical distribution of P based on one
observation Yi = (Xi , i 
i, i ), and evaluated at t. Here ICt is called the influence curve
of ST (t). We have that n(ST (t)  ST (t)) is asymptotically normal with mean zero and
variance equal to the variance of ICt(Y ). Hence an estimator of ICt will lead to an
estimate of the asymptotic variance of ST (t) and a (pointwise) confidence interval for
ST (t) in the usual fashion.
Determining the influence curve comes down simply to finding the derivatives (linear
approximations) of  and 1 by neglecting all second order terms and substituting the
linearization of 1 in the linearization of . Here it means that we need to find the
linearization of Fn (dx, 1)F (dx, 1) and the corresponding linearization of SX,n (x)SX (x)
and substitute these in the linearization of ST (t)  ST (t), the product integral mapping,
in terms of Fn (dx, 1)  F (dx, 1) and SX,n (x)  SX (x).
By using telescoping (see Gill, van der Laan, Wellner, 1995) it follows that:
P1,X,n(E(x))
PX (E(x))
 P1 (dx, )
P1,n (E(x))
P1,X (E(x))
PX (E(x))
P1 (dx, )
(P1,n  P1 )(dx, ) + (PX,n  PX )(E(x))
P1,X (E(x))
P1,X (E(x))
Fn (dx, )  F (dx, ) = P1,n (dx, )
(P1,X,n  P1,X )(E(x))
PX (E(x))
P1 (dx, ).
P1,X (E(x))2
This provides us with the linearization of Fn (dx, )  F (dx, ) in terms of the empirical
distribution of the data. We have
PX (E(x))
1
=
P1 (E(x))
P ( = 1 | X  E(x))
F (dx, )
P1 (dx, )
=
P1 (E(x))
FX (E(x))
PX (E(x))
1
F (dx, )
P1 (dx, ) =
.
2
P1 (E(x))
P ( = 1 | E(x)) FX (E(x))
For notational convenience we define D (x)  P ( = 1 | X  E(x)). Substitution of these
expressions in the linearization of Fn (dx, )  F (dx, ) provides us with:
Fn (dx, )  F (dx, ) 
1
F (dx, )
(P1,n  P1 )(dx, ) + (PX,n  PX )(E(x))
D (xd )
FX (E(x))
1
F (dx, )
(P1,X,n  P1,X )(E(x))
.
(12)
D (x) FX (E(x))
The linearization of FX,n(dx)FX (dx) is now simply obtained by adding the linearizations
of Fn (dx, 1)  F (dx, 1) and Fn (dx, 0)  F (dx, 0). Hence
Z
FX (dx)
1
SX,n (x)  SX (x) =
(PX,n  PX )(E(x))
(P1,X,n  P1,X )(dx) +
FX (E(x))
x D (x)
x
Z 
1
FX (dx)
.
(13)
(P1,X,n  P1,X )(E(x))
D (x) FX (E(x))
x
Now, it remains to find the linearization of ST  ST in terms of the linearizations (12) and
(13). The linearization of the product integral T  (1  dT ) = ST is given in Gill
and Johansen (1990) and follows directly from the Duhamel equation. The linearization
of (F (dx, ), SX (x))  F (dx, 1)/SX (x) = T (dx) is trivial. We have:
ST (t)  ST (t)  ST (t)
Z
0
1
Fn (dx, 1)  F (dx, 1)
1  T ({x})
SX (x)
!
(SX,n  SX )(x)
+
F (dx, 1) .
SX (x)2
(14)
Substitute now for SX,n  SX and Fn (dx, 1)  F (dx, 1) in (14) the linearizations (12) and
(13) to obtain:
Z t
ST (t)  ST (t)
F (dx, 1)
1
1
(PX,n  PX )(E(x))
ST (t)
FX (E(x))
0 1  T ({x}) SX (x)
(15)
(P1,n  P1 )(dx, 1) (P1,X,n  P1,X )(E(x))F (dx, 1)
+
D (x)
D (x)FX (E(x))
Z t
Z 
1
1
1
+
F
(dx,
1)
(P1,X,n  P1,X )(ds)
2
0 1  T ({x}) SX (x)
x D (s)
!
!
Z 
FX (ds)
(P1,X,n  P1,X )(E(s))
+
(PX,n  PX )(E(s)) 
.
D (s)
FX (E(s))
x
If we have only one observation (i.e. n = 1), then P1,n (dx, ) = I(X  dx,  = ,  = 1),
PX,n (E(x)) = I(X  E(x)) and P1,n (E(x)) = I(X  E(x),  = 1). Substitution of these
indicators into (16) provides us with the influence curve of ST (t):
ICt(Y )
I(X  t,  = 1,  = 1)
=
(16)
ST (t)
(1  ({X}))SX (X)D (X)
!
Z t
1
F (dx, 1)
I(X  E(x),  = 1)
+
I(X  E(x)) 
D (x)
SX (x)FX (E(x))
0 1  T ({x})
Z t
1
1 I(X > x,  = 1)
+
F (dx, 1)
D (X)
0 1  T ({x}) SX (x)2
!
!
Z t
Z 
I(X  E(s),  = 1)
1
FX (ds)
F (dx, 1)
+
I(X  E(s)) 
.
D (s)
FX (E(s)) SX (x)2
0 1  T ({x})
x
Estimation of ICt requires estimation of ST and F , and is simply carried out by pluggingin our proposals, and an estimate of D (x). If it is known that  is independent of X,
then one simply estimates P ( = 1) by the fraction of uncensored (Xi , i ). If dependence
between  and X is expected, then one estimates D (x) by the fraction of completely
observed (Xi , i ) with Xi  E(x).
3.2
The efficient influence curve
The influence curve (16) is the influence curve of ST for a fixed grid. If we let the mesh of
the grid converge to zero slowly with n, then ST is efficient; so it is asymptotically linear
with influence curve equal to the efficient influence curve. Hence the efficient influence
curve must be the limit of (16) for maxi | ai  ai1 | 0. If the mesh of the grid converges
to zero, then an integral with integrand I(X  E(x)) only integrates over an infinitesimal
interval (ai , ai+1] and hence
Z
I(X  t)
I(X  t) dF (, 1)
I(X  E(x))F (dx, 1)
E(X) F (dx, 1)
lim
=
(X),
SX (x)FX (E(x))
SX (X)
FX (E(X))
SX (X) dFX
where
k(x) 
dF (, 1)
T (dx)
P (X  dx, C > x)
=
.
(x) =
dFX
P (T  C  dx)
T (dx) + C (dx)
10
Furthermore, we have that D (X)  (X) and
Hence the efficient influence curve is given by:
I(X  E(x))FX (dx)/FX (E(x))  1.
!
I(X  t,  = 1) k(X)
1
ICt(Y )
I(X  t) 
=
ST (t)
1  ({X})
(X)
SX (X)
Z t
I(X  t,  = 1,  = 1)
1
1 I(X > x,  = 1)
+
+
F (dx, 1)
(1  ({X}))SX (X)(X)
(X)
0 1  ({x}) SX (x)2
!Z
tX
I( = 1)
1
F (dx, 1)
.
+ 1
(X)
1  ({x}) SX (x)2
0
(17)
Estimation of the efficient influence curve requires nonparametric estimation of the density
k and hence requires smoothing, which explains why a standard NPMLE is not consistent
for this problem. Moreover, if  depends fully on X, then it requires nonparametric
estimation of the binary regression function (X). Although we do not pursue it here,
the efficient influence curve can be used to construct one-step efficient estimators. If
the efficient influence curve is estimated consistently, then the one-step estimator will be
efficient (see van der Laan, 1996a, Corollary 2.2).
Gill, van der Laan and Robins (1997) have shown for general nonparametric models
under minimal-CAR that there exists only one influence curve; in our case the weakest
possible CAR assumption is (1), and (17) is the unique influence curve. Consequently,
any inefficient estimator cannot be asymptotically linear. This explains why inefficient
estimators can only be constructed under stronger assumptions than just minimal-CAR;
the missing failure indicator model is an interesting example (another one is the bivariate
censoring model) were many estimators have been proposed, all being inconsistent under
minimal CAR.
An alternative approach
Results of Robins and Rotnitzky (1992) make it possible to construct an alternative
efficient estimator for ST using the general theory of semiparametric efficiency bounds
[see, e.g., Newey (1991) and Bickel et al. (1993)].
Suppose we observe a random vector Y having distribution P  {P }, which is identified by an unknown (possibly infinite dimensional) parameter . Let L20 (P ) denote the
Hilbert space of P -square integrable functions with mean zero. Consider a smooth onedimensional (SOD) submodel {P  }  {P } passing through P and having score function
k(Y )  L20 (P ) at  = 0, see Bickel et al.s (1993) definition of a regular parametric submodel. The tangent space T(P ) is the L20 (P )-closure of the linear span of all such score
functions k. For example, if nothing is known about P , then P  (dy) = (1 + k(y))P (dy)
is a SOD submodel for any bounded function k with mean zero (provided  is sufficiently
small), so T(P ) is seen to be the whole of L20 (P ) in this case.
Let  = () = (P ) be a real parameter that is pathwise differentiable at P : there
exists g  L20 (P ) such that lim0 ((P  )  (P )) / = hg, ki, for any SOD submodel {P  }
11
with score function k, where h, i is the inner product in L20 (P ). The function g is called
a gradient (or influence curve) for ; the projection IC of any gradient on the tangent
space is unique and is known as the canonical gradient (or efficient influence curve). The
supremum of the CramerRao bounds for all SOD submodels (the information bound) is
given by the second moment of IC(Y ).
In the present context, we observe Y = (X, , ) and the distribution P of Y is
identified by  = (F, ). Consider the parameter  = (F ) = F (x, ) for given x and
  {0, 1}. In the full model that only assumes a CAR missingness process, we have
T(P ) = L20 (P ), see Gill, van der Laan and Robins (1997). Thus, any gradient for  is
necessarily its canonical gradient. The canonical gradient of  can be found as a gradient
in the submodel with  known minus its projection on the space of missingness scores (i.e.,
the tangent space for the submodel in which only  is unknown). Robins and Rotnitzky
(1992, Theorem 4.2) provide closed form expressions for the projection on a space of
missingness scores when the missingness process is monotone. In our case, however, the
missingness process is very simple (it acts on only one component of the complete data
vector), so it is easy to find the projection without reference to this general theorem, see
the Appendix.
The following proposition expresses the canonical gradient of  explicitly in terms of
the functions (x) and p(x) = P ( =  |  = 1, X = x).
Proposition 4.1 Suppose the CAR assumption (1) holds and (X) is bounded away from
zero. Then the canonical gradient of  is
!
IC(Y
[I(X  x,  = )  ]  [I(X  x)p(X)  ]
1 .
)=
(X)
(X)
This result can be used to obtain a closed form efficient estimator 
 of  by solving the
P
d(Y ) = 0, where IC
d is a plug-in estimate of IC  in which 
estimating equation n=1 IC
 i
and p are replaced by suitable estimates  and p. The solution is
 = n1
n
X
=1
 (Xi )1 I(Xi  x){i I(i = )  [i   (Xi )]
p(Xi )},
which is a special case of an estimator that has been studied by Robins and Ritov (1997,
Sections 7 and 8). Under our assumption that P ( = 1 | X) = P ( = 1 | XD ), a natural
estimator of (x) = (xD ) is the empirical proportion of subjects with  = 1 among the
subjects whose discretized value of X = xD . A kernel estimator can be used for p(x).
The above representation of the efficient influence curve for  implies that if one
estimates p inconsistently, then 
 is still consistent and asymptotically linear. This buildin protection against misspecification of p allows the construction of estimators of  that
are efficient at a chosen submodel for p and are always consistent and asymptotically
linear. Such estimators typically have a better finite sample performance at the chosen
submodel.
12
The estimator 
=
(x, ) could be used in place of Fn (x, ) in the product integral
formula (9) to provide an alternative efficient estimator of ST (t) having the efficient influence curve given by (17). Another way of computing (17) would be to apply the results
of Robins and Rotnitzky (1992) directly to the known influence curve of ST (t) in the case
of complete data (X, ). This would lead to yet another efficient estimator of ST (t) in
terms of  and p, along the lines used to construct 
.
The benefit of the RobinsRotnitzky approach is that the computation of the efficient
influence curve is relatively simple, and leads directly to an efficient estimator via standard
estimating equation technology, without the need for the artificial reduced data model.
Our approach, on the other hand, provides an understanding of the role of nonparametric
maximum likelihood through the link to the bivariate censoring model. Moreover, the
reduced data NPMLE has the attractive property that it solves the efficient estimating
equation at every , and a consistent estimator of  is not required. Our reduced data
model plays a similar role to the data reduction inherent in the kernel estimator p used
in the RobinsRotnitzky approach.
Numerical results
We now report the results of a small simulation study comparing the performance of the
proposed estimator with that of Los (1991) estimator. The comparison is made in terms
of mean integrated squared error (MISE).
The survival time T and the censoring time C are taken to be exponentially distributed
with parameters 1.4 and 0.6 for 30% censoring, and 0.6 and 1.4 for 70% censoring, respectively. The sample size is set at n = 100. The coarsening mechanism is MCAR (required
for Los estimator to be consistent), and the probability (x) =  that a failure indicator
is non-missing is taken as 0.1, 0.2 or 0.3. The partition consists of k points on a regular
grid, with k = 10, 30 or 50.
Two artificial points are used in each region Rj having no complete observations: (xj , 0)
and (xj , 1), where xj is the midpoint of the interval (aj , aj+1 ]. The mass redistributed to
these points is divided according to the proportions of censored and uncensored observations in the complete data, respectively.
The results are given in Table 1. The proposed estimator improves considerably upon
Los estimator, the greatest gains being obtained when the proportion of missing failure
indicators is high ( = 0.1). This was to be expected, of course, since the strength of our
approach comes from the way it handles the missing data. Note also that although the
performance of ST is relatively insensitive to the choice of the partition when k  30, it is
significantly degraded under the coarsest grid (k = 10). The best performance is obtained
using the finest grid, irrespective of the degree of censorship.
The effect of the partition is also readily seen by comparing Figures 13, which give
plots of ST based on simulated data, for different values of k. The closest fit to the
underlying survival function (exponential with parameter 1.4) clearly corresponds to the
finest partition. Figure 1 illustrates how the proposed estimator can adapt more efficiently
13
to the missing data than Los estimator; see especially the region between t = 0.1 and
t = 0.5. Figures 2 and 3 show how ST deteriorates as the partition becomes coarser.
Table 1. Mean integrated squared error of the proposed estimator ST and of Los estimator under the MCAR model P ( = 1 | X, )  .
ST (k = 50)
ST (k = 30)
ST (k = 10)
Lo
Ratio
30% Censoring
 = 0.1  = 0.2  = 0.3
1.105
0.620
0.469
1.194
0.623
0.478
1.215
0.705
0.515
1.932
0.916
0.621
1.75
1.48
1.32
70% Censoring
 = 0.1  = 0.2  = 0.3
1.659
0.918
0.695
1.647
0.972
0.747
2.007
1.200
0.866
2.482
1.253
0.849
1.50
1.36
1.22
NOTE: The MISE is expressed in units of 0.01 and is calculated over the interval [0, 1]. Lo refers to
the second estimator of Lo (1991). Ratio refers to the ratio of the MISE of Los estimator to the MISE
of ST (k = 50). Each MISE is based on 10,000 samples.
[Insert Figures 13 about here]
Figure 1. The estimator ST compared with Los estimator for simulated data. The data
were generated using the MCAR model in Table 1 with 30% censoring and  = 0.2. The
partition uses k = 100 equispaced points over the interval [0, 2]. Sample size n = 100.
The data points with missing failure indicators are represented by  and the remaining
data points by +.
Figure 2. The estimator ST based on k = 60 grid points over the interval [0, 2]. See the
caption for Figure 1.
Figure 3. The estimator ST based on k = 20 grid points over the interval [0, 2]. Same
data as in Figure 1.
Appendix
Proof of Proposition 4.1. The main step of the proof is find the canonical gradient
of  in the submodel M(F ) in which only F is unknown ( known). This submodel has
tangent space
T(F ) = sp{E(h(X, )|Y ): Eh(X, ) = 0}
formed as the closed linear span of the conditional expectations of all complete data scores
h(X, ) given the observed data Y . Also, the orthogonal complement of T(F ) is given
by
T() = sp{(Y ): E((Y )|X, ) = 0},
14
which is the tangent space in the submodel in which only  is unknown (F known), see
Robins and Rotnitzky (1992) and Gill, van der Laan and Robins (1997).
Note that a gradient for  in the submodel M(F ) is
IC (Y ) =
(I(X  x,  = )  ).
(X)
To see this, consider a SOD submodel {P F }  M(F ) with score function k(Y ) =
E(h(X, )|Y ), where h is a bounded complete data score function and F (du, ) =
(1 + h(u, ))F (du, ). Then
lim ((F )  (F )) / = E(h(X, )I(X  x,  = ))
0
= E(h(X, )IC (Y ))
= hIC , ki,
where the second equality above uses E(IC(Y )|X, ) = I(X  x,  = )  , which is
a consequence of the CAR assumption.
The canonical gradient of  in the submodel M(F ) is the projection of IC on T(F ),
which can be expressed as IC  nu (IC ), where nu is the projection on the nuisance
tangent space T(), the orthogonal complement of T(F ). For any function (Y )  L20 (P )
we have
nu () = E((Y ) | , X)  E((Y ) | X).
(18)
Indeed, the right hand side is a function of Y with conditional mean zero, given the
complete data, so it belongs to T(). Also,   nu () is orthogonal to T(), which can
be seen by first taking the conditional expectation given (X, ) and then the conditional
expectation given X, establishing (18). The application of (18) to IC is straightforward,
resulting in the expression given by IC.
The final step is to show that the canonical gradient of  in the submodel M(F ) is
also a gradient for  in the full model. Let k  L20 (P ) be bounded and consider the SOD
submodel P  (dy) = P F , (dy) = (1+k(y))P (dy). The score k can be expressed uniquely
in the form k = k1 + k2 , where k1  T(F ) and k2  T(). Then {F} defines a SOD
submodel for M(F ) having score k1 , because k1 does not depend on { } and k2 does not
depend on {F }. Thus, using the fact that  does not depend on , we have
lim ((P  )  (P )) / = lim ((F )  (F )) / = hIC, k1 i = hIC, ki,
0
0
as required.
Acknowledgements. We thank Jamie Robins for pointing out the alternative approach
based on Robins and Rotnitzky (1992). We also thank the Editor, Larry Brown, and the
Associate Editor for helpful comments.
15
References
Bickel, P. J., Klaassen, C. A. J., Ritov, Y., and Wellner, J. A. (1993). Efficient and Adaptive Estimation for Semiparametric Models. Johns Hopkins University Press, Baltimore.
Bijlsma, F. (1994). Mortality among young people: causes and background. Ned. Tijdschr. Geneeskd., 138 24392443. (In Dutch)
Dinse, G. E. (1982). Nonparametric estimation for partially-complete time and type of
failure data. Biometrics 38 417431.
Efron, B. (1967). The two sample problem with censored data. In Proc. Fifth Berkeley
Symp. Math. Statist. Probab. 4 831853.
Gijbels, I., Lin, D. Y. and Ying, Z. (1993). Non- and semi-parametric analysis of failure
time data with missing failure indicators. Preprint.
Gill, R. D. (1989). Non- and semi- parametric maximum likelihood estimators and the
von Mises method (part 1). Scand. J. of Stat 16 97124.
Gill, R. D. and Johansen, S. (1990). A survey of product-integration with a view toward
application in survival analysis. Ann. Statist. 18 15011555.
Gill, R. D., van der Laan, M. J. and Robins, J. M. (1997). Coarsening at Random, to
appear in Proceedings of the First Seattle Symposium on Biostatistics (D.-Y. Lin, ed.),
Springer, New York.
Gill, R. D., van der Laan, M. J. and Wellner, J. A. (1995). Inefficient estimators of the
bivariate survival function for three models. Ann. Inst. Henri Poincare 31 545597.
Heitjan D. F. and Rubin D. B. (1991). Ignorability and coarse data. Ann. Statist. 19
22442253.
Jacobsen, M. and Keiding, N. (1995). Coarsening at random in general sample spaces
and random censoring in continuous time. Ann. Statist. 23 774786.
Kaplan, E. L. and Meier, P. (1958). Nonparametric estimation from incomplete observations. J. Amer. Statist. Assoc. 53 457481.
Van der Laan, M. J. (1996a). Efficient and inefficient estimation in semiparametric models. CWI-tract 114, Centre of Computer Science and Mathematics, Amsterdam.
Van der Laan, M. J. (1996b). Efficient estimation in the bivariate censoring model and
repairing NPMLE. Ann. Statist. 24, 596627
Lapidus G., Braddock M., Schwartz R., Banco L., and Jacobs L. (1994). Accuracy of
fatal motorcycle-injury reporting on death certificates. Accid. Anal. Prev., 26 535542.
Little, R. J. A. and Rubin, D. B. (1987). Statistical Analysis with Missing Data. Wiley,
16
New York.
Lo, S.-H. (1991). Estimating a survival function with incomplete cause-of-death data. J.
Multvariate Anal. 39 217235.
McKeague, I. W. and Subramanian, S. (1996). Product-limit estimators and Cox regression with missing censoring indicators. Preprint.
Newey, W. K. (1990). Semiparametric efficiency bounds. J. Applied Econometrics 5
99135.
OSullivan, F. (1986). A statistical perspective on ill-posed inverse problems. Statist. Sci.
1 502527.
Pruitt, R. C. (1991). On negative mass assigned by the bivariate Kaplan-Meier estimator.
Ann. Statist. 19 443453.
Robins, J. M. and Ritov, Y. (1997). Toward a curse of dimensionality appropriate (CODA)
asymptotic theory for semi-parametric models. Statistics in Medicine 16 285319.
Robins, J. M. and Rotnitzky, A. (1992). Recovery of information and adjustment for
dependent censoring using surrogate markers. In: AIDS Epidemiology  Methodological
Issues (N. Jewell, K. Dietz, V. Farewell, eds.), 279331, Birkhauser, Boston.
Rubin, D. B. (1976). Inference and missing data. Biometrika 63 581590.
Van der Vaart, A. W. (1991). Efficiency and Hadamard differentiable functionals. Scand.
J. Statist. 18 6375.
Wellner, J. A. (1982). Asymptotic optimality of the product limit estimator. Ann. Statist.
10 595602.
17