Statistical Inference Based on Pooled Data: A
Moment-Based Estimating Equation Approach
Howard D. Bondell
Department of Statistics, Rutgers University, Piscataway, NJ 08854, U.S.A.
Aiyi Liu∗ and Enrique F. Schisterman
Division of Epidemiology, Statistics and Prevention Research, National Institute of
Child Health and Human Development, Department of Health and Human Services,
6100 Executive Blvd., Bethesda, MD 20852, U.S.A.
∗
email: liua@mail.nih.gov
Summary. We consider statistical inference on parameters of a distribution when only
pooled data are observed. A moment-based estimating equation approach is proposed
to deal with situations where likelihood functions based on pooled data are difficult
to work with. We outline the method to obtain estimates and test statistics of the
parameters of interest in the general setting. We demonstrate the approach on the
family of distributions generated by the Box-Cox transformation model, and in the
process, construct tests for goodness of fit based on the pooled data.
Key words: Pooling biospecimens; Set-based observations; Moments; Box-Cox trans-
formation; Goodness of fit; Lognormal distribution.
1. Introduction
The classical approach to conducting statistical inference on an unknown vector of
parameters θ characterizing a distribution Fθ of a random variable X is based on a ran-
dom sample of size np (with both n and p being integers) from Fθ , say, X1 , . . . , Xnp ,
1
independent and identically distributed according to Fθ . Suppose that, in order to
reduce the cost of the study, the subjects that yield the np samples are randomly
grouped into n sets, each of size p. Subsequently, instead of observing each indi-
vidual X, the average of the Xs in each set is observed, yielding n observations,
Pjp
Xj∗ = i=p(j−1)+1 Xi /p, j = 1, . . . , n, which are often called set-based observations.
The X ∗ s are also independent and identically distributed, but each following the dis-
tribution Fθ∗ of the average of p random draws from Fθ . We are concerned with inference
on θ based on the pooled data X1∗ , . . . , Xn∗ .
The data framework described above, often called group testing as a result of testing
for a microorganism in pooled biospecimens, appeared initially in the context of screen-
ing with dichotomous outcomes (Dorfman, 1943; Sobel and Groll 1959), and was later
further developed by Gastwirth and Johnson (1994) and Litvak, Tu and Pagano (1994)
in HIV contamination, Sobel and Elashoff (1975), Chen and Swallow (1990), Farrington
(1992), Hughes-Oliver and Swallow (1994), and Tu, Litvak and Pagano (1995) in esti-
mating population prevalence, and Barcellos et al. (1997) in localizing disease genes.
Recently Weinberg and Umbach (1999) proposed a set-based logistic model to explore
the association between a disease and exposures when only the pooled exposure values
are available. Farragi, Reiser and Schisterman (2003) and Liu and Schisterman (2003)
considered evaluation of diagnostic biomarkers based on pooled specimens whose mea-
surements are assumed to follow normal or gamma distributions. Other areas where
pooling biospecimens has been found useful include gene microarray experiments where
mRNA samples are often pooled across subjects (Jin et al., 2001; Enard et al., 2002;
Kendziorski et al., 2003).
Although the strategy of pooling specimens has been used in practice, methods
for analysis of set-based data from such experiments have not been fully and well
2
developed in the literature, except for certain special cases. This is, perhaps, partly
because for a general distribution, Fθ , the likelihood methods based on set-based data
may not be feasible since the distribution of the averages involves convolution of p
random variables of Fθ . The purpose of the present paper is to develop a general
methodology for reasonably efficient estimation and testing under a broad class of
distributional assumptions, including the family of distributions generated by the Box-
Cox transformation model. The context is that Fθ possesses and is fully determined
by its first several moments, which may be estimated by converting the estimates of
the moments of the set-based random variable X ∗ .
The paper is arranged as follows. In §2 we shall describe the method under the
assumption that Fθ can be parameterized by no more than its first three moments.
We obtain estimating equations to yield estimates of θ. The method can be readily
extended to distributions with more than three parameters, but we omit the details
at the present time. In §3 we derive the large sample distribution of the estimates.
We then apply the method in §4 to the family of distributions generated by the Box-
Cox transformation model, which is an extremely versatile class of distributions that
includes the normal, lognormal, and (non-central) χ2 distributions as special cases. As
an extension, we discuss several procedures to test goodness of fit based on the pooled
data. The methods are exemplified in §5 using data from a study evaluating oxidative
stress and antioxidants on cardiovascular disease in upstate New York. Some comments
and discussions appear in §6.
2. Estimating Equations Based on the Moments: A General Method
Our aim is to conduct inference on a k-parameter vector, θ, that characterizes the
distribution Fθ of interest, based on the set-based observations X ∗ s. Except for certain
special cases, the likelihood function based on X ∗ s will be extremely difficult to derive.
3
Alternatively we can obtain and connect the moments of X ∗ with that of X, and
construct inference based on moment estimates. For this purpose, we assume that Fθ
has at least 2k moments of which the first k moments will be used to estimate θ and
the rest to estimate the variance.
We illustrate the method by assuming that θ is at most three-dimensional. Define
µ1 = E(X), µ∗1 = E(X ∗ ), µr = E{(X −µ1 )r }, and µ∗r = E{(X ∗ −µ∗1 )r }, for r > 1; these
are all functions of θ. Then it is straightforward to show that the first three central
moments of X ∗ , being the mean of p independent, identically distributed variables of
X, are
µ∗1 = µ1 , µ∗2 = µ2 /p, µ∗3 = µ3 /p2 . (1)
Replacing the left-hand sides of the equations by their corresponding set-based
sample moments and solving for θ will then yield an estimator of θ based on which
inference can be conducted. Putting the above into an estimating equation framework
allows us to conveniently utilize the general asymptotic theory on estimating equations
(See Serfling, 1980, and §3 below). Define
w − µ1 (θ)
Ψ(w; θ) = 2
p{w − µ1 (θ)} − µ2 (θ)
,
(2)
p2 {w − µ1 (θ)}3 − µ3 (θ)
then a consistent estimator, θ̃, of θ is the solution to the equation:
n
X
n−1 Ψ(Xj∗ ; θ̃) = 0, (3)
j=1
Pn
or equivalently θ̃ = arg minθ n−1 j=1 Ψ(Xj∗ ; θ̃)T Ψ(Xj∗ ; θ̃). Note that if θ is of dimen-
sion k < 3, we only require the first k components of Ψ.
We have restricted our attention to at most three parameters which are sufficient for
most practical needs. If more parameters are required, the approach described above
4
can be readily extended to accommodate the additional parameters, though the higher
order moments may have less simple formulas. (See next section below.)
3. Distribution Theory for Estimates
Clearly, there is no simple exact distribution theory for the estimator θ̃, since it will
depend on the distribution Fθ∗ of X ∗ , which, as mentioned earlier, may not be feasible
to work with, unless the original distribution is of a particularly convenient form, such
as a normal or gamma. Here we derive asymptotic theory for θ̃, on which statistical
inference may be based.
In order to obtain the asymptotic variance of the estimator, we also need the next
three higher-order moments. Straightforward manipulation leads to the following rela-
tionships (for p > 1):
µ∗4 = p−3 {µ4 + 3(p − 1)µ22 }, µ∗5 = p−4 {µ5 + 10(p − 1)µ3 µ2 },
µ∗6 = p−5 {µ6 + 15(p − 1)µ4 µ2 + 10(p − 1)µ23 + 15(p − 1)(p − 2)µ32 }, (4)
again, all are functions of θ.
Following standard asymptotic theory of estimating equations (for example, Ser-
³ ´
d
fling, 1980, Chapter 7), we can show, via Taylor expansion, that n1/2 θ̃ − θ →
N3 (0, Σ) , where Σ = ABAT , with,
( )
∂ n o
A−1 = −Eθ Ψ(X ∗ ; θ) , B = Eθ Ψ(X ∗ ; θ)Ψ(X ∗ ; θ)T . (5)
∂θ
For the particular form of the estimating equations (3), we can explicitly express
the matrices A and B in terms of the central moments of the original distribution, by
using the moment relationships given in (1) and (4). Putting θ = (θ1 , θ2 , θ3 )T , we then
5
have
∂µ1 ∂µ1 ∂µ1
∂θ1 ∂θ2 ∂θ3
−1
A = ∂µ2
∂θ1
∂µ2 ∂µ2
(6)
∂θ2 ∂θ3
3pµ2 ∂µ1
∂θ1
+ ∂µ3
∂θ1
3pµ2 ∂µ
∂θ2
1
+ ∂µ3
∂θ2
3pµ2 ∂µ
∂θ3
1
+ ∂µ3
∂θ3
and
µ2 /p µ3 /p p2 µ∗4
B= 2 ∗ 2
µ3 /p p µ4 − µ2 p µ5 − µ3 µ2
3 ∗
. (7)
p2 µ∗4 p3 µ∗5 − µ3 µ2 p4 µ∗6 − µ23
We will need estimates of Σ to construct confidence intervals and test statistics for
hypotheses regarding a function of θ. Here we propose two approaches to construct
estimates of A and B which then yield estimates of Σ. One approach is to “plug-in” θ̃
for θ in the expressions (6) and (7), respectively to obtain estimates of A and B.
Alternatively, we may obtain “semi-empirical” estimates for A and B, using a two-
stage strategy. First replace θ by θ̃ in (5), and then estimate the resulting functions
by their empirical sample means, we have
n
X n
X
∂
Ã−1 = −n−1 Ψ(Xj∗ ; θ̃), B̃ = n−1 Ψ(Xj∗ ; θ̃)Ψ(Xj∗ ; θ̃)T . (8)
j ∂θ j
Both approaches lead to consistent estimators of Σ. We may then conduct approx-
imate inference on any function of θ, using the asymptotic normality of the statistics.
For example, a 100(1 − α) per cent confidence interval for a linear function lT θ is
lT θ̃ ± z1−α/2 (lT ÃB̃ ÃT l/n)1/2 , with z being the upper percentile of the standard normal
distribution.
4. The Box-Cox Transformation Family
4.1 Preliminaries
In their seminal paper, Box and Cox (1964) developed a widely used transformation
6
family for the linear regression model.
(X λ − 1)/λ, λ 6= 0,
Y = , (9)
log X, λ = 0,
where λ is a real-valued, unknown parameter, and X is the original data, assumed to
be strictly positive (in order that all real λ yield real values).
Based on this transformation, a diverse family of distributions for X is generated.
The Box-Cox power transformation assumes that there is some member of the power
family of transformations such that when applied to the data, the transformed data
are normally distributed. Hence the original data can take on a wide range of possible
distributions, and in most practical situations, there exists some member of this family
of distributions that is a reasonable model for the data generating mechanism. Three
important special cases of this family are the normal (λ = 1), lognormal (λ = 0), and
(non-central) χ2 (λ = 1/2).
It is now assumed that Y follows a normal distribution with mean µ and variance
σ 2 . We would like to conduct inference on the parameter vector θ = (µ, σ 2 , λ)T based
on observing X ∗ , which as before, are averages of p random observations of X.
4.2 Inference
4.2.1 Estimation of parameters
The standard approach to inference in the Box-Cox model is via maximum likeli-
hood, while other approaches have also been proposed (See Sakia, 1992, for a review).
However, when only pooled data are observed, these methods are extremely difficult or
impossible to carry out, except in special cases. We shall instead estimate the three pa-
rameters based on the pooled data via the moment-based estimating equation method
described in §2 and §3.
7
To proceed, we obtain expressions for the central moments of the distribution of X,
the transformed normal. Once the expressions are obtained, it is then computation-
ally straightforward to derive the estimator and its estimated large sample covariance
matrix.
It should be pointed out that the claim that Y has a normal distribution is not
exactly true for λ 6= 0, since Y is bounded by −1/λ from below (above) for λ > (<)
0. This effect in general practice is assumed to be negligible (and usually is), but must
be accounted for in our expression for the moments.
We first deal with λ ≥ 0, in which case all moments exist for X. For λ > 0, define
U = X λ = λY + 1, whose density is a truncated (at zero) normal density:
à !
1 u
f (u; µ, σ, λ) = φ − δ , (u > 0), (10)
|λ|σΦ(δ) |λ|σ
where δ = (λµ + 1)/(|λ|σ), φ, and Φ are the standard normal density and distribution
functions. Note that the absolute value sign is used in order to unify the density
expressions for both λ > 0 and λ < 0 (See below).
The moments of X as functions of µ, σ, and λ are thus given by
Z ∞ Ã !
1 u
µ1 (µ, σ, λ) = u1/λ φ − δ du, (11)
|λ|σΦ(δ) 0 |λ|σ
Z ∞n à !
1 or u
µr (µ, σ, λ) = u1/λ − µ1 (µ, σ, λ) φ − δ du, (r > 1). (12)
|λ|σΦ(δ) 0 |λ|σ
Note that as λ → 0, the moments converge to the moments of the lognormal
distribution, for which known formulas are available (See, for example, Aitchison and
Brown, 1957), and thus explicit expressions for the estimates of µ and σ 2 based on the
pooled data may be obtained.
For the case λ < 0, we notice from (11)-(12) that if X is bounded from above then all
moments exist; otherwise, only the moments of order r < −λ exist. To ensure feasible
8
execution of the proposed moment-based procedure, we assume that X ≤ x0 for some
x0 > 0. If we define U = X λ −xλ0 = λY +1−xλ0 , then the density of U and the moments
of X are still given respectively by (10)-(12), but with δ = (λµ + 1 − xλ0 )/(|λ|σ), and
u in the integrand being replaced by u + xλ0 .
Using these moment formulas and the estimating equations (3), we can obtain
estimates of θ = (µ, σ 2 , λ)T . The derivatives required for the asymptotic distribution
given by (5)-(7) may be computed by differentiating under the integral sign, or may
be computed numerically using the estimated parameters. We may then plug in the
empirical moment estimates to obtain the estimated asymptotic covariance matrix in
order to construct confidence intervals and test statistics.
Note that if we assume λ to be known, then only the first two moments µ1 and µ2
are needed to obtain estimates of µ and σ 2 , and the next two higher moments, µ3 and
µ4 to derive asymptotic variance. See §5 for an explicit example in the lognormal case
(λ = 0).
4.2.2 Computation and inference regarding µ and σ 2
Often, we are only interested in inference on µ and σ 2 , and the transformation param-
eter is regarded as being a nuisance. We adopt the following convenient approach to
conduct the inference.
Write µ(λ) and σ 2 (λ) to denote the dependency on the transformation scale. For
a fixed λ, we obtain (µ̃(λ), σ̃ 2 (λ)) by using only the first two moment relationships. λ̃
is then found via a grid search using the third moment. Our limited simulation results
show that the third moment equation is monotone in λ, when considered as a function
of (µ̃(λ), σ̃ 2 (λ), λ), and hence the grid search should yield a unique estimate λ̃ of λ.
Once λ̃ is determined from the data, we then proceed, just as in the standard data
transformation situation, as if λ were known (to be λ̃), to estimate µ and σ 2 and the
9
asymptotic variance with the last row and column of A−1 and B in (6) and (7) being
removed.
The appropriateness of such “conditional” inference has been a subject of much
debate in the literature (Bickel and Doksum, 1981; Box and Cox, 1982; Doksum and
Wong, 1983; Hinkley and Runger, 1984, among others). Since λ̃ is a consistent esti-
mate of λ under the Box-Cox model, the asymptotic equivalency of the “conditional”
transformed two-sample t-statistic with the “unconditional” one as in Doksum and
Wong (1983) would hold for the t-statistic based on the moment estimates as well.
Since formulas for the asymptotic variances are available for both the λ known, and
the λ estimated situations, the appropriateness of treating λ as known for problems
other than those treated by Doksum and Wong (1983) deserves further investigation.
4.3 Testing Goodness-of-fit
When inference procedures depend heavily on the distributional assumptions, it is
important to justify these assumptions before conducting the inference. Below we
propose several goodness-of-fit tests concerning the distribution of X based on the
pooled data X ∗ .
4.3.1 One sample test
One standard approach to testing goodness-of-fit is to imbed the distribution in ques-
tion into a larger family of distributions indexed by one or more additional parameters
and then test hypotheses regarding these parameters. Since the Box-Cox family is
a diverse family that includes many important special cases, a natural extension to
the estimation problem is the test for goodness-of-fit based on the pooled data to a
hypothesized distribution.
Using the estimate of λ, derived by using the estimating equations (3) with θ =
(µ, σ 2 , λ)T and the moments given in (11) and (12), we may test the fit of the underlying
10
data to a desired distribution of X, based solely on the pooled data. For example,
testing for goodness-of-fit to a lognormal distribution based on the pooled data is
accomplished simply by testing H0 : λ = 0 vs. H1 : λ 6= 0, using the asymptotically
standard normal test statistic Z = λ̃/s, where s is the estimated standard error of λ̃
based on the asymptotic distribution derived in §3.
4.3.2 Two-sample extension
The Box-Cox transformation family has been used in the receiver operating charac-
teristic (ROC) curve analysis to evaluate the accuracy of a medical diagnostic test or
biomarker that yields continuous outcomes (Faraggi and Reiser, 2002; Zou and Hall,
2000, 2002). A key assumption to warrant the use of the Box-Cox transformation
theory in such analysis is that the transformation parameter λ be the same for both
diseased and non-diseased outcomes. Below we propose a method of testing this key
assumption based on the pooled data.
We observe two independent pooled samples from np diseased and mq nondis-
Pjp Pkq
eased subjects, Xj∗ = i=(j−1)p+1 Xi /p, (j = 1, ..., n), and Yk∗ = i=(k−1)q+1 Yi /q,
(k = 1, ..., m). The individual Xs and Y s are not observed. We assume that for cer-
tain unknown parameters λX and λY , (X λX − 1)/λX and (Y λY − 1)/λY both follow
(truncated) normal distributions. The null hypothesis to be tested in this situation is
H0 : λX − λY = 0 vs. H1 : λX − λY 6= 0.
We may test this hypothesis in the following manner. Let λ̃X and λ̃Y be the
estimates of λX and λY , respectively, obtained by solving the estimating equations (3),
and s2X and s2Y be their estimated variances derived from the methods described in §3.
We then reject H0 at significance level α if |λ̃X − λ̃Y | > z1−α/2 s, where s = (s2X +s2Y )1/2 .
If we do not reject the null hypothesis, we may then feel comfortable in combining
the two estimates to obtain a weighted estimate of the common value of λ, use the
11
common estimate as the true λ to estimate µ and σ 2 for each group, and then proceed
with the analysis as proposed by Zou and Hall (2000, 2002).
4.3.3 General goodness-of-fit
The goodness-of-fit test based on an estimate of λ in §4·3·1 is technically valid only
under the assumption that the true distribution is actually a member of the Box-Cox
family. We may also adapt other readily available techniques to test the distributional
assumptions without assuming membership in the Box-Cox family.
We will assume that the two distributions, Fθ of X and Fθ∗ of X ∗ , are uniquely
determined by each other, which then implies that testing the hypothesis that the
unoberserved data X follow the distribution Fθ is equivalent to testing the hypothesis
that the observed pooled data X ∗ follow the distribution Fθ∗ . While there are certain
exceptions to this uniqueness characterization, we suspect that it holds for most of the
distributions actually used in practice. We comment on this further in §6.
One simple method is to draw a Q-Q plot of the pooled data versus a hypothetical
distribution F ∗ of X ∗ . Suppose we want to test the hypothesis that a distribution
Fθ generates the unobserved data from which the pooled data are observed. Using
the moment based technique we have developed in the previous sections, we obtain an
estimator θ̃ of θ based on the pooled data Xj∗ , (j = 1, . . . , n).
If the quantiles of F ∗ are difficult to compute, which is the case in general, We may
generate a large number of observations from Fθ̃ , and group them into sets of size p,
to yield the distribution, Fθ̃∗ . We then plot the quantiles of this large sample empirical
distribution versus the quantiles of the empirical distribution of the observed data, and
check for linearity in the plot.
Another approach would be to use a formal goodness-of-fit hypothesis test. One of
the most common goodness-of-fit tests of a parametric assumption is the Kolmogorov-
12
Smirnoff type test. This test is based on the statistic,
D = supx |Fn∗ (x) − Fθ̃∗ (x)|,
the largest difference in cumulative distribution functions between the empirical and
theoretical distributions. The distribution of D under the null hypothesis that the data
follows the hypothesized distribution is complicated by the fact that we are using an
estimate of θ, and not the true parameter. Hence, the critical regions of the standard
Kolmogorov-Smirnoff test are not valid.
Based on the results of Romano (1988), the following bootstrap method will deter-
mine critical values that will yield tests with correct asymptotic significance levels.
Step 1. Based on the estimate, θ̃, generate a random sample of size np from Fθ̃∗ ,
and then group into sets of size p to obtain the pooled sample.
Step 2. Compute the empirical distribution, Fn∗ , for this sample.
Step 3. Generate the empirical distribution Fθ̃∗∗ based on a large sample grouped
into sets of size p, as in the Q-Q plot.
Step 4. Calculate D∗ = supx |Fn∗ (x) − Fθ̃∗∗ (x)| for this sample.
Step 5. Repeat a large number of times, and use the frequency distribution of D∗
as the null distribution of D to find the critical region.
5. An example
A population-based sample of randomly selected residents of New York State’s Erie
and Niagara counties, 35 to 79 years of age, was the focus of this investigation. The
New York State Department of Motor Vehicles drivers’ license rolls were utilized as the
sampling frame for adults between the ages of 35 and 65; whereas the elderly sample
(age 65 to 79) was randomly selected from the Health Care Financing Administration
database.
13
A total of 72 men and women were selected for the analyses. Personal history of my-
ocardial infarction and angina pectoris was ascertained by self-reported questionnaire.
Participants were asked if they had been diagnosed with angina pectoris confirmed by
angiogram or with myocardial infarction. Medical charts were reviewed by a physician
for outcome verification and were defined as having coronary heart disease. Partici-
pants provided a 12-hour fasting blood specimen for biochemical analysis. A number
of parameters were examined in fresh blood samples, including routine Vitamin E lev-
els. We assume that the distribution of Vitamin E concentrations is a member of the
Box-Cox family. Fig 1(a) and 1(b) shows the normal Q-Q plots for the original and
log-transformed data, respectively.
*** (Insert Figure 1 here) ***
Figure 1. Normal Q-Q plot of Vitamin E concentration.
A lognormal distribution appears to be a reasonable fit to the data. Based on the full
(un-pooled) data, the standard Kolmogorov-Smirnoff test rejects the normal assump-
tion (p-value=0.0023), while not rejecting the lognormal assumption (p-value=0.5).
A 95% confidence interval for λ based on the maximum likelihood estimate, which is
obtainable when full data are available, is found to be (-0.6924, 0.3334), overlapping
zero, further confirming the lognormal assumption.
We now randomly group the subjects into groups of 2 and take the average as the
pooled observation. Based on this pooled sample, the moment-based estimate of λ
is 0.1096 with a standard error of 0.5565, yielding a confidence interval of (-0.9811,
1.2003). This again indicates lognormality. Moreover, the simulated Q-Q plots of the
14
pooled data (Figure 2) also supports that data are lognormally distributed. We will
therefore proceed under the lognormal assumption.
*** (Insert Figure 2 here) ***
Figure 2. Lognormal Q-Q plot of Vitamin E concentration with pooled data.
For the lognormal distribution, the four required central moments are given by
(Aitchison and Brown, 1957):
³ ´
µ1 = exp µ + 21 σ 2 , µ2 = µ21 ω 2 ,
µ3 = µ31 ω 4 (ω 2 + 3), µ4 = µ41 ω 4 (ω 8 + 6ω 6 + 15ω 4 + 16ω 2 + 3)
where ω 2 = exp(σ 2 ) − 1.
We find that
n o n o
µ = log (µ21 + µ2 )−1/2 µ21 , σ 2 = log 1 + µ−2
1 µ2 .
We thus obtain (µ̃, σ̃ 2 ) by respectively replacing µ1 and µ2 by their moment estimates:
n n
1X pX
µ̃1 = Xi∗ , µ̃2 = (X ∗ − µ̃1 )2 ,
n i=1 n i=1 i
where the X ∗ s are the pooled observations. Using the explicit formulas for the mo-
ments and the asymptotic variance, some straightforward but tedious algebra yields
the asymptotic variances and covariance as:
. −1
n V ar (µ̃) = (4pγ 2 ) {γ 6 − 8γ 4 + 16γ 3 + (2p − 11)γ 2 − 4(p − 1)γ + 2(p − 1)} ,
. −1
n V ar (σ̃ 2 ) = (pγ 2 ) {γ 6 − 4γ 4 + 4γ 3 + (2p − 3)γ 2 − 4(p − 1)γ + 2(p − 1)} ,
. −1
n Cov (µ̃, σ̃ 2 ) = − (2pγ 2 ) {γ 6 − 6γ 4 + 8γ 3 + (2p − 5)γ 2 − 4(p − 1)γ + 2(p − 1)} ,
15
where γ = 1 + ω 2 .
Notice that the above formulas depend only on σ 2 , the variance of the underlying
normal distribution, along with the pooling size p. We may plug in γ̃ = exp(σ̃ 2 ) to
yield consistent estimates of the variances and covariance, and thus to construct test
statistics and confidence intervals.
Applying these formulas we obtain the estimates of (µ, σ 2 ) and their estimated
standard errors, for both the pooled and un-pooled data. The results are presented in
Table 1. For comparison, estimates based on pooled data with group size of 3 and 4
are also given in the Table.
Table 1
Estimate (± standard error) of the lognormal mean µ and variance σ 2 .
p n µ̃ σ̃ 2
1 72 2.6421 (±0.0498) 0.1733 (±0.0373)
2 36 2.6408 (±0.0519) 0.1757 (±0.0465)
3 24 2.6396 (±0.0540) 0.1781 (±0.0545)
4 18 2.6405 (±0.0554) 0.1764 (±0.0603)
To evaluate the performance of these estimates, we also computed the maximum
likelihood estimates µ̂ of µ and σ̂ 2 of σ 2 , using the un-pooled data. It turned out that
µ̂ = 2.6466 and σ̂ 2 = 0.1609, with standard errors 0.0473 and 0.0270, respectively.
We observe that, due to the small value of σ 2 , which is common in lognormal data,
there is not a great deal of efficiency loss in the moment based estimates as compared
with the maximum likelihood estimates, especially in the estimation of µ. In addition,
there is also only a small loss of efficiency as we pool the data. This small efficiency
16
loss is in agreement with previous studies on the merits of pooling data, under normal
and gamma assumptions, to reduce costs associated with bioassays. See Faraggi et al.
(2003), Liu and Schisterman (2003) and Weinberg and Umbach (1999).
6. Discussion
6.1 Comments on Goodness-of-fit Test
In the goodness-of-fit testing problem, we are testing a hypothesis regarding the under-
lying distribution of the unpooled data. It is implicitly assumed that the distribution
of the convolution is in one-to-one correspondence with the distribution of the indi-
vidual observations. This is true under general regularity conditions. For example, a
nonvanishing characteristic function is a sufficient condition for this one-to-one corre-
spondence. For other characterization conditions see Prokhorov and Ushakov (2002)
and the references therein. Regardless of the uniqueness of the characterization, the
type I error of the test is unaffected. However, the test will be unable to detect the dif-
ference between any two original distributions that may yield the identical convolution
distribution. An additional question then arises, if the characterization is not unique,
how different are the generating distributions with respect to an underlying distance
such as Kolmogorov-Smirnoff, or (symmetrized) Kullback-Leibler?
6.2 Other Comments And Further Directions
In this paper we proposed inference on pooled data under parametric assumptions
on the individual observations. We also suggested methods to test these parametric
assumptions. One of the problems inherent in this type of set-based data is created by
the central limit theorem. The pooled data tend to a normal distribution as the pooling
size increases, and even for small to moderate pooling sizes, much of the skewness (and
higher moments) of the original distribution is lost in the set-based distribution. While
the loss in variability is linear in the pooling size, this loss of skewness is quadratic,
17
as can be seen by the moments (1). This hampers the ability to detect differences in
distributional shape even for modest pooling sizes.
To our knowledge, the current paper is the first to present a general methodology
for dealing with set-based data under a broad class of parametric distributional as-
sumptions. As the pooling of data becomes a more common procedure, particularly in
the area of evaluation of disease biomarkers, more research on methods to deal with
this form of data needs to be done. For example, under a parametric assumption,
we may be able to use Edgeworth expansions to write out an approximate likelihood
function for the set-based data and proceed via likelihood methods. The accuracy of
inference based on these approximations would be of interest.
Non-parametric methods for set-based data may also be appropriate. A possible al-
ternative to the parametric models proposed in this paper, would be an approach based
on density deconvolution, which again appears to be technically and computationally
challenging.
Acknowledgements
The authors thank W. Jack Hall and Kai F. Yu for helpful discussion and suggestions.
References
Aitchison, J. and Brown, J. A. C. (1957). The Lognormal Distribution. Cambridge:
the University Press.
Barcellos, L. F., Klitz, W., Field, L. L., Tobias, R., Bowcock, A. M., Wilson, R., Nelson,
M. P., Nagatomi, J., Thomson, G. (1997). Association mapping of disease loci, by
18
use of a pooled DNA genomic screen. American Journal of Human Genetics 61,
734-47.
Bickel, P. J. and Doksum, K. A. (1981). An analysis of transformations revisited.
Journal of the American Statistical Association 76, 296-311.
Box, G. E. P. and Cox, D. R. (1964). An analysis of transformations. Journal of the
Royal Statistical Society B 26, 211-52.
Box, G. E. P. and Cox, D. R. (1982). An analysis of transformations revisited, rebutted.
Journal of the American Statistical Association 77, 209-10.
Chen, C. L. and Swallow, W. H. (1990). Using group testing to estimate a proportion,
and to test the binomial model. Biometrics 46, 1035-46.
Doksum, K. A. and Wong, C-W. (1983). Statistical tests based on transformed data.
Journal of the American Statistical Association 78, 411-7.
Dorfman, R. (1943). The detection of defective members of large populations. Annals
of Mathematical Statistics 14, 436-40.
Enard, W., Khaitovich, P., Klose, J., Zollner, S., Heissig, F., Giavalisco, P., Nieselt-
Struwe, K., Muchmore, E., Varki, A., Ravid, R., Doxiadis, G. M., Bontrop, R. E.
and Paabo, S. (2002). Intra- and interspecific variation in primate gene expression
patterns. Science 296, 340-3.
Faraggi, D. and Reiser, B. (2002). Estimation of the area under the ROC curve.
Statistics in Medicine 21, 3093-106.
Faraggi, D., Reiser, B. and Schisterman, E. F. (2003). ROC curve analysis for biomark-
ers based on pooled assessments. Statistics in Medicine 15, 2515-27.
Farrington, C. (1992). Estimation prevalence by group testing using generalized linear
models. Statistics in Medicine 11, 1591-7.
Gastwirth, J. and Johnson, W. (1994). Screening with cost-effective quality control:
19
Potential applications to HIV and drug testing. Journal of the American Statistical
Association 89, 972-81.
Hinkley, D. V. and Runger, G. (1984). The analysis of transformed data. Journal of
the American Statistical Association 79, 302-20.
Hughes-Oliver, J. M. and Swallow, W. H. (1994). A two-stage adaptive group-testing
procedure for estimating small proportions. Journal of the American Statistical
Association 89, 982-93.
Jin, W., Riley, R. M., Wolfinger, R. D., White, K. P., Passador-Gurgel, G. and Gibson,
G. (2001). The contributions of sex, genotype and age to transcriptional variance
in Drosophila melanogaster. Nature Genetics 29, 389-95.
Kendziorski, C. M., Zhang, Y. , Lan, H. and Attie, D. (2003). The efficiency of pooling
m RNA in microarray experiments. Biostatistics 4, 465-77.
Litvak, E. , Tu, X. M. and Pagano, M. (1994). Screening for the presence of a disease by
pooling sera samples. Journal of the American Statistical Association 89 , 424-34.
Liu, A. and Schisterman, E. F. (2003). Comparison of diagnostic accuracy of biomark-
ers with pooled assessments. Biometrical Journal 45, 631-44.
Prokhorov, A. V. and Ushakov, N. G. (2002). On the problem of reconstructing a
summands distribution by the distribution of their sum. Theory of Probability and
its Applications 46, 420-30.
Romano, J. P. (1988). A bootstrap revival of some nonparametric distance tests.
Journal of the American Statistical Association 83, 698-708.
Sakia, R. M. (1992). The Box-Cox transformation technique: a review. The Statisti-
cian 41, 169-78.
Serling, R. J. (1980). Approximation Theorems of Mathematical Statistics. New York:
Wiley.
20
Sobel, M. and Groll, P. (1959). Group testing to eliminate efficiently all defectives in
a binomial sample. The Bell System Technical Journal 38, 1179-252.
Sobel, M. and Elashoff, R. (1975). Group testing with a new goal: Estimation.
Biometrika 62, 181-93.
Tu, X. M., Litvak, E. and Pagano, M. (1995). On the informativeness and accuracy
of pooled testing in estimating prevalence of a rare disease: Application to HIV
screening. Biometrika 82, 287-97.
Weinberg C. R. and Umbach, M. (1999). Using pooled exposure assessment to improve
efficiency in case-control studies. Biometrics 55, 718-26.
Zou, K. H. and Hall, W. J. (2000). Two transformation models for estimating an ROC
curve derived from continuous data. Journal of Applied Statistics 27, 621-31.
Zou, K. H. and Hall, W. J. (2002). Semiparametric and parametric transformation
models for comparing diagnostic markers with paired design. Journal of Applied
Statistics 29, 803-16.
21
40
3.5
log(VitaminE)
30
3.0
VitaminE
20
2.5
10
2.0
-2 -1 0 1 2 -2 -1 0 1 2
Quantiles of Standard Normal
Quantiles of Standard Normal
(b)
(a)
This is Figure 1.
25
20
VitaminE
15
10
10 20 30 40
Quantiles of Lognormal Average
This is Figure 2.