0% found this document useful (0 votes)

63 views23 pages

Statistical Inference Based On Pooled Data: A Moment-Based Estimating Equation Approach

This document proposes a moment-based estimating equation approach for statistical inference on distribution parameters when only pooled data are observed rather than individual data points. The approach uses the relationships between the moments of the pooled data distribution and the moments of the original individual data distribution to construct estimating equations. Estimates of the distribution parameters and their asymptotic distributions can then be obtained. The method is generally applicable when likelihood-based inference is difficult with pooled data and is demonstrated for the Box-Cox transformation model family of distributions.

Uploaded by

dinadinic

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

63 views23 pages

Statistical Inference Based On Pooled Data: A Moment-Based Estimating Equation Approach

Uploaded by

dinadinic

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 23

Statistical Inference Based on Pooled Data: A

Moment-Based Estimating Equation Approach

Howard D. Bondell

Department of Statistics, Rutgers University, Piscataway, NJ 08854, U.S.A.

Aiyi Liu∗ and Enrique F. Schisterman

Division of Epidemiology, Statistics and Prevention Research, National Institute of

Child Health and Human Development, Department of Health and Human Services,

6100 Executive Blvd., Bethesda, MD 20852, U.S.A.

∗
email: liua@mail.nih.gov

Summary. We consider statistical inference on parameters of a distribution when only

pooled data are observed. A moment-based estimating equation approach is proposed

to deal with situations where likelihood functions based on pooled data are difficult

to work with. We outline the method to obtain estimates and test statistics of the

parameters of interest in the general setting. We demonstrate the approach on the

family of distributions generated by the Box-Cox transformation model, and in the

process, construct tests for goodness of fit based on the pooled data.

Key words: Pooling biospecimens; Set-based observations; Moments; Box-Cox trans-

formation; Goodness of fit; Lognormal distribution.

1. Introduction

The classical approach to conducting statistical inference on an unknown vector of

parameters θ characterizing a distribution Fθ of a random variable X is based on a ran-

dom sample of size np (with both n and p being integers) from Fθ , say, X1 , . . . , Xnp ,

1
independent and identically distributed according to Fθ . Suppose that, in order to

reduce the cost of the study, the subjects that yield the np samples are randomly

grouped into n sets, each of size p. Subsequently, instead of observing each indi-

vidual X, the average of the Xs in each set is observed, yielding n observations,

Pjp
Xj∗ = i=p(j−1)+1 Xi /p, j = 1, . . . , n, which are often called set-based observations.

The X ∗ s are also independent and identically distributed, but each following the dis-

tribution Fθ∗ of the average of p random draws from Fθ . We are concerned with inference

on θ based on the pooled data X1∗ , . . . , Xn∗ .

The data framework described above, often called group testing as a result of testing

for a microorganism in pooled biospecimens, appeared initially in the context of screen-

ing with dichotomous outcomes (Dorfman, 1943; Sobel and Groll 1959), and was later

further developed by Gastwirth and Johnson (1994) and Litvak, Tu and Pagano (1994)

in HIV contamination, Sobel and Elashoff (1975), Chen and Swallow (1990), Farrington

(1992), Hughes-Oliver and Swallow (1994), and Tu, Litvak and Pagano (1995) in esti-

mating population prevalence, and Barcellos et al. (1997) in localizing disease genes.

Recently Weinberg and Umbach (1999) proposed a set-based logistic model to explore

the association between a disease and exposures when only the pooled exposure values

are available. Farragi, Reiser and Schisterman (2003) and Liu and Schisterman (2003)

considered evaluation of diagnostic biomarkers based on pooled specimens whose mea-

surements are assumed to follow normal or gamma distributions. Other areas where

pooling biospecimens has been found useful include gene microarray experiments where

mRNA samples are often pooled across subjects (Jin et al., 2001; Enard et al., 2002;

Kendziorski et al., 2003).

Although the strategy of pooling specimens has been used in practice, methods

for analysis of set-based data from such experiments have not been fully and well

2
developed in the literature, except for certain special cases. This is, perhaps, partly

because for a general distribution, Fθ , the likelihood methods based on set-based data

may not be feasible since the distribution of the averages involves convolution of p

random variables of Fθ . The purpose of the present paper is to develop a general

methodology for reasonably efficient estimation and testing under a broad class of

distributional assumptions, including the family of distributions generated by the Box-

Cox transformation model. The context is that Fθ possesses and is fully determined

by its first several moments, which may be estimated by converting the estimates of

the moments of the set-based random variable X ∗ .

The paper is arranged as follows. In §2 we shall describe the method under the

assumption that Fθ can be parameterized by no more than its first three moments.

We obtain estimating equations to yield estimates of θ. The method can be readily

extended to distributions with more than three parameters, but we omit the details

at the present time. In §3 we derive the large sample distribution of the estimates.

We then apply the method in §4 to the family of distributions generated by the Box-

Cox transformation model, which is an extremely versatile class of distributions that

includes the normal, lognormal, and (non-central) χ2 distributions as special cases. As

an extension, we discuss several procedures to test goodness of fit based on the pooled

data. The methods are exemplified in §5 using data from a study evaluating oxidative

stress and antioxidants on cardiovascular disease in upstate New York. Some comments

and discussions appear in §6.

2. Estimating Equations Based on the Moments: A General Method

Our aim is to conduct inference on a k-parameter vector, θ, that characterizes the

distribution Fθ of interest, based on the set-based observations X ∗ s. Except for certain

special cases, the likelihood function based on X ∗ s will be extremely difficult to derive.

3
Alternatively we can obtain and connect the moments of X ∗ with that of X, and

construct inference based on moment estimates. For this purpose, we assume that Fθ

has at least 2k moments of which the first k moments will be used to estimate θ and

the rest to estimate the variance.

We illustrate the method by assuming that θ is at most three-dimensional. Define

µ1 = E(X), µ∗1 = E(X ∗ ), µr = E{(X −µ1 )r }, and µ∗r = E{(X ∗ −µ∗1 )r }, for r > 1; these

are all functions of θ. Then it is straightforward to show that the first three central

moments of X ∗ , being the mean of p independent, identically distributed variables of

X, are

µ∗1 = µ1 , µ∗2 = µ2 /p, µ∗3 = µ3 /p2 . (1)

Replacing the left-hand sides of the equations by their corresponding set-based

sample moments and solving for θ will then yield an estimator of θ based on which

inference can be conducted. Putting the above into an estimating equation framework

allows us to conveniently utilize the general asymptotic theory on estimating equations

(See Serfling, 1980, and §3 below). Define

 
 w − µ1 (θ) 
 
 
Ψ(w; θ) =  2
 p{w − µ1 (θ)} − µ2 (θ)
,
 (2)
 
 
p2 {w − µ1 (θ)}3 − µ3 (θ)

then a consistent estimator, θ̃, of θ is the solution to the equation:

n
X
n−1 Ψ(Xj∗ ; θ̃) = 0, (3)
j=1
Pn
or equivalently θ̃ = arg minθ n−1 j=1 Ψ(Xj∗ ; θ̃)T Ψ(Xj∗ ; θ̃). Note that if θ is of dimen-

sion k < 3, we only require the first k components of Ψ.

We have restricted our attention to at most three parameters which are sufficient for

most practical needs. If more parameters are required, the approach described above

4
can be readily extended to accommodate the additional parameters, though the higher

order moments may have less simple formulas. (See next section below.)

3. Distribution Theory for Estimates

Clearly, there is no simple exact distribution theory for the estimator θ̃, since it will

depend on the distribution Fθ∗ of X ∗ , which, as mentioned earlier, may not be feasible

to work with, unless the original distribution is of a particularly convenient form, such

as a normal or gamma. Here we derive asymptotic theory for θ̃, on which statistical

inference may be based.

In order to obtain the asymptotic variance of the estimator, we also need the next

three higher-order moments. Straightforward manipulation leads to the following rela-

tionships (for p > 1):

µ∗4 = p−3 {µ4 + 3(p − 1)µ22 }, µ∗5 = p−4 {µ5 + 10(p − 1)µ3 µ2 },

µ∗6 = p−5 {µ6 + 15(p − 1)µ4 µ2 + 10(p − 1)µ23 + 15(p − 1)(p − 2)µ32 }, (4)

again, all are functions of θ.

Following standard asymptotic theory of estimating equations (for example, Ser-

³ ´
d
fling, 1980, Chapter 7), we can show, via Taylor expansion, that n1/2 θ̃ − θ →

N3 (0, Σ) , where Σ = ABAT , with,

( )
∂ n o
A−1 = −Eθ Ψ(X ∗ ; θ) , B = Eθ Ψ(X ∗ ; θ)Ψ(X ∗ ; θ)T . (5)
∂θ

For the particular form of the estimating equations (3), we can explicitly express

the matrices A and B in terms of the central moments of the original distribution, by

using the moment relationships given in (1) and (4). Putting θ = (θ1 , θ2 , θ3 )T , we then

5
have  
∂µ1 ∂µ1 ∂µ1
 ∂θ1 ∂θ2 ∂θ3 
 
−1
 
A = ∂µ2
 ∂θ1
∂µ2 ∂µ2 
 (6)
 ∂θ2 ∂θ3 
 
3pµ2 ∂µ1
∂θ1
+ ∂µ3
∂θ1
3pµ2 ∂µ
∂θ2
1
+ ∂µ3
∂θ2
3pµ2 ∂µ
∂θ3
1
+ ∂µ3
∂θ3

and  
 µ2 /p µ3 /p p2 µ∗4 
 
 
B= 2 ∗ 2
 µ3 /p p µ4 − µ2 p µ5 − µ3 µ2 
3 ∗
. (7)
 
 
p2 µ∗4 p3 µ∗5 − µ3 µ2 p4 µ∗6 − µ23
We will need estimates of Σ to construct confidence intervals and test statistics for

hypotheses regarding a function of θ. Here we propose two approaches to construct

estimates of A and B which then yield estimates of Σ. One approach is to “plug-in” θ̃

for θ in the expressions (6) and (7), respectively to obtain estimates of A and B.

Alternatively, we may obtain “semi-empirical” estimates for A and B, using a two-

stage strategy. First replace θ by θ̃ in (5), and then estimate the resulting functions

by their empirical sample means, we have

n
X n
X
∂
Ã−1 = −n−1 Ψ(Xj∗ ; θ̃), B̃ = n−1 Ψ(Xj∗ ; θ̃)Ψ(Xj∗ ; θ̃)T . (8)
j ∂θ j

Both approaches lead to consistent estimators of Σ. We may then conduct approx-

imate inference on any function of θ, using the asymptotic normality of the statistics.

For example, a 100(1 − α) per cent confidence interval for a linear function lT θ is

lT θ̃ ± z1−α/2 (lT ÃB̃ ÃT l/n)1/2 , with z being the upper percentile of the standard normal

distribution.

4. The Box-Cox Transformation Family

4.1 Preliminaries

In their seminal paper, Box and Cox (1964) developed a widely used transformation

6
family for the linear regression model.



 (X λ − 1)/λ, λ 6= 0,
Y = , (9)
  log X, λ = 0,

where λ is a real-valued, unknown parameter, and X is the original data, assumed to

be strictly positive (in order that all real λ yield real values).

Based on this transformation, a diverse family of distributions for X is generated.

The Box-Cox power transformation assumes that there is some member of the power

family of transformations such that when applied to the data, the transformed data

are normally distributed. Hence the original data can take on a wide range of possible

distributions, and in most practical situations, there exists some member of this family

of distributions that is a reasonable model for the data generating mechanism. Three

important special cases of this family are the normal (λ = 1), lognormal (λ = 0), and

(non-central) χ2 (λ = 1/2).

It is now assumed that Y follows a normal distribution with mean µ and variance

σ 2 . We would like to conduct inference on the parameter vector θ = (µ, σ 2 , λ)T based

on observing X ∗ , which as before, are averages of p random observations of X.

4.2 Inference

4.2.1 Estimation of parameters

The standard approach to inference in the Box-Cox model is via maximum likeli-

hood, while other approaches have also been proposed (See Sakia, 1992, for a review).

However, when only pooled data are observed, these methods are extremely difficult or

impossible to carry out, except in special cases. We shall instead estimate the three pa-

rameters based on the pooled data via the moment-based estimating equation method

described in §2 and §3.

7
To proceed, we obtain expressions for the central moments of the distribution of X,

the transformed normal. Once the expressions are obtained, it is then computation-

ally straightforward to derive the estimator and its estimated large sample covariance

matrix.

It should be pointed out that the claim that Y has a normal distribution is not

exactly true for λ 6= 0, since Y is bounded by −1/λ from below (above) for λ > (<)

0. This effect in general practice is assumed to be negligible (and usually is), but must

be accounted for in our expression for the moments.

We first deal with λ ≥ 0, in which case all moments exist for X. For λ > 0, define

U = X λ = λY + 1, whose density is a truncated (at zero) normal density:

Ã !
1 u
f (u; µ, σ, λ) = φ − δ , (u > 0), (10)
|λ|σΦ(δ) |λ|σ

where δ = (λµ + 1)/(|λ|σ), φ, and Φ are the standard normal density and distribution

functions. Note that the absolute value sign is used in order to unify the density

expressions for both λ > 0 and λ < 0 (See below).

The moments of X as functions of µ, σ, and λ are thus given by

Z ∞ Ã !
1 u
µ1 (µ, σ, λ) = u1/λ φ − δ du, (11)
|λ|σΦ(δ) 0 |λ|σ
Z ∞n Ã !
1 or u
µr (µ, σ, λ) = u1/λ − µ1 (µ, σ, λ) φ − δ du, (r > 1). (12)
|λ|σΦ(δ) 0 |λ|σ
Note that as λ → 0, the moments converge to the moments of the lognormal

distribution, for which known formulas are available (See, for example, Aitchison and

Brown, 1957), and thus explicit expressions for the estimates of µ and σ 2 based on the

pooled data may be obtained.

For the case λ < 0, we notice from (11)-(12) that if X is bounded from above then all

moments exist; otherwise, only the moments of order r < −λ exist. To ensure feasible

8
execution of the proposed moment-based procedure, we assume that X ≤ x0 for some

x0 > 0. If we define U = X λ −xλ0 = λY +1−xλ0 , then the density of U and the moments

of X are still given respectively by (10)-(12), but with δ = (λµ + 1 − xλ0 )/(|λ|σ), and

u in the integrand being replaced by u + xλ0 .

Using these moment formulas and the estimating equations (3), we can obtain

estimates of θ = (µ, σ 2 , λ)T . The derivatives required for the asymptotic distribution

given by (5)-(7) may be computed by differentiating under the integral sign, or may

be computed numerically using the estimated parameters. We may then plug in the

empirical moment estimates to obtain the estimated asymptotic covariance matrix in

order to construct confidence intervals and test statistics.

Note that if we assume λ to be known, then only the first two moments µ1 and µ2

are needed to obtain estimates of µ and σ 2 , and the next two higher moments, µ3 and

µ4 to derive asymptotic variance. See §5 for an explicit example in the lognormal case

(λ = 0).

4.2.2 Computation and inference regarding µ and σ 2

Often, we are only interested in inference on µ and σ 2 , and the transformation param-

eter is regarded as being a nuisance. We adopt the following convenient approach to

conduct the inference.

Write µ(λ) and σ 2 (λ) to denote the dependency on the transformation scale. For

a fixed λ, we obtain (µ̃(λ), σ̃ 2 (λ)) by using only the first two moment relationships. λ̃

is then found via a grid search using the third moment. Our limited simulation results

show that the third moment equation is monotone in λ, when considered as a function

of (µ̃(λ), σ̃ 2 (λ), λ), and hence the grid search should yield a unique estimate λ̃ of λ.

Once λ̃ is determined from the data, we then proceed, just as in the standard data

transformation situation, as if λ were known (to be λ̃), to estimate µ and σ 2 and the

9
asymptotic variance with the last row and column of A−1 and B in (6) and (7) being

removed.

The appropriateness of such “conditional” inference has been a subject of much

debate in the literature (Bickel and Doksum, 1981; Box and Cox, 1982; Doksum and

Wong, 1983; Hinkley and Runger, 1984, among others). Since λ̃ is a consistent esti-

mate of λ under the Box-Cox model, the asymptotic equivalency of the “conditional”

transformed two-sample t-statistic with the “unconditional” one as in Doksum and

Wong (1983) would hold for the t-statistic based on the moment estimates as well.

Since formulas for the asymptotic variances are available for both the λ known, and

the λ estimated situations, the appropriateness of treating λ as known for problems

other than those treated by Doksum and Wong (1983) deserves further investigation.

4.3 Testing Goodness-of-fit

When inference procedures depend heavily on the distributional assumptions, it is

important to justify these assumptions before conducting the inference. Below we

propose several goodness-of-fit tests concerning the distribution of X based on the

pooled data X ∗ .

4.3.1 One sample test

One standard approach to testing goodness-of-fit is to imbed the distribution in ques-

tion into a larger family of distributions indexed by one or more additional parameters

and then test hypotheses regarding these parameters. Since the Box-Cox family is

a diverse family that includes many important special cases, a natural extension to

the estimation problem is the test for goodness-of-fit based on the pooled data to a

hypothesized distribution.

Using the estimate of λ, derived by using the estimating equations (3) with θ =

(µ, σ 2 , λ)T and the moments given in (11) and (12), we may test the fit of the underlying

10
data to a desired distribution of X, based solely on the pooled data. For example,

testing for goodness-of-fit to a lognormal distribution based on the pooled data is

accomplished simply by testing H0 : λ = 0 vs. H1 : λ 6= 0, using the asymptotically

standard normal test statistic Z = λ̃/s, where s is the estimated standard error of λ̃

based on the asymptotic distribution derived in §3.

4.3.2 Two-sample extension

The Box-Cox transformation family has been used in the receiver operating charac-

teristic (ROC) curve analysis to evaluate the accuracy of a medical diagnostic test or

biomarker that yields continuous outcomes (Faraggi and Reiser, 2002; Zou and Hall,

2000, 2002). A key assumption to warrant the use of the Box-Cox transformation

theory in such analysis is that the transformation parameter λ be the same for both

diseased and non-diseased outcomes. Below we propose a method of testing this key

assumption based on the pooled data.

We observe two independent pooled samples from np diseased and mq nondis-

Pjp Pkq
eased subjects, Xj∗ = i=(j−1)p+1 Xi /p, (j = 1, ..., n), and Yk∗ = i=(k−1)q+1 Yi /q,

(k = 1, ..., m). The individual Xs and Y s are not observed. We assume that for cer-

tain unknown parameters λX and λY , (X λX − 1)/λX and (Y λY − 1)/λY both follow

(truncated) normal distributions. The null hypothesis to be tested in this situation is

H0 : λX − λY = 0 vs. H1 : λX − λY 6= 0.

We may test this hypothesis in the following manner. Let λ̃X and λ̃Y be the

estimates of λX and λY , respectively, obtained by solving the estimating equations (3),

and s2X and s2Y be their estimated variances derived from the methods described in §3.

We then reject H0 at significance level α if |λ̃X − λ̃Y | > z1−α/2 s, where s = (s2X +s2Y )1/2 .

If we do not reject the null hypothesis, we may then feel comfortable in combining

the two estimates to obtain a weighted estimate of the common value of λ, use the

11
common estimate as the true λ to estimate µ and σ 2 for each group, and then proceed

with the analysis as proposed by Zou and Hall (2000, 2002).

4.3.3 General goodness-of-fit

The goodness-of-fit test based on an estimate of λ in §4·3·1 is technically valid only

under the assumption that the true distribution is actually a member of the Box-Cox

family. We may also adapt other readily available techniques to test the distributional

assumptions without assuming membership in the Box-Cox family.

We will assume that the two distributions, Fθ of X and Fθ∗ of X ∗ , are uniquely

determined by each other, which then implies that testing the hypothesis that the

unoberserved data X follow the distribution Fθ is equivalent to testing the hypothesis

that the observed pooled data X ∗ follow the distribution Fθ∗ . While there are certain

exceptions to this uniqueness characterization, we suspect that it holds for most of the

distributions actually used in practice. We comment on this further in §6.

One simple method is to draw a Q-Q plot of the pooled data versus a hypothetical

distribution F ∗ of X ∗ . Suppose we want to test the hypothesis that a distribution

Fθ generates the unobserved data from which the pooled data are observed. Using

the moment based technique we have developed in the previous sections, we obtain an

estimator θ̃ of θ based on the pooled data Xj∗ , (j = 1, . . . , n).

If the quantiles of F ∗ are difficult to compute, which is the case in general, We may

generate a large number of observations from Fθ̃ , and group them into sets of size p,

to yield the distribution, Fθ̃∗ . We then plot the quantiles of this large sample empirical

distribution versus the quantiles of the empirical distribution of the observed data, and

check for linearity in the plot.

Another approach would be to use a formal goodness-of-fit hypothesis test. One of

the most common goodness-of-fit tests of a parametric assumption is the Kolmogorov-

12
Smirnoff type test. This test is based on the statistic,

D = supx |Fn∗ (x) − Fθ̃∗ (x)|,

the largest difference in cumulative distribution functions between the empirical and

theoretical distributions. The distribution of D under the null hypothesis that the data

follows the hypothesized distribution is complicated by the fact that we are using an

estimate of θ, and not the true parameter. Hence, the critical regions of the standard

Kolmogorov-Smirnoff test are not valid.

Based on the results of Romano (1988), the following bootstrap method will deter-

mine critical values that will yield tests with correct asymptotic significance levels.

Step 1. Based on the estimate, θ̃, generate a random sample of size np from Fθ̃∗ ,

and then group into sets of size p to obtain the pooled sample.

Step 2. Compute the empirical distribution, Fn∗ , for this sample.

Step 3. Generate the empirical distribution Fθ̃∗∗ based on a large sample grouped

into sets of size p, as in the Q-Q plot.

Step 4. Calculate D∗ = supx |Fn∗ (x) − Fθ̃∗∗ (x)| for this sample.

Step 5. Repeat a large number of times, and use the frequency distribution of D∗

as the null distribution of D to find the critical region.

5. An example

A population-based sample of randomly selected residents of New York State’s Erie

and Niagara counties, 35 to 79 years of age, was the focus of this investigation. The

New York State Department of Motor Vehicles drivers’ license rolls were utilized as the

sampling frame for adults between the ages of 35 and 65; whereas the elderly sample

(age 65 to 79) was randomly selected from the Health Care Financing Administration

database.

13
A total of 72 men and women were selected for the analyses. Personal history of my-

ocardial infarction and angina pectoris was ascertained by self-reported questionnaire.

Participants were asked if they had been diagnosed with angina pectoris confirmed by

angiogram or with myocardial infarction. Medical charts were reviewed by a physician

for outcome verification and were defined as having coronary heart disease. Partici-

pants provided a 12-hour fasting blood specimen for biochemical analysis. A number

of parameters were examined in fresh blood samples, including routine Vitamin E lev-

els. We assume that the distribution of Vitamin E concentrations is a member of the

Box-Cox family. Fig 1(a) and 1(b) shows the normal Q-Q plots for the original and

log-transformed data, respectively.

* (Insert Figure 1 here) *

Figure 1. Normal Q-Q plot of Vitamin E concentration.

A lognormal distribution appears to be a reasonable fit to the data. Based on the full

(un-pooled) data, the standard Kolmogorov-Smirnoff test rejects the normal assump-

tion (p-value=0.0023), while not rejecting the lognormal assumption (p-value=0.5).

A 95% confidence interval for λ based on the maximum likelihood estimate, which is

obtainable when full data are available, is found to be (-0.6924, 0.3334), overlapping

zero, further confirming the lognormal assumption.

We now randomly group the subjects into groups of 2 and take the average as the

pooled observation. Based on this pooled sample, the moment-based estimate of λ

is 0.1096 with a standard error of 0.5565, yielding a confidence interval of (-0.9811,

1.2003). This again indicates lognormality. Moreover, the simulated Q-Q plots of the

14
pooled data (Figure 2) also supports that data are lognormally distributed. We will

therefore proceed under the lognormal assumption.

* (Insert Figure 2 here) *

Figure 2. Lognormal Q-Q plot of Vitamin E concentration with pooled data.

For the lognormal distribution, the four required central moments are given by

(Aitchison and Brown, 1957):

³ ´
µ1 = exp µ + 21 σ 2 , µ2 = µ21 ω 2 ,

µ3 = µ31 ω 4 (ω 2 + 3), µ4 = µ41 ω 4 (ω 8 + 6ω 6 + 15ω 4 + 16ω 2 + 3)

where ω 2 = exp(σ 2 ) − 1.

We find that

n o n o
µ = log (µ21 + µ2 )−1/2 µ21 , σ 2 = log 1 + µ−2
1 µ2 .

We thus obtain (µ̃, σ̃ 2 ) by respectively replacing µ1 and µ2 by their moment estimates:

n n
1X pX
µ̃1 = Xi∗ , µ̃2 = (X ∗ − µ̃1 )2 ,
n i=1 n i=1 i

where the X ∗ s are the pooled observations. Using the explicit formulas for the mo-

ments and the asymptotic variance, some straightforward but tedious algebra yields

the asymptotic variances and covariance as:

. −1
n V ar (µ̃) = (4pγ 2 ) {γ 6 − 8γ 4 + 16γ 3 + (2p − 11)γ 2 − 4(p − 1)γ + 2(p − 1)} ,
. −1
n V ar (σ̃ 2 ) = (pγ 2 ) {γ 6 − 4γ 4 + 4γ 3 + (2p − 3)γ 2 − 4(p − 1)γ + 2(p − 1)} ,
. −1
n Cov (µ̃, σ̃ 2 ) = − (2pγ 2 ) {γ 6 − 6γ 4 + 8γ 3 + (2p − 5)γ 2 − 4(p − 1)γ + 2(p − 1)} ,

15
where γ = 1 + ω 2 .

Notice that the above formulas depend only on σ 2 , the variance of the underlying

normal distribution, along with the pooling size p. We may plug in γ̃ = exp(σ̃ 2 ) to

yield consistent estimates of the variances and covariance, and thus to construct test

statistics and confidence intervals.

Applying these formulas we obtain the estimates of (µ, σ 2 ) and their estimated

standard errors, for both the pooled and un-pooled data. The results are presented in

Table 1. For comparison, estimates based on pooled data with group size of 3 and 4

are also given in the Table.

Table 1

Estimate (± standard error) of the lognormal mean µ and variance σ 2 .

p n µ̃ σ̃ 2
1 72 2.6421 (±0.0498) 0.1733 (±0.0373)
2 36 2.6408 (±0.0519) 0.1757 (±0.0465)
3 24 2.6396 (±0.0540) 0.1781 (±0.0545)
4 18 2.6405 (±0.0554) 0.1764 (±0.0603)

To evaluate the performance of these estimates, we also computed the maximum

likelihood estimates µ̂ of µ and σ̂ 2 of σ 2 , using the un-pooled data. It turned out that

µ̂ = 2.6466 and σ̂ 2 = 0.1609, with standard errors 0.0473 and 0.0270, respectively.

We observe that, due to the small value of σ 2 , which is common in lognormal data,

there is not a great deal of efficiency loss in the moment based estimates as compared

with the maximum likelihood estimates, especially in the estimation of µ. In addition,

there is also only a small loss of efficiency as we pool the data. This small efficiency

16
loss is in agreement with previous studies on the merits of pooling data, under normal

and gamma assumptions, to reduce costs associated with bioassays. See Faraggi et al.

(2003), Liu and Schisterman (2003) and Weinberg and Umbach (1999).

6. Discussion

6.1 Comments on Goodness-of-fit Test

In the goodness-of-fit testing problem, we are testing a hypothesis regarding the under-

lying distribution of the unpooled data. It is implicitly assumed that the distribution

of the convolution is in one-to-one correspondence with the distribution of the indi-

vidual observations. This is true under general regularity conditions. For example, a

nonvanishing characteristic function is a sufficient condition for this one-to-one corre-

spondence. For other characterization conditions see Prokhorov and Ushakov (2002)

and the references therein. Regardless of the uniqueness of the characterization, the

type I error of the test is unaffected. However, the test will be unable to detect the dif-

ference between any two original distributions that may yield the identical convolution

distribution. An additional question then arises, if the characterization is not unique,

how different are the generating distributions with respect to an underlying distance

such as Kolmogorov-Smirnoff, or (symmetrized) Kullback-Leibler?

6.2 Other Comments And Further Directions

In this paper we proposed inference on pooled data under parametric assumptions

on the individual observations. We also suggested methods to test these parametric

assumptions. One of the problems inherent in this type of set-based data is created by

the central limit theorem. The pooled data tend to a normal distribution as the pooling

size increases, and even for small to moderate pooling sizes, much of the skewness (and

higher moments) of the original distribution is lost in the set-based distribution. While

the loss in variability is linear in the pooling size, this loss of skewness is quadratic,

17
as can be seen by the moments (1). This hampers the ability to detect differences in

distributional shape even for modest pooling sizes.

To our knowledge, the current paper is the first to present a general methodology

for dealing with set-based data under a broad class of parametric distributional as-

sumptions. As the pooling of data becomes a more common procedure, particularly in

the area of evaluation of disease biomarkers, more research on methods to deal with

this form of data needs to be done. For example, under a parametric assumption,

we may be able to use Edgeworth expansions to write out an approximate likelihood

function for the set-based data and proceed via likelihood methods. The accuracy of

inference based on these approximations would be of interest.

Non-parametric methods for set-based data may also be appropriate. A possible al-

ternative to the parametric models proposed in this paper, would be an approach based

on density deconvolution, which again appears to be technically and computationally

challenging.

Acknowledgements

The authors thank W. Jack Hall and Kai F. Yu for helpful discussion and suggestions.

References

Aitchison, J. and Brown, J. A. C. (1957). The Lognormal Distribution. Cambridge:

the University Press.

Barcellos, L. F., Klitz, W., Field, L. L., Tobias, R., Bowcock, A. M., Wilson, R., Nelson,

M. P., Nagatomi, J., Thomson, G. (1997). Association mapping of disease loci, by

18
use of a pooled DNA genomic screen. American Journal of Human Genetics 61,

734-47.

Bickel, P. J. and Doksum, K. A. (1981). An analysis of transformations revisited.

Journal of the American Statistical Association 76, 296-311.

Box, G. E. P. and Cox, D. R. (1964). An analysis of transformations. Journal of the

Royal Statistical Society B 26, 211-52.

Box, G. E. P. and Cox, D. R. (1982). An analysis of transformations revisited, rebutted.

Journal of the American Statistical Association 77, 209-10.

Chen, C. L. and Swallow, W. H. (1990). Using group testing to estimate a proportion,

and to test the binomial model. Biometrics 46, 1035-46.

Doksum, K. A. and Wong, C-W. (1983). Statistical tests based on transformed data.

Journal of the American Statistical Association 78, 411-7.

Dorfman, R. (1943). The detection of defective members of large populations. Annals

of Mathematical Statistics 14, 436-40.

Enard, W., Khaitovich, P., Klose, J., Zollner, S., Heissig, F., Giavalisco, P., Nieselt-

Struwe, K., Muchmore, E., Varki, A., Ravid, R., Doxiadis, G. M., Bontrop, R. E.

and Paabo, S. (2002). Intra- and interspecific variation in primate gene expression

patterns. Science 296, 340-3.

Faraggi, D. and Reiser, B. (2002). Estimation of the area under the ROC curve.

Statistics in Medicine 21, 3093-106.

Faraggi, D., Reiser, B. and Schisterman, E. F. (2003). ROC curve analysis for biomark-

ers based on pooled assessments. Statistics in Medicine 15, 2515-27.

Farrington, C. (1992). Estimation prevalence by group testing using generalized linear

models. Statistics in Medicine 11, 1591-7.

Gastwirth, J. and Johnson, W. (1994). Screening with cost-effective quality control:

19
Potential applications to HIV and drug testing. Journal of the American Statistical

Association 89, 972-81.

Hinkley, D. V. and Runger, G. (1984). The analysis of transformed data. Journal of

the American Statistical Association 79, 302-20.

Hughes-Oliver, J. M. and Swallow, W. H. (1994). A two-stage adaptive group-testing

procedure for estimating small proportions. Journal of the American Statistical

Association 89, 982-93.

Jin, W., Riley, R. M., Wolfinger, R. D., White, K. P., Passador-Gurgel, G. and Gibson,

G. (2001). The contributions of sex, genotype and age to transcriptional variance

in Drosophila melanogaster. Nature Genetics 29, 389-95.

Kendziorski, C. M., Zhang, Y. , Lan, H. and Attie, D. (2003). The efficiency of pooling

m RNA in microarray experiments. Biostatistics 4, 465-77.

Litvak, E. , Tu, X. M. and Pagano, M. (1994). Screening for the presence of a disease by

pooling sera samples. Journal of the American Statistical Association 89 , 424-34.

Liu, A. and Schisterman, E. F. (2003). Comparison of diagnostic accuracy of biomark-

ers with pooled assessments. Biometrical Journal 45, 631-44.

Prokhorov, A. V. and Ushakov, N. G. (2002). On the problem of reconstructing a

summands distribution by the distribution of their sum. Theory of Probability and

its Applications 46, 420-30.

Romano, J. P. (1988). A bootstrap revival of some nonparametric distance tests.

Journal of the American Statistical Association 83, 698-708.

Sakia, R. M. (1992). The Box-Cox transformation technique: a review. The Statisti-

cian 41, 169-78.

Serling, R. J. (1980). Approximation Theorems of Mathematical Statistics. New York:

Wiley.

20
Sobel, M. and Groll, P. (1959). Group testing to eliminate efficiently all defectives in

a binomial sample. The Bell System Technical Journal 38, 1179-252.

Sobel, M. and Elashoff, R. (1975). Group testing with a new goal: Estimation.

Biometrika 62, 181-93.

Tu, X. M., Litvak, E. and Pagano, M. (1995). On the informativeness and accuracy

of pooled testing in estimating prevalence of a rare disease: Application to HIV

screening. Biometrika 82, 287-97.

Weinberg C. R. and Umbach, M. (1999). Using pooled exposure assessment to improve

efficiency in case-control studies. Biometrics 55, 718-26.

Zou, K. H. and Hall, W. J. (2000). Two transformation models for estimating an ROC

curve derived from continuous data. Journal of Applied Statistics 27, 621-31.

Zou, K. H. and Hall, W. J. (2002). Semiparametric and parametric transformation

models for comparing diagnostic markers with paired design. Journal of Applied

Statistics 29, 803-16.

21
40

3.5
log(VitaminE)
30

3.0
VitaminE
20

2.5
10

2.0
-2 -1 0 1 2 -2 -1 0 1 2

Quantiles of Standard Normal

Quantiles of Standard Normal
(b)
(a)

This is Figure 1.
25
20
VitaminE
15
10

10 20 30 40

Quantiles of Lognormal Average

This is Figure 2.

A Comparison of Methods For The Estimation of Weibull Distribution Parameters
No ratings yet
A Comparison of Methods For The Estimation of Weibull Distribution Parameters
14 pages
Method of Moments
No ratings yet
Method of Moments
4 pages
Notes For Lectures 1 To 10 - 2024
No ratings yet
Notes For Lectures 1 To 10 - 2024
39 pages
18.443 MIT Stats Course
No ratings yet
18.443 MIT Stats Course
139 pages
Session 32 - Point Estimate
No ratings yet
Session 32 - Point Estimate
53 pages
Stat-Review Xid-8243919 1
No ratings yet
Stat-Review Xid-8243919 1
24 pages
MA204 FinalTest 2022
No ratings yet
MA204 FinalTest 2022
14 pages
Parameter Estimation For The Two-Parameter Weibull Distribution
No ratings yet
Parameter Estimation For The Two-Parameter Weibull Distribution
108 pages
Least Squares PDF
No ratings yet
Least Squares PDF
192 pages
Geodesy: Least Squares Method
No ratings yet
Geodesy: Least Squares Method
192 pages
Point Estimation Exercises
100% (1)
Point Estimation Exercises
7 pages
Estimation EMV
No ratings yet
Estimation EMV
37 pages
AllNotes 4
No ratings yet
AllNotes 4
56 pages
Statistical+Inference+1 Shaw2007
No ratings yet
Statistical+Inference+1 Shaw2007
66 pages
Hinkley 1975
No ratings yet
Hinkley 1975
12 pages
Maximum Likelihood An Introduction: L. Le Cam
No ratings yet
Maximum Likelihood An Introduction: L. Le Cam
31 pages
Method of Moments: Topic 13
No ratings yet
Method of Moments: Topic 13
9 pages
Modi#ed Moment Estimation For The Two-Parameter Birnbaum-Saunders Distribution
No ratings yet
Modi#ed Moment Estimation For The Two-Parameter Birnbaum-Saunders Distribution
16 pages
Distribuiçao de Pareto
No ratings yet
Distribuiçao de Pareto
29 pages
1520-0493-1520-0493 1958 086 0117 Anotgd 2 0 Co 2
No ratings yet
1520-0493-1520-0493 1958 086 0117 Anotgd 2 0 Co 2
6 pages
Stat 450850 Notes 2012
No ratings yet
Stat 450850 Notes 2012
190 pages
Statistics 512 Notes 18
No ratings yet
Statistics 512 Notes 18
10 pages
Lecture Notes - 1
No ratings yet
Lecture Notes - 1
56 pages
Prelims Stats
No ratings yet
Prelims Stats
39 pages
On Moments of Sample Mean and Variance
No ratings yet
On Moments of Sample Mean and Variance
21 pages
Statistical Models Based On Counting Processes (PDFDrive) PDF
No ratings yet
Statistical Models Based On Counting Processes (PDFDrive) PDF
778 pages
Intro To Essential Stats With Python
No ratings yet
Intro To Essential Stats With Python
51 pages
Extropy Estimation of Weibull Distribution Under Upper Records
No ratings yet
Extropy Estimation of Weibull Distribution Under Upper Records
7 pages
Two-Parameter Rayleigh Distribution: Different Methods of Estimation
No ratings yet
Two-Parameter Rayleigh Distribution: Different Methods of Estimation
21 pages
Covariance Matrix (W Krzanowski)
No ratings yet
Covariance Matrix (W Krzanowski)
5 pages
Method of Moments
No ratings yet
Method of Moments
5 pages
R300 Advanced Econometrics Methods Lecture Slides
No ratings yet
R300 Advanced Econometrics Methods Lecture Slides
362 pages
STAT2102 Chapter6
No ratings yet
STAT2102 Chapter6
5 pages
Geometry in Space
No ratings yet
Geometry in Space
9 pages
Statistics I: Parameter Estimation, Part I
No ratings yet
Statistics I: Parameter Estimation, Part I
24 pages
Empirical Bayes via Data Fission
No ratings yet
Empirical Bayes via Data Fission
4 pages
Press (1972)
No ratings yet
Press (1972)
6 pages
Barndorff-Nielsen 1987
No ratings yet
Barndorff-Nielsen 1987
68 pages
Predição em Modelos de Tempo de Falha Acelerado Com Efeito Aleatório para Avaliação de Riscos de Falha - (JoaoBC)
No ratings yet
Predição em Modelos de Tempo de Falha Acelerado Com Efeito Aleatório para Avaliação de Riscos de Falha - (JoaoBC)
22 pages
Normal Statistics Estimation
No ratings yet
Normal Statistics Estimation
8 pages
Kubat 1980
No ratings yet
Kubat 1980
8 pages
Estimation of The Generalized Extreme-Value Distribution by The Method of Probability-Weighted Moments
No ratings yet
Estimation of The Generalized Extreme-Value Distribution by The Method of Probability-Weighted Moments
11 pages
Chap - 2point - Estimation
No ratings yet
Chap - 2point - Estimation
11 pages
From The Moments of The Standard Normal Distribution
No ratings yet
From The Moments of The Standard Normal Distribution
18 pages
Hypothesis Testing Problems 11
No ratings yet
Hypothesis Testing Problems 11
9 pages
Outile-Course-Of - Inferential-Statistic
No ratings yet
Outile-Course-Of - Inferential-Statistic
16 pages
Box 1965
No ratings yet
Box 1965
12 pages
Lecture Notes
No ratings yet
Lecture Notes
90 pages
Identification of Reliability Models For Non Repai
No ratings yet
Identification of Reliability Models For Non Repai
8 pages
Lecture Note 17
No ratings yet
Lecture Note 17
10 pages
MATH 437/ MATH 535: Applied Stochastic Processes/ Advanced Applied Stochastic Processes
No ratings yet
MATH 437/ MATH 535: Applied Stochastic Processes/ Advanced Applied Stochastic Processes
7 pages
Estimation of The Generalized Extreme Value Distribution by The Method of Probability Weighted Moments
No ratings yet
Estimation of The Generalized Extreme Value Distribution by The Method of Probability Weighted Moments
11 pages
A Method of Moments For The Estimation o
No ratings yet
A Method of Moments For The Estimation o
5 pages
Wickham Stati
No ratings yet
Wickham Stati
12 pages
STAT 713 Mathematical Statistics Ii: Lecture Notes
No ratings yet
STAT 713 Mathematical Statistics Ii: Lecture Notes
152 pages
Classics: 76 Resonance
No ratings yet
Classics: 76 Resonance
15 pages
Conference Poster
No ratings yet
Conference Poster
1 page
Constructions of National and Cultural Identity in The Poetry of Scots Makars
No ratings yet
Constructions of National and Cultural Identity in The Poetry of Scots Makars
14 pages
Yuan Et Al - Foreign Accent Perception
No ratings yet
Yuan Et Al - Foreign Accent Perception
4 pages
Guidance For Pull-Out Feature
No ratings yet
Guidance For Pull-Out Feature
8 pages
Statistical Inference
No ratings yet
Statistical Inference
62 pages
Wei Et Al - Strong Influence of Prosody On The Perception of Foreign Accent
No ratings yet
Wei Et Al - Strong Influence of Prosody On The Perception of Foreign Accent
4 pages
Conjuncts Vs Disjuncts
No ratings yet
Conjuncts Vs Disjuncts
10 pages
The Influence of Voice Disguise On Temporal Characteristics of Speech
No ratings yet
The Influence of Voice Disguise On Temporal Characteristics of Speech
2 pages
IJSLL Imitations
No ratings yet
IJSLL Imitations
24 pages
The Phonetic Evolution of Reduplicated Expressions: Reduplication, Lexical Tones and Prosody in Na (Naxi)
No ratings yet
The Phonetic Evolution of Reduplicated Expressions: Reduplication, Lexical Tones and Prosody in Na (Naxi)
4 pages
Grammar in The Law PDF
No ratings yet
Grammar in The Law PDF
8 pages
ESL Pronunciation Teaching Guide
No ratings yet
ESL Pronunciation Teaching Guide
3 pages
Factors Influencing Utilisation of Immunisation Services Among Children Under Five Years in Lira Municipality Lira District Northern Uganda
No ratings yet
Factors Influencing Utilisation of Immunisation Services Among Children Under Five Years in Lira Municipality Lira District Northern Uganda
21 pages
Stat and Prob Q3-Week8 Mod8 Abelaine Abaquitacorrected
100% (1)
Stat and Prob Q3-Week8 Mod8 Abelaine Abaquitacorrected
34 pages
Effectiveness of Insulin Plant (Costus Igneus) Leaves As Tea On Lowering Blood Sugar Level
No ratings yet
Effectiveness of Insulin Plant (Costus Igneus) Leaves As Tea On Lowering Blood Sugar Level
43 pages
Stat2507 Finalexam
100% (3)
Stat2507 Finalexam
12 pages
Chapter 9: Statistical Inference For Two Samples: Probability and Statistics
No ratings yet
Chapter 9: Statistical Inference For Two Samples: Probability and Statistics
30 pages
(Ebook PDF) Statistics For Business and Economics, Global Edition 9th Edition - The Full Ebook With All Chapters Is Available For Download Now
100% (2)
(Ebook PDF) Statistics For Business and Economics, Global Edition 9th Edition - The Full Ebook With All Chapters Is Available For Download Now
56 pages
Stat 97
No ratings yet
Stat 97
10 pages
Advanced Business Mathematics and Statistics For Entrepreneurs
100% (4)
Advanced Business Mathematics and Statistics For Entrepreneurs
262 pages
Consequences of Excessive Use of Amlarasa (Sour Taste) : A Case Control Study
No ratings yet
Consequences of Excessive Use of Amlarasa (Sour Taste) : A Case Control Study
5 pages
ISE Elementary Statistics: A Step by Step Approach - A Brief Version, 8e 8th Edition Allan G. Bluman Instant Access 2025
100% (5)
ISE Elementary Statistics: A Step by Step Approach - A Brief Version, 8e 8th Edition Allan G. Bluman Instant Access 2025
101 pages
Problem Set 1 Estimation Answer Key
No ratings yet
Problem Set 1 Estimation Answer Key
4 pages
Solution Manual For Practicing Statistics Guided Investigations For The Second Course by Kuiper
100% (1)
Solution Manual For Practicing Statistics Guided Investigations For The Second Course by Kuiper
18 pages
Peer Review Feedback for BNP-Track
No ratings yet
Peer Review Feedback for BNP-Track
25 pages
Population Parameters Chapter Test
No ratings yet
Population Parameters Chapter Test
9 pages
Astm E739
No ratings yet
Astm E739
7 pages
Diagnostics of The Condition of Sucker-Rod Pumping
No ratings yet
Diagnostics of The Condition of Sucker-Rod Pumping
8 pages
Statistics Unlocking The Power of Data 1st Edition Lock Fast Access
No ratings yet
Statistics Unlocking The Power of Data 1st Edition Lock Fast Access
320 pages
Reliability MINITAB
No ratings yet
Reliability MINITAB
5 pages
May Retail Sales
No ratings yet
May Retail Sales
7 pages
Baxter P.Reese Instrument Adjustement Policies 14 2015 NCSLI
No ratings yet
Baxter P.Reese Instrument Adjustement Policies 14 2015 NCSLI
45 pages
Attempt All The Questions. All The Questions Are Compulsory and Carry 10 Marks
No ratings yet
Attempt All The Questions. All The Questions Are Compulsory and Carry 10 Marks
12 pages
(S.K.maiti) Water and Wastewater Analysis (Handboo (BookFi)
100% (1)
(S.K.maiti) Water and Wastewater Analysis (Handboo (BookFi)
321 pages
40 2 Intvl Est Var PDF
No ratings yet
40 2 Intvl Est Var PDF
10 pages
Ie 408 Lecture
No ratings yet
Ie 408 Lecture
169 pages
Exercises Chap 9
No ratings yet
Exercises Chap 9
4 pages
Pellet-Group Count Technique
No ratings yet
Pellet-Group Count Technique
19 pages
API-581 3rd Thinning Example 2
86% (7)
API-581 3rd Thinning Example 2
42 pages
Estimation and Sampling Answer by WWW - Studyrift.info
No ratings yet
Estimation and Sampling Answer by WWW - Studyrift.info
14 pages
Worksheet FAITH SR
No ratings yet
Worksheet FAITH SR
4 pages
External Match Load in Women's Collegiate Lacrosse
No ratings yet
External Match Load in Women's Collegiate Lacrosse
5 pages