0% found this document useful (0 votes)
41 views20 pages

Combining Survey Data

Uploaded by

mam10050
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views20 pages

Combining Survey Data

Uploaded by

mam10050
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Statistical Science

2017, Vol. 32, No. 2, 293–312


DOI: 10.1214/16-STS584
© Institute of Mathematical Statistics, 2017

Combining Survey Data with Other Data


Sources
Sharon L. Lohr and Trivellore E. Raghunathan

Abstract. Collecting data using probability samples can be expensive, and


response rates for many household surveys are decreasing. The increasing
availability of large data sources opens new opportunities for statisticians to
use the information in survey data more efficiently by combining survey data
with information from these other sources. We review some of the work done
to date on statistical methods for combining information from multiple data
sources, discuss the limitations and challenges for different methods that have
been proposed, and describe research that is needed for combining survey
estimates.
Key words and phrases: Hierarchical models, imputation, multiple frame
survey, probability sample, record linkage, small area estimation.

1. INTRODUCTION phone survey is now less than 10% (Kohut et al.,


2012)—far from the 95% response rate for mail sur-
How can we collect data that give accurate and
veys thought to be achievable by Deming (1950),
timely estimates of quantities of interest, and assess
page 35. Even high-quality face-to-face surveys such
the suitability of those estimates for answering re-
as the U.S. National Health Interview Survey (NHIS)
search and policy questions? Probability sampling the-
have declining response rates, and the NHIS house-
ory was developed beginning in the 1920s and 1930s
hold response rate decreased from 92% in 1997 to 70%
(Neyman, 1934; Duncan and Shelton, 1992) to pro-
in 2015 (National Center for Health Statistics, 2016),
vide methods for collecting information efficiently and
with additional nonresponse occurring among individ-
assessing the error arising from sampling. The early
ual persons within sampled households. Investigations
books and papers on probability sampling contrasted
it with “judgment sampling,” in which selection of to date have not found strong relationships between
units depends on an interviewer’s or expert’s judgment, the response rate and bias, at least for some statistics
and with “convenience sampling,” in which the sam- (Groves, 2006), but the declining response rates have
ple consists of whatever units are conveniently at hand. contributed to higher costs for data collection. The
Many of the probability surveys in current use were increased expense of conducting probability samples
launched decades ago, and when launched were of- limits the sample sizes. Hence, reliable estimates for
ten the only reliable source of information on the topic subpopulations of interest may require multiple years
studied. of data, if they can be calculated at all, and the esti-
Probability samples are often tailored to answer the mates may be out of date when they are produced.
research and policy questions of interest but face a Parallel to these developments in the probability
number of challenges. Response rates are decreasing sampling arena, large amounts of data are now avail-
worldwide and the response rate for a typical tele- able in many forms. Traditional administrative sources
such as the U.S. Decennial Census, tax records, or lists
of recipients of social services continue to be avail-
Sharon L. Lohr is Vice President, Westat, 1600 Research
Boulevard, Rockville, Maryland 20850, USA (e-mail: able. Road cameras and satellites provide streams of in-
sharonlohr@westat.com). Trivellore E. Raghunathan is formation about traffic patterns and movements. Elec-
Director of Survey Research Center, Institute for Social tronic health records contain the medical history and
Research, and Professor of Biostatistics, School of Public diagnosed conditions of large parts of the population.
Health, University of Michigan, Ann Arbor, Michigan Police agencies post lists of crimes reported to them—
48106, USA (e-mail: teraghu@umich.edu). sometimes within a day of the reporting. Social media

293
294 S. L. LOHR AND T. E. RAGHUNATHAN

such as Facebook and Twitter capture expressed senti- Probability samples have long used information
ments of the participants, and internet search engines from other sources whenever possible. Stratification
track trending search items. Cellular telephone records and balanced sampling use auxiliary information in
provide locations of individuals and details of call lo- the design, while poststratification and regression esti-
cations and durations. Credit card records and shopper mation use auxiliary information to improve precision
loyalty cards capture information on financial transac- of estimates and to attempt to compensate for nonre-
tions. Web crawling software gathers information from sponse and undercoverage. Section 2 briefly reviews
web pages. Much of this information can be gathered these methods and establishes notation.
faster and cheaper than data from a probability sample. Sometimes the information from a survey can be
The large sample sizes of these data sets can provide augmented through linking individual records from the
finer detail on subpopulations than a typical probabil- survey respondents with other data sets, as described in
ity sample. Citro (2014) has emphasized the need to Section 3. Such linkage requires record identifiers that
rely on multiple data sources—not just data from tra- can be used to match records across sources. Record
ditional probability samples—for producing statistics. linkage can be thought of as imputing the auxiliary in-
The field of statistics now faces opportunities (and, formation from the linked records. Even when records
of course, challenges) in developing methods and are not linked, models developed on a high-quality data
frameworks to combine survey and nonsurvey data source can be used to impute information for responses
sources to produce estimates, while maintaining a of interest in other sources, and these methods are de-
probabilistic framework for drawing inferences of high scribed in Section 4.
quality and rigor. Such developments are important be- In other situations, individual records cannot or
cause the data sources differ in their quality and suit- should not be linked because of insufficient informa-
ability for answering research questions, and many of tion, privacy concerns, or lack of overlap among data
the inexpensive data sources provide convenience sam-
sources. Many data sources report aggregate statistics
ples. The set of income tax records gives a census of the
and do not release individual records. In these situa-
entities filing taxes in a country; however, some entries
tions, summary statistics can be calculated separately
in tax returns may be incorrect and the records do not
from each source and then combined, often by taking a
include unreported income or nonfiling entities. The
weighted average of the summary statistics. Section 5
tax records also do not contain information on behav-
summarizes multiple frame survey methods used to ag-
ioral variables that may be of interest to researchers.
gregate estimates across data sources.
Persons without health insurance are underrepresented
in electronic health records. Social media capture the Sections 6 and 7 describe hierarchical models that
expressed views of persons who use the platform, but can be used to combine estimates across studies. Small
do not represent nonusers. Administrative records and area (also called small domain) estimation methods
large convenient data sets might not have the informa- borrow strength from administrative data to obtain es-
tion needed for statistical purposes. timates in subpopulations where the sample size from
We review statistical methods that have been pro- the probability sample is too small to produce reli-
posed for combining information from multiple proba- able estimates. Many small area methods combine the
bility samples and other sources to answer research and data from the survey with predictions from a regression
societal questions. All sources have advantages and de- model using covariates from the administrative data,
ficiencies, and it is desired to leverage the advantages often using a hierarchical model in which the devia-
and reduce the deficiencies as much as possible. This tion of an area mean from the overall mean is repre-
goal accords with Deming’s (1950), page 2, holistic sented by a random effect. Hierarchical models are also
view of sampling: “Sampling is not mere substitution used for combining data sources, where the individual
of a partial coverage for a total coverage. Sampling is records from each data source are nested within data
the science and art of controlling and measuring the sources.
reliability of useful statistical information through the There are many potential advantages to using mul-
theory of probability.” We summarize each method, tiple data sources. These include being able to obtain
highlight its potential gains and drawbacks, and assess information on more parts of the population with finer
the work done to date with respect to the goals of (1) in- detail on subpopulations. Using administrative or sen-
creasing the precision, timeliness and granularity of es- sor data can result in substantial cost savings. An addi-
timates and (2) providing accurate estimates of uncer- tional advantage is being able to use multiple sources in
tainty. survey design. Section 8 discusses using multiple data
COMBINING SURVEY DATA 295

sources to improve sampling frames and make the de- the survey estimate for the number of persons in each
sign of the entire data collection effort more efficient. age/race/sex cell is forced to agree with the control to-
At the same time, there are also challenges for com- tal for that cell.
bining the information. Section 9 describes some of the Calibration, or other weight adjustment methods
statistical research needed for combining data sources. such as raking (Deville, Särndal and Sautory, 1993)
We recommend a modular approach, in which different or inverse propensity weighting (Rosenbaum and Ru-
methods may be used with different subpopulations, bin, 1983; Lee and Valliant, 2009; Valliant and Dever,
reflecting the availability of information. 2011), are often used to adjust for nonresponse or un-
dercoverage. The calibration constraints require that
2. MULTIPLE DATA SOURCES IN DESIGN AND estimated population totals for the x variables, using
CALIBRATION the respondents to the survey, equal the external con-
trol totals X: the calibration removes the bias in the
Most probability samples use information from mul- calibration variables. It is hoped that the calibration
tiple data sources in the design and estimation as part will remove bias for other variables, too, but that hope
of standard survey practice. The sampling frame may is sometimes unfounded. Kohut et al. (2012), for ex-
be constructed using information from a census, and ample, found that estimates of civic engagement from
variables in the frame can be used to stratify the sam- low-response-rate surveys are higher than correspond-
ple and determine selection probabilities. A university ing estimates from high-quality surveys, indicating that
conducting a survey of its students would have demo- the weighting adjustments do not remove bias for these
graphic information and information on major and aca- variables in the low-response-rate surveys. Calibration
demic performance for every student. Using the frame and other weighting adjustments are also sometimes
information in design allows better control of the sam- used to attempt to adjust for bias from convenience
ple, for example, by specifying a predetermined num- samples (Baker et al., 2013). In this case, in the ab-
ber of students from each academic division. sence of known inclusion probabilities, the initial de-
A probability sampling design assigns a probability sign weights are set to 1 and all weight variation comes
P (S ) to each potential sample S that can be selected from the calibration. Again, it is hoped that calibra-
from the finite population, and these probabilities serve tion removes the self-selection bias, although there is
as the basis for inference. The probability that unit i is evidence that calibration may be less successful in re-
included in the sample is πi = P (i ∈ S ), and the design ducing bias for nonprobability samples (Yeager et al.,
weight is di = 1/πi . Unit i in the sample is considered 2011).
to represent di units in the population, so that the pop- The advantages of using external data sources in
design and estimation are well known. Stratification
ulation

total of a characteristic y can be estimated by
almost always increases precision and allows better
i∈S i yi .
d
control of sample sizes for subpopulations. When the
Calibration and poststratification, reviewed in
response rate is 100 percent, calibration also usually
Särndal (2007) and Brick (2013), use information from
increases precision. When there is nonresponse, cali-
an external data source in the estimation. A vector
bration and other weight adjustment methods remove
of auxiliary variables xi is known for each unit, i, in or reduce bias in the x variables used in the calibration,
the sample, and the external data source is assumed and it is hoped that they reduce nonresponse bias in
to provide the exact value of the population totals for other variables as well.
those variables, denoted X. These control totals will be These methods also have disadvantages if the ex-
known if the sampling frame has the value of xi for ev- ternal data sources have errors. A frame constructed
ery unit in the population, as in a survey of university from a data source that omits some of the population
students, or may alternatively be obtained from an in- will have undercoverage. If that same frame is used to
dependent external source such as a population census. provide the control totals for the calibration, then the
Calibration constructs adjusted  weights wi that sat- weight adjustments for the undercovered subpopula-
isfy the calibration constraints i∈S wi xi = X while tion will be too small. Control totals from independent
minimizing a distance function between the adjusted sources may also have undercoverage or other errors.
weights wi and the design weights di . Poststratification The NHIS, which asks respondents about their cellu-
is a special case of calibration, in which the auxiliary lar and landline telephone usage, is often used to cali-
variables are indicators for poststrata such as combina- brate dual frame telephone surveys, discussed in Sec-
tions of age, race, and sex. After the poststratification, tion 5. Yet the NHIS is itself a sample with sampling
296 S. L. LOHR AND T. E. RAGHUNATHAN

error and potential nonresponse bias, and the errors in outcomes of the graduate students in the university
the calibration totals introduce additional uncertainty databases. PRL methods typically calculate a similarity
into the estimates. Renssen and Nieuwenbroek (1997) score for pairs of prospective matches using the pattern
discussed calibrating two surveys to each other using of agreements, disagreements, and near-agreements
variables common to both surveys. among the variables used in linking. A record from
When there is nonresponse, the properties of estima- source A is linked with a record from source B if
tors calculated using the calibration weights depend on the similarity score exceeds a predetermined thresh-
how well the calibration model captures the structure old. A comprehensive review of PRL methods is be-
of the population or the response mechanism. Most yond the scope of this paper, and we refer the reader
published survey estimates report standard errors that to the books of Herzog, Scheuren and Winkler (2007),
are calculated under the very strong assumption that Christen (2012), and Harron, Goldstein and Dibben
the calibration has removed all of the bias. If that as- (2016) for details of how similarity scores may be cal-
sumption is wrong, then the standard errors understate culated.
the uncertainty of the estimates. False matches or missed matches can occur in ei-
ther DRL or PRL when the linkage variables do not
3. COMBINING INFORMATION FROM INDIVIDUAL uniquely identify entities. Records may have typo-
RECORDS graphical errors or variations (Robert may be the same
person as Bob), be out of date, or have insufficient in-
In some cases, data records for individuals can be formation for unique linkage (multiple persons may
composited from different sources. This can be done have the same name and date of birth, or date of birth
to reduce burden for survey respondents, to fill data information may be missing). Zolas et al. (2015) failed
gaps, or to check accuracy of information. Record link- to match 20% of the doctoral recipients in the partic-
age, also known as data matching or entity resolution, ipating universities. Even small amounts of error in
merges records from different sources that are believed linkage can bias results (Bohensky et al., 2010): for ex-
to belong to the same entity such as a person, house- ample, graduate students who cannot be linked may be
hold, or business. We give two recent examples. less likely to have found employment. Winkler (2014)
The Canadian Income Survey informed respondents reviewed recent research on accounting for linkage er-
that Statistics Canada planned to combine the house- ror in statistical analyses of linked data.
hold’s survey information with tax data (Statistics Bayesian record linkage methods calculate the pos-
Canada, 2014). The questionnaire for the survey could terior probability that two records match. The uncer-
therefore omit many of the income questions that had tainty about the linkage in the posterior distribution
been in previous surveys, reducing the length of the can then be propagated in other analyses. Steorts, Hall
questionnaire and allowing deeper exploration of other and Fienberg (2016) reviewed Bayesian linkage re-
topics such as employment, housing, and disability. search and considered a formulation in which records
The information from tax returns was also used to ad- from each data set are linked to latent “true” individu-
just for nonresponse through calibration. This is an ex- als. Under the assumption that the data sets are condi-
ample of exact or deterministic record linkage (DRL), tionally independent given the latent individuals, they
so called not because the method is always error-free calculated the posterior distributions of linkages with
but because the linked records agree on a set of char- the latent individuals, which then allowed computation
acteristics (in this case, tax identification number) that of linkage probabilities among the different data sets
is deemed to determine unique linkage. that preserve transitivity (i.e., if A matches B and B
Zolas et al. (2015) combined data from university ad- matches C, then A matches C).
ministrative records on graduate students who received Record linkage can be thought of as a form of im-
research funding with confidential survey information putation, in which the data fields from source B fill in
housed at the Census Bureau. Lacking a unique iden- those missing fields for the linked record in source A
tifier across all sources, they used probabilistic record (Goldstein, Harron and Wade, 2012). In the Canadian
linkage (PRL, Fellegi and Sunter, 1969) to link per- Income Survey, the tax records supply the information
sons in the university databases with Social Security on income that is no longer collected in the question-
Administration records and Census Bureau informa- naire.
tion by name, address and date of birth. This linkage Statistical matching, sometimes called data fusion,
then allowed the researchers to study the employment may be done when individual records cannot be linked.
COMBINING SURVEY DATA 297

Records, or groups of records, from source B are advantage in creating estimates based on combining in-
matched to similar records from source A using vari- formation from multiple data sources.
ables common to both sources such as demographic in- In this approach, variables that are missing from a
formation. For example, source A might have informa- data source are “filled in,” or imputed. Many tech-
tion on heart disease for one set of persons and source niques are available for filling in the missing values
B might have information on nutritional intake for a (Durrant, 2009; Andridge and Little, 2010; Carpenter
different set of persons, but both sources have infor- and Kenward, 2012), and the goal of all of these meth-
mation on each person’s age, sex, race, ethnicity, and ods is to use information available in the survey and
education. By matching records from source A with other sources to accurately predict missing items. Most
records from source B that have similar age, sex, race, of the applications of imputation for combining infor-
ethnicity, and education, the analyst can explore rela- mation across sources have relied on multivariate mod-
tionships between nutritional intake and heart disease. els to predict and then impute the missing items. Mod-
Correlational relationships between the demographic els developed on one data source may be used to im-
variables and nutritional intake in source B, and be- pute missing variables in other sources. Alternatively,
tween the demographic variables and heart disease in all records may be concatenated into one large data set
and all missing items in the concatenated data may be
source A, are used to make inferences about the re-
imputed using one multivariate model or a sequence of
lationship between nutritional intake and health char-
regression models.
acteristics (Rodgers, 1984; Moriarity and Scheuren,
There are many advantages to being able to impute
2001). Of course, such an analysis requires strong as-
the missing items. The primary advantages of impu-
sumptions to be made about the comparability of the
tation are the abilities to augment the amount of in-
data sets and the nature of the relationship between formation available for analysis, and to produce data
nutritional intake and heart disease. A second type of sets without “holes” in them. Suppose that Survey A
data fusion involves using information from one source provides data on x and y, Study B provides data on
to impute variables into another source (Rässler, 2002) y and z, and administrative data provide information
and this will be discussed in more detail in Section 4. about x and z. An imputation model making use of
When records can be linked across sources with a the bivariate relationships estimable from the individ-
high degree of accuracy, the linked data sets can pro- ual sources can provide information about the relation-
vide information on many more variables than would ship among all three variables. Clearly, combining data
be available from any of the data sources by them- from these sources provides a means for inferring be-
selves, and this allows researchers to explore multivari- yond the scope of each individual study.
ate relationships among these extra variables. Record Kim and Rao (2012) imputed a variable of interest
linkage methods can also be used to augment the num- y in a large survey that does not measure y directly
ber of records in the combined data set, if records that but that does measure covariates x. A second, smaller,
cannot be linked are deemed to be separate entities. survey measures both x and y, and a regression model
However, it is often difficult to link records accu- predicting y as a function of x is fit to the data in this
rately, especially when there is little identifying in- survey. That model is then applied to the x variables in
formation in the data files. The creation of linked the large survey to obtain imputed values for y. These
databases also raises concerns about privacy and in- imputed values are then used together with the weights
formed consent, and these issues will be discussed fur- from the large survey to estimate the population total
ther in Section 9. for y; the standard error of the estimate depends on the
sampling variability from the large survey and on the
4. IMPUTATION lack of fit of the regression model. This model has the
strong assumptions that the x variables for the two sur-
Combining information from multiple data sources veys measure the same quantity (i.e., there is no mea-
naturally fits within the missing data framework given surement error due to mode effects or other sources of
that not all variables are typically measured in every incomparability discussed below), and that the regres-
data set. Thus, a standard missing data pattern is ob- sion model developed on data from the small survey
tained when the data sets are concatenated. In addition, applies to the large survey.
many variables in each data set may also be subject to Gelman, King and Liu (1998) used multiple impu-
item missing data. Given this scenario, it is not surpris- tation to combine information from a series of cross-
ing that an imputation-based approach offers a distinct sectional surveys where some questions are not asked
298 S. L. LOHR AND T. E. RAGHUNATHAN

in some surveys. The particular problem involved com- to be considered. These are raised in the context of im-
bining data from 51 election polls conducted during the putation, but also apply to other methods for combining
six months prior to the election. The goal was to as- data sources.
sess the changes in vote intentions over time for dif- A first source of potential incomparability is the dif-
ferent subgroups based on gender, age, party affilia- ferences in the types of respondents and the sources
tions, etc. Using a hierarchical model to incorporate of respondent information. For example, in a house-
study differences, a fully Bayesian approach was used hold survey, the respondents may be interviewed face-
to draw values from the posterior predictive distribu- to-face and report health conditions based on memory
tion of not asked or not answered items conditional on and recall. The data from the other source may be pro-
the observed data. vided by physicians who may be consulting medical
Raghunathan (2006) and Schenker, Raghunathan records to check for health conditions.
and Bondarenko (2010) used multiple imputation to A second potential source of incomparability may
correct for possible bias in self-reports of health con- arise due to mode of the interview. For example, one
ditions (such as diabetes, hypertension, or hyperlipi- survey may be based on random digit dialing, the sec-
demia) in the NHIS using data from the National ond survey may be based on face-to-face interviews,
Health and Nutrition Examination Survey (NHANES) and the third survey may begin with a telephone mode
which collects data using both self-reports and clinical but switch over to face-to-face interviews on a subset.
measures. An added advantage of this approach is that Based on the effect of mode on the measurement of
the national estimates of undiagnosed health conditions outcomes, pooling may introduce bias, if there is one
borrow strength from both surveys. preferred “gold standard” approach for collecting the
Another example is given in He, Landrum and Za- information. In the absence of such a gold standard,
slavsky (2014), where data from surveys, medical
combining data may be a better reflection of the popu-
records, Medicare claim data, and cancer registries
lation quantity since it accounts for differences among
were combined to study hospice use in terminal can-
sources.
cer patients. All data sources had missing data and the
A third source of potential incomparability may
multiple imputation relied on observed data from all
arise due to survey contexts. For example, nation-
sources.
ally representative data collected by a federal agency
There are number of challenges when implement-
that is well known and well publicized may have dif-
ing multiple imputation approach for combining in-
ferent response error properties than a survey con-
formation from multiple survey data sources. For ex-
ample, surveys usually involve stratification, clustering ducted through a reputable institution that is not as well
and weighting for selection and nonresponse. Though known. The advance letter that is usually sent may also
each survey may represent the same or a similar pop- affect the measurement error properties.
ulation, the complex survey design differences have to A fourth source of potential incomparability may
be taken into consideration in deriving the combined arise from differences in the survey design. For exam-
estimates. The recent work of Dong, Elliott and Raghu- ple, NHIS collects information in an interview setting
nathan (2014a, 2014b) proposed “uncomplexing” the whereas NHANES collects information in an interview
survey data by simulating populations from each sur- setting but with an advance knowledge provided to the
vey data and then combining using the superpopulation respondent that he/she may be selected for medical ex-
modeling framework. Zhou, Elliott and Raghunathan amination and specimen collection. Recalling abilities
(2015) extended the approach when variables in each of the respondents may differ in these two survey set-
survey data are subject to item missing values. tings.
Estimates based on combining information from A final source of potential incomparability may arise
multiple data sources are subject to errors due to in- due to different wordings of the questions asking the
comparability as well as issues in modeling of those er- same information. Other issues relate to placement of
rors. Early references to address the issue of compara- the questions in the survey instrument, protocol differ-
bility in pooling data are Bancroft (1944) and Mosteller ences for the interviewer prompts, and additional ques-
(1948). The latter is perhaps the first to discuss the tionnaire features.
bias-variance trade-off in pooling data and lay out con- These and other sources of incomparability af-
ditions for deciding whether to pool or not. Here, we fect combining information from multiple survey data
raise five potential sources of incomparability that need sources. If nonsurvey data sources are also brought into
COMBINING SURVEY DATA 299

the mix, a lack of a probability survey framework to as- particularly beneficial if the population being studied
sess representativeness can be an additional source of is a small component of the general population. Data
incomparability. collection has already been done for electronic med-
The above discussion may appear to discourage ical records and tax records, and using them can in-
combining information from multiple sources. On the crease the precision for the parts of the population they
contrary, the advantage of combining information is contain. The large sample size from these sources also
the ability to address analytic problems beyond the provides more information on subpopulations such as
scope of any single survey, and imputation can pro- persons with rare disorders or taxpayers who hold tax-
vide a richness of data unavailable from any single exempt bonds. Lesser, Newton and Yang (2008) inves-
source. Direct estimation techniques may not be ap- tigated use of lists of individuals belonging to disability
plicable and some modeling approach may have to organizations as sampling frames in their study of im-
be used to properly harness and pool the information. proving public transportation access for persons with
There are no assumption-free approaches in statistics. disabilities. While these lists do not include all persons
The modeling framework provides a means for incor- with disabilities, they reduce screening costs that are
porating the study differences and one or more issues needed if respondents to a random digit dialing survey
of incomparability in an explicit manner. An explicit are asked questions to determine if they are in the popu-
modeling framework provides transparency. With lim- lation of interest. However, multiple frame surveys are
ited data and complicated modeling, it is important to more complicated than single frame surveys, and must
consider issues such as covariate selection, features to be carefully analyzed to take advantage of the potential
be incorporated, collection of auxiliary variables, and increased efficiency and to avoid bias.
incorporation of model uncertainty. The easiest way to use multiple frames, if feasible, is
to create a single frame from the different sources be-
5. MULTIPLE FRAME METHODS fore sampling by concatenating the frames and remov-
In a multiple frame survey, samples are selected from ing duplicates. This is not always possible, however:
each of F sampling frames and estimates from the for a dual frame cellular/landline telephone survey,
samples are combined. A sample is selected from each typically the sampling frames would consist of land-
of the frames, and the estimates from the different sam- line and cellular telephone numbers, and one would
ples are combined. The different frames often include not know before sampling whether a person associ-
different subsets of the population. For example, frame ated with a cellular telephone number also has access
A might cover the entire population of interest, such to landline service. If a single frame cannot be con-
as the frame for the face-to-face NHIS; frame B might structed using the frame information, then an alterna-
be a set of electronic medical records; frame C might tive is to take independent samples from the different
consist of tax records. Some frames might not be well frames and then combine the data or estimates after
defined in advance, as would occur if the sample from sampling.
frame D consists of volunteers responding to an inter- In multiple frame methods, the union of the frames is
net survey. For some frames, such as electronic medical assumed to be the population of interest. The overlap of
records or tax records, the frame itself may have the in- the frames creates overlap sets1 consisting of the pop-
formation of interest so that the entire data set may be ulation units accessible through different combinations
used rather than sampling from it. of the frames. Using the notation in Hartley (1974),
Multiple frame survey methods have several poten- overlap set a consists of the population units in frame
tial advantages. If each data source includes only a part A but none of the other frames; set ab consists of the
of the population of interest, using multiple sources as units in frames A and B but not C or D; set abcd con-
frames can give better coverage of the population. Tele- sists of the units that could be accessed through each
phone surveys often take one sample from a frame of of the four frames. The overlap sets are disjoint and
landline telephone numbers and an independent sam- together comprise all of the population units that can
ple from a frame of cellular telephone numbers; using be reached through at least one of the frames. If each
just the landline (or cellular) frame would exclude per- of the F frames consists of the same set of population
sons with exclusively cellular (or landline) telephone
service from the survey. Multiple frame surveys can 1 The multiple frame literature typically calls these domains
increase precision for little additional cost if data col- rather than overlap sets; however, in this paper we use the term
lection is inexpensive for some of the frames. This is domain to denote a subpopulation of interest.
300 S. L. LOHR AND T. E. RAGHUNATHAN

units, as would happen when all frames cover the entire records across surveys, but we do need to know how
population, there will be one overlap set. If all frames many chances an individual has to be in the data set.
overlap but none has complete coverage, there can be The overlap sets for a multiple frame survey are often
up to 2F − 1 overlap sets. With F = 2 frames, there can determined by asking respondents about their member-
be 3 overlap sets: units in frame A but not in B (set a), ship in other frames, and that sometimes introduces
units in frame B but not in A (set b), and units in both measurement error into the determination. For exam-
(set ab). ple, respondents to a dual frame cellular/landline tele-
Lohr and Rao (2006) and Metcalf and Scott (2009) phone survey are usually asked about their relative us-
summarized estimators for combining information age of cellular and landline telephones to receive calls,
from the F samples taken from the frames. The com- but that determination may be imprecise. Lohr (2011)
plications come in because units in more than one showed that dual frame methods can have less preci-
frame have multiple chances to be selected; in a dual sion than using estimates from just one data source if
frame survey, the units in overlap set ab can be sam- individuals are misclassified in the wrong overlap set,
pled from either or both frames. If we simply concate- and she and Stokes and Lin (2015) considered esti-
nated the data sets without adjusting for the multiplic- mators that account for misclassification bias in dual
ity, then the individuals in set ab would be overrepre- frame surveys.
sented in the combined samples. The population total The additional complexity of multiple frame surveys
in overlap set k can be estimated by a weighted average has implications for nonresponse adjustments. Brick
of the estimated population totals from the individual et al. (2011) discussed choosing the compositing fac-
frames that have observations in overlap set k: tor λ to reduce nonresponse bias. It is often assumed
 that the weights from all samples are individually pre-
(5.1) Ŷk = λkf Ŷkf ,
adjusted for nonresponse using methods such as those
f ∈k
 described in Section 2; if desired, the weights can
where f ∈k λkf = 1. Then the overall population total be calibrated again after the estimators are combined
is estimated by summing the pieces Ŷk from the distinct (Ranalli et al., 2016).
overlap sets. Much of the literature on multiple frame surveys as-
The λkf ’s can be thought of as adjusting the respon- sumes that the survey conducted from each frame asks
dents’ weights for the multiplicity that occurs because the same questions, and that estimates from the overlap
units can appear in multiple frames. The estimate Ŷkf sets from different sources measure the same quantity.
from each source is assumed to be approximately un- In a dual frame survey, this means that the expected
biased after calibration has been performed. The rel- value of the estimated population total from overlap
ative importance λkf assigned to source f may be set ab is the same for the estimator from frame A
fixed in advance (it is common to use λ = 1/2 for the and the estimator from frame B. But the sources of
overlap set ab in dual frame surveys), based on the survey incomparability discussed in Section 4 apply
surveys’ selection probabilities (Bankier, 1986; Kalton to multiple frame surveys as well. If the sample se-
and Anderson, 1986), or determined so as to mini- lected from frame A is collected using different sur-
mize the variance of the aggregated estimate (Hartley, vey questions, modes, or procedures than the sample
1962; Skinner and Rao, 1996). Chauvet and de Marsac from frame B, the estimated population totals in ab
(2014) studied estimators for two-stage dual frame sur- may differ because of the procedures or nonsampling
veys where the two surveys share some of the same error rather than because of sampling variability. These
primary sampling units. differences are of particular concern for the sources in
To apply the appropriate weighting factor λkf to which data collection is inexpensive, because the es-
each sampled unit, one must be able to identify which timates may have different measurement error proper-
overlap set it belongs to, or at the very least one ties than the estimates from the expensive sources. It
must know how many sampling frames it belongs to is important that these nonsampling errors be included
(Mecatti, 2007; Rao and Wu, 2010). We know that a in the measures of uncertainty about the survey esti-
respondent to the NHIS is in the frame for that survey, mates. Typically, it is recommended that the variance
but do we know whether he or she is represented in the of the multiple-frame estimate for an overlap set k be
set of electronic health records? In other words, how estimated by summing the variances λ2kf V (Ŷkf ) for the
many times could the same person be represented in the components of the weighted sum in that set, but this
combined data sets? We do not need to be able to link formula accounts only for sampling error and does not
COMBINING SURVEY DATA 301

consider differences that may be due to different survey estimates for domains, or areas, with small sample
procedures or questions. sizes.
One method for evaluating potential bias is to use Small area estimation methods combine information
multiple frame methods on the different subpopula- from a survey with auxiliary information from admin-
tions, called domains, of the surveys. These domains istrative data sources to calculate domain-level statis-
can be distinct from the overlap sets. For example, in tics. Fay and Herriot (1979) estimated the mean θd
a dual frame telephone survey, the overlap sets would in domain d using a weighted average of estimates
be persons with a landline phone only, persons with from two sources. The first estimate is ȳd , which is
a cell phone only, and persons with both landline and the estimated mean in domain d calculated directly
cell phone. The domains studied could be different ge- from the survey. For many small domains, ȳd is based
ographic regions or demographic subgroups. This al- on a small sample size and is imprecise; for some
lows the analyst to compare estimates from the differ- domains such as large states, however, ȳd may have
ent surveys in those domains. Merkouris (2004, 2010) high precision. The second estimate uses a regression
used regression methods to adjust two surveys being model predicting θd from domain-level covariates xd
combined, using the common variables from the sur- that are available from an administrative data source
veys. Merkouris (2010) used regression estimators to to obtain prediction θ̂d . The Fay–Herriot estimator of
combine information from multiple surveys and ob- θd is λd ȳd + (1 − λd )θ̂d , where λd ∈ [0, 1] is larger if
tain small domain estimators. He considered the case V (ȳd ) is small or if the regression model does not fit
in which there are multiple surveys of the same pop- the data well (and, therefore, does not provide accurate
ulation, and calibrated the surveys to each other using predictions). The Fay–Herriot estimator is thus of the
variables that are common to both surveys. same form as (5.1), combining the direct estimate ȳd
Lohr and Brick (2012) considered dual frame esti- from the survey with a regression prediction based on
mation when one of the sources is considered to be covariates from an administrative data source. A two-
unbiased but with small sample sizes in domains. The stage model underlies the Fay–Herriot estimator. First,
other data sources have larger sample sizes but poten- the area-level means from the survey are assumed to
tially have differential bias across domains. The rela- follow a distribution with mean θd and sampling vari-
tive contributions of sources toward each domain esti- ance ψd , where ψd is estimated using the survey de-
mator depend on the relative variances and the amount sign and weights. The second stage relates the θd ’s to
of differential bias. These methods allowed the differ- the external-source covariate information through a re-
ences among estimators that arose from nonsampling gression model, θd = xd β + vd , where vd represents
error to be included in the mean squared error estimates the error in prediction from using the regression model
for the domains. and is assumed to have mean 0.
Multiple frame survey methods have great potential The U.S. Small Area Income and Poverty Estimates
for combining information from data sources that are (SAIPE) program (United States Census Bureau, 2016)
measuring the same quantities. As with all the other uses a variant of this method to provide annual poverty
methods discussed in this paper, however, they have statistics for states, counties and school districts. The
strong assumptions about the comparability of the data direct estimates are one-year estimates from the Amer-
sources, and extending the methods to relax those as- ican Community Survey (ACS), and the regression pre-
sumptions is a promising area for research. dictions use covariates from the Decennial Census,
from tax records collected by the Internal Revenue
6. SMALL AREA ESTIMATION
Service, from the Supplemental Nutrition Assistance
Program, and from population estimates. The use of
A pressing need for many policy makers is obtaining the administrative data sources allows the U.S. Census
estimates of important quantities at small geographic Bureau to publish poverty statistics for every county
levels such as counties or states, or for a subgroup and school district each year, even though the sample
based on certain demographic characteristics (such as sizes for many of these areas are too small for the ACS
gender, age or race). Many national surveys are in- estimate to be published.
adequate for constructing such estimates because the When estimates are produced for nested areas of dif-
sample size in many domains of interest is too small, ferent sizes, it is often desirable to adjust estimates at
or may even be zero. Combining data from multiple finer levels of geography so that they aggregate to es-
sources provides the only meaningful way to develop timates at coarser levels of geography. In general, the
302 S. L. LOHR AND T. E. RAGHUNATHAN

estimates for larger geographic areas are thought to be self-reported weight and height. The measurement er-
more reliable because they have a larger sample size ror models accounted for the sampling error in NHIS
from the survey and rely less on the model-based pre- both in the calculation of λd (which was smaller if the
dictions which are based on model assumptions. The NHIS estimate had higher variance) and in the mean
SAIPE program state-level estimates of the number squared error of the small area estimates. You, Datta
of children in poverty are ratio-adjusted so that they and Maples (2014) used a bivariate Fay–Herriot model
sum to the national estimate of number of children in to incorporate the error from multiple sources when es-
poverty that is calculated from the ACS. The county timating disability.
estimates within a state are also ratio-adjusted to sum Elliott and Davis (2005) and Raghunathan et al.
to the state estimate, and the school district estimates (2007) further developed small area estimation by
are similarly benchmarked to the county estimates. In combining data from two surveys, NHIS and the Be-
this way, the estimated counts of children in poverty are havioral Risk Factor Surveillance System (BRFSS).
consistent across school districts, counties, and states, Elliott and Davis (2005) used a model-assisted frame-
and the nation as a whole. Datta et al. (2011) reviewed work to match the respondents in the two surveys us-
benchmarking methods for small area estimates, and ing propensity score methods and then used the com-
proposed a class of Bayesian small area estimators that bined data to develop Fay–Herriot-type estimates. On
constrain a weighted average of the posterior means the other hand, Raghunathan et al. (2007) used an ex-
to equal prespecified estimates. Pfeffermann and Tiller plicit Bayesian hierarchical model framework to model
(2006) and Hyndman, Lee and Wang (2016) described NHIS, which was assumed to provide unbiased es-
methods that may be used to benchmark time series. timates for telephone and nontelephone households,
The Fay–Herriot model makes use of statistics com- but for only a few counties. The NHIS was com-
puted for each area using the sampling weights from bined with BRFSS data, which provides biased esti-
the survey, and uses individual records only through the mates for the telephone households but for all the coun-
area-level summaries. A unit-level small area model ties. Using auxiliary county and state level covariates,
(Battese, Harter and Fuller, 1988) may be used when the estimands (the population-size-weighted county-
covariates are available for each population unit. A hi- level population means of telephone and nontelephone
erarchical model is used for the individual responses of households) were simulated from their posterior distri-
survey participant j in area d: bution using Markov Chain Monte Carlo methods.
The arcsine square root transformation was applied
ydj = xdj β + vd + edj , to the county level direct estimates, in part to sim-
plify the modelling by stabilizing the variances of the
where the area-specific random effects vd are assumed outcomes. However, the theory behind the variance
to have mean 0 and variance σv2 , and the individual- stabilizing properties of the arcsine square root trans-
level errors edj are assumed to have mean 0 and vari- formation is a large-sample theory, and thus the trans-
ance σe2 . In this hierarchical model, individual respon- formation might be less effective for some of the coun-
dents from an area are considered to be nested in that ties in the project that have sparse samples. To avoid
area. Rao and Molina (2015) provided a comprehen- making large sample approximations, the logit-normal
sive description of models commonly used in small model was also used which resulted in similar es-
area estimation, including empirical Bayes, hierarchi- timates but with an enormous increase in computa-
cal Bayes, time series and spatial models. For most of tional time and complexity. Current work is consid-
these models, the x information is assumed to be mea- ering small area estimates by combining three differ-
sured exactly, and different methods are needed if the ent subpopulations within each area: households with
x information comes from another survey or a source a landline (with or without cell phones), nontelephone
with differential measurement error. households, and cell-only households. Chen, Wake-
Ybarra and Lohr (2008) used a Fay–Herriot-type field and Lumely (2014) reduced the computational
model, accounting for measurement error in the co- complexity for Bayesian hierarchical small-area mod-
variates, to estimate mean body mass index (BMI) els by using an integrated nested Laplace approxima-
for age/race/sex domains. NHANES calculates BMI tion. Mercer et al. (2014) incorporated spatial random
from direct measurements of height and weight, and effects in models estimating smoking prevalence at the
thus is thought to be more accurate than the measure zip code level from BRFSS, and compared different
of BMI from the larger NHIS that is calculated from model structures in a simulation study.
COMBINING SURVEY DATA 303

Small area methods use multiple sources to augment where the bias δdj is assumed to follow a normal dis-
the information available at the domain level. As with tribution with mean j and variance τj2 . The error
imputation and multiple frame methods, this augmen- term edj is the sampling error for the estimate udj ,
tation requires the use of model assumptions and we re- assumed to have mean 0 and variance σdj 2 , where the
fer the reader to Rao and Molina (2015) for discussion variance is calculated from the sample design. Note
of model misspecification in small area estimation. that the model in (7.1) is similar to the Fay–Herriot
model for small area estimation, with the additional
7. HIERARCHICAL MODELS FOR COMBINING feature that the model-based deviation component is
DATA SOURCES allowed to have mean j rather than mean 0. Many
In the hierarchical models used in Section 6 to ob- of the properties of the estimates of θd depend on the
tain small area estimates, random effect terms are used constraints put on the mean bias j from source j .
to model the means of different domains. Hierarchical Manzi et al. (2011) adopted a vague prior distribution
models can also be used to synthesize data from mul- for the bias j , and constrained the mean prevalence
tiple sources: in this usage, random effect terms repre- over all domains to equal the prevalence estimate from
sent the means from different data sources, and indi- the UK General Household Survey, which was con-
vidual data records from the studies (if available) are sidered to be highly accurate. Alternatively, one of the
nested in the studies. The problem is structurally sim- sources could be considered to be a gold standard with
ilar to that of random effects models used for meta- zero bias. Turner et al. (2009) discussed a framework
analysis (Sutton and Higgins, 2008), in which sum- for eliciting prior information on bias for multiple data
mary statistics from different studies are assumed to sources.
come from a normal distribution with mean θ , and Many hierarchical models that treat different studies
a weighted average of the summary statistics from as random effects incorporate the between-source vari-
the different studies is used to estimate the underly- ability into the measures of uncertainty. Thus, the esti-
ing effect size of the treatment. The weights may be mate of smoking prevalence obtained from combining
inversely proportional to variances, or experts’ judg- multiple studies may have larger standard error than an
ments may be used to assess the quality of the stud- estimate constructed from one probability sample. The
ies and downweight studies with lower quality (United standard error of the estimate from a single probability
States General Accounting Office, 1992; Turner et al., sample includes the within-survey error, while the stan-
2000; Greenland, 2005). dard error of the estimate obtained by pooling surveys
A number of models have been proposed that com- also includes the between-survey error.
bine summary statistics—usually means—from differ- Wang et al. (2012), Nandram, Berg and Barboza
ent studies. Methods that rely upon summary statistics (2014), and Cruze (2015) discussed hierarchical
do not require access to the individual data records that Bayesian methods for combining information from
comprise the studies, and thus can be used when access multiple repeated surveys to obtain benchmarked es-
to the individual data records is restricted. The mean timates of state and regional crop yields in the United
and its estimated variance for the subpopulation of in- States. The quantity of interest is the true annual yield
terest is calculated separately for each data source us- in year t, denoted as μt , and it is desired to estimate
ing the design and the nonresponse-adjusted weights μt at different time points in the growing season. The
for that source. first survey takes monthly field measurements, includ-
Manzi et al. (2011) used a Bayesian analysis to es- ing acres planted, from a sample of sites in states that
timate θd , the smoking prevalence in local area (do- are the top producers for the crop being studied. The
main) d, for each of 48 local areas in England. Preva- second survey is a monthly national interview survey
lence estimates were available from seven different asking farmers to estimate their expected yields for a
studies, but these studies differed in methodology and range of crops. The measurements from the first and
quality, and there was concern that estimates from second surveys are taken throughout the growing sea-
some of the studies could be biased. The estimated son. Estimates of μt derived from these two sources
prevalence for domain d from data source j , udj , is tend to be biased; however, the biases are assumed to
assumed to follow the model be consistent across years, and depend only on num-
ber of months before harvest. The third national survey
(7.1) udj = θd + δdj + edj , occurs after harvest, and asks farmers about yield of
304 S. L. LOHR AND T. E. RAGHUNATHAN

different commodities as well as other quantities. Be- Finucane et al. (2014) combined information using
cause the third survey has large sample size and oc- a hierarchical Bayesian framework to estimate trends
curs after harvest, it is considered the gold standard for in mean systolic blood pressure for different countries.
yield estimates—but it is not available for making pre- They had surveys and other data sources, of varying
harvest estimates in the current year. The model uses quality, from almost 200 countries. Some data sources
the historical relationship between the gold standard contained individual records, while others only had
estimate and the monthly estimates of crop yield from summary statistics; some were rigorous national prob-
the other two surveys, so that the accruing information ability samples with high response rates, while others
from the first two surveys can be used to update the were less representative community studies. The hi-
forecast yield μt for the current year. The estimated erarchical model used the estimated mean and stan-
crop yields for month m, year t, and survey j are as- dard deviation from each data source and year as in-
sumed to be normally distributed with mean given by put. Random effects terms captured the study-level het-
μt plus a bias term for the first two surveys that varies erogeneity. Finucane et al. (2014) used an informative
by survey and month. The bias term is assumed to be prior distribution to account for the quality of the data
zero for the gold standard survey. The posterior dis- sources; they constrained the variances of the differ-
tribution of μt for the current year is calculated by ent terms so that national probability samples were
conditioning on the data available at the time of the assumed to have lower model variance than regional
forecast and including covariates such as weather in- studies, which in turn had lower model variance than
formation. This model uses the estimates of bias from community studies. This did not model the bias explic-
the first and second surveys from previous years to ad- itly, but resulted in the community studies that were
just the current-year forecast for those biases. The pos- thought to be less reliable having less influence on
terior mean for crop yield is a weighted average of the the estimates of health characteristics. Finucane et al.
bias-corrected estimates from the first and second sur- (2015) used related methodology to estimate the distri-
veys, the information from the third survey when avail- bution of child malnutrition for different countries.
able, and predictions using covariate information, with The malaria atlas project (Bhatt et al., 2015) em-
higher weights assigned to more precise sources. This ployed a Bayesian hierarchical model to study the
methodology allows biased surveys to be used to pro- infection prevalence of the malaria-causing parasite
duce more accurate and timely estimates of crop yield, Plasmodium falciparum in sub-Saharan Africa from
along with measures of uncertainty. 2000 to 2015. Data sources included community-level
The models discussed above combine summary measurements of the parasite rate from published lit-
statistics to improve the precision of estimates. Other erature (see http://www.map.ox.ac.uk/explorer/), na-
studies have combined individual records with aggre- tional household surveys, and historical records that
gated statistics through a hierarchical model. Wakefield provided environmental covariates (such as tempera-
and Salway (2001) presented a framework for using ture, surface wetness, and population) at a 5 km · 5 km
aggregated data, with attention to potential bias com- spatial resolution. Spatial and temporal correlations
ing from variability of covariates in the different ar- were included in the model through a Matérn covari-
eas; Wakefield (2004) argued that sometimes informa- ance function and first-order autoregressive terms. This
tive priors are needed when fitting hierarchical models model allowed the investigators to include uncertainty
using aggregated data. Jackson, Best and Richardson that arose from small sample sizes, information on ob-
(2008) used a hierarchical logistic model to study the served clinical incidence rates, and the estimated pa-
risk of hospital admission for heart and circulatory dis- rameters in the posterior distribution predicting preva-
ease. They had individual-level data on risk-behavior lence.
and socioeconomic covariates, and the outcome of hos- The Global Burden of Disease Study used similar
pital admission from the Health Survey for England; hierarchical Bayesian models to combine data from
individual-level data on covariates from the UK cen- thousands of epidemiological sources as well as avail-
sus; and aggregate counts of hospitalization, and so- able national surveys in approximately 200 countries
cioeconomic covariates, at the ward and district level. in order to study levels and trends of disease inci-
The individual-level logistic model had terms for area- dence, prevalence, and mortality (Vos et al., 2015;
level covariates; the model for aggregate-level data re- Wang et al., 2016). The hierarchical Bayesian pop-
lied on the summary statistics for different areas as well ulation reconstruction method described by Wheldon
as the within-area variability in covariates. et al. (2016) reconciled census counts with population
COMBINING SURVEY DATA 305

projections based on vital rates. The prior information (2014) extended propensity-based matching methods
came from expert opinion about the relative error of the to complex surveys.
data sources, and the methodology provided a mecha- One challenge when combining individual records
nism for assessing biases in the different data sources. from different sources is how to treat the survey
Sweeting et al. (2008) used hierarchical models (see weights from individual sources (Rao et al., 2008).
also Ades and Sutton, 2006) to evaluate the consis- When summary statistics are combined, the individual-
tency of data sources (one capture-recapture study on source survey weights are used to calculate the means
intravenous drug users, four national household sur- for each domain and then a weighted average is taken
veys asking about drug use, medical clinic data, blood of these means. When combining individual records
donation records, and testing data) for estimating the across data sources, two sets of weights are used:
prevalence of hepatitis C. The model included param- (1) the weights used to generalize each survey to its
eters for the bias from each sources. They re-estimated population, described in Section 2, which are based
the model, leaving each data source out in turn, to in- on the inverse of the selection probabilities with ad-
vestigate whether omitting sources changed the esti- justments for nonresponse and calibration; and (2) the
mates. relative contribution of each individual source toward
Although it does not use an explicit hierarchical the combined estimate. In multiple frame methods, the
model, Brick’s (2015) design-based framework for weights within each overlap set are multiplied by the
compositing multiple surveys is related to this work. value of λ for that frame and overlap set to account for
Each source is weight-adjusted, using poststratifica- the multiplicity. A similar method could be used with
tion or inverse propensity weighting, and the variability hierarchical models, but more research on this topic
among sources is used to estimate the variance of the is needed. Korn and Graubard (1999), Chapter 8, dis-
mean estimated from the sources. cussed the calculation of weights when pooling multi-
Strauss et al. (2001) studied hierarchical models ple surveys.
to estimate the relationship between residential lead Hierarchical models have many advantages for com-
exposure and children’s blood lead levels. NHANES bining data. As with imputation methods, they provide
provided information on blood lead concentrations, a transparent model framework for combining the in-
but not exposure; the U.S. Department of Housing formation with explicit assumptions. The model as-
and Urban Development (HUD) National Survey on sumptions can be strong, however, and the measures
Lead-Based Paint in Housing had exposure informa- of uncertainty, while accounting for variability among
tion but no information on lead levels in children. sources, often do not account for potential model defi-
They used a third source that related lead exposure and ciencies.
blood lead concentration (but only for Rochester, New
York). An additional complication occurred because 8. DESIGNING STUDIES TO LEVERAGE MULTIPLE
the Rochester study measured lead exposure differently DATA SOURCES
than the HUD study. The authors assumed that the true The increasing availability of multiple data sources
value of the lead exposure level was a latent variable, opens up new options for survey design. As described
and modeled the Rochester and HUD lead exposure us- in Section 5, data sources may have information on dif-
ing covariates available in both sources. Adopting the ferent but overlapping parts of the population. Elec-
strong assumption that the exposure/blood lead rela- tronic medical records might provide information on
tionship found in Rochester held nationally, the model persons who have used certain medical services, but
allowed the researchers to predict a national distribu- other sources are needed to provide information about
tion of blood lead in children. the health characteristics of persons who have not used
There has been a great deal of work in biostatistics those medical services. A data source may provide ac-
on pooling information from different studies. Pocock curate information for some populations, but may be
(1976), Raghunathan (1991), and Prentice et al. (1992) thought to have bias for other subpopulations.
pooled information from a randomized trial with retro- With multiple sources available, the goal of the de-
spective data from historical controls, using Bayesian sign is to leverage the strengths of each source to pro-
methods to model potential bias in the historical con- vide an accurate picture of the population and of sub-
trols. Stuart (2010) reviewed research on methods that populations of interest. In this section, we consider the
may be used to match treatment and control groups situation in which administrative data sources are avail-
across multiple sources. Dugoff, Schuler and Stuart able for some subpopulations and it is desired to use
306 S. L. LOHR AND T. E. RAGHUNATHAN

those sources when designing a probability sample that 2012; Tourangeau et al., 2017) to modify the proto-
will (1) provide a check on the accuracy of the other col for survey data collection while in the field. These
sources for variables of interest and (2) provide ac- methods often use paradata—data about the process
curate information on subpopulations that are under- of collecting the survey data, such as number of con-
represented in the administrative sources. There is a tact attempts or neighborhood observations—to adapt
danger that all administrative sources may undercover the survey design. Data from external sources such as
the same subpopulations: for example, persons without sensor data could also be used for these design modifi-
health insurance may be missing from electronic health cations.
records and from insurance records. The survey design Smith (2011) reported the recommendations of an
needs to capture the subpopulations underrepresented international workshop on using auxiliary data to de-
in other sources. tect and adjust for nonresponse bias in surveys. Aux-
The administrative data sources may be used in sev- iliary data from population registers, linked databases,
eral ways during the design process. First, they may the sampling frame, or paradata can provide case-level
be used when constructing the frame. Section 5 dis- information for assessing potential nonresponse bias,
cussed combining estimates from multiple frame sur- while independent population estimates from censuses
veys when the information could not be consolidated or high-quality surveys such as the ACS can be com-
prior to sampling. But of course in some situations, the pared with survey estimates. The report noted, how-
information from the sources can be linked and consol- ever, that adding more auxiliary data “increases the
idated to form a better sampling frame with rich aux- likelihood of deductive disclosure and thus potentially
iliary information. This auxiliary information may be undermines confidentiality” (Smith, 2011, page 395).
used to improve the efficiency of the stratification of Fourth, the entire data collection can be designed to
the sample, or may be used in conjunction with bal- make use of the multiple sources of data. If the records
anced sampling (Valliant, Dorfman and Royall, 2000). from different sources can be linked and merged before
This also provides higher quality information for sur- sampling to construct a rich sampling frame, then the
veys of particular subpopulations. If it is desired to take sample can be allocated optimally using stratification
a survey of persons with asthma, the data sources may or balanced sampling. Thus, if frame A is nearly com-
provide better information for screening eligibility of plete but expensive to sample, frame B is incomplete
the sample. but less expensive, and the frames can be combined
A second use of the information from other sources before sampling, then the design can specify obtaining
is to provide contextual variables for the survey. the information from overlap sets b and ab from frame
Nachman and Parker (2012) linked respondents from B, and only using the expensive frame A to collect in-
the NHIS to information from the U.S. Environmental formation on overlap set a.
Protection Agency AirData system to study the rela- If the population source information is unknown be-
tionship between exposure to pollutants and outcomes fore sampling, however, the design needs to ensure that
such as asthma and bronchitis. They linked the latitude all parts of the population are represented in the sam-
and longitude of the survey respondent to the kriged ple. Hartley (1962) derived the optimal sample sizes
prediction of fine particulate matter at that latitude and along with the optimal compositing factor λ for dual
longitude. This linkage provided important contextual frame designs where overlap-set membership is un-
variables for interpreting the NHIS data. known in advance. When frame A is nearly complete
Third, the administrative data may provide informa- but expensive to sample, and frame B is incomplete, a
tion for dealing with nonresponse in the survey. If sur- larger sample size should be taken from frame A when:
vey records can be linked, the administrative data may (1) the cost per unit is higher in frame B, or (2) a larger
be used to impute information for nonrespondents. Tax proportion of the population is in overlap set a, and
records, for example, could be used to impute missing thus cannot be sampled from frame B. Lohr and Brick
income information for nonrespondents to the survey. (2014) found that for many cost structures it made eco-
The information from the administrative sources nomic sense to use a two-phase screening survey for
may also provide valuable information for nonresponse the expensive frame A, where the interview was ter-
assessment and follow-up in surveys. Adaptive (also minated after determining that the unit was also in the
called responsive) survey design often uses informa- less expensive frame B. This is of course less efficient
tion from multiple sources (Groves and Heeringa, than if the frame membership is known before sam-
2006; Wagner and Raghunathan, 2007; Wagner et al., pling, because extra effort must be expended to obtain
COMBINING SURVEY DATA 307

screening interviews for persons sampled from frame “high-quality” data sources when they are available,
A whose data are not used in the estimation. and downweighting the contributions of low-quality
We recommend a modular approach to survey de- data sources. This raises the question of how to deter-
sign, in which the design makes use of the different mine the quality of a data source. Berlin and Rennie
information sources available for different parts of the (1999) listed qualities of well-designed, high-quality
population. With data collection planned to take advan- clinical trials. Citro and Straf (2013) and the American
tage of administrative sources, the survey design can Association of Public Opinion Research (2015) gave
concentrate on parts of the population less represented characteristics of high-quality surveys but did not pro-
in other sources. vide metrics for quantifying survey quality. The devel-
opment of metrics for the quality of estimates from dif-
9. OPPORTUNITIES ferent data sources—going beyond sampling variabil-
All of the methods for combining information re- ity to consider measurement error, nonresponse bias,
viewed in this paper have strengths and shortcomings. and stability over time—is a crucial area for research.
Linking records allows the most efficient use of in- Estimates calculated from different sources are of-
formation, but accurate linkage is not always possible ten further apart than can be explained by the sam-
and linkage can raise privacy concerns. Imputation can pling error of the respective sources. These extra dif-
allow use of data sources that contain some but not ferences are often due to nonsampling errors such as
all of the variables of interest by imputing the miss- undercoverage, nonresponse, different question word-
ing variables through multivariate relationships deter- ing or modes, and measurement errors, as discussed
mined from sources that have the other variables, but in Section 4. Making use of estimates from different
the imputation models are strong: if the imputation sources, then, can be used to provide a measure of un-
model is developed on a source that has different re- certainty about estimates that includes some of the non-
lationships than the source where the imputation is sampling error. Some of the hierarchical models dis-
applied, then the imputed values may be misleading. cussed in Section 7 incorporate the estimated bias from
Multiple frame methods allow information from many sources into the posterior uncertainty about the param-
sources to be composited, but require accurate infor- eters. These do not always capture all of the potential
mation about the frame membership of the sampled sources of bias, however, and more research is needed
units. Hierarchical models are powerful tools for com- in this area.
bining information from surveys, but a big challenge Another area for research is the use of multiple
with these methods is accounting for bias from differ- sources to improve nonresponse bias assessment and
ent sources. All of the methods other than deterministic
adjustment. The standard practices of calibration and
record linkage rely on models, and the results need to
poststratification make use of a single external source,
be investigated for sensitivity to those models.
considered to be a gold standard, to adjust weights of
The survey designer and analyst may wish to use
respondents so that survey estimates conform to the
different methods for different subpopulations, reflect-
external population totals. In the absence of a single
ing the availability and quality of sources available for
those subpopulations. In the United States, most per- gold standard, however, it may be possible to use the
sons aged 65 and over are on Medicare, and so are information from different sources to calibrate survey
represented in the records of the Centers for Medi- data. In related work, it may be possible to use multiple
care and Medicaid Services (CMS). It may be possi- sources of data in adaptive design or to assign protocols
ble to link those records with records from electronic dynamically.
health records to obtain more detailed information We discussed linking records among sources that
about that subpopulation. Younger persons, however, have identifying information. Such linkage raises con-
are less likely to be in the CMS records and for cerns about the confidentiality of respondents’ data.
those subpopulations hierarchical modeling or multiple The information contained in a single data source
frame methods may be needed. Thus, we see the prob- might be insufficient to identify an individual, but the
lem of combining information from different sources extra variables contained in the linked sources may
as a mosaic, where different sources contribute to con- increase the chances of disclosure. Fellegi (1999),
structing the entire picture. page 6, described record linkage as “intrinsically in-
Much of the literature on meta-analysis and on com- trusive of privacy.” Daas et al. (2015) discussed con-
bining surveys discusses having higher reliance on cerns about privacy that can arise when using large
308 S. L. LOHR AND T. E. RAGHUNATHAN

nonsurvey data sources for official statistics. Some- Increasingly, rich administrative data sources such as
times, privacy concerns can be lessened if aggregated credit card transactions, electronic health records and
statistics are combined instead of linking individual social media are owned and harnessed by private com-
records, although these methods too can compromise panies. At the same time, increasing costs, decreasing
the confidentiality of individuals’ or subpopulations’ budgets, and lower cooperation of the public in pro-
information. Duncan, Jabine and de Wolf (1993) pro- viding data for federal and state surveys are threaten-
vided guiding principles for balancing the needs of ing the federal statistical system. Thus, for the methods
data access with the need to protect the confiden- reviewed in this paper to be useful, a framework of a
tiality of survey respondents’ information. Many of private-public partnership will need to be forged to use
the statistical techniques for reducing disclosure risk all available data for the benefit of society.
in that report, however, were conceived in an era in Many of the probability sampling designs in current
which fewer data and less sophisticated identification use were developed at a time when other sources of in-
techniques were available. More research is needed formation were not available. If these data collections
on privacy-preserving methods for releasing data; the were designed starting over, it is likely that the de-
differential privacy framework of Dwork (2011) can signs would make use of the wealth of information now
provide a mathematical foundation for such work available from multiple data sources. The availability
(Machanavajjhala and Kifer, 2015). One possible area of multiple data sources opens multiple opportunities
for research is on use of hierarchical models to obtain for research on designing the data collection using a
aggregate statistics that protect privacy. systems-based approach; on linking records; on de-
In this article, we have concentrated on statistical veloping imputation, multiple frame, and hierarchical
methods for combining information. An important fac- models for combining data; on developing measures
tor not discussed here is the issue of obtaining con- of uncertainty that reflect the nonsampling errors from
sent from participants to have their information com- various data sources; and on preserving privacy for in-
bined with information from other sources. Several dividuals who contribute their data. The use of multi-
of the studies we cited (e.g., Nachman and Parker, ple data sources has great potential for capturing more
2012; Zolas et al., 2015) linked records across multiple of the population, saving resources by making use of
sources. When should consent be obtained from sur- cheaper sources of information, obtaining more infor-
vey respondents for their survey-provided information mation on subpopulations, and improving the temporal
to be linked with other sources? Even if the information and spatial granularity of information used for research
released from the analysis is in the form of aggregated and public policy.
statistics, the linkage creates a database that could po-
tentially be obtained by hackers. ACKNOWLEDGMENTS
Record linkage and other methods for combining
The authors thank the reviewers for helpful com-
information across sources also raise questions about
ments and suggestions for additional references.
data ownership. Does a college student own the data
about her test scores, class attendance, analytics from
online classes, library usage, and cafeteria purchases, REFERENCES
or do those belong to the educational institution (Jones, A DES , A. E. and S UTTON , A. J. (2006). Multiparameter evi-
Thomson and Arnold, 2014)? How should society dence synthesis in epidemiology and medical decision-making:
balance patients’ ownership of their electronic health Current approaches. J. Roy. Statist. Soc. Ser. A 169 5–35.
records, fitness tracking data, and genetic information MR2222010
with potential benefits that could arise from sharing A MERICAN A SSOCIATION OF P UBLIC O PINION R ESEARCH
(2015). Code of Professional Ethics and Practices. Avail-
data (Kish and Topol, 2015; Kostkova et al., 2016)? able at https://www.aapor.org/Standards-Ethics/AAPOR-Code-
Hurst (2015) discussed data ownership issues in his of-Ethics.aspx.
testimony to the U.S. House of Representatives Com- A NDRIDGE , R. R. and L ITTLE , R. J. A. (2010). A review of hot
mittee on Agriculture, and proposed a “Transparency deck imputation for survey non-response. Int. Stat. Rev. 78 40–
Evaluator” that would accompany data collection. 64.
BAKER , R., B RICK , J. M., BATES , N. A., BATTAGLIA , M.,
Farmers providing data would be told who controls
C OUPER , M. P., D EVER , J. A., G ILE , K. J. and
their data, who can access them, and how the data will T OURANGEAU , R. (2013). Summary report of the AAPOR task
be used, along with other information about the data force on non-probability sampling. Journal of Survey Statistics
curation. and Methodology 1 90–143.
COMBINING SURVEY DATA 309

BANCROFT, T. A. (1944). On biases in estimation due to the use of D ONG , Q., E LLIOTT, M. R. and R AGHUNATHAN , T. E. (2014a).
preliminary tests of significance. Ann. Math. Stat. 15 190–204. A nonparametric method to generate synthetic populations to
MR0010373 adjust for complex sampling design features. Surv. Methodol.
BANKIER , M. D. (1986). Estimators based on several stratified 40 29–46.
samples with applications to multiple frame surveys. J. Amer. D ONG , Q., E LLIOTT, M. R. and R AGHUNATHAN , T. E. (2014b).
Statist. Assoc. 81 1074–1079. Combining information from multiple complex surveys. Surv.
BATTESE , G. E., H ARTER , R. M. and F ULLER , W. A. (1988). Methodol. 40 347–354.
An error-components model for prediction of county crop areas D UGOFF , E. H., S CHULER , M. and S TUART, E. A. (2014). Gen-
using survey snd satellite data. J. Amer. Statist. Assoc. 83 28–36. eralizing observational study results: Applying propensity score
B ERLIN , J. A. and R ENNIE , D. (1999). Measuring the quality of methods to complex surveys. Health Serv. Res. 49 284–303.
trials: The quality of quality scales. J. Amer. Med. Assoc. 282 D UNCAN , G. T., JABINE , T. B. and DE W OLF, V. A. (1993). Pri-
1083–1085. vate Lives and Public Policies: Confidentiality and Accessibility
B HATT, S., W EISS , D. J., C AMERON , E., B ISANZIO , D., M AP - of Government Statistics. National Academies Press, Washing-
PIN , B., DALRYMPLE , U., BATTLE , K. E., M OYES , C. L., ton, DC.
H ENRY, A., E CKHOFF , P. A. et al. (2015). The effect of Malaria D UNCAN , J. W. and S HELTON , W. C. (1992). U.S. Government
control on Plasmodium falciparum in Africa between 2000 and contributions to probability sampling and statistical analysis.
2015. Nature 526 207–211. Statist. Sci. 7 320–338. MR1181415
B OHENSKY, M. A., J OLLEY, D., S UNDARARAJAN , V., D URRANT, G. B. (2009). Imputation methods for handling item-
E VANS , S., P ILCHER , D. V., S COTT, I. and B RAND , C. A. nonresponse in practice: Methodological issues and recent de-
(2010). Data linkage: A powerful research tool with potential bates. International Journal of Social Research Methodology 12
problems. BMC Health Serv. Res. 10 1–7. 293–304.
B RICK , J. M. (2013). Unit nonresponse and weighting adjust- DWORK , C. (2011). A firm foundation for private data analysis.
ments: A critical review. J. Off. Stat. 29 329–353. Commun. ACM 54 86–95.
B RICK , J. M. (2015). Compositional model inference. In Proceed- E LLIOTT, M. R. and DAVIS , W. W. (2005). Obtaining cancer
ings of the Survey Research Methods Section 299–307. Amer. risk factor prevalence estimates in small areas: Combining data
Statist. Assoc., Alexandria, VA. from two surveys. J. Roy. Statist. Soc. Ser. C 54 595–609.
B RICK , J. M., C ERVANTES , I. F., L EE , S. and N ORMAN , G. MR2137256
(2011). Nonsampling errors in dual frame telephone surveys. FAY, R. E. III and H ERRIOT, R. A. (1979). Estimates of income
Surv. Methodol. 37 1–12. for small places: An application of James–Stein procedures to
C ARPENTER , J. and K ENWARD , M. (2012). Multiple Imputation census data. J. Amer. Statist. Assoc. 74 269–277. MR0548019
and Its Application. Wiley, Hoboken, NJ. F ELLEGI , I. P. (1999). Record linkage and public policy: A dy-
C HAUVET, G. and DE M ARSAC , G. T. (2014). Estimation meth- namic evolution. In Record Linkage Techniques—1997: Pro-
ods on multiple sampling frames in two-stage sampling designs. ceedings of an International Workshop and Exposition 1–12.
Surv. Methodol. 40 335–346. National Academy Press, Washington, DC.
C HEN , C., WAKEFIELD , J. and L UMELY, T. (2014). The use of F ELLEGI , I. P. and S UNTER , A. B. (1969). A theory of record
sampling weights in Bayesian hierarchical models for small linkage. J. Amer. Statist. Assoc. 64 1183–1210.
area estimation. Spat. Spatiotemporal Epidemiol. 11 33–43. F INUCANE , M. M., PACIOREK , C. J., DANAEI , G. and E Z -
C HRISTEN , P. (2012). Data Matching: Concepts and Techniques ZATI , M. (2014). Bayesian estimation of population-level
for Record Linkage, Entity Resolution, and Duplicate Detection. trends in measures of health status. Statist. Sci. 29 18–25.
Springer Science & Business Media, New York. MR3201842
C ITRO , C. F. (2014). From multiple modes for surveys to multiple F INUCANE , M. M., PACIOREK , C. J., S TEVENS , G. A. and
data sources for estimates. Surv. Methodol. 40 137–161. E ZZATI , M. (2015). Semiparametric Bayesian density estima-
C ITRO , C. F. and S TRAF, M. L., EDS . (2013). Principles and tion with disparate data sources: A meta-analysis of global
Practices for a Federal Statistical Agency, 5th ed. National childhood undernutrition. J. Amer. Statist. Assoc. 110 889–901.
Academies Press, Washington, DC. MR3420668
C RUZE , N. (2015). Integrating survey data with auxiliary sources G ELMAN , A., K ING , G. and L IU , C. (1998). Not asked and not
of information to estimate crop yields. In Proceedings of the answered: Multiple imputation for multiple surveys. J. Amer.
Survey Research Methods Section 565–578. Amer. Statist. As- Statist. Assoc. 93 846–857.
soc., Alexandria, VA. G OLDSTEIN , H., H ARRON , K. and WADE , A. (2012). The anal-
DAAS , P. J. H., P UTS , M. J., B UELENS , B. and VAN DEN ysis of record-linked data using multiple imputation with data
H URK , P. A. (2015). Big data as a source for official statistics. value priors. Stat. Med. 31 3481–3493. MR3041825
J. Off. Stat. 31 249–262. G REENLAND , S. (2005). Multiple-bias modelling for analysis of
DATTA , G. S., G HOSH , M., S TEORTS , R. and M APLES , J. (2011). observational data. J. Roy. Statist. Soc. Ser. A 168 267–306.
Bayesian benchmarking with applications to small area estima- MR2119402
tion. TEST 20 574–588. G ROVES , R. M. (2006). Nonresponse rates and nonresponse bias
D EMING , W. E. (1950). Some Theory of Sampling. Wiley, New in household surveys. Public Opin. Q. 70 646–675.
York. G ROVES , R. M. and H EERINGA , S. G. (2006). Responsive de-
D EVILLE , J.-C., S ÄRNDAL , C.-E. and S AUTORY, O. (1993). sign for household surveys: Tools for actively controlling sur-
Generalized raking procedures in survey sampling. J. Amer. vey errors and costs. J. Roy. Statist. Soc. Ser. A 169 439–457.
Statist. Assoc. 88 1013–1020. MR2236915
310 S. L. LOHR AND T. E. RAGHUNATHAN

H ARRON , K., G OLDSTEIN , H. and D IBBEN , C. (2016). Method- L OHR , S. L. and B RICK , J. M. (2014). Allocation for dual frame
ological Developments in Data Linkage. Wiley, Hoboken, NJ. telephone surveys with nonresponse. Journal of Survey Statis-
H ARTLEY, H. O. (1962). Multiple Frame Surveys. In Proceedings tics and Methodology 2 388–409.
of the Social Statistics Section, American Statistical Association L OHR , S. L. and R AO , J. N. K. (2006). Estimation in multiple-
203–206. Amer. Statist. Assoc., Alexandria, VA. frame surveys. J. Amer. Statist. Assoc. 101 1019–1030.
H ARTLEY, H. O. (1974). Multiple frame methodology and selected MR2324141
applications. Sankhyā, Ser. C 36 99–118. M ACHANAVAJJHALA , A. and K IFER , D. (2015). Designing statis-
H E , Y., L ANDRUM , M. B. and Z ASLAVSKY, A. M. (2014). Com- tical privacy for your data. Commun. ACM 58 58–67.
bining information from two data sources with misreporting and M ANZI , G., S PIEGELHALTER , D. J., T URNER , R. M., F LOW-
incompleteness to assess hospice-use among cancer patients: ERS , J. and T HOMPSON , S. G. (2011). Modelling bias in com-
A multiple imputation approach. Stat. Med. 33 3710–3724. bining small area prevalence estimates from multiple surveys.
H ERZOG , T. N., S CHEUREN , F. J. and W INKLER , W. E. (2007). J. Roy. Statist. Soc. Ser. A 174 31–50. MR2758280
Data Quality and Record Linkage Techniques. Springer Science M ECATTI , F. (2007). A single frame multiplicity estimator for mul-
& Business Media, New York. tiple frame surveys. Surv. Methodol. 33 151–157.
H URST, B. (2015). Big Data and Agriculture: Innovations and M ERCER , L., WAKEFIELD , J., C HEN , C. and L UMLEY, T. (2014).
Implications. Statement of the American Farm Bureau Fed- A comparison of spatial smoothing methods for small area esti-
eration to the House Committee on Agriculture, available mation with sampling weights. Spat. Stat. 8 69–85. MR3326822
at http://agriculture.house.gov/uploadedfiles/10.28.15_hurst_ M ERKOURIS , T. (2004). Combining independent regression esti-
testimony.pdf. mators from multiple surveys. J. Amer. Statist. Assoc. 99 1131–
H YNDMAN , R. J., L EE , A. J. and WANG , E. (2016). Fast compu- 1139. MR2109501
tation of reconciled forecasts for hierarchical and grouped time M ERKOURIS , T. (2010). Combining information from multiple
series. Comput. Statist. Data Anal. 97 16–32. surveys by using regression for efficient small domain esti-
JACKSON , C., B EST, N. and R ICHARDSON , S. (2008). Hierarchi- mation. J. R. Stat. Soc. Ser. B. Stat. Methodol. 72 27–48.
cal related regression for combining aggregate and individual MR2751242
data in studies of socio-economic disease risk factors. J. Roy. M ETCALF, P. and S COTT, A. (2009). Using multiple frames in
Statist. Soc. Ser. A 171 159–178. MR2412651 health surveys. Stat. Med. 28 1512–1523. MR2649709
J ONES , K. M., T HOMSON , J. C. and A RNOLD , K. (2014). Ques- M ORIARITY, C. and S CHEUREN , F. (2001). Statistical match-
tions of data ownership on campus. EDUCAUSE Review, August ing: A paradigm for assessing the uncertainty in the procedure.
1–10. J. Off. Stat. 17 407–422.
K ALTON , G. and A NDERSON , D. W. (1986). Sampling rare popu- M OSTELLER , F. (1948). On pooling data. J. Amer. Statist. Assoc.
lations. J. Roy. Statist. Soc. Ser. A 149 65–82. 43 231–242.
K IM , J. K. and R AO , J. N. K. (2012). Combining data from two in- NACHMAN , K. E. and PARKER , J. D. (2012). Exposures to fine
dependent surveys: A model-assisted approach. Biometrika 99 particulate air pollution and respiratory outcomes in adults us-
85–100. ing two national datasets: A cross-sectional study. Environ.
K ISH , L. J. and T OPOL , E. J. (2015). Unpatients—Why patients Health 11 1–12.
should own their medical data. Nat. Biotechnol. 33 921–924. NANDRAM , B., B ERG , E. and BARBOZA , W. (2014). A hierarchi-
KOHUT, A., K EETER , S., D OHERTY, C., D IMOCK , M. and cal Bayesian model for forecasting state-level corn yield. Envi-
C HRISTIAN , L. (2012). Assessing the Representativeness of ron. Ecol. Stat. 21 507–530. MR3248537
Public Opinion Surveys. Pew Research Center, Washington NATIONAL C ENTER FOR H EALTH S TATISTICS (2016). Survey
DC. Available at http://www.people-press.org/files/legacy-pdf/ Description, National Health Interview Survey, 2014. Cen-
Assessing%20the%20Representativeness%20of%20Public% ters for Disease Control and Prevention, Hyattsville, MD.
20Opinion%20Surveys.pdf. ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Dataset_
KORN , E. L. and G RAUBARD , B. I. (1999). Analysis of Health Documentation/NHIS/2015/srvydesc.pdf.
Surveys. Wiley, New York. N EYMAN , J. (1934). On the two different aspects of the representa-
KOSTKOVA , P., B REWER , H., DE L USIGNAN , S., F OTTRELL , E., tive method: The method of stratified sampling and the method
G OLDACRE , B., H ART, G., KOCZAN , P., K NIGHT, P., M AR - of purposive selection. Journal of the Royal Statistical Society
SOLIER , C., M C K ENDRY, R. A. et al. (2016). Who owns the 97 558–625. MR0121942
data? Open data for healthcare. Frontiers in Public Health 4 1– P FEFFERMANN , D. and T ILLER , R. (2006). Small-area estima-
6. tion with state-space models subject to benchmark constraints.
L EE , S. and VALLIANT, R. (2009). Estimation for volunteer panel J. Amer. Statist. Assoc. 101 1387–1397. MR2307572
web surveys using propensity score adjustment and calibration P OCOCK , S. J. (1976). The combination of randomized and histor-
adjustment. Sociol. Methods Res. 37 319–343. ical controls in clinical trials. J. Chronic. Dis. 29 175–188.
L ESSER , V. M., N EWTON , L. and YANG , D. (2008). Evaluating P RENTICE , R. L., S MYTHE , R. T., K REWSKI , D. and M A -
Frames and Modes of Contact in a Study of Individuals with SON , M. (1992). On the use of historical control data to esti-
Disabilities. Paper presented at the Joint Statistical Meetings, mate dose response trends in quantal bioassay. Biometrics 48
Denver, Colorado. 459–478. MR1173492
L OHR , S. L. (2011). Alternative survey sample designs: Sampling R AGHUNATHAN , T. E. (1991). Pooling controls from different
with multiple overlapping frames. Surv. Methodol. 37 197–213. studies. Stat. Med. 10 1417–1426.
L OHR , S. L. and B RICK , J. M. (2012). Blending domain estimates R AGHUNATHAN , T. E. (2006). Combining information from mul-
from two victimization surveys with possible bias. Canad. J. tiple surveys for assessing health disparities. Allg. Stat. Arch. 90
Statist. 40 679–696. MR2998856 515–526.
COMBINING SURVEY DATA 311

R AGHUNATHAN , T. E., X IE , D., S CHENKER , N., PAR - S UTTON , A. J. and H IGGINS , J. (2008). Recent developments in
SONS , V. L., DAVIS , W. W., D ODD , K. W. and F EUER , E. J. meta-analysis. Stat. Med. 27 625–650. MR2418504
(2007). Combining information from two surveys to estimate S WEETING , M. J., DE A NGELIS , D., H ICKMAN , M. and
county-level prevalence rates of cancer risk factors and screen- A DES , A. E. (2008). Estimating hepatitis C prevalence in Eng-
ing. J. Amer. Statist. Assoc. 102 474–486. MR2370848 land and Wales by synthesizing evidence from multiple data
R ANALLI , M. G., A RCOS , A., RUEDA , M. D . M. and sources. Assessing data conflict and model fit. Biostatistics 9
T EODORO , A. (2016). Calibration estimation in dual-frame sur- 715–734.
veys. Stat. Methods Appl. 25 321–349. MR3539496 T OURANGEAU , R., B RICK , J. M., L OHR , S. and L I , J. (2017).
R AO , J. N. K. and M OLINA , I. (2015). Small Area Estimation, 2nd Adaptive and responsive survey designs: A review and assess-
ed. Wiley, Hoboken, NJ. MR3380626 ment. J. Roy. Statist. Soc. Ser. A. 180 203–223. MR3600507
R AO , J. N. K. and W U , C. (2010). Pseudo-empirical likelihood T URNER , R. M., O MAR , R. Z., YANG , M., G OLDSTEIN , H. and
inference for multiple frame surveys. J. Amer. Statist. Assoc. T HOMPSON , S. G. (2000). A multilevel model framework for
105 1494–1503. MR2796566 meta-analysis of clinical trials with binary outcomes. Stat. Med.
R AO , S. R., G RAUBARD , B. I., S CHMID , C. H., M ORTON , S. C., 19 3417–3432.
L OUIS , T. A., Z ASLAVSKY, A. M. and F INKELSTEIN , D. M. T URNER , R. M., S PIEGELHALTER , D. J., S MITH , G. C. S. and
(2008). Meta-analysis of survey data: Application to health ser- T HOMPSON , S. G. (2009). Bias modelling in evidence synthe-
vices research. Health Serv. Outcomes Res. Methodol. 8 98– sis. J. Roy. Statist. Soc. Ser. A 172 21–47. MR2655603
114. U NITED S TATES C ENSUS B UREAU (2016). Model-Based Small
R ÄSSLER , S. (2002). Statistical Matching: A Frequentist The- Area Income & Poverty Estimates (SAIPE) for School Districts,
ory, Practical Applications, and Alternative Bayesian Ap- Counties, and States. Available at http://www.census.gov/did/
proaches. Lecture Notes in Statistics 168. Springer, New York. www/saipe/.
MR1996879 U NITED S TATES G ENERAL ACCOUNTING O FFICE (1992). Cross-
R ENSSEN , R. H. and N IEUWENBROEK , N. J. (1997). Aligning Design Synthesis: A New Strategy for Medical Effectiveness
estimates for common variables in two or more sample surveys. Research. U.S. General Accounting Office, Washington, DC.
J. Amer. Statist. Assoc. 92 368–374. MR1436124 Available at archive.gao.gov/d31t10/145906.pdf.
RODGERS , W. L. (1984). An evaluation of statistical matching.
VALLIANT, R. and D EVER , J. A. (2011). Estimating propensity
J. Bus. Econom. Statist. 2 91–102.
adjustments for volunteer web surveys. Sociol. Methods Res. 40
ROSENBAUM , P. R. and RUBIN , D. B. (1983). The central role of
105–137. MR2758301
the propensity score in observational studies for causal effects.
VALLIANT, R., D ORFMAN , A. H. and ROYALL , R. M. (2000).
Biometrika 70 41–55. MR0742974
Finite Population Sampling and Inference: A Prediction Ap-
S ÄRNDAL , C.-E. (2007). The calibration approach in survey the-
proach. Wiley, New York.
ory and practice. Surv. Methodol. 33 99–119.
VOS , T., BARBER , R. M., B ELL , B., B ERTOZZI -V ILLA , A.,
S CHENKER , N., R AGHUNATHAN , T. E. and B ONDARENKO , I.
B IRYUKOV, S., B OLLIGER , I., C HARLSON , F., DAVIS , A.,
(2010). Improving on analyses of self-reported data in a large-
D EGENHARDT, L., D ICKER , D. et al. (2015). Global, regional,
scale health survey by using information from an examination-
and national incidence, prevalence, and years lived with disabil-
based survey. Stat. Med. 29 533–545. MR2758451
S KINNER , C. J. and R AO , J. N. K. (1996). Estimation in dual ity for 301 acute and chronic diseases and injuries in 188 coun-
frame surveys with complex designs. J. Amer. Statist. Assoc. 91 tries, 1990–2013: A systematic analysis for the Global Burden
349–356. MR1394091 of Disease Study 2013. Lancet 386 743–800.
S MITH , T. W. (2011). The report of the international workshop on WAGNER , J. and R AGHUNATHAN , T. (2007). Bayesian ap-
using multi-level data from sample frames, auxiliary databases, proaches to sequential selection of survey design protocols. In
paradata and related sources to detect and adjust for nonre- Proceedings of the Survey Research Methods Section 3333–
sponse bias in surveys. Int. J. Public Opin. Res. 23 389–402. 3340. Amer. Statist. Assoc., Alexandria, VA.
S TATISTICS C ANADA (2014). Note to Users of Data from the 2012 WAGNER , J., W EST, B. T., K IRGIS , N., L EPKOWSKI , J. M., A X -
Canadian Income Survey, available at http://www.statcan.gc.ca/ INN , W. G. and N DIAYE , S. K. (2012). Use of paradata in a
pub/75-513-x/75-513-x2014001-eng.htm. responsive design framework to manage a field data collection.
S TEORTS , R. C., H ALL , R. and F IENBERG , S. E. (2016). J. Off. Stat. 28 477.
A Bayesian approach to graphical record linkage and WAKEFIELD , J. (2004). Ecological inference for 2 × 2 tables
de-duplication. J. Amer. Statist. Assoc. 111 1660–1672. (with discussion). J. Roy. Statist. Soc. Ser. A 167 385–445.
MR3601725 MR2082057
S TOKES , L. and L IN , D. (2015). Measurement error in dual frame WAKEFIELD , J. and S ALWAY, R. (2001). A statistical framework
designs. Paper presented at the Joint Statistical Meetings, Seat- for ecological and aggregate studies. J. Roy. Statist. Soc. Ser. A
tle WA. 164 119–137. MR1819026
S TRAUSS , W. J., C ARROLL , R. J., B ORTNICK , S. M., WANG , J. C., H OLAN , S. H., NANDRAM , B., BARBOZA , W.,
M ENKEDICK , J. R. and S CHULTZ , B. D. (2001). Combin- T OTO , C. and A NDERSON , E. (2012). A Bayesian approach
ing datasets to predict the effects of regulation of environmen- to estimating agricultural yield based on multiple repeated sur-
tal lead exposure in housing stock. Biometrics 57 203–210. veys. J. Agric. Biol. Environ. Stat. 17 84–106.
MR1833308 WANG , H., W OLOCK , T. M., C ARTER , A., N GUYEN , G.,
S TUART, E. A. (2010). Matching methods for causal inference: K YU , H. H., G AKIDOU , E., H AY, S. I., M ILLS , E. J.,
A review and a look forward. Statist. Sci. 25 1–21. MR2741812 T RICKEY, A., M SEMBURI , W. et al. (2016). Estimates of
312 S. L. LOHR AND T. E. RAGHUNATHAN

global, regional, and national incidence, prevalence, and mor- surveys conducted with probability and non-probability sam-
tality of HIV, 1980–2015: The Global Burden of Disease Study ples. Public Opin. Q. 75 709–747.
2015. The Lancet. HIV 3 e361–e387. YOU , J., DATTA , G. S. and M APLES , J. J. (2014). Modeling dis-
W HELDON , M. C., R AFTERY, A. E., C LARK , S. J. and G ER - ability in small areas: An area-level approach of combining two
LAND , P. (2016). Bayesian population reconstruction of female surveys. In Proceedings of the Survey Research Methods Sec-
populations for less developed and more developed countries. tion 3770–3784. Amer. Statist. Assoc., Alexandria, VA.
Popul. Stud. (Camb.) 70 21–37. Z HOU , H., E LLIOTT, M. R. and R AGHUNATHAN , T. E. (2015).
A two-step semiparametric method to accommodate sam-
W INKLER , W. E. (2014). Matching and record linkage. Wiley In-
pling weights in multiple imputation. Biometrics 72 242–252.
terdiscip. Rev.: Comput. Stat. 6 313–325.
MR3500593
Y BARRA , L. M. and L OHR , S. L. (2008). Small area estimation Z OLAS , N., G OLDSCHLAG , N., JARMIN , R., S TEPHAN , P.,
when auxiliary information is measured with error. Biometrika OWEN -S MITH , J., ROSEN , R. F., A LLEN , B. M., W EIN -
95 919–931. MR2461220 BERG , B. A. and L ANE , J. I. (2015). Wrapping it up in a per-
Y EAGER , D. S., K ROSNICK , J. A., C HANG , L., JAVITZ , H. S., son: Examining employment and earnings outcomes for Ph.D.
L EVENDUSKY, M. S., S IMPSER , A. and WANG , R. (2011). recipients. Science 350 1367–1371.
Comparing the accuracy of RDD telephone surveys and Internet

You might also like