How Few Countries Will
How Few Countries Will
Meuleman and Billiet (2009) have carried out a simulation study aimed at the question how
many countries are needed for accurate multilevel SEM estimation in comparative studies.
The authors concluded that a sample of 50 to 100 countries is needed for accurate estimation.
Recently, Bayesian estimation methods have been introduced in structural equation modeling
which should work well with much lower sample sizes. The current study reanalyzes the
simulation of Meuleman and Billiet using Bayesian estimation to find the lowest number of
countries needed when conducting multilevel SEM. The main result of our simulations is that
a sample of about 20 countries is sufficient for accurate Bayesian estimation, which makes
multilevel SEM practicable for the number of countries commonly available in large scale
comparative surveys.
Keywords: Multilevel SEM, sample size, cross-national research, Bayesian estimation
87
88 JOOP HOX, RENS VAN DE SCHOOT AND SUZETTE MATTHIJSSE
the advantages of a model based approach where we can ex- sion, Meuleman and Billiet confirm the suggestion that about
plicitly include country level explanatory variables and coun- 50 countries is the minimum sample size at the second level
try level residual variation in the model, rather than a sample for accurate estimation in multilevel SEM.
design based argumentation. We refer to Groves (1989) for a The sample size requirements suggested by the simula-
discussion of these two perspectives. tion studies reviewed above imply that for most comparative
When multigroup SEM is used, the number of countries surveys the country level sample sizes are problematic. For
is not a principled issue. Multigroup SEM can be used to instance, the European Social Survey round four (2008) in-
compare any number of groups. If the number of groups is cludes 30 countries (http://www.europeansocialsurvey.org),
huge, there may be practical analysis issues, such as the ca- the third wave of SHARE (2008-2009) includes 13 coun-
pacity of the software or the computer (or even the interpre- tries (http://www.share-project.org), the 2007 wave of
tational capacity of the analyst), but there is no formal lower the mathematics survey TIMMS includes 36-48 countries
or upper limit on the number of groups. In multilevel anal- (http://nces.ed.gov/timss), and the 2009 large scale educa-
ysis, the second level sample size (in comparative surveys tional assessment PISA sponsored by the OECD includes 65
generally the number of countries) is an issue. The second countries (http://www.opisa.oecd.org). These country level
level sample size must be large enough to permit accurate sample sizes suggest that only the larger collaborative com-
parameter estimates and associated standard errors. parative surveys involve enough countries to consider em-
Simulations have shown that multilevel regression mod- ploying multilevel SEM, but the majority appears too small
eling can be used with second-level samples as low as 20, to employ multilevel structural equation modeling.
provided that the interpretation focuses on the regression co- Recently, Bayesian estimation methods have been intro-
efficients (Maas and Hox 2005). However, accurate estima- duced in structural equation modeling (Lee 2007). Bayesian
tion and testing of variances requires much larger sample estimation works well with lower sample sizes, and will not
sizes, Maas and Hox (2005) suggest 50 groups as a lower produce inadmissible parameter estimates such as negative
limit when variances are important. Structural equation mod- variances. Bayesian methods generally imply prior informa-
eling with latent variables relies on (co)variances, which sug- tion in the analysis, but when uninformative priors are used
gests that for multilevel SEM even larger samples are needed this has only a small effect on the resulting parameter esti-
for accurate estimation. Indeed, a simulation involving a mates.
two-level confirmative factor model shows that with fewer The goal of the current paper is to examine how well
than 50 groups, the group level model parameters and their Bayesian estimation deals with the problem of estimating pa-
corresponding standard errors are not estimated with accept- rameters in a multilevel SEM model with a small sample size
able accuracy (Hox, Maas and Brinkhuis 2010). These simu- at the country level. The paper starts with an introduction
lations suggest that for accurate estimation at least 50 groups of Bayesian estimation methods and the issues involved in
should be available. a Bayesian multilevel SEM analysis. Next, it describes the
Meuleman and Billiet (2009) have carried out a simula- simulation design which is patterned after Meuleman and
tion study directly aimed at the question how many countries Billiet (2009). Our simulation design explicitly studies the
are needed for accurate multilevel SEM estimation in com- accuracy of the estimation method with very small numbers
parative surveys. They specified within country sample sizes of countries. The results and their implications for the anal-
to follow the sample sizes typically achieved in the European ysis of comparative surveys are discussed in detail.
Social Survey. The number of countries was varied from 20 We provide a basic introduction of Bayesian statistics,
to 100. The simulation model at both the individual and the but interested researchers could further refer to Lynch (2007)
country level is a confirmative one-factor model for four in- for an introduction to Bayesian estimation, and for technical
dicator variables, plus a structural effect predicting the factor details to Gelman, Carlin, Stern, and Rubin (2004). Bayesian
from an exogenous observed variable. Meuleman and Billiet structural equation modeling is discussed by Lee (2007) and
(2009) conclude that a sample of 20 countries is simply not Bayesian multilevel modeling by Hox (2010). In this pa-
enough for accurate estimation. They do not suggest a spe- per we use the software Mplus (Muthén and Muthén 1998-
cific lower limit for the country level sample size; instead, 2010) because it is often used by applied researchers. For the
they discuss how model complexity and goal of the analy- technical implementation of Bayesian statistics in Mplus, see
sis affect the country level sample size requirements. How- Asparouhov and Muthén (2010).
ever, their simulation results indicate that if we require that
the 95% confidence interval for country level factor loadings 2. Estimation methods in
lies in fact between 90 and 99 percent, which corresponds to multilevel SEM
a bias of about 5%, we require at least 60 countries. For 60
countries, the empirical alpha level for a test that the struc- In this section we describe briefly different estimation
tural effect equals zero is 0.083, which is acceptable. With methods for multilevel SEM, including Bayesian estimation.
40 countries, the empirical alpha level is 0.103, which is not For a more elaborate accessible introduction we refer to Hox
acceptable (cf. Boomsma and Hoogland 2001). The power (2010), and for a statistical treatment we refer to Kaplan
for a medium size structural effect at the country level is (2009).
0.523 with 60 countries, well below the value of 0.80 that Multilevel SEM assumes sampling at both individual
Cohen (1988) recommends as a worth pursuing. In conclu- and country levels. The individual data are collected in a
HOW FEW COUNTRIES WILL DO? COMPARATIVE SURVEY ANALYSIS FROM A BAYESIAN PERSPECTIVE 89
p-variate vector Yi j (subscript i for individuals, j for groups). More formally, let M be a statistical model with a vector
The data Yi j are decomposed into a between groups (Group of unknown parameters θ, for example regression parame-
level) component YB = Y j , and a within groups (individ- ters and correlations, and let Y be the observed data set with
ual level) component YW = Yi j − Y j . These two compo- sample size n. In Bayesian estimation, θ is considered to be
nents are orthogonal and additive, thus YT = YB + YW . The random and the behavior of θ under Y in such a Bayesian
population covariance model can be described by
matrices
are also orthogonal and ad-
ditive, thus T = B + W . Multilevel structural equa-
tion modeling assumes that the population covariance ma- p(θ|Y, M) ∝ p(θ|M) × p(Y|θ, M) (2)
trices B and W are described by distinct models for the
between groups and within groups structure. Several ap- where p(Y|θ, M) is the likelihood function, the information
proaches have been proposed to estimate the parameters of about the parameters in the data, p(θ|M) is the prior distribu-
the multilevel SEM. Muthén (1989) suggests to approximate tion, the information about the parameters before observing
the full maximum likelihood solution by assuming equal the data, and p = (θ|Y, M) is the posterior distribution, the in-
group sizes, which leads to a limited information estimation formation about the parameters after observing the data and
method called MUML (for Muthén’s Maximum Likelihood). taking the prior information into account.
For the prior distribution, we have a fundamental choice be-
A more accurate way to estimate a model for B and W
is a Weighted Least Squares (WLS) method implemented in tween using an informative prior or an uninformative prior.
Mplus. Full maximum likelihood estimation for multilevel An informative prior is a peaked distribution with a small
structural equation modeling requires to model the raw data. variance, which expresses a strong belief about the unknown
This minimizes the fit function given by population parameter, and has a substantial effect on the pos-
terior distribution. In contrast, an uninformative or diffuse
prior serves to produce the posterior, but has very little influ-
N
N −1 ence. An example of an uninformative prior is the uniform
F= log| i| + log(xi − μi ) i (xi − μi ), (1) distribution, which simply states that all possible values for
i=1
i=1 the unknown parameter are equally likely. Another exam-
where the subscript i refers to the observed ple of an uninformative prior is a very flat normal distribu-
cases, xi to those
variables observed for case i, and μi and i contain the pop- tion specified with an enormous variance. Sometimes such
ulation means and covariances of the variables observed for a prior is called an ignorance prior, to indicate that we know
case i. Mehta and Neale (2005) show that models for mul- nothing about the unknown parameter. However, this is not
tilevel data, with individuals nested within groups, can be accurate, since total ignorance does not exist. All priors add
expressed as a structural equation model. The fit function some information to the data, but diffuse priors add very lit-
(1) applies, with clusters as units of observation, and indi- tle information, and therefore do not have much influence
viduals within clusters as variables. Unbalanced data, here on the posterior. For our analyses we used the default prior
unequal numbers of individuals within clusters, are included specifications of Mplus which uses uninformative priors.
the same way as incomplete data in standard
SEM. The two- If the posterior distribution has a mathematically simple
stage approaches that model B and W separately (MUML form, the known characteristics of the distribution can be
and WLS) include only random intercepts in the between used to produce point estimates and confidence intervals.
groups model, the full ML representation can incorporate However, in complex models the posterior is generally a
random slopes as well (Mehta and Neale 2005). Maximum complicated multivariate distribution, which is often math-
likelihood estimation assumes large samples, and relies on ematically intractable. Therefore, simulation techniques are
numerical methods to integrate out random effects. In com- used to generate random draws from the multivariate poste-
parison, Bayesian methods are reliable in small samples, and rior distribution. These simulation procedures are known as
are better able to deal with complex models. The Bayesian Markov Chain Monte Carlo (MCMC) simulation. MCMC
approach is fundamentally different from classical statistics simulation is used to produce a large number of random
(Barnett 2008). In classical statistics, the population param- draws from the posterior distribution, which is then used to
eter has one specific value, only we happen to not know it. compute a point estimate and a confidence interval (for an
In Bayesian statistics, we express the uncertainty about the introduction to Bayesian estimation including MCMC meth-
population value of a model parameter by assigning to it a ods see Lynch 2007). Typically, the marginal (univariate)
probability distribution of possible values. This probability distribution of each parameter is used.
distribution is called the prior distribution, because it is spec- Given a set of initial values from a specific multivariate dis-
ified independently from the data. After we have collected tribution, MCMC procedures generate a new random draw
our data, this distribution is combined with the Likelihood of from the same distribution. Suppose that Z (1) is a draw from
the data to produce a posterior distribution, which describes a target distribution f (Z). Using MCMC methods, we gener-
our uncertainty about the population values after observing ate a series of new draws: Z (1) → Z (2) → . . . → Z (t) . MCMC
our data. Typically, the variance of the posterior distribution methods are attractive because, even if Z (1) is not from the
is smaller than the variance of the prior distribution, which target distribution f (Z), if t is sufficiently large, in the end
means that observing the data has reduced our uncertainty Z (t) is a draw from the target distribution f (Z). Having good
about the possible population values. initial values for Z (1) helps, because it speeds up the conver-
90 JOOP HOX, RENS VAN DE SCHOOT AND SUZETTE MATTHIJSSE
(a) within (individual) level (b) between (country) level
Figure 1. Path diagram for within (individual) and between (country) level
gence on the target distribution, so the classical maximum is more difficult to determine than the mean, the mean of the
likelihood estimates are often used as initial values for Z (1) . posterior distribution is also often used. In skewed posterior
The number of iterations t needed before the target distri- distributions, the median is an attractive choice. In Bayesian
bution is reached is referred to as the ‘burn in’ period of the estimation, the standard deviation of the posterior distribu-
MCMC algorithm. It is important that the burn in is com- tion is comparable to the standard error in classical statis-
plete. To check if enough iterations of the algorithm have tics. However, the confidence interval generally is based on
passed to converge on the target distribution, several diagnos- the 1/2 α and 100 − 1/2 α percentiles around the point esti-
tics are used. A useful diagnostic is a graph of the successive mate. In the Bayesian terminology, this is referred to as the
values produced by the algorithm. A different procedure is 100 − α% credibility interval. Mplus by default uses the me-
to start the MCMC procedure several times with widely dif- dian of the posterior distribution for the point estimate, and
ferent initial values. If essentially identical distributions are the percentile-based 95% credibility interval, which we have
obtained after t iterations, we decide that t has been large followed in our simulations. Bayesian methods have some
enough to converge on the target distribution (Gelman and advantages over classical methods. To begin, in contrast to
Rubin 1992). the asymptotic maximum likelihood method, they are valid
An additional issue in MCMC methods is that successive in small samples. Given the correct probability distribution,
draws are dependent. Depending on the distribution and the the estimates are always proper, which solves the problem of
amount of information in the data, they can be strongly cor- negative variance estimates. Finally, since the random draws
related. Logically, we would prefer independent draws to are taken from the correct distribution, there is no assumption
use as simulated draws from the posterior distribution. One of normality when variances are estimated. In this study, we
way to reach independence is to omit a number of succes- examine if Bayesian estimation will help in drawing correct
sive estimates before a new draw is used for estimation. This inferences in multilevel SEM if the number of groups (coun-
process is called thinning. To decide how many iterations tries) is relatively small. The simulation studies cited in the
must be deleted between two successive draws, it is useful introduction typically find that at smaller country level sam-
to inspect the autocorrelations between successive draws. If ple sizes the parameter estimates themselves are unbiased,
the autocorrelations are high, we must delete many estimates. but that the standard errors are underestimated, which leads
Alternatively, since each draw still gives some information, to poor control of the alpha level and undercoverage for the
we may keep all draws, but use an extremely large number confidence intervals. We expect that the credibility intervals
of draws. in our Bayesian estimation will perform better at the country
level sample sizes usually encountered in comparative survey
The mode of the marginal posterior distribution is an at- research.
tractive point estimate of the unknown parameter, because it
is the most likely value, and therefore the Bayesian equiva-
lent of the maximum likelihood estimator. Since the mode
HOW FEW COUNTRIES WILL DO? COMPARATIVE SURVEY ANALYSIS FROM A BAYESIAN PERSPECTIVE 91
3. Simulation design Table 2: Statistical power for detecting the country level structural
effect, for various effect sizes and country level sample sizes
The simulation design in this study closely follows Meule-
Number of countries
man and Billiet (2009). The model at both the individual and
the country level is a one-factor model with four indicators. Bayesian estimation ML estimation1
There is one structural effect from an observed exogenous 10 15 20 20 40 60
variable on the factor. Figure 1 shows the path diagram with Effect size
the population parameter values. None (0.00) 0.03 0.05 0.05 0.16 0.10 0.08
The simulated data were generated from a population that Small (0.10) 0.04 0.06 0.06 0.18 0.15 0.16
has the same characteristics as used in Meuleman and Billiet Medium (0.25) 0.08 0.13 0.15 0.31 0.41 0.53
(2009:48): Large (0.50) 0.26 0.43 0.58 0.75 0.94 0.99
• The observed variables have a multivariate distribu- Very large (0.75) 0.67 0.89 0.97 1.00 1.00 1.00
1
tion. Parameters estimated by Meuleman and Billiet 2009.
Table 1: Mean absolute bias for various country level sample sizes
Number of countries
Bayesian estimation ML estimation1
10 15 20 20 40 60
Parameter bias
Within factor loadings 0.00 0.00 0.00 0.00 0.00 0.00
Within error variances 0.00 0.00 0.00 0.00 0.00 0.00
Within structural effect 0.00 0.00 0.00 0.00 0.00 0.00
Between factor loadings 0.50 0.04 0.03 -.07 -.03 -.02
Between error variances 0.59 0.33 0.24 -.10 -.05 -.04
Between structural effect -.05 -.05 -.04 0.11 0.05 0.04
Coverage
Within factor loadings 0.95 0.96 0.95 0.93 0.94 0.94
Within error variances 0.93 0.94 0.93 0.93 0.94 0.94
Within structural effect 0.95 0.95 0.96 0.93 0.94 0.94
Between factor loadings 0.96 0.96 0.94 0.84 0.89 0.91
Between error variances 0.95 0.95 0.94 0.81 0.88 0.90
Between structural effect 0.96 0.94 0.95 0.85 0.90 0.92
1
Parameters estimated by Meuleman and Billiet 2009.
discussion, when we discuss convergence problems in the as we have countries, we recommend inspection of autocor-
Bayesian context. With respect to statistical power, it is clear relations and setting much stricter criteria for convergence.
that Bayesian estimation does not solve the problem of small In fact, if we deviate from the software defaults and set the
sample, only very large country level effects can be discov- convergence criteria much stricter, the bias in the residual
ered when the number of countries is small. variances at the country level becomes much smaller, at the
The results also show that Bayesian estimation is not magic. cost of a much increased computation time.
With ten countries, problems start to show in the summary ta- Softwarewise, we have simply specified a different estima-
bles, but they are clearer when the simulation output is stud- tion method. From a principled standpoint, we have chosen
ied in more detail. For the condition with ten countries, each a different kind of statistics. As a result, the 95% credibil-
simulation run contains some outliers for the estimates of the ity interval now may correctly be interpreted as the interval
error variances and corresponding standard errors, with esti- that contains the population parameter with 95% probability.
mates up to twenty times the population values. Such outliers In our power table, we presented p-values. In the Bayesian
would be recognized as such in a real analysis. The between case, this is not the normal p-value, but the so-called poste-
model contains a total of 10 parameters, so it is not surprising rior predictive p-value. This is roughly interpreted as a stan-
that problems arise when the number of countries approaches dard p-value, but it is actually a different entity. Bayesian
the number of parameters in the between model. Simplifying modeling in general prefers that decisions about parameters
the model, for instance by using the mean of the observed in- are based on credibility intervals, and that decisions about
dicators instead of a latent variable would make estimation models are based on comparative evidence, such as informa-
easier. tion criteria or Bayes factors. A discussion of these issues is
We briefly mentioned convergence problems and outlying beyond the scope of this paper, but we believe that applied
estimates. In MCMC estimation, convergence means con- researchers should be aware that doing a Bayesian analysis
vergence of the chain to the correct distribution. In our simu- is not just choosing a different estimation method.
lation, we have decided to emulate a relatively nave user and
therefore to follow all defaults implemented in the software In our analysis, we have chosen the default uninformative
(Mplus 6.1). We also used an automatic cut-off criterion to priors provided by Mplus. Other choices are possible. One
decide whether convergence had been reached. In one sim- interesting option is using an informative prior. For example,
ulation run, we needed to change the default criterion to a the default prior for a factor loading in Mplus is a normal
more strict value. Textbooks introducing Bayesian statistics distribution with a mean of zero and a very large variance
caution users to always use diagnostic tools such as plots of (1010 ). We have more prior knowledge than that. If we model
the iteration history (trace plots, c.f. Gelman, Carlin, Stern seven-point answer scales with an underlying factor, using
and Rubin 2004; Lynch 2007), and we completely agree with standard identifying constraints, we know that the (absolute)
such recommendations. Obviously, in a simulation, visually factor loadings will not exceed, say, the value ten. Why not
inspecting trace plots for 15,000 replications times 20 param- use a prior distribution that reflects this knowledge? In doing
eters is not possible. In applied Bayesian analysis, we con- so, we would become real subjectivist statisticians, a posi-
sider such inspection mandatory. In addition, especially in tion that is far away from mainstream statistics. If we im-
modeling situations as extreme as having as many parameters pose priors that describe only realistic parameter values, the
convergence problem discussed above will disappear. But in
HOW FEW COUNTRIES WILL DO? COMPARATIVE SURVEY ANALYSIS FROM A BAYESIAN PERSPECTIVE 93
small samples, such prior information could easily dominate Harkness, J. A., Braun, M., Edwards, B., Johnson, T. P., Lyberg,
the information in the data. In this paper, we have taken the L. E., Mohler, P. P., et al. (2010). Survey methods in multina-
position that this is undesirable, and prefer to work with un- tional, multiregional, and multicultural contexts. Chicester, UK:
informative priors. Wiley.
Hox, J. J. (2010). Multilevel analysis. Techniques and applications.
Acknowledgements NY: Routledge.
Jöreskog, K. G. (1971). Simultaneous factor analysis in several
Rens van de Schoot received a grant from the Netherlands populations. Psychometrika, 36, 409-426.
Organisation for Scientific Research: NWO-VINI-451-11- Kaplan, D. (2009). Structural equation modeling (2nd ed.). Thou-
008. sand Oaks, CA: Sage.
Lee, S. (2007). Structural equation modeling: a Bayesian ap-
References proach. Chicester, UK: Wiley.
Lynch, S. M. (2007). Introduction to applied Bayesian statistics
Asparouhov, T., & Muthén, B. (2010). Bayesian anal-
and estimation for social scientists. Berlin: Springer.
ysis of latent variable models using Mplus. Version 4.
Unpublished manuscript, accessed October 13, 2011 on Maas, C. J. M., & Hox, J. J. (2005). Sufficient sample sizes for mul-
http://www.statmodel.com/download/BayesAdvantages18.pdf. tilevel modeling. Methodology: European Journal of Research
Barnett, V. (2008). Comparative statistical inference. Chicester, Methods for the Behavioral and Social Sciences, 1, 85-91.
UK: Wiley. Mehta, P. D., & Neale, M. C. (2005). People are variables too:
Boomsma, A., & Hoogland, J. J. (2001). The robustness of LIS- multilevel structural equations modeling. Psychological Meth-
REL modeling revisited. In R. Cudeck, S. du Toit, & D. Sörbom ods, 10, 259-284.
(Eds.), Structural equation modeling: Present and future. A Meuleman, B., & Billiet, J. (2009). A Monte Carlo sample size
Festschrift in honor of Karl Jöreskog (p. 139-168). Chicago: study : How many countries are needed for accurate multilevel
Scientific Software International. SEM? Survey Research Methods, 3, 45-58.
Cohen, J. (1988). Statistical power analysis for the behavioral Muthén, B. (1989). Latent variable modeling in heterogeneous
sciences. Mahwah, NJ: Lawrence Erlbaum Associates. populations. Psychometrika, 54, 557-585.
Gelman, A., Carlin, J. B., Stern, H. S., & Rubin, D. B. (2004). Muthén, L. K., & Muthén, B. O. (1998-2010). Mplus user’s guide
Bayesian data analysis (2nd ed.). Boca Raton, FL: Chapman & (6th ed.). Los Angeles, CA: Muthén & Muthén.
Hall/CRC. Raudenbush, S. W., & Bryk, A. S. (2002). Hierarchical linear
Gelman, A., & Rubin, D. B. (1992). Inference from iterative simu- models (2nd ed.). Thousand Oaks, CA: Sage.
lation using multiple sequences. Statistical Science, 7, 457-511. Van de Vijver, F. R., van Hemert, D. A., & Poortinga, Y. H. (Eds.).
Goldstein, H. (2011). Multilevel statistical models. Chicester, UK: (2008). Multilevel analysis of individuals and cultures. NY:
Wiley. Taylor & Francis.
Groves, R. M. (1989). Survey errors and survey costs. New York:
Wiley.