0% found this document useful (0 votes)
14 views39 pages

Lavaan Package in RStudio

Overview of the Lavaan package in RStudio

Uploaded by

Charlie
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views39 pages

Lavaan Package in RStudio

Overview of the Lavaan package in RStudio

Uploaded by

Charlie
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

Lavaan

RStudio is a free, integrated development environment for the R programming language that provides
a range of powerful, user-friendly tools for data analysis and visualization. In addition to the core R
language, RStudio supports a vast array of packages that extend its functionality and allow users to
perform a variety of tasks, from data manipulation and cleaning to advanced statistical modelling and
machine learning. These packages can be easily installed and loaded within RStudio, and often include
detailed documentation and examples to help users get started. In this tutorial, the lavaan package of
RStudio, developed by prof. Yves Rosseel of Ghent University, will be elaborated. Lavaan is a powerful
RStudio package that can be used for specifying and estimating complex models, including path analysis,
confirmatory factor analysis, and structural equation modelling. Lavaan also offers a range of advanced
features, such as multiple-group analysis, mediation and moderation analysis, and latent growth curve
modelling. The package supports various estimation methods, including maximum likelihood, robust
maximum likelihood, and weighted least squares, and provides comprehensive output summaries and
model fit indices to assess the adequacy of the model. Lavaan is an indispensable tool for researchers
and practitioners in fields such as psychology, education, biological and social sciences. To use lavaan,
you must have a recent version (4.0.0 or higher) of R installed. You can download the latest R version
from this page: http://cran.r-project.org/. RStudio can be downloaded here: https://posit.co/.

This tutorial will mainly focus on confirmatory factor analysis, structural equation modelling, multiple
group analysis, and data visualization. If you want to learn more about lavaan and its other features,
we refer to the lavaan tutorial of prof. Yves Rosseel via https://lavaan.ugent.be/tutorial/index.html, on
which the examples in this document are based. However, this tutorial will go more into detail with
regard to model specification, model identification, parameter estimation, model evaluation (including
model fit and model validity), model modification, assessing measurement invariance, and calculating
modification indices. An overview of the six chapters in this document can be found below.

Index
1. Introduction .............................................................................................................................................................................................. 2
2. Model syntax of the lavaan package ................................................................................................................................................... 5
3. Functions and concepts ......................................................................................................................................................................... 6
4. Confirmatory factor analysis ................................................................................................................................................................ 7
Five steps in confirmatory factor analysis ................................................................................................................................ 7
Multiple group analysis ................................................................................................................................................................. 19
Categorical data.............................................................................................................................................................................. 25
Visualizing the model .................................................................................................................................................................... 26
5. Structural equation modelling ........................................................................................................................................................... 27
Five steps in structural equation modeling ............................................................................................................................ 27
Multiple group analysis ................................................................................................................................................................. 32
Visualizing the model .................................................................................................................................................................... 37
6. Sources ..................................................................................................................................................................................................... 38

Latest Version: 03/03/2023.

1
1. Introduction
Before introducing the lavaan package of RStudio, we need to understand the difference(s) between
latent and observable variables. Observable variables, such as ‘carbon dioxide emissions [ppm]’, age,
‘concentration [g/L]’, Likert scale responses in consumer questionnaires, etc., can be directly observed
or measured and are stored in a structured way in a dataset. In contrast, latent variables are variables
that can only be inferred or measured indirectly from observable variables through a mathematical
model. Examples of latent variables are psychological traits such as ‘anxiety’ and ‘risk perceptions’, but
also ‘climate change’ is a good example. Latent variables are therefore never present in a dataset.

In this tutorial, latent variables will be measured and causal relationships between latent and observable
variables will be investigated by using three closely related techniques, i.e. confirmatory factor analysis
(CFA), path analysis (PA), and structural equation modelling (SEM). Note that these techniques are
not suitable as exploratory tools. If you want to use them, you must first develop a hypothetical model
that includes underlying (i.e. not measurable) latent variables, and/or that captures important causal
relationships between latent and observable variables. Only after designing this model and collecting
the data you need to validate it, you can use CFA, PA or SEM to effectively validate (or amend) your
model (if needed). If you don’t have a model, you will need to apply techniques that are developed for
the exploration of latent variable problems (e.g. Exploratory Factor Analysis or EFA).

CFA is most often applied to measure latent variables based on correlated variations in a dataset (such
as causal relationships). CFA is also used to reduce data dimensions, standardize the scale of multiple
observable variables, and account for correlations inherent in a dataset. PA, on the other hand, aims
to find causal relationships amongst latent and observable variables by creating a path diagram. Lastly,
SEM combines CFA and PA and consists of two models; a measurement model and a structural model.
The measurement model measures the latent variables, while the structural model is used to test all
hypothetical causal dependencies between latent and observable variables, based on PA.

The measurement model is essentially a linear regression model. For a single indicator y1 (note that
indicator is used as a synonym for observable variable), this model can be written as:

where τ1 is the intercept of the indicator, λ1 is the loading of the latent variable η on the indicator y1
(loadings are also called regression weights or regression coefficients), and ε1 is the residual of the in-
dicator. However, a latent variable should always be inferred or measured by more than one indicator
(otherwise, you don’t need a latent variable and you can just use the observable variable). For three
indicators and one latent variable, the measurement model becomes:

(𝜂)

where yi are the indicators or observable variables, τi are the indicator intercepts, λi are the loadings,
interpreted as the correlation of the indicators with the latent variable η, and εi are the residuals of
the model, i.e. what’s left over after accounting for the latent variable, or thus, what the latent variable
does not explain. The index refers to the indicator. For example, τ1 refers to the intercept of the first
indicator, λ2 is the loading of the latent variable on the second indicator, etc.

2
In the structural model, we are interested in identifying causal relationships amongst the observable
and latent variables. To describe these causal relationships, the model-implied covariance matrix Σ(Θ)
is typically used. Note that this is not the same matrix as the observed population variance-covariance
matrix Σ, which only comes from the data. The formula for the model-implied covariance matrix is:

where Λ is the loading matrix (consisting of the same loadings λ as the measurement model), Ψ is the
variance-covariance matrix of the latent variables (if there is only one latent variable, Ψ is a scalar Ψii),
and Θε is the variance-covariance matrix of the residuals. The dimensions of Σ(Θ) and Σ are always
the same. For three indicators and one latent variable, Σ(Θ) can be written as:

To simplify the interpretation of the measurement model and the structural model, a path diagram
can be used. The table below provides the typically used symbols in a path diagram.

Latent variable (η)

Residual (ε)

Observable variable (y)

Indicator intercept (τ) or latent mean, which is the intercept of a latent variable (see later)

Path between a latent and observable variable, i.e. a loading λ, or path from an intercept to a variable

Variance or covariance (for latent variables: Ψ, for residuals: Θ)

For example, in the figure below, the diagram on the left depicts the regression of a latent variable on
an indicator (essentially a measurement model), while the diagram on the right depicts the variance of
a latent variable (and not the covariance, as the covariance of a variable with itself is the variance).

In the path diagram below, all measurement model parameters are coloured in green and all structural
model parameters are coloured in blue. Note that the loadings λ are shared between these models
but are coloured in green here for simplicity, and that only variances (and no covariances) are shown.

3
We will always assume that the residuals are independent. This implies that the variance-covariance
matrix of the residuals Θε is always a diagonal matrix. The path diagram above (with three indicators
and one latent variable) can thus be translated to the following measurement and structural model:

Sometimes, the terminology of exogenous and endogenous variable is used. An exogenous variable is
an independent variable (latent or observable) that explains an endogenous, dependent variable (latent
or observable), and from which a causal path originates. This implies that:

▪ A structural model always specifies causal relationships from exogenous to endogenous variables.
▪ Exogenous latent or observable variables influence, either directly or indirectly, endogenous latent
or observable variables.

For example, in the path diagram above, all observable variables (y) are endogenous, while the latent
variable (η) is exogenous. However, this will not always be the case. The path diagram below gives an
overview of the most common types of variables and relationships in CFA, PA and SEM.

4
2. Model syntax of the lavaan package
The name lavaan refers to latent variable analysis. At the heart of the lavaan package is the ‘model
syntax’, which is a description of the model to be estimated. Four formula types that are especially
important are summarized in the table below.

Formula type Operator Interpretation


latent variable definition =~ is measured by
regression ~ is regressed on
(residual) (co)variance ~~ is correlated with
intercept ~1 intercept (of a latent or observable variable)1

A regression formula always has the following form: d1 + d2 ~ i1 + i2 + i3. In this formula, the tilde (“~”)
is the regression operator. On the left-hand side of the operator, the dependent variables are defined,
while on the right-hand side, the independent variables are defined, separated by the “+” operator. In
lavaan, a typical model is simply a system of regression formulas, where some variables may be latent.
Latent variables should always be defined by the “=~” operator. In addition, intercepts and (residual)
(co)variances can manually be added to the model by using resp. the “~1” and “~~” operators. Note
that not all operators will be needed for every model. An example is given below.

model <- '


## Measurement model (latent variable definitions, indicated with =~)
f1 =~ y1 + y2 + y3
f2 =~ y4 + y5 + y6
f3 =~ y7 + y8 + y9 + y10

## Structural model: regressions (indicated with ~)


y1 + y2 ~ f1 + f2 + y3 + y4
f1 ~ f2 + f3
f2 ~ f3 + y8 + y9

## Structural model: variances and covariances (indicated with ~~)


y1 ~~ y1 ## Variance
y1 ~~ y2 ## Covariance
f1 ~~ f2 ## Covariance

## Structural model: intercepts (indicated with ~ 1)


y1 ~ 1
f1 ~ 1
'

In the example above, f1, f2 and f3 are the latent variables, while y1 to y10 are the observable variables.
Note that only y1, y2, f1 and f2 are dependent (endogenous) variables, while all other variables (latent
and observable) are independent (exogenous).

1
Intercept of an indicator: “indicator intercept”, Intercept of a latent variable: “latent mean”. See also Page 19.

5
3. Functions and concepts
Functions in lavaan are always indicated with brackets. For example: cfa() is a function, while CFA is a concept.
The column ‘page’ indicates where a function or concept is first explained or used.

Function or concept Page Function or concept Page


5 Steps in CFA and SEM 7 Negative variances error in lavaan 18 and 33
AIC (Akaike Information Criterion) 12 Observable variable or indicator y 2
as.data.frame() converts objects to data frames 32 Operators in lavaan (=~, ~~, ~, ~1) 5
BIC (Bayesian Information Criterion) 12 Outer loading 16
c() combines arguments 16 PA (Path Analysis) 2
Categorical data analysis in lavaan 25 parameterEstimates() extracts the estimated 14
cbind() combines columns 32 values of all (free) parameters in the model

CFA (Confirmatory Factor Analysis) 2 Partial measurement invariance (scalar or metric) 21


cfa() executes a confirmatory factor analysis 10 Path (diagram) 3
CFI-index (Comparative Fit index) 12 PoliticalDemocracy dataset 27
Chi-square statistic (χ2) 9 - 12 read.csv() transforms Excel data into a data frame 8
coef() extracts model coefficients 14 rep(x,n) replicates the values in x n times 32
Degrees of freedom 9 Residual ε 2
EFA (Exploratory Factor analysis) 2 Residual (or strict) measurement invariance 25
Endogenous (dependent) variable 4 rm(list = ls()) clears global RStudio environment 7
Exogenous (independent) variable 4 RMSEA (Root Mean Square Error of Approximation) 13
fitMeasures() extracts model fit measures 16 R-Square (R²) 13
fitted() extracts the model-implied covariance 15 Scalar (or strong) measurement invariance 19
matrix when used on a cfa() or sem() object Scaling correction factor 10
Fixing parameters in lavaan 21 SEM (Structural Equation Modelling) 2
head() lists the first few records of an object 8 SRMR (Standardized Root Mean Square Residual) 12
HolzingerSwineford1939 dataset 7 Standard Error (SE) 10
(Over-, Under- of Just-) Identified model 9 Structural model 3
Indicator intercept τ 2 and 19 summary() summarises the results (parameter 10
install.packages() installs RStudio packages 7 estimates, model fit measures, etc.) of a model fitting

lapply() performs operations on list objects 25 var() calculates the variance 34


Latent mean (indicator of a latent variable) 19 Variance-covariance matrix 3
Compare the values of Latent means 19 of the latent variables Ψ
Latent variable or factor η 2 Variance-covariance matrix 3
lavaanPlot() visualises path diagrams 26 of the residuals Θε
lavResiduals() extract the residuals of a fitted model 15 vcov() extracts the estimated covariance 15
lavTestLRT() compares lavaan models, based on a χ2 20 matrix of the parameter estimates
statistic (similar to ANOVA, but we use it to test for
measurement invariance) WLSMV (Weighted Least Square Mean and 26
library() loads installed RStudio packages 8 Variance adjusted estimator)
list() creates a list 26
Loading or regression weight/coefficient λ 2 Function arguments
Loading matrix Λ 3 coefs = TRUE 26
mardia() assesses normality, based on the multivariate 9 estimator = “ML”, “MLR”, or “WLSMV” 10
skewness, the kurtosis and the Shapiro-Wilk test fit.measures = TRUE 10
Measurement invariance 19 group = “x” 20
Measurement model 2 group.equal = c(“loadings”,”intercepts”) 20
Metric (or weak) measurement invariance 19 likelihood = “wishart” 10
ML (Maximum Likelihood estimator) 8 minimum.value = x 17
MLR (Maximum Likelihood estimator with 9 op = “~1”, “~”, ”=~”, or ”~~” 21
Robust, Huber-White standard errors and a ordered = c(“x”) 25
Yuan-Bentler test statistic) rsq = TRUE 10
Model fit 10 and 12 sig = 0.05 26
Model validity 16 sort = TRUE 17
Model-implied covariance matrix Σ(Θ) 3 standardized = TRUE 10
Modification Indices (MIs) 17 and 21 type = “cor.bentler” 15
modindices() calculates the modification indices 17

6
4. Confirmatory factor analysis
We start with an example of CFA, using the cfa() function, which is a user-friendly function for fitting
CFA models. The lavaan package contains a built-in dataset called HolzingerSwineford1939. This data-
set consists of the mental ability test scores of 301 7th and 8th grade children from two different schools
(Pasteur and Grant-White). In our dataset, only 9 of the original 26 observable variables are included.
The CFA model that is proposed consists of 3 latent variables (or factors), each with 3 indicators:

▪ A visual factor, measured by 3 observable variables: x1, x2 and x3


▪ A textual factor, measured by 3 observable variables: x4, x5 and x6
▪ A speed factor, measured by 3 observable variables: x7, x8 and x9

With x1: visual perception, x2: cubes, x3: lozenges, x4: paragraph comprehension, x5: sentence comple-
tion, x6: word meaning, x7: speeded addition, x8: speeded counting of dots, and x9: speeded discrimina-
tion of straight and curved capitals. Figure 1 shows the path diagram of this model. In this example, all
latent variables are exogenous (independent), while all observable variables are endogenous.

Figure 1 - Path diagram of the CFA model for the HolzingerSwineford1939 dataset (residuals are not depicted). Note:
although path diagrams have their origins in path analysis, we can also use them in CFA (and of course also in SEM).

There are five logical steps in CFA and SEM: (i) model specification (defining the hypothesized relation-
ships among the variables based on prior knowledge), (ii) model identification (checking if the model
is over-identified, just-identified, or under-identified), (iii) parameter estimation (only possible if the
model is just-identified or over-identified), (iv) model evaluation (assessing model fit with quantitative
indices), and (v) model modification (post hoc modification of the model to improve model fit). We
will follow this sequence in subsequent paragraphs. However, first we need to pre-analyse the dataset.

Step 0. Pre-analysis of the dataset


After clearing the global environment of RStudio, we load lavaan and other helpful packages.
## Clear global RStudio environment.
rm(list = ls())

## Load Libraries.
## If a package is not installed yet, use: install.packages(“package name”).
library(lavaan)

7
library(lavaanPlot)
library(dplyr)
library(semTools)
library(knitr)
library(mvnormalTest)

Next, we load the dataset.


## Load data. In this case, the dataset is part of the lavaan package. If you
## have to load data from an Excel file, use: read.csv(“file name”, header =
## TRUE). The function “head” is used to list a first few records of the data.

data <- HolzingerSwineford1939


head(data)

id sex ageyr agemo school grade x1 x2 x3 x4 x5 x6 x7 x8 x9


1 1 1 13 1 Pasteur 7 3.333333 7.75 0.375 2.333333 5.75 1.2857143 3.391304 5.75 6.361111
2 2 2 13 7 Pasteur 7 5.333333 5.25 2.125 1.666667 3.00 1.2857143 3.782609 6.25 7.916667
3 3 2 13 1 Pasteur 7 4.500000 5.25 1.875 1.000000 1.75 0.4285714 3.260870 3.90 4.416667
4 4 1 13 2 Pasteur 7 5.333333 7.75 3.000 2.666667 4.50 2.4285714 3.000000 5.30 4.861111
5 5 2 12 2 Pasteur 7 4.833333 4.75 0.875 2.666667 4.00 2.5714286 3.695652 6.30 5.916667
6 6 2 14 1 Pasteur 7 5.333333 5.00 2.250 1.000000 3.00 0.8571429 4.347826 6.65 7.500000

Before we analyse the dataset, we should check if the indicators (x1 to x9) are normally distributed, as
the maximum likelihood (ML) estimator, which we will use to estimate our model parameters, assumes
normality. Note that this assumption applies to the residuals and is thus only relevant for observable,
dependent variables (in this model, the observable variables, x1 to x9, are all dependent); in contrast,
independent variables can take any distributional form at all. If the observable, dependent variables are
not normally distributed, a more robust form of the maximum likelihood (ML) estimator needs to be
applied. Luckily, there are multiple robust forms of this estimator available in the lavaan package.

## Extracting the indicators from the dataset (all rows, columns 7 to 15).
indicators <- data[,7:15]

## Assessing normality with function “mardia”.


mardia(indicators)

$mv.test
Test Statistic p-value Result
1 Skewness 344.9053 0 NO
2 Kurtosis 3.2344 0.0012 NO
3 MV Normality <NA> <NA> NO

$uv.shapiro
W p-value UV.Normality
x1 0.9928 0.1582 Yes
x2 0.9697 0 No
x3 0.9523 0 No
x4 0.9827 0.0011 No
x5 0.9769 1e-04 No
x6 0.9538 0 No
x7 0.9908 0.056 Yes
x8 0.9807 4e-04 No
x9 0.9942 0.307 Yes

We can conclude with 95% confidence that the indicators generally don’t follow a normal distribution
(skewness or asymmetry: p < 0.05, Kurtosis or thickness of tails: p < 0.05). Also the Shapiro-Wilk test
indicates that 6 out of 9 indicators are not normally distributed (only for x1, x7 and x9, p > 0.05). This
means that we will have to use a more robust form of the maximum likelihood estimator.

8
Step 1. Model specification
We can now enter the model syntax.
HS.model <- 'visual =~ x1 + x2 + x3
textual =~ x4 + x5 + x6
speed =~ x7 + x8 + x9'

Note that covariances between the latent variables, although present in Figure 1, are not written in
the model syntax. The reason why this model syntax is so short is that the cfa() function, which we
will use in step 2, takes care of several things. First, all exogenous (independent) latent variables are
correlated by default. Second, again by default, the loading of a latent variable on its first indicator is
fixed to 1 (this means that λ = 1 for the paths ‘x1 - visual’, ‘x4 - textual’ and ‘x7 - speed’), thereby fixing
the scale of the latent variable. Third, residual (co)variances (for or between the observable variables)
are added automatically. In this way, the model syntax is kept concise. On the other hand, the user
remains in control: this ‘default’ behaviour can be overridden or switched off.

Step 2. Model identification


Remember that parameters can only be estimated if the model is just-identified (i.e. the amount of
parameters to be estimated (free parameters) equals the number of unique elements in the model-
implied covariance matrix, or in other words, the residual degrees of freedom df after fitting the model
is equal to zero) or over-identified (the residual df after fitting the model is bigger than zero).

Model identification can be performed without RStudio.


▪ The number of unique elements in the model-implied covariance matrix is equal to:
𝐾 ∗ (𝐾 + 1) 9 ∗ (9 + 1)
= = 45
2 2
𝑤𝑖𝑡ℎ 𝐾 = 𝑡ℎ𝑒 𝑎𝑚𝑜𝑢𝑛𝑡 𝑜𝑓 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑏𝑙𝑒 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒𝑠 = 9
▪ The number of free model parameters is equal to the sum of the number of regression coefficients
(number of single-headed arrows, also called loadings or regression weights, i.e. 9), the number of
regressions (0), the number of variances (sum of the number of latent & observable variables, i.e.
12), and the number of covariances (number of double-headed arrows, i.e. 3), minus the number
of loadings that correspond to the first indicator of a latent variable (these are fixed to 1 and are
thus not free parameters, i.e. 3). This gives us 21 free parameters, which is smaller than the number
of unique elements in the model-implied covariance matrix (45). Thus, the residual df after fitting
the model = 45 - 21 = 24 > 0. The model is over-identified, and parameters can be estimated.

Model identification can also be performed in RStudio. In lavaan, we can use the cfa() function to fit
the HS.model to the dataset. This function will also perform the model identification.
Note also that we use “MLR” as a robust alternative for the ML estimator2 because not all dependent
variables are normally distributed (see Step 0).
## Fit the model to a confirmatory factor analysis (CFA) model.
## If maximum likelihood estimations are used (ML or a robust variant of ML,
## such as MLR), lavaan will by default base the analysis on the so-called
## biased sample covariance matrix, where the elements are divided by N instead
## of N-1. To use the unbiased sample covariance matrix, and N-1 as the
## multiplier to compute the Chi-square statistic, the argument
## “likelihood = wishart” can be added.

2
"MLR" is a maximum likelihood estimation technique with robust (Huber-White) standard errors and a scaled test statistic that is (asymptotically) equal to
the Yuan-Bentler test statistic. MLR can be used for both complete and incomplete datasets. Alternatives for MLR: https://lavaan.ugent.be/tutorial/est.html.

9
fit <- cfa(HS.model, data, estimator = "MLR", likelihood = "wishart")

lavaan 0.6.14 ended normally after 35 iterations

Estimator ML
Optimization method NLMINB
Number of model parameters 21
Number of observations 301

Model Test User Model:


Standard Scaled
Test Statistic 85.022 87.422
Degrees of freedom 24 24
P-value (Chi-square) 0.000 0.000
Scaling correction factor 0.973
Yuan-Bentler correction (Mplus variant)

From this output, we can conclude that:


▪ The model was successfully fitted to the dataset after 35 iterations (lavaan 0.6.14 ended normally
after 35 iterations, where 0.6.14 is the version number of lavaan).
▪ The model is over-identified (i.e. degrees of freedom = 24 > 0).
▪ Note that lavaan displays ML as the applied estimator, while we used MLR. However, lavaan does
tell us that the fitted model was scaled with the Yuan-Bentler correction (scaling correction factor:
85.022/87.422 = 0.973), which is used by the MLR estimator. Note that we want scaling correction
factors > 1.000, as this means that the scaled model produces a lower Chi-square statistic and a
higher p-value compared to the standard model, which indicates a better model fit. However, in
this case, the scaling correction factor is lower than 1.000. This may be explained by the robust
nature of the MLR estimator: it will provide the same parameter estimates as the ML estimator
but the standard errors (SE) that are computed by the MLR estimator are more robust against
normality deviations and are also more robust to certain forms of model misspecification. When
the model is correctly specified however, the ML estimator will still provide reliable results, even
if some data are not normally distributed. More information on the ML and MLR estimators and
how these estimators calculate standard errors (SE) is provided by Rosseel (2017). As the scaling
correction factor is close to 1.000, we will neglect the small model fit deviance that results from
using the MLR estimator, as we got more robust SE by using the MLR instead of the ML estimator.

Step 3 & 4. Parameter estimation & Model evaluation


Once the model has been fitted, the summary() function provides the estimated parameters, as well
as different quantitative goodness-of-fit indices.
## Summary of the model.
## The argument fit.measures = TRUE is used to show quantitative goodness-of-
## fit indices. The argument standardized = TRUE is used to supplement the
## output with standardized parameter values. Two extra columns of standardized
## parameter values are printed. In the first column (labelled Std.lv), only
## the latent variables are standardized. In the second column (labelled
## Std.all), both latent and observable variables are standardized. The latter
## is often called the ‘completely standardized solution’. The argument
## rsq = TRUE is used to display the R-Square values of the dependent
## variables. The R-Square values indicate the % of variance of a dependent
## variable that is explained by the corresponding independent variable.

summary(fit, fit.measures = TRUE, standardized = TRUE, rsq = TRUE)

10
lavaan 0.6.14 ended normally after 35 iterations
Estimator ML
Optimization method NLMINB
Number of model parameters 21
Number of observations 301

Model Test User Model:


Standard Scaled
Test Statistic 85.022 87.422
Degrees of freedom 24 24
P-value (Chi-square) 0.000 0.000
Scaling correction factor 0.973
Yuan-Bentler correction (Mplus variant)

Model Test Baseline Model:

Test statistic 915.799 883.017


Degrees of freedom 36 36
P-value 0.000 0.000
Scaling correction factor 1.037

User Model versus Baseline Model:

Comparative Fit Index (CFI) 0.931 0.925


Tucker-Lewis Index (TLI) 0.896 0.888

Robust Comparative Fit Index (CFI) 0.930


Robust Tucker-Lewis Index (TLI) 0.895

Loglikelihood and Information Criteria:

Loglikelihood user model (H0) -3742.252 -3742.252


Scaling correction factor 1.126
for the MLR correction
Loglikelihood unrestricted model (H1) -3699.600 -3699.600
Scaling correction factor 1.044
for the MLR correction

Akaike (AIC) 7526.505 7526.505


Bayesian (BIC) 7604.354 7604.354
Sample-size adjusted Bayesian (SABIC) 7537.754 7537.754

Root Mean Square Error of Approximation:

RMSEA 0.092 0.094


90 Percent confidence interval - lower 0.071 0.073
90 Percent confidence interval - upper 0.114 0.116
P-value H_0: RMSEA <= 0.050 0.001 0.000
P-value H_0: RMSEA >= 0.080 0.838 0.867

Robust RMSEA 0.093


90 Percent confidence interval - lower 0.072
90 Percent confidence interval - upper 0.114
P-value H_0: Robust RMSEA <= 0.050 0.000
P-value H_0: Robust RMSEA >= 0.080 0.851

Standardized Root Mean Square Residual:

SRMR 0.065 0.065

Parameter Estimates:

Standard errors Sandwich


Information bread Observed
Observed information based on Hessian

Latent Variables:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
visual =~
x1 1.000 0.901 0.772
x2 0.554 0.132 4.198 0.000 0.499 0.424
x3 0.729 0.141 5.178 0.000 0.657 0.581
textual =~
x4 1.000 0.991 0.852

11
x5 1.113 0.066 16.974 0.000 1.103 0.855
x6 0.926 0.061 15.114 0.000 0.918 0.838
speed =~
x7 1.000 0.621 0.570
x8 1.180 0.130 9.061 0.000 0.732 0.723
x9 1.082 0.266 4.067 0.000 0.671 0.665

Covariances:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
visual ~~
textual 0.410 0.099 4.117 0.000 0.459 0.459
speed 0.263 0.060 4.373 0.000 0.471 0.471
textual ~~
speed 0.174 0.056 3.086 0.002 0.283 0.283

Variances:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
.x1 0.551 0.157 3.515 0.000 0.551 0.404
.x2 1.138 0.112 10.151 0.000 1.138 0.821
.x3 0.847 0.100 8.433 0.000 0.847 0.662
.x4 0.372 0.050 7.394 0.000 0.372 0.275
.x5 0.448 0.057 7.883 0.000 0.448 0.269
.x6 0.357 0.047 7.670 0.000 0.357 0.298
.x7 0.802 0.097 8.236 0.000 0.802 0.676
.x8 0.489 0.120 4.087 0.000 0.489 0.477
.x9 0.568 0.119 4.776 0.000 0.568 0.558
visual 0.812 0.181 4.494 0.000 1.000 1.000
textual 0.983 0.121 8.089 0.000 1.000 1.000
speed 0.385 0.107 3.602 0.000 1.000 1.000

R-Square:
Estimate
x1 0.596
x2 0.179
x3 0.338
x4 0.725
x5 0.731
x6 0.702
x7 0.324
x8 0.523

Interpretation: model fit


The chi-square (or χ2) statistic is equal to 87.422 (scaled) with 24 degrees of freedom and a p-value
< 0.05. In null hypothesis testing terms, we should reject the null hypothesis of "no difference between
the model-implied covariance matrix and the original covariance matrix" and conclude that there is a
significant difference. While this conclusion is contrary to what we hoped for (accurately reproduced
correlations), we know that this statistic is very sensitive to sample size and that, given a large sample,
even small deviations can be significant. Therefore, it isn’t recommended to make key decisions about
model fit based on the chi-square statistic alone, but to use a set of fit indices, such as:

▪ The Comparative Fit Index (CFI) represents the amount of variance that has been accounted for
in a covariance matrix and ranges from 0.0 to 1.0. A higher CFI indicates a better model fit. In
practice, the CFI should be ≥ 0.95. The CFI is less affected by sample size than the χ2 statistic.
▪ The Root Mean Square Error of Approximation (RMSEA) and the Standardized Root Mean Square
Residual (SRMR): the RMSEA is a “badness of fit” index where 0 indicates a perfect fit and higher
values indicate a lack of fit. It is useful for detecting model misspecification and is less sensitive to
sample size than the χ2 statistic. In practice, the RMSEA should be ≤ 0.06. The SRMR is similar to
the RMSEA and should be ≤ 0.05 for a close model fit; 0.05 - 0.08 for a reasonable model fit; and
≥ 0.10 for a poor model fit.
▪ The Akaike Information Criterion (AIC) and the Bayesian Information Criterion (BIC): these are
both relative measures from the perspective of model selection rather than null hypothesis testing.

12
AIC offers a relative estimation of the information lost when the given model is used to generate
data, while BIC is an estimation of how parsimonious a model is among several candidate models.
The AIC and BIC are not useful in testing the null hypothesis but are useful for selecting the model
with the least overfitting, whereby the model with the lowest AIC- or BIC-value is least overfitted.
▪ While not often used in CFA or SEM, we can also interpret the R-Square (R²) value. Any variable
that has an arrow pointing to it in the path diagram (i.e. a dependent, endogenous variable) has an
R² value attached to it. The R² value indicates the percentage of variance of a dependent variable
that is explained by a corresponding independent variable. There is no strict cut-off value for an
acceptable R2 value, but R² values ≥ 0.500 are desirable and higher values are better.

In this example, the CFI-index is too low (0.930), the RMSEA is too high (0.093) and the SRMR is okay
(0.065). Based on this, we can conclude that the fit is rather bad. We will address this issue in Step 5
(model modification). The R2 values range from 0.179 to 0.731. It appears from these values that each
endogenous variable has a substantial relationship with their corresponding exogenous variable, but
also that there are big differences (highest R² value for x5 on textual, lowest R² value for x2 on visual).

Interpretation: latent variables


The values in the last two columns (std.lv & std.all) are the standardized regression weights or loadings
of the latent variables (visual, textual, speed) on the observable variables (x1 to x9). There is no strict
cut-off value for the loadings, but ideally, these are higher than 0.700, while values > 0.400 are still
acceptable. Every loading appears to be statistically significant in this example, as the P(>|z|) value is
always equal to 0.000 and thus < 0.05. This means that every observable variable has a statistically
significant relationship with its corresponding latent variable.

Interpretation: covariances
The values in the last two columns are the standardized covariances between the latent variables. These
values are always a measure of the relationship between two latent factors. The covariances evaluate
to what extent two latent factors change together. In other words, it is essentially a measure of the
variance between two latent factors. A positive covariance indicates that two latent factors tend to
move in the same direction, while a negative covariance indicates that they tend to move in opposite
directions. As the P(>|z|) values are always equal to 0.000 and thus < 0.05, all three covariances are
significantly different from zero. Or in other words, the extent in which two latent variables in this
example change together is always significantly different from zero.

Interpretation: variances
The values in the last two columns are the standardized variances of the observed and latent variables.
As the P(>|z|) values are always equal to 0.000 and thus < 0.05, the variance of every variable (observed
or latent) in this example is significantly different from zero.

Note that in the variances tab, there is a dot (.) before the observable variables names. This is because
they are dependent (or endogenous) variables (predicted by the latent variables), and therefore, the
value for the variance that is printed in the output is an estimate of the residual variance: the left-over
variance not explained by the latent variables. In contrast, there is no dot before the latent variable
names, because they are exogenous variables in this model (there are no single-headed arrows pointing
to them). The values for the variances here are the estimated total variances of the latent variables.

13
Extracting information

It is noteworthy to mention that a few functions exist that can be used to extract certain elements of
this summary() function as a data.frame.

Function coef()
## Extract the estimated values of all free parameters in the model, together
## with their names (factor loadings or regression weights of the latent
## variables on the indicators: =~, variances and covariances: ~~).

coef(fit)

visual=~x2 visual=~x3 textual=~x5 textual=~x6 speed=~x8 speed=~x9


0.554 0.729 1.113 0.926 1.180 1.082

x1~~x1 x2~~x2 x3~~x3 x4~~x4 x5~~x5 x6~~x6


0.551 1.138 0.847 0.372 0.448 0.357

x7~~x7 x8~~x8 x9~~x9 visual~~visual textual~~textual speed~~speed


0.802 0.489 0.568 0.812 0.983 0.385

visual~~textual visual~~speed textual~~speed


0.410 0.263 0.174

Note that the 21 parameters values displayed here are not standardized, as coef() does not allow the
argument “standardized = TRUE”. Also remember that the loading λ of a latent variable on its first
indicator is fixed to 1 (λ = 1 for the paths ‘x1 - visual’, ‘x4 - textual’ and ‘x7 - speed’), which is why these
parameter values are not displayed here (but we know that they are fixed to 1).

Function parameterEstimates()
## Extract the estimated values of all (free) parameters in the model, together
## with their standard error (se), z-value (z), p-value (pvalue) and 95%
## confidence interval (ci.lower and ci.upper). The lhs (left-hand side), op
## (operator) and rhs (right-hand side) columns define the parameter.

parameterEstimates(fit, standardized = TRUE)

lhs op rhs est se z pvalue ci.lower ci.upper std.lv std.all std.nox


1 visual =~ x1 1.000 0.000 NA NA 1.000 1.000 0.901 0.772 0.772
2 visual =~ x2 0.554 0.132 4.198 0.000 0.295 0.812 0.499 0.424 0.424
3 visual =~ x3 0.729 0.141 5.178 0.000 0.453 1.005 0.657 0.581 0.581
4 textual =~ x4 1.000 0.000 NA NA 1.000 1.000 0.991 0.852 0.852
5 textual =~ x5 1.113 0.066 16.974 0.000 0.985 1.242 1.103 0.855 0.855
6 textual =~ x6 0.926 0.061 15.114 0.000 0.806 1.046 0.918 0.838 0.838
7 speed =~ x7 1.000 0.000 NA NA 1.000 1.000 0.621 0.570 0.570
8 speed =~ x8 1.180 0.130 9.061 0.000 0.925 1.435 0.732 0.723 0.723
9 speed =~ x9 1.082 0.266 4.067 0.000 0.560 1.603 0.671 0.665 0.665
10 x1 ~~ x1 0.551 0.157 3.515 0.000 0.244 0.858 0.551 0.404 0.404
11 x2 ~~ x2 1.138 0.112 10.151 0.000 0.918 1.357 1.138 0.821 0.821
12 x3 ~~ x3 0.847 0.100 8.433 0.000 0.650 1.044 0.847 0.662 0.662
13 x4 ~~ x4 0.372 0.050 7.394 0.000 0.274 0.471 0.372 0.275 0.275
14 x5 ~~ x5 0.448 0.057 7.883 0.000 0.336 0.559 0.448 0.269 0.269
15 x6 ~~ x6 0.357 0.047 7.670 0.000 0.266 0.449 0.357 0.298 0.298
16 x7 ~~ x7 0.802 0.097 8.236 0.000 0.611 0.993 0.802 0.676 0.676
17 x8 ~~ x8 0.489 0.120 4.087 0.000 0.255 0.724 0.489 0.477 0.477
18 x9 ~~ x9 0.568 0.119 4.776 0.000 0.335 0.801 0.568 0.558 0.558
19 visual ~~ visual 0.812 0.181 4.494 0.000 0.458 1.166 1.000 1.000 1.000
20 textual ~~ textual 0.983 0.121 8.089 0.000 0.745 1.221 1.000 1.000 1.000
21 speed ~~ speed 0.385 0.107 3.602 0.000 0.176 0.595 1.000 1.000 1.000
22 visual ~~ textual 0.410 0.099 4.117 0.000 0.215 0.605 0.459 0.459 0.459
23 visual ~~ speed 0.263 0.060 4.373 0.000 0.145 0.381 0.471 0.471 0.471
24 textual ~~ speed 0.174 0.056 3.086 0.002 0.064 0.285 0.283 0.283 0.283

14
This function displays all estimated, standardized parameter values. In addition, a column with std.nox
is added. To calculate this column, the variances of exogenous observable variables are ignored. Since
all observable variables in this example are endogenous, this column is exactly the same as “std.all”.
Note that we have 24 parameters, of which 21 free parameters and 3 fixed parameters (p-value: NA).

Functions fitted() and vcov()


## To extract the model-implied (fitted) covariance matrix, the function
## fitted() can be used. To extract the estimated covariance matrix of the
## parameter estimates (free parameters), the function vcov() can be used.
fitted(fit)
$cov
x1 x2 x3 x4 x5 x6 x7 x8 x9
x1 1.363
x2 0.449 1.386
x3 0.592 0.328 1.279
x4 0.410 0.227 0.299 1.355
x5 0.456 0.252 0.333 1.094 1.665
x6 0.379 0.210 0.277 0.910 1.013 1.200
x7 0.263 0.146 0.192 0.174 0.194 0.161 1.187
x8 0.310 0.172 0.226 0.205 0.229 0.190 0.454 1.025
x9 0.285 0.157 0.208 0.188 0.210 0.174 0.416 0.491 1.018

vcov(fit)
vsl=~2 vsl=~3 txt=~5 txt=~6 spd=~8 spd=~9 x1~~x1 x2~~x2 x3~~x3 x4~~x4 x5~~x5 x6~~x6 x7~~x7 x8~~x8 x9~~x9 vsl~~v txtl~~t spd~~s vsl~~t vsl~~s txtl~~s
visual=~x2 0.017
visual=~x3 0.015 0.020
textual=~x5 0.001 0.000 0.004
textual=~x6 0.001 0.001 0.001 0.004
speed=~x8 0.003 0.002 0.001 0.001 0.017
speed=~x9 0.006 0.007 -0.001 -0.001 0.009 0.071
x1~~x1 0.015 0.018 0.000 0.001 0.002 0.008 0.025
x2~~x2 -0.004 -0.004 0.000 0.000 0.000 -0.003 -0.004 0.013
x3~~x3 -0.008 -0.010 0.000 -0.001 -0.002 -0.005 -0.009 0.003 0.010
x4~~x4 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.000 0.000 0.003
x5~~x5 -0.001 0.000 -0.002 0.000 0.000 0.000 0.000 0.000 0.000 -0.001 0.003
x6~~x6 0.000 0.000 0.000 -0.001 0.000 0.002 0.000 0.000 0.000 0.000 0.000 0.002
x7~~x7 0.001 0.000 0.000 0.000 0.002 0.017 0.001 -0.001 -0.001 0.001 0.000 0.001 0.009
x8~~x8 0.001 0.001 -0.001 -0.001 -0.001 0.024 0.002 -0.001 -0.001 0.000 0.000 0.001 0.006 0.014
x9~~x9 -0.001 -0.001 0.000 0.001 0.000 -0.025 -0.002 0.000 0.001 0.000 0.000 -0.001 -0.007 -0.011 0.014
visual~~visual -0.017 -0.021 -0.001 -0.001 -0.001 -0.002 -0.021 0.004 0.008 0.000 0.001 0.000 0.001 0.001 -0.001 0.033
textual~~textual -0.002 -0.002 -0.005 -0.004 -0.002 0.004 -0.001 0.001 0.001 -0.002 0.001 0.001 0.001 0.002 -0.002 0.004 0.015
speed~~speed -0.003 -0.002 0.000 0.000 -0.007 -0.024 -0.003 0.001 0.002 -0.001 0.000 -0.001 -0.006 -0.007 0.007 0.001 0.000 0.011
visual~~textual -0.007 -0.009 -0.002 -0.001 -0.001 0.001 -0.008 0.001 0.004 -0.001 0.001 0.001 0.001 0.001 -0.001 0.013 0.006 0.001 0.010
visual~~speed -0.003 -0.002 -0.001 0.000 -0.002 0.006 -0.002 0.001 0.001 0.000 0.000 0.000 0.002 0.003 -0.004 0.005 0.002 0.000 0.003 0.004
textual~~speed -0.001 -0.001 -0.001 0.000 -0.003 0.001 -0.001 0.000 0.000 0.000 0.000 0.000 0.000 0.001 -0.001 0.002 0.003 0.001 0.003 0.002 0.003

Note that the model-implied (fitted) covariance matrix contains [9*(9+1)]/2 = 45 unique values. The
covariance matrix of the parameter estimates contains [21*(21+1)]/2 = 231 unique values.

Function lavResiduals()
## To extract the residuals of the fitted model (i.e. the differences between
## the observed covariance matrix and the model-implied covariance matrix),
## the function lavResiduals() can be used.
lavResiduals(fit, type = “cor.bentler”)
$type
[1] "cor.bentler"

$cov
x1 x2 x3 x4 x5 x6 x7 x8 x9
x1 0.000
x2 -0.030 0.000
x3 -0.008 0.094 0.000
x4 0.071 -0.012 -0.068 0.000
x5 -0.009 -0.027 -0.151 0.005 0.000
x6 0.060 0.030 -0.026 -0.009 0.003 0.000
x7 -0.140 -0.189 -0.084 0.037 -0.036 -0.014 0.000
x8 -0.039 -0.052 -0.012 -0.067 -0.036 -0.022 0.075 0.000
x9 0.149 0.073 0.147 0.048 0.067 0.056 -0.038 -0.032 0.000

$cov.z
x1 x2 x3 x4 x5 x6 x7 x8 x9
x1 0.000
x2 -1.089 0.000

15
x3 -0.269 2.211 0.000
x4 2.022 -0.243 -1.918 0.000
x5 -0.278 -0.529 -3.712 0.345 0.000
x6 1.675 0.604 -0.697 -0.688 0.242 0.000
x7 -3.476 -3.712 -1.917 0.896 -0.842 -0.366 0.000
x8 -1.241 -1.157 -0.301 -1.945 -1.008 -0.646 4.465 0.000
x9 3.370 1.747 3.433 1.086 1.667 1.278 -2.086 -2.871 0.000

$summary
cov
srmr 0.065
srmr.se 0.007
srmr.exactfit.z 4.227
srmr.exactfit.pvalue 0.000
usrmr 0.055
usrmr.se 0.010
usrmr.ci.lower 0.039
usrmr.ci.upper 0.072
usrmr.closefit.h0.value 0.050
usrmr.closefit.z 0.538
usrmr.closefit.pvalue 0.295

Note that summary statistics are provided, and that you can choose the type of residual information:
▪ When type = "raw" is used, the function lavResiduals provides the unscaled difference between the observed
and the model-implied covariance matrix.
▪ When type = "cor.bentler" is used, the function lavResiduals provides the rescaled difference between the
observed and the model-implied covariance matrix (rescaled by dividing all elements by the square roots of
the corresponding variances of the observed covariance matrix).
▪ Other options: type = “standardized”, type = “standardized.mplus”, type = “cor.bollen”.

Function fitMeasures()
## To extract important model fit measures (e.g. χ2 statistic, CFI-index, SRMR,
## RMSEA, etc.) of the fitted model, the function FitMeasures() can be used.

fitMeasures(fit,c("chisq.scaled","df.scaled","pvalue.scaled","aic","bic",
"cfi.robust","rmsea.robust","srmr"))

chisq.scaled df.scaled pvalue.scaled aic bic cfi.robust rmsea.robust srmr


87.422 24.000 0.000 7526.505 7604.354 0.930 0.093 0.065

Step 5. Model Modification


To improve the model fit, several strategies can be applied:

▪ Removing parameters from the model


Check for outer loadings, i.e. paths that are represented by single-pointed arrows from a latent
variable to one of its indicators. Remove outer loadings that are too low (< 0.400) from the model
(in Step 1. Model specification). However, keep in mind that (i) every latent variable should have
more than one indicator, and (ii) removing paths from the model can sometimes improve model
fit but at the same time worsen model validity since a data-driven model was built in the previous
steps (e.g. if the sample size is rather small and certain relationships are not captured in the dataset
but are yet present in the population, it is possible that you remove a path that the summary()
function assesses as statistically insignificant in your dataset, but that would have been assessed
significant if you were able to study the entire population). Therefore, you should always carefully
consider if removing a certain outer loading from the model will improve both model fit and model
validity (unfortunately, this cannot be “measured” in R, this requires critical thinking). If only model
fit is improved but not model validity, you should not remove a certain outer loading. In addition,
as no outer loading is < 0.400 in this specific example, there is no reason to apply this strategy.

16
▪ Add parameters to the model
It is also possible to add parameters to the model (outer loadings, but also variances, covariances
or regressions). However, before adding a parameter to the model, critically consider if this action
will not violate model validity. For instance, don’t add an indicator to a latent variable definition if
it has nothing to do with the latent variable. This seems logical, but we will later discuss the lavaan
function modindices() that determines which parameters should preferentially be added to a CFA/
SEM model, and sometimes, this function will identify relationships that are statistically significant
but that are at the same time not causal (oorzakelijk). So, be critical when you use this function.

We will now discuss an important strategy that can be used to add parameters to a CFA or SEM
model that is not well-fitted: the determination of modification indices (MI). A modification index
is an estimate of the amount by which the χ2 statistic would be improved (i.e. reduced) if a certain
parameter restriction were to be removed from the model. Most commonly, modification indices
reflect the improvement in model fit that would result if a previously omitted parameter were to
be added and freely estimated. This parameter can be a loading, a regression coefficient, or even
a correlated residual. If a parameter is added based on a large MI, this is called a “post hoc, data-
driven model modification”. MIs > 3.84 indicate that the model fit would be improved (3.84 is the
critical value for the χ2 statistic with 1 df), and that the p-value for the added parameter would
be < 0.05. MIs > 10.83 indicate that the added parameter would have a p-value < 0.001.

Although MIs can be useful in identifying sources of misfit in a model, using them also carries risks.
First, they are completely determined by the data and are thus devoid of theory. The largest MIs
might be associated with parameters that are unsupported by theory and instead represent some
idiosyncratic characteristics of the data. Second, simulation research has suggested that using MIs
to guide model modifications rarely leads to the true population model.

However, like many things in statistics, MIs can be beneficial if used in a thoughtful, judicious way.
For instance, these can be used as another goodness-of-fit indicator. For example, a small number
of large MIs indicates that important parameters were omitted, while a large number of small MIs
indicates that a lot of less-important parameters were not considered. MIs are also used when
assessing measurement invariance (or the lack thereof) across groups in CFA and SEM models.

In lavaan, modification indices can be requested by the function modindices(). By default, MIs are
printed out for each non-free (fixed-to-zero) parameter. The MIs are supplemented by the expec-
ted parameter change (epc) values. The last three columns contain the standardized epc values
(sepc.lv: only standardizing latent variables; sepc.all: standardizing all variables; sepc.nox: standardi-
zing all but exogenous observable variables).
## Function modindices().
modindices(fit, sort = TRUE, minimum.value = 3.84)

lhs op rhs mi epc sepc.lv sepc.all sepc.nox


30 visual =~ x9 36.411 0.577 0.520 0.515 0.515
76 x7 ~~ x8 34.145 0.538 0.538 0.859 0.859
28 visual =~ x7 18.631 -0.422 -0.380 -0.349 -0.349
78 x8 ~~ x9 14.946 -0.425 -0.425 -0.805 -0.805
33 textual =~ x3 9.151 -0.272 -0.269 -0.238 -0.238
55 x2 ~~ x7 8.918 -0.183 -0.183 -0.192 -0.192
31 textual =~ x1 8.903 0.350 0.347 0.297 0.297
51 x2 ~~ x3 8.532 0.219 0.219 0.223 0.223
59 x3 ~~ x5 7.858 -0.131 -0.131 -0.212 -0.212
26 visual =~ x5 7.441 -0.210 -0.189 -0.147 -0.147

17
50 x1 ~~ x9 7.335 0.138 0.138 0.247 0.247
65 x4 ~~ x6 6.220 -0.236 -0.236 -0.646 -0.646
66 x4 ~~ x7 5.920 0.099 0.099 0.180 0.180
48 x1 ~~ x7 5.420 -0.130 -0.130 -0.195 -0.195
77 x7 ~~ x9 5.183 -0.187 -0.187 -0.278 -0.278
36 textual =~ x9 4.796 0.138 0.137 0.136 0.136
29 visual =~ x8 4.295 -0.210 -0.190 -0.187 -0.187
63 x3 ~~ x9 4.126 0.102 0.102 0.147 0.147
e
As an example, we will now add the parameters for which the MI > 10.83 to the model. In practice,
this means that we consider that the latent variable “visual” is also influenced by indicators x9 (speeded
discrimination of straight and curved capitals) and x7 (speeded addition), that x7 and x8 are correlated
(x8: speeded counting of dots), and that x8 and x9 are correlated. As these relationships seem plausible
and not only statistically relevant, we will add them to the model and repeat steps 2, 3 and 4.

## Step 1.
HS.model2 <- 'visual =~ x1 + x2 + x3 + x7 + x9
textual =~ x4 + x5 + x6
speed =~ x7 + x8 + x9
x7 ~~ x8
x8 ~~ x9'

## Step 2. The output is not displayed here, but the residual df was equal to
## 20 (> 0), which means that also HS.model2 is over-identified.
fit2 <- cfa(HS.model2, data, estimator = "MLR", likelihood = "wishart")

This fit (fit2) results in the following warning message:


Warning message:
In lav_object_post_check(object) :
lavaan WARNING: some estimates of variances are negative

Although variances cannot be negative, lavaan can produce variance estimates that are negative. The
solution is then called inadmissible. Negative variances are theoretically impossible, so the solution is
considered improper and other estimates are not reliable. Therefore, we will omit the parameters
that caused this warning message (x7 ~~ x8 and x8 ~~ x9, you can check this by yourself). We will also
omit “visual =~ x7”, as this loading is insignificant (p > 0.05) (execute the summary function to see this).
## Step 1.
HS.model2 <- 'visual =~ x1 + x2 + x3 + x9
textual =~ x4 + x5 + x6
speed =~ x7 + x8 + x9’

## Step 2 (residual df = 22).


fit2 <- cfa(HS.model2, data, estimator = "MLR", likelihood = "wishart")

## Step 3 & 4. The output of the summary is not displayed here.


summary(fit2, fit.measures = TRUE, standardized = TRUE, rsq = TRUE)
fitMeasures(fit2,c("chisq.scaled","df.scaled","pvalue.scaled","aic","bic",
"cfi.robust","rmsea.robust","srmr"))

chisq.scaled df.scaled pvalue.scaled aic bic cfi.robust rmsea.robust srmr


51.898 23.000 0.001 7495.581 7577.138 0.967 0.065 0.045

## Compare this to the first model (HS.model):


fitMeasures(fit,c("chisq.scaled","df.scaled","pvalue.scaled","aic","bic",
"cfi.robust","rmsea.robust","srmr"))

chisq.scaled df.scaled pvalue.scaled aic bic cfi.robust rmsea.robust srmr


87.422 24.000 0.000 7526.505 7604.354 0.930 0.093 0.065

18
From this output, we can observe that:

▪ The χ2 statistic is reduced, while the p-value of this statistic is increased (which we desired). The
p-value is still < 0.05, but we know that the χ2 statistic is sensitive to sample size. In addition, the
CFI-index is now ≥ 0.95 and also the RMSEA and SRMR are improved. From this, we can conclude
that the model fit is improved by altering the model specification.

▪ Since 1 parameter was added to the model, df.scaled decreased from 24 to 23.
d

▪ The AIC and BIC of HS.model2 are lower than the AIC and BIC of HS.model, which means that
the HS.model2 is less overfitted than the HS.model.

Overall, the model fit was improved by altering the first latent variable definition. As the validity of
the model seems not negatively affected by this adjustment, it can be justified. However, always
remember that model validity is more important than model fit, and that model fit is dependent on
sample size, while model validity is not. In addition, always consider that you should report model
modifications, whether guided by modification indices or other considerations, so that reviewers and
readers of your research can also judge these decisions.

Multiple group analysis


In lavaan, we can also compare the values of latent means, i.e. the intercepts of latent variables, across
multiple groups. Pay attention to the terminology used: we are not interested in τ, i.e. the intercepts
of indicators (see Page 2). Moreover, before we can compare the values of latent means across multiple
groups, measurement invariance should be assessed.

The idea behind measurement invariance is that when we compare any measurement across groups,
this comparison should reflect true differences rather than measurement differences. When data are
continuous, testing for measurement invariance involves testing for three invariance types: configural
invariance, weak or metric invariance, and strong or scalar invariance.

▪ Configural invariance: the same latent variable structure is imposed on all groups (i.e. the
number and pattern of latent variables is the same for all groups, which means that the same
indicators are loading on the same latent variables for all groups). However, loadings (λ) and
indicator intercepts (τ) can be freely estimated.

▪ Weak or metric invariance posits the same constraints as configural invariance, but with the
additional constraint that the loadings (λ) are equal across groups.

▪ Strong or scalar invariance posits the same constraints as configural and metric invariance, but
with the additional constraint that the indicator intercepts (τ) are equal across groups.

In the code below, it is shown how these three invariance impositions can be checked in lavaan.
## Remark 1: In this example, measurement invariance will be checked for the
## three latent variables that are present in our model (visual, textual and
## speed) because we want to compare the values of their latent means between
## the two school groups that are present in this dataset (Pasteur and Grant-
## White). In other words: we want to determine which school group performs
## the best for the latent variables visual, textual and speed.

## Remark 2: The function in lavaan that can be used to test for measurement
## invariance is lavTestLRT(). This function is similar to ANOVA and compares
## lavaan models, based on a χ2 statistic. A p-value > 0.05 is desired, as this

19
## indicates that measurement invariance is supported by the dataset.
## Note that we don’t have to check for configural invariance in lavaan, as we
## imposed the same latent variable structure on all groups in the model
## specification (the model specification is the same for both groups). This
## is why the code below will only provide 2 p-values: the first p-value will
## test for the presence of weak invariance; the second p-value will test for
## the presence of strong invariance (if present: p > 0.05).
## Remark 3: In the code below, note that we omit the arguments estimator =
## "MLR" and likelihood = "wishart". We do this because lavTestLRT() will
## execute these corrections automatically. If we would still use these two
## arguments, lavaan would display the following note (no error, but a note):
## The “Chisq” column contains standard test statistics, not the robust
## test that should be reported per model. A robust difference test is a
## function of two standard statistics.
## In other words, using these two arguments would lead to a non-corrected
## χ2 statistic and a potentially false interpretation of measurement
## invariance, which is of course not desired.

-------------------------------------------------------------------------------
Testing for measurement invariance
-------------------------------------------------------------------------------
rm(list = ls())
data <- HolzingerSwineford1939
HS.model2 <- 'visual =~ x1 + x2 + x3 + x9
textual =~ x4 + x5 + x6
speed =~ x7 + x8 + x9'

## Model 1: Configural invariance.


fitA <- cfa(HS.model2, data, group = "school")
## Model 2: Weak or metric invariance.
fitB <- cfa(HS.model2, data, group = "school", group.equal = "loadings")
## Model 3: Strong or scalar invariance.
fitC <- cfa(HS.model2, data, group = "school",
group.equal = c("loadings", "intercepts"))
## Model comparison test
lavTestLRT(fitA, fitB, fitC)

Chi-Squared Difference Test


Chi-Squared Difference Test

Df AIC BIC Chisq Chisq diff RMSEA Df diff Pr(>Chisq)


fitA 46 7454.1 7683.9 81.547
fitB 53 7447.8 7651.7 89.266 7.720 0.026135 7 0.358
fitC 59 7477.4 7659.0 130.828 41.562 0.198449 6 2.244e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Because we provided three model fits, the function lavTestLRT() will produce two tests: the first test
compares the first model to the second model, while the second test compares the second model to
the third model. Because the first p-value is non-significant, we may conclude that weak invariance (or
metric invariance, i.e. equal loadings across groups) is supported in this dataset. However, because the
second p-value is significant, strong invariance (or scalar invariance, i.e. equal indicator intercepts across
groups) is not3. Therefore, we can’t compare the values of latent means across the two groups directly.
However, if full measurement invariance is not supported, Byrne et al. (1989) have proposed a solution
that is based on partial measurement invariance. This can be both partial metric invariance (some, but

3
This can i.a. be caused by the restricted size of the dataset, especially because we divided the dataset into two smaller groups.

20
not all loadings λ are equated across groups), or partial scalar invariance (some, but not all indicator
intercepts τ are equal across groups). However, for both these types of partial measurement invariance,
at least 2 indicators of a latent variable must have equal loadings or indicator intercepts across groups.
Partial metric and/or scalar measurement invariance allows researchers to at least compare the group-
specific (standardized) latent means. If less than two indicators per latent variable have equal loadings
and/or indicator intercepts, multiple group comparisons in CFA and SEM may become problematic.

Since metric invariance was established for our model fit, there is no need to introduce partial metric
invariance. However, as scalar invariance was not established, we will introduce partial scalar invariance
in our model fit. Remember: this means that we will allow that some indicator intercepts are not equal
across groups. In practice, we need to calculate the modification indices of all indicator intercepts τ,
after which we have to constrain invariant τ while setting free non-invariant τ. However, we also have
to make sure that at least two indicators of a latent variable still have equal intercepts across groups4.
If this procedure results in partial scalar invariance, we can reliably compare the values of latent means
across the two groups by using the summary() function.

-------------------------------------------------------------------------------
Introducing partial scalar (measurement) invariance
-------------------------------------------------------------------------------
## Step 1: calculating the modification indices. It is important to realize
## that the function modindices() will only consider fixed-to-zero parameters.
## Therefore, we need to fix all indicator intercepts to zero first.
## This can be done by amending the model specification. Note that
## (i) the operator (op) in lavaan for (indicator) intercepts is ~1
## (ii) we can fix a parameter in a lavaan formula by pre-multiplying the
## corresponding variable in the formula with a numerical value
## (iii) to fix a parameter to zero, pre-multiply with 0 (if you want to
## force a parameter to be freely estimated, pre-multiply with NA)
## (iv) because all indicator intercepts are fixed to the same value
## (i.e. zero), we only need one line of code and we can separate them
## by an addition operator or + sign (if you would want to fix all
## indicator intercepts to a different value, e.g. 5, the code would be:
## x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 ~ 5*1,
## and if you would want to set all indicator intercepts free, the code
## would be: x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 ~ NA*1)

HS.modeltest <- 'visual =~ x1 + x2 + x3 + x9


textual =~ x4 + x5 + x6
speed =~ x7 + x8 + x9
x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 ~ 0*1'

fitD <- cfa(HS.modeltest, data, group = "school",


group.equal = c("loadings", "intercepts"))

modindices(fitD, op = "~1", sort = TRUE, minimum.value = 3.84)


lhs op rhs block group level mi epc sepc.lv sepc.all sepc.nox
50 x3 ~1 2 2 1 26.368 -0.606 -0.606 -0.609 -0.609
54 x7 ~1 2 2 1 15.747 -0.430 -0.430 -0.414 -0.414
55 x8 ~1 2 2 1 12.434 0.443 0.443 0.420 0.420
17 x7 ~1 1 1 1 10.423 0.267 0.267 0.061 0.061
49 x2 ~1 2 2 1 9.939 0.492 0.492 0.390 0.390
13 x3 ~1 1 1 1 8.270 0.247 0.247 0.098 0.098
16 x6 ~1 1 1 1 5.661 -0.128 -0.128 -0.058 -0.058

4
For partial metric invariance, the solution would be similar, i.e. calculating the modification indices of all loadings λ, constraining invariant
λ and setting free non-invariant λ, while making sure that at least two indicators per latent variable still have equal loadings across groups.

21
From this output, we can notice that especially indicators x3 and x7 have intercept estimates that are
non-invariant across groups (i.e. high modification indices). Therefore, we will set free the intercepts
of these two indicators, while constraining all other indicator intercepts in the model. Note that “x3
~1” and “x7 ~ 1” are shown twice because the indicator intercepts are now estimated for both groups
(and thus, are not equal anymore). As the intercept estimate of x3 is strongly non-invariant for both
groups, it is especially important to set this intercept free. Also note that we can’t set free the
intercept of x8. If we would do this, the latent variable “speed” would only have one indicator left with
equal intercepts across groups (i.e. x9) since we already freed the intercept of x7.

-------------------------------------------------------------------------------
Introducing partial scalar (measurement) invariance
-------------------------------------------------------------------------------
## Step 2: constrain invariant τ and set free non-invariant τ.
## Attention: make sure that you use “HS.model2” and not “HS.modeltest”.
fitE <- cfa(HS.model2, data, group = "school", group.equal =
c("loadings", "intercepts"), group.partial = c("x3~1","x7~1"))

## Step 3: model comparison test.


lavTestLRT(fitA, fitB, fitE)
Chi-Squared Difference Test

Df AIC BIC Chisq Chisq diff RMSEA Df diff Pr(>Chisq)


fitA 46 7454.1 7683.9 81.547
fitB 53 7447.8 7651.7 89.266 7.7196 0.026135 7 0.3580
fitE 57 7445.6 7634.7 95.064 5.7981 0.054652 4 0.2147

We can conclude from the output above that both metric invariance and partial scalar invariance are
present. Therefore, we can now reliably compare the values of latent means across the two groups.

summary(fitE, fit.measures = TRUE, standardized = TRUE)

lavaan 0.6.14 ended normally after 59 iterations


Estimator ML
Optimization method NLMINB
Number of model parameters 65
Number of equality constraints 14
Number of observations per group:
Pasteur 156
Grant-White 145

Model Test User Model:


Test statistic 95.064
Degrees of freedom 57
P-value (Chi-square) 0.001
Test statistic for each group:
Pasteur 59.744
Grant-White 35.320

Model Test Baseline Model:


Test statistic 957.769
Degrees of freedom 72
P-value 0.000

User Model versus Baseline Model:


Comparative Fit Index (CFI) 0.957
Tucker-Lewis Index (TLI) 0.946

Loglikelihood and Information Criteria:


Loglikelihood user model (H0) -3671.804
Loglikelihood unrestricted model (H1) -3624.272
Akaike (AIC) 7445.608
Bayesian (BIC) 7634.671
Sample-size adjusted Bayesian (SABIC) 7472.928

22
Root Mean Square Error of Approximation:
RMSEA 0.067
90 Percent confidence interval - lower 0.042
90 Percent confidence interval - upper 0.090
P-value H_0: RMSEA <= 0.050 0.123
P-value H_0: RMSEA >= 0.080 0.180

Standardized Root Mean Square Residual:


SRMR 0.057

Parameter Estimates:
Standard errors Standard
Information Expected
Information saturated (h1) model Structured

Group 1 [Pasteur]:

Latent Variables:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
visual =~
x1 1.000 0.868 0.744
x2 (.p2.) 0.617 0.099 6.257 0.000 0.535 0.432
x3 (.p3.) 0.804 0.103 7.836 0.000 0.697 0.602
x9 (.p4.) 0.453 0.081 5.577 0.000 0.393 0.391
textual =~
x4 1.000 0.938 0.814
x5 (.p6.) 1.121 0.066 16.984 0.000 1.052 0.831
x6 (.p7.) 0.931 0.056 16.587 0.000 0.874 0.862
speed =~
x7 1.000 0.600 0.558
x8 (.p9.) 1.195 0.164 7.300 0.000 0.717 0.732
x9 (.10.) 0.682 0.107 6.359 0.000 0.409 0.407

Covariances:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
visual ~~
textual 0.370 0.092 4.033 0.000 0.454 0.454
speed 0.084 0.064 1.314 0.189 0.162 0.162
textual ~~
speed 0.149 0.063 2.365 0.018 0.265 0.265

Intercepts:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
.x1 (.26.) 4.921 0.091 54.361 0.000 4.921 4.219
.x2 (.27.) 6.092 0.078 77.977 0.000 6.092 4.921
.x3 2.487 0.093 26.803 0.000 2.487 2.146
.x9 (.29.) 5.390 0.070 76.980 0.000 5.390 5.360
.x4 (.30.) 2.778 0.087 31.938 0.000 2.778 2.411
.x5 (.31.) 4.034 0.096 41.832 0.000 4.034 3.186
.x6 (.32.) 1.926 0.079 24.426 0.000 1.926 1.899
.x7 4.432 0.086 51.521 0.000 4.432 4.125
.x8 (.34.) 5.574 0.077 72.605 0.000 5.574 5.693
visual 0.000 0.000 0.000
textual 0.000 0.000 0.000
speed 0.000 0.000 0.000

Variances:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
.x1 0.608 0.127 4.774 0.000 0.608 0.447
.x2 1.247 0.153 8.123 0.000 1.247 0.813
.x3 0.857 0.123 6.964 0.000 0.857 0.638
.x9 0.637 0.088 7.272 0.000 0.637 0.630
.x4 0.447 0.069 6.431 0.000 0.447 0.337
.x5 0.497 0.082 6.084 0.000 0.497 0.310
.x6 0.265 0.050 5.276 0.000 0.265 0.258
.x7 0.795 0.114 6.981 0.000 0.795 0.689
.x8 0.445 0.107 4.174 0.000 0.445 0.464
visual 0.753 0.156 4.819 0.000 1.000 1.000
textual 0.880 0.131 6.692 0.000 1.000 1.000
speed 0.360 0.089 4.023 0.000 1.000 1.000

23
Group 2 [Grant-White]:

Latent Variables:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
visual =~
x1 1.000 0.855 0.734
x2 (.p2.) 0.617 0.099 6.257 0.000 0.528 0.477
x3 (.p3.) 0.804 0.103 7.836 0.000 0.687 0.663
x9 (.p4.) 0.453 0.081 5.577 0.000 0.387 0.382
textual =~
x4 1.000 0.932 0.847
x5 (.p6.) 1.121 0.066 16.984 0.000 1.046 0.862
x6 (.p7.) 0.931 0.056 16.587 0.000 0.868 0.795
speed =~
x7 1.000 0.727 0.702
x8 (.p9.) 1.195 0.164 7.300 0.000 0.869 0.833
x9 (.10.) 0.682 0.107 6.359 0.000 0.496 0.489

Covariances:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
visual ~~
textual 0.435 0.096 4.544 0.000 0.545 0.545
speed 0.244 0.078 3.139 0.002 0.393 0.393
textual ~~
speed 0.182 0.072 2.526 0.012 0.268 0.268

Intercepts:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
.x1 (.26.) 4.921 0.091 54.361 0.000 4.921 4.225
.x2 (.27.) 6.092 0.078 77.977 0.000 6.092 5.508
.x3 1.970 0.105 18.726 0.000 1.970 1.900
.x9 (.29.) 5.390 0.070 76.980 0.000 5.390 5.319
.x4 (.30.) 2.778 0.087 31.938 0.000 2.778 2.524
.x5 (.31.) 4.034 0.096 41.832 0.000 4.034 3.327
.x6 (.32.) 1.926 0.079 24.426 0.000 1.926 1.764
.x7 4.000 0.096 41.737 0.000 4.000 3.862
.x8 (.34.) 5.574 0.077 72.605 0.000 5.574 5.345
visual 0.032 0.125 0.253 0.800 0.037 0.037
textual 0.576 0.117 4.919 0.000 0.618 0.618
speed -0.079 0.095 -0.834 0.404 -0.109 -0.109

Variances:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
.x1 0.625 0.117 5.360 0.000 0.625 0.461
.x2 0.945 0.122 7.767 0.000 0.945 0.773
.x3 0.603 0.094 6.413 0.000 0.603 0.561
.x9 0.480 0.073 6.616 0.000 0.480 0.467
.x4 0.342 0.062 5.534 0.000 0.342 0.282
.x5 0.377 0.073 5.135 0.000 0.377 0.256
.x6 0.438 0.067 6.572 0.000 0.438 0.367
.x7 0.545 0.092 5.928 0.000 0.545 0.508
.x8 0.332 0.101 3.299 0.001 0.332 0.306
visual 0.732 0.154 4.757 0.000 1.000 1.000
textual 0.869 0.131 6.660 0.000 1.000 1.000
speed 0.528 0.119 4.431 0.000 1.000 1.000

fitMeasures(fitE, c("chisq", "df", "pvalue", "aic", "bic", "cfi", "rmsea", "srmr"))


chisq df pvalue aic bic cfi rmsea srmr
95.064 57.000 0.001 7445.608 7634.671 0.957 0.067 0.057

From this output, we can conclude that:

▪ In the tab “Intercepts”, the indicator intercepts and latent means are now shown by the summary()
function. Note that, by default, the cfa() and sem() functions fix the intercepts to zero (this was
the case in Step 3 & 4. Parameter estimation & Model evaluation). Otherwise, the model wouldn’t
be estimable. Only because we now have two groups, the intercepts can differ from zero.
▪ The loadings and indicator intercepts are equated across the two groups, which was desired. Note
that this is indicated in lavaan by the .p2. to .11. labels (loadings) and the .27 to .35 labels (indicator
intercepts). For x3 & x7, the intercepts are not labelled, as these indicator intercepts were set free.

24
▪ The model fit, based on the χ2 statistic, CFI-index, RMSEA and SRMR, is close to reasonable.
▪ The latent means of group 1 (Pasteur) are all standardized to 0.000. We can consider group 1 as
the reference group, which is necessary to compare Pasteur with Grand-White. The latent means
of group 2 (Grand-White) are 0.032 (visual), 0.576 (textual) and -0.079 (speed). Notice that we
cannot interpret these values directly, we can only compare them between the two groups. For
instance, we can conclude that, after allowing partial scalar measurement invariance, the students
of the Grand-White school globally perform better on the latent variable textual (p = 0.000), while
significant differences between the two schools for the latent variables visual (p = 0.800) and speed
(p = 0.404) are not present (significance level: α = 0.05).

Extra: You can also test for strict invariance or residual invariance, i.e. the variances of the indicators
are also equated across groups. However, this is not a prerequisite for testing latent mean differences
because the indicator residuals are not part of the latent variable(s), so their invariance is inconsequen-
tial to the interpretation of latent mean differences. However, residual invariance is still reported in
many tests of measurement invariance. If you want to account for strict invariance in lavaan, you can
design a CFA/SEM model where group.equal = c(“intercepts”, “loadings”, “residuals”). However, including
this strict invariance model in the lavTestLRT() function to decide if latent mean difference testing can
be executed, may lead to an over-rejection of your group comparison model.

Categorical data
Binary, ordinal and nominal observable variables result in categorical data. It makes a big difference if
these categorical variables are exogenous (independent) or endogenous (dependent) in the model.

1. Exogenous observable variables


If you have a binary exogenous observable variable (e.g. gender), all you need to do is to recode it as
a dummy (0/1) variable. If you have an ordinal exogenous observable variable (e.g. a Likert scale), you
can use a coding scheme that reflects the order (e.g. 1, 2, 3, etc.) and treat it as any other (numerical)
observable variable. If you have a nominal exogenous observable variable with K > 2 levels (e.g. place
of living), you need to replace it by a set of K-1 dummy variables.

2. Endogenous observable variables


The lavaan package can deal with both binary and ordinal but not with nominal endogenous observable
variables. Two ways exist to communicate to lavaan that some of the endogenous observable variables
should be treated as categorical:

2A. Declare them as ‘ordered’ in your data.frame before you run the analysis. For instance, if you need
to declare four observable variables a1, a2, a3 and a4 as ordinal in your data.frame “data”, use:
data[,c("a1","a2","a3","a4")] <- lapply(data[,c("a1","a2","a3","a4")], ordered)

2B. Use the ordered argument when using one of the fitting functions, i.e. cfa() or sem(). For example,
if you have four binary or ordinal observable variables a1, a2, a3 and a4, you can use:
fit <- cfa(model, data, ordered = c("a1","a2","a3","a4"))

If all the endogenous observable variables are to be treated as categorical, you can use the argument
“ordered = TRUE” as a shortcut. When the “ordered = …” argument is used, lavaan will automatically
switch to the WLSMV estimator: it will use diagonally weighted least squares (DWLS) to estimate the

25
model parameters and it will use the full weight covariance matrix to compute robust standard errors
as well as a mean- and variance-adjusted test statistic. Other options are ULSMV and PML.

Visualizing the model


You can also visualize the path diagram (the CFA model), by using the function lavaanPlot(), which is
part of the equally named package lavaanPlot. The argument “coefs = TRUE” can be used to show the
loadings. The argument “sig = 0.05” can be used in order to show only significant loadings (p < 0.05).
library(lavaanPlot)
labels = list(visual = "Visual", textual = "Textual", speed = "Speed",
x1 = "Visual perception", x2 = "Cubes", x3 = "Lozenges",
x4 = "Paragraph comprehension", x5 = "Sentence completion",
x6 = "Word meaning", x7 = "Speeded addition",
x8 = "Speeded dot counting", x9 = "Speeded capitals")

lavaanPlot(model = fit2,
labels = labels,
node_options = list(shape = "box", fontname = "Helvetica"),
edge_options = list(color = "grey"),
coefs = TRUE,
sig = 0.05)

Other possible arguments for lavaanPlot include:

stand = TRUE Only show standardized paths


covs = TRUE Include model covariances (double-headed arrows)
stars = “covs” Include significance stars (***) for covariances
stars = “regress” Include significance stars (***) for regressions
stars = “latent” Include significance stars (***) for loadings
digits = 2 Change the number of decimal places (here 2) in the coefficient value labels
(loadings and covariances)

More information and practical examples can be found on:

https://cran.r-project.org/web/packages/lavaanPlot/vignettes/Intro_to_lavaanPlot.html
https://cran.r-project.org/web/packages/lavaanPlot/lavaanPlot.pdf

26
5. Structural equation modelling
In our second example, we will study structural equation modelling (SEM), using the sem() function.
The lavaan package contains a built-in dataset called PoliticalDemocracy. This dataset consists of 75
observations with regard to 11 observable variables that are a measure for political democracy (dem)
or industrialization (ind) in developing countries in 1960 or 1965. The proposed SEM model is shown
in Figure 2 and consists of 3 latent variables (dem60, dem65, ind60) and the original 11 indicators:

▪ y1: Expert ratings of the freedom of the press in 1960


▪ y2: The freedom of political opposition in 1960
▪ y3: The fairness of elections in 1960
▪ y4: The effectiveness of the elected legislature in 1960
▪ y5: Expert ratings of the freedom of the press in 1965
▪ y6: The freedom of political opposition in 1965
▪ y7: The fairness of elections in 1965
▪ y8: The effectiveness of the elected legislature in 1965
▪ x1: The gross national product (GNP) per capita in 1960
▪ x2: The inanimate energy consumption per capita in 1960
▪ x3: The percentage of the labour force in the industry in 1960

Figure 2 - Path diagram of the SEM model for the PoliticalDemocracy dataset.

In the path diagram above, all observable variables are endogenous (dependent). The latent variables
dem60 and dem65 are also endogenous, while the latent variable ind60 is exogenous (independent).

Step 0. Pre-analysis of the dataset


## Clear global RStudio environment and load libraries.
rm(list = ls())
library(lavaan)
library(lavaanPlot)
library(dplyr)
library(semTools)
library(knitr)
library(mvnormalTest)

## Load data.
data <- PoliticalDemocracy
head(data)

27
y1 y2 y3 y4 y5 y6 y7 y8 x1 x2 x3
1 2.50 0.000000 3.333333 0.000000 1.250000 0.000000 3.726360 3.333333 4.442651 3.637586 2.557615
2 1.25 0.000000 3.333333 0.000000 6.250000 1.100000 6.666666 0.736999 5.384495 5.062595 3.568079
3 7.50 8.800000 9.999998 9.199991 8.750000 8.094061 9.999998 8.211809 5.961005 6.255750 5.224433
4 8.90 8.800000 9.999998 9.199991 8.907948 8.127979 9.999998 4.615086 6.285998 7.567863 6.267495
5 10.00 3.333333 9.999998 6.666666 7.500000 3.333333 9.999998 6.666666 5.863631 6.818924 4.573679
6 7.50 3.333333 6.666666 6.666666 6.250000 1.100000 6.666666 0.368500 5.533389 5.135798 3.892270

## Assessing normality.
mardia(data)
$mv.test
Test Statistic p-value Result
1 Skewness 344.4944 0.0101 NO
2 Kurtosis -0.2581 0.7963 YES
3 MV Normality <NA> <NA> NO

$uv.shapiro
W p-value UV.Normality
y1 0.9316 6e-04 No
y2 0.8311 0 No
y3 0.8547 0 No
y4 0.8994 0 No
y5 0.9598 0.0179 No
y6 0.8107 0 No
y7 0.8715 0 No
y8 0.9006 0 No
x1 0.9751 0.1454 Yes
x2 0.9763 0.1715 Yes
x3 0.9734 0.1145 Yes

Based on the test above, we can conclude with 95% confidence that 9 out of 12 indicators don’t follow
a normal distribution. Thus, we will use a more robust form of the maximum likelihood estimator.

Step 1. Model specification


model <- ' ## Measurement model (latent variable definitions)
ind60 =~ x1 + x2 + x3
dem60 =~ y1 + y2 + y3 + y4
dem65 =~ y5 + y6 + y7 + y8

## Structural model: regressions


dem60 ~ ind60
dem65 ~ ind60 + dem60

## Structural model: residual correlations between indicators


y1 ~~ y5 ## Covariance
y2 ~~ y4 + y6 ## Covariances
y3 ~~ y7 ## Covariance
y4 ~~ y8 ## Covariance
y6 ~~ y8' ## Covariance

We already know from Section 4 that all exogenous (independent) latent variables are correlated by
default by the sem() function. However, this is not the case for endogenous latent variables, which is
why we need to add three regressions (dem60 ~ ind60, dem65 ~ ind60, dem65 ~ dem60) in the model
specification. We also know from Section 4 that residual (co)variances are added automatically. Still,
we have added six residual covariances in the model specification (y1 ~~ y5 to y6 ~~ y8). This is only
done when it is believed that observable variables have something in common that is not captured by
the latent variables. In the case of y1 ~~ y5 for instance, both indicators refer to identical scores (expert
ratings of the freedom of press), but measured in two different years (1960 and 1965, resp.). As there
is no latent variable that measures this ‘relationship’, we have added this residual covariance.

28
Step 2. Model identification
Without RStudio:
▪ The number of unique elements in the model-implied covariance matrix is equal to:
𝐾 ∗ (𝐾 + 1) 11 ∗ (11 + 1)
= = 66
2 2
𝑤𝑖𝑡ℎ 𝐾 = 𝑡ℎ𝑒 𝑎𝑚𝑜𝑢𝑛𝑡 𝑜𝑓 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑏𝑙𝑒 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒𝑠 = 11
▪ The number of free parameters is equal to the sum of the number of loadings (11), the number
of regressions (3), the number of variances (14), and the number of covariances (6), minus the
number of loadings that correspond to the first indicator of a latent variable (3). This gives us 31
free parameters, which is smaller than the number of unique elements in the model-implied cova-
riance matrix (66). The residual df = 66 - 31 = 35 > 0. The model is thus over-identified.

With RStudio:
## Fit the model to a SEM model.
fit <- sem(model, data, estimator = "MLR", likelihood = "wishart")

lavaan 0.6.14 ended normally after 69 iterations


Estimator ML
Optimization method NLMINB
Number of model parameters 31
Number of observations 75

Model Test User Model:


Standard Scaled
Test Statistic 37.617 41.961
Degrees of freedom 35 35
P-value (Chi-square) 0.350 0.195
Scaling correction factor 0.896
Yuan-Bentler correction (Mplus variant)

From this output, we can conclude that:


▪ The model was successfully fitted to the dataset after 69 iterations.
▪ The model is over-identified (df = 35 > 0), the number of model parameters is equal to 31, and
the number of unique elements in the model-implied covariance matrix is equal to 31 + 35 = 66.
▪ In this case, the scaling correction factor is a lot lower than 1.000 (i.e. 0.896) although 9 out of 12
indicators are not normally distributed. Despite the fact that we desire robust standard errors,
we also prefer a high model fit. Therefore, we will continue to work with the ML estimator in this
example. Note that there is no “correct” or “wrong” in this example: the use of either estimator
can be justified. However, researchers should always motivate their choice for a certain estimator.
fit <- sem(model, data, likelihood = "wishart")

Step 3 & 4. Parameter estimation & Model evaluation


## Summary of the model.
summary(fit, fit.measures = TRUE, standardized = TRUE, rsq = TRUE)

lavaan 0.6.14 ended normally after 69 iterations


Estimator ML
Optimization method NLMINB
Number of model parameters 31
Number of observations 75

Model Test User Model:


Test statistic 37.617
Degrees of freedom 35
P-value (Chi-square) 0.350

29
Model Test Baseline Model:
Test statistic 720.912
Degrees of freedom 55
P-value 0.000

User Model versus Baseline Model:

Comparative Fit Index (CFI) 0.996


Tucker-Lewis Index (TLI) 0.994

Loglikelihood and Information Criteria:

Loglikelihood user model (H0) -1553.328


Loglikelihood unrestricted model (H1) -1534.265
Akaike (AIC) 3168.656
Bayesian (BIC) 3240.498
Sample-size adjusted Bayesian (SABIC) 3142.794

Root Mean Square Error of Approximation:

RMSEA 0.032
90 Percent confidence interval - lower 0.000
90 Percent confidence interval - upper 0.091
P-value H_0: RMSEA <= 0.050 0.629
P-value H_0: RMSEA >= 0.080 0.107

Standardized Root Mean Square Residual:

SRMR 0.044

Parameter Estimates:

Standard errors Standard


Information Expected
Information saturated (h1) model Structured

Latent Variables:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
ind60 =~
x1 1.000 0.674 0.920
x2 2.180 0.139 15.636 0.000 1.470 0.973
x3 1.819 0.153 11.887 0.000 1.226 0.872
dem60 =~
y1 1.000 2.238 0.850
y2 1.257 0.184 6.842 0.000 2.813 0.717
y3 1.058 0.152 6.940 0.000 2.367 0.722
y4 1.265 0.146 8.664 0.000 2.831 0.846
dem65 =~
y5 1.000 2.117 0.808
y6 1.186 0.170 6.977 0.000 2.510 0.746
y7 1.280 0.161 7.948 0.000 2.709 0.824
y8 1.266 0.159 7.953 0.000 2.680 0.828

Regressions:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
dem60 ~
ind60 1.483 0.402 3.691 0.000 0.447 0.447
dem65 ~
ind60 0.572 0.223 2.569 0.010 0.182 0.182
dem60 0.837 0.099 8.457 0.000 0.885 0.885

Covariances:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
.y1 ~~
.y5 0.632 0.366 1.729 0.084 0.632 0.296
.y2 ~~
.y4 1.331 0.716 1.858 0.063 1.331 0.273
.y6 2.182 0.749 2.914 0.004 2.182 0.356
.y3 ~~
.y7 0.806 0.620 1.299 0.194 0.806 0.191
.y4 ~~
.y8 0.353 0.451 0.782 0.434 0.353 0.109
.y6 ~~
.y8 1.374 0.580 2.370 0.018 1.374 0.338

30
Variances:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
.x1 0.083 0.020 4.156 0.000 0.083 0.154
.x2 0.121 0.071 1.707 0.088 0.121 0.053
.x3 0.473 0.092 5.142 0.000 0.473 0.239
.y1 1.917 0.453 4.227 0.000 1.917 0.277
.y2 7.473 1.402 5.331 0.000 7.473 0.486
.y3 5.136 0.971 5.289 0.000 5.136 0.478
.y4 3.190 0.754 4.232 0.000 3.190 0.285
.y5 2.383 0.490 4.863 0.000 2.383 0.347
.y6 5.021 0.933 5.382 0.000 5.021 0.443
.y7 3.478 0.727 4.781 0.000 3.478 0.322
.y8 3.298 0.709 4.653 0.000 3.298 0.315
ind60 0.454 0.088 5.138 0.000 1.000 1.000
.dem60 4.009 0.940 4.266 0.000 0.800 0.800
.dem65 0.175 0.219 0.798 0.425 0.039 0.039

R-Square:
Estimate
x1 0.846
x2 0.947
x3 0.761
y1 0.723
y2 0.514
y3 0.522
y4 0.715
y5 0.653
y6 0.557
y7 0.678
y8 0.685
dem60 0.200
dem65 0.961

Interpretation: model fit


The χ2 statistic is equal to 37.617 with 35 degrees of freedom and a p-value = 0.350 > 0.05. This indi-
cates that there are no significant differences between the model-implied and the original covariance
matrix and that the model fit is close. Also the CFI-index (0.996 ≥ 0.95), the RMSEA (0.032 ≤ 0.06)
and the SRMR (0.044 ≤ 0.05) indicate a close model fit. The R2 values range from 0.200 (dem60 on
ind60) to 0.961 (dem65 on ind60). Consequently, each endogenous variable appears to have a substan-
tial relationship with their corresponding exogenous variable, but big differences can be observed.
Interpretation: latent variables
Every loading appears to be statistically significant in this example, as the P(>|z|) value is always equal
to 0.000 < 0.05. This means that every observable variable in this example has a statistically significant
relationship with its corresponding latent variable.
Interpretation: regressions
Every regression coefficient appears to be statistically significant in this example, as the P(>|z|) value
is always equal to 0.000 < 0.05. This means that all correlations between endogenous and exogenous
latent variables in this example are statistically significant.

Interpretation: covariances and variances


Only the covariances “y2 ~~ y6” and “y6 ~~ y8” are statistically significant, i.e. P(>|z|) < 0.05. Only the
variances of x2 and dem65 are not significantly different from zero, i.e. P(>|z|) > 0.05.

Step 5. Model Modification


If we want to simplify the model, we can choose to omit all non-significant covariances. However, this
can lead to a reduction of model fit and even to a reduction of AIC & BIC, even if the number of free

31
parameters is reduced by this modification. Because of this, and also because the model fit was already
very close, we will not make model modifications in this example.

Multiple group analysis


In this example, there is only one group of observations. However, to illustrate the use of the sem()
function in multiple group analysis, we will arbitrarily assume that the first 37 observations are made
in Costa Rica, and the last 38 in Moldova. The code below can then be used to compare the latent
means of dem60, dem65 and ind60 between these two countries. In other words: the code below
allows us to determine which of the two countries (Costa Rica or Moldova) has the highest level of
political democracy (dem) and/or industrialization (ind) in 1960 and 1965.
## Load data and add a column that contains the name of the country for which the obser-
## vation is made (Costa Rica: first 37 observations, Moldova: last 38 observations).
rm(list = ls())
data <- PoliticalDemocracy
Country <- rep(c("Costa Rica","Moldova"), c(37,38)) ## rep(x,n): replicate x n times
groupdata <- cbind(as.data.frame(Country),data) ## cbind(): combine columns
> groupdata
Country y1 y2 y3 y4 y5 y6 y7 y8 x1 x2 x3
1 Costa Rica 2.50 0.000000 3.333333 0.000000 1.250000 0.000000 3.726360 3.333333 4.442651 3.637586 2.557615
2 Costa Rica 1.25 0.000000 3.333333 0.000000 6.250000 1.100000 6.666666 0.736999 5.384495 5.062595 3.568079

37 Costa Rica 5.20 4.999998 6.600000 3.333333 3.633403 1.100000 3.314128 3.333333 5.564520 5.236442 2.677633
38 Moldova 5.00 3.333333 6.400000 6.666666 2.844997 0.000000 4.429657 1.485166 4.727388 3.610918 1.418971

75 Moldova 3.75 3.333333 0.000000 0.000000 1.250000 3.333333 0.000000 0.000000 4.488636 4.897840 2.867566

## Model specification (this model is identical to the one in Step 1).


model <- ' ## Measurement model (latent variable definitions)
ind60 =~ x1 + x2 + x3
dem60 =~ y1 + y2 + y3 + y4
dem65 =~ y5 + y6 + y7 + y8

## Structural model: regressions


dem60 ~ ind60
dem65 ~ ind60 + dem60

## Structural model: residual correlations between indicators


y1 ~~ y5 ## Covariance
y2 ~~ y4 + y6 ## Covariances
y3 ~~ y7 ## Covariance
y4 ~~ y8 ## Covariance
y6 ~~ y8' ## Covariance

## Testing for measurement invariance.


## Model 1: Configural invariance.
fitA <- sem(model, groupdata, group = "Country")
## Model 2: Weak or metric invariance.
fitB <- sem(model, groupdata, group = "Country", group.equal = "loadings")
## Model 3: Strong or scalar invariance.
fitC <- sem(model, groupdata, group = "Country", group.equal = c("loadings","intercepts"))
## Model comparison test
lavTestLRT(fitA, fitB, fitC)
Chi-Squared Difference Test
Df AIC BIC Chisq Chisq diff RMSEA Df diff Pr(>Chisq)
fitA 70 3189.3 3383.9 82.203
fitB 78 3178.8 3355.0 87.781 5.5773 0.00000 8 0.6944564
fitC 86 3189.9 3347.5 114.874 27.0931 0.25228 8 0.0006813 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Weak invariance is supported in this dataset, but strong invariance is not. Therefore, we will fix
all indicator intercepts to zero and calculate the MIs.

32
modeltest <- ' ## Measurement model (latent variable definitions)
ind60 =~ x1 + x2 + x3
dem60 =~ y1 + y2 + y3 + y4
dem65 =~ y5 + y6 + y7 + y8

## Structural model: regressions


dem60 ~ ind60
dem65 ~ ind60 + dem60

## Structural model: residual correlations between indicators


y1 ~~ y5 ## Covariance
y2 ~~ y4 + y6 ## Covariances
y3 ~~ y7 ## Covariance
y4 ~~ y8 ## Covariance
y6 ~~ y8 ## Covariance

## Fix indicator intercepts to zero.


x1 + x2 + x3 + y1 + y2 + y3 + y4 + y5 + y6 + y7 + y8 ~ 0*1'

fitD <- sem(modeltest, groupdata, group = "Country", group.equal = c("loadings","intercepts"))

Warning message:
In lav_object_post_check(object) :
lavaan WARNING: some estimates of variances are negative

Negative variances can occur due to (i) model misspecifications, (ii) small sample size, (iii) sampling
fluctuations, (iv) observable variables that have different scales, (v) observable, dependent variables
that are skewed or not-normally distributed, (vi) missing data, (vii) outliers, etc.

Given the impossibility of negative variances, researchers always need to determine the reason of
their occurrence. In this case, negative variances are most probably caused by randomly dividing
our dataset into two small(er) datasets without any factual evidence that this division holds true.

Because we only need fitD to calculate our modification indices and not to compare the values of
latent means, we can easily avoid the negative variances by fixing the variances of the corresponding
variables to their values in the complete dataset (i.e. no group division, as we know that no negative
variances occurred when we used the full dataset in Step 3 & 4). However, if negative variances
would appear in fitE, the model we will use to compare the values of latent means, this solution is
not appropriate because it can lead to deviating, incorrect estimates and misinterpretations in the
comparison of groups. To solve these kind of problems, we refer to two reference works:
(i) Kolenikov, S., & Bollen, K.A. (2012). Testing negative error variances: Is a Heywood case a symptom of
misspecification? Sociological Methods & Research, 41(1). 124-167. doi: 10.1177/0049124112442138.
(ii) Savalei, V., & Kolenikov, S. (2008). Constrained versus unconstrained estimation in structural equation
modelling. Psychological Methods, 13(2). 150-170. doi: 10.1037/1082-989X.13.2.150.

## Solution for our specific problem.


## Step 1: Calculate the R² values of the endogenous variables.
summary(fitD, rsq = TRUE) ## Only the R² values are displayed here

R-Square:
Estimate
x1 NA
x2 0.284
x3 0.168
y1 0.803
y2 0.260
y3 0.530
y4 0.387
y5 0.723
y6 0.230
y7 0.588
y8 0.518
dem60 0.060
dem65 0.959

33
## Step 2: As the R² value cannot be calculated for x1, we will calculate the variance
## of x1 in the full dataset and use this value to fix the variance of x1 in our model.
> var(data$x1)
[1] 0.5371487

modeltest <- ' ## Measurement model (latent variable definitions)


ind60 =~ x1 + x2 + x3
dem60 =~ y1 + y2 + y3 + y4
dem65 =~ y5 + y6 + y7 + y8

## Structural model: regressions


dem60 ~ ind60
dem65 ~ ind60 + dem60

## Structural model: residual correlations between indicators


y1 ~~ y5 ## Covariance
y2 ~~ y4 + y6 ## Covariances
y3 ~~ y7 ## Covariance
y4 ~~ y8 ## Covariance
y6 ~~ y8 ## Covariance
x1 ~~ 0.5371487*x1 ## Fixed variance of x1

## Fix indicator intercepts to zero.


x1 + x2 + x3 + y1 + y2 + y3 + y4 + y5 + y6 + y7 + y8 ~ 0*1'

## Step 3: We can now calculate the modification indices.


fitD <- sem(modeltest, groupdata, group = "Country", group.equal = c("loadings","intercepts"))
modindices(fitD, op = "~1", sort = TRUE, minimum.value = 3.84)

lhs op rhs block group level mi epc sepc.lv sepc.all sepc.nox


46 ind60 ~1 1 1 1 34.424 5.237 0.965 0.965 0.965
70 x1 ~1 2 2 1 8.405 0.559 0.559 0.463 0.463
76 y4 ~1 2 2 1 6.049 -1.017 -1.017 -0.318 -0.318
72 x3 ~1 2 2 1 4.254 -0.381 -0.381 -0.345 -0.345

Especially ind60, x1 and y4 have high MIs. However, note that ind60 is an exogenous (independent)
latent variable. As the intercept of ind60 is thus a latent mean and not an indicator intercept, there
is no need to include “ind60 ~1” in our fitE model since we didn’t fix ind60 to zero in fitD. In other
words, the latent mean of ind60 was already free: including it in fitE will not increase the χ2 statistic.
We will only set free the intercepts of x1 and y4. Note that we can’t set free the intercept of x3. If
we would do this, the latent variable ind60 would only have one indicator left with equal intercepts
across groups (i.e. x2) since we already freed the intercept of x1.
fitE <- sem(model, groupdata, group = "Country",
group.equal = c("loadings", "intercepts"),
group.partial = c("y4~1", "x1~1")) ## Note: no negative variances
## are present!
lavTestLRT(fitA, fitB, fitE)

Chi-Squared Difference Test


Df AIC BIC Chisq Chisq diff RMSEA Df diff Pr(>Chisq)
fitA 70 3189.3 3383.9 82.203
fitB 78 3178.8 3355.0 87.781 5.5773 0.00000 8 0.69446
fitE 84 3178.0 3340.2 98.937 11.1563 0.15138 6 0.08366 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Since metric invariance and partial scalar invariance are now present (although the second p-value,
i.e. the p-value that evaluates partial scalar invariance, is only slightly larger than 0.05), we can now
compare the values of latent means across the two groups (Costa Rica and Moldova).
summary(fitE, fit.measures = TRUE, standardized = TRUE)
lavaan 0.6.14 ended normally after 114 iterations
Estimator ML
Optimization method NLMINB
Number of model parameters 87

34
Number of equality constraints 17
Number of observations per group:
Costa Rica 37
Moldova 38

Model Test User Model:


Test statistic 98.937
Degrees of freedom 84
P-value (Chi-square) 0.127
Test statistic for each group:
Costa Rica 57.401
Moldova 41.536

Model Test Baseline Model:


Test statistic 808.372
Degrees of freedom 110
P-value 0.000

User Model versus Baseline Model:


Comparative Fit Index (CFI) 0.979
Tucker-Lewis Index (TLI) 0.972

Loglikelihood and Information Criteria:


Loglikelihood user model (H0) -1519.000
Loglikelihood unrestricted model (H1) -1469.532
Akaike (AIC) 3178.000
Bayesian (BIC) 3340.224
Sample-size adjusted Bayesian (SABIC) 3119.603

Root Mean Square Error of Approximation:


RMSEA 0.069
90 Percent confidence interval - lower 0.000
90 Percent confidence interval - upper 0.118
P-value H_0: RMSEA <= 0.050 0.295
P-value H_0: RMSEA >= 0.080 0.387

Standardized Root Mean Square Residual:


SRMR 0.086

Parameter Estimates:
Standard errors Standard
Information Expected
Information saturated (h1) model Structured

Group 1 [Costa Rica]:

Latent Variables:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
ind60 =~
x1 1.000 0.651 0.938
x2 (.p2.) 2.276 0.133 17.049 0.000 1.481 0.997
x3 (.p3.) 1.902 0.159 11.960 0.000 1.237 0.863
dem60 =~
y1 1.000 2.230 0.829
y2 (.p5.) 1.267 0.179 7.085 0.000 2.826 0.736
y3 (.p6.) 1.021 0.150 6.789 0.000 2.277 0.721
y4 (.p7.) 1.201 0.136 8.866 0.000 2.679 0.895
dem65 =~
y5 1.000 2.099 0.795
y6 (.p9.) 1.249 0.171 7.300 0.000 2.622 0.799
y7 (.10.) 1.299 0.163 7.988 0.000 2.726 0.849
y8 (.11.) 1.343 0.159 8.428 0.000 2.819 0.808

Regressions:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
dem60 ~
ind60 1.858 0.545 3.409 0.001 0.542 0.542
dem65 ~
ind60 0.959 0.334 2.875 0.004 0.297 0.297
dem60 0.734 0.121 6.070 0.000 0.780 0.780

Covariances:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
.y1 ~~
.y5 0.704 0.520 1.353 0.176 0.704 0.292
.y2 ~~
.y4 1.692 0.932 1.816 0.069 1.692 0.486
.y6 1.259 0.792 1.590 0.112 1.259 0.246
.y3 ~~
.y7 1.409 0.787 1.791 0.073 1.409 0.380
.y4 ~~
.y8 -0.403 0.504 -0.801 0.423 -0.403 -0.146
.y6 ~~
.y8 1.201 0.847 1.419 0.156 1.201 0.296

35
Intercepts:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
.x1 5.340 0.114 46.849 0.000 5.340 7.702
.x2 (.36.) 5.180 0.244 21.213 0.000 5.180 3.487
.x3 (.37.) 3.883 0.222 17.499 0.000 3.883 2.709
.y1 (.38.) 5.702 0.423 13.482 0.000 5.702 2.120
.y2 (.39.) 4.514 0.576 7.838 0.000 4.514 1.175
.y3 (.40.) 6.833 0.472 14.474 0.000 6.833 2.163
.y4 5.416 0.487 11.113 0.000 5.416 1.808
.y5 (.42.) 5.358 0.402 13.333 0.000 5.358 2.029
.y6 (.43.) 3.163 0.507 6.242 0.000 3.163 0.964
.y7 (.44.) 6.480 0.505 12.831 0.000 6.480 2.019
.y8 (.45.) 4.315 0.526 8.201 0.000 4.315 1.237
ind60 0.000 0.000 0.000
.dem60 0.000 0.000 0.000
.dem65 0.000 0.000 0.000

Variances:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
.x1 0.057 0.018 3.126 0.002 0.057 0.120
.x2 0.014 0.065 0.209 0.835 0.014 0.006
.x3 0.524 0.130 4.015 0.000 0.524 0.255
.y1 2.264 0.681 3.324 0.001 2.264 0.313
.y2 6.767 1.844 3.670 0.000 6.767 0.459
.y3 4.790 1.251 3.830 0.000 4.790 0.480
.y4 1.793 0.746 2.403 0.016 1.793 0.200
.y5 2.572 0.692 3.717 0.000 2.572 0.369
.y6 3.885 1.061 3.663 0.000 3.885 0.361
.y7 2.874 0.840 3.424 0.001 2.874 0.279
.y8 4.230 1.211 3.494 0.000 4.230 0.347
ind60 0.423 0.108 3.905 0.000 1.000 1.000
.dem60 3.513 1.101 3.190 0.001 0.706 0.706
.dem65 0.230 0.273 0.842 0.400 0.052 0.052

Group 2 [Moldova]:

Latent Variables:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
ind60 =~
x1 1.000 0.584 0.895
x2 (.p2.) 2.276 0.133 17.049 0.000 1.329 0.943
x3 (.p3.) 1.902 0.159 11.960 0.000 1.110 0.865
dem60 =~
y1 1.000 2.185 0.872
y2 (.p5.) 1.267 0.179 7.085 0.000 2.769 0.712
y3 (.p6.) 1.021 0.150 6.789 0.000 2.232 0.673
y4 (.p7.) 1.201 0.136 8.866 0.000 2.626 0.787
dem65 =~
y5 1.000 2.020 0.793
y6 (.p9.) 1.249 0.171 7.300 0.000 2.522 0.741
y7 (.10.) 1.299 0.163 7.988 0.000 2.623 0.795
y8 (.11.) 1.343 0.159 8.428 0.000 2.713 0.893

Regressions:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
dem60 ~
ind60 1.049 0.657 1.597 0.110 0.280 0.280
dem65 ~
ind60 0.539 0.288 1.874 0.061 0.156 0.156
dem60 0.861 0.108 7.955 0.000 0.932 0.932

Covariances:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
.y1 ~~
.y5 0.807 0.482 1.674 0.094 0.807 0.423
.y2 ~~
.y4 1.557 0.968 1.607 0.108 1.557 0.276
.y6 2.325 1.088 2.138 0.033 2.325 0.372
.y3 ~~
.y7 0.664 0.943 0.704 0.482 0.664 0.135
.y4 ~~
.y8 1.003 0.619 1.621 0.105 1.003 0.356
.y6 ~~
.y8 0.947 0.652 1.453 0.146 0.947 0.303

Intercepts:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
.x1 5.120 0.123 41.727 0.000 5.120 7.852
.x2 (.36.) 5.180 0.244 21.213 0.000 5.180 3.678
.x3 (.37.) 3.883 0.222 17.499 0.000 3.883 3.025
.y1 (.38.) 5.702 0.423 13.482 0.000 5.702 2.274
.y2 (.39.) 4.514 0.576 7.838 0.000 4.514 1.160
.y3 (.40.) 6.833 0.472 14.474 0.000 6.833 2.061
.y4 4.191 0.594 7.059 0.000 4.191 1.256
.y5 (.42.) 5.358 0.402 13.333 0.000 5.358 2.104

36
.y6 (.43.) 3.163 0.507 6.242 0.000 3.163 0.929
.y7 (.44.) 6.480 0.505 12.831 0.000 6.480 1.965
.y8 (.45.) 4.315 0.526 8.201 0.000 4.315 1.420
ind60 -0.344 0.148 -2.330 0.020 -0.589 -0.589
.dem60 -0.152 0.569 -0.267 0.790 -0.069 -0.069
.dem65 0.179 0.274 0.651 0.515 0.088 0.088

Variances:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
.x1 0.084 0.028 3.061 0.002 0.084 0.199
.x2 0.218 0.111 1.967 0.049 0.218 0.110
.x3 0.415 0.121 3.445 0.001 0.415 0.252
.y1 1.511 0.538 2.811 0.005 1.511 0.240
.y2 7.478 1.904 3.927 0.000 7.478 0.494
.y3 6.011 1.522 3.951 0.000 6.011 0.547
.y4 4.241 1.185 3.579 0.000 4.241 0.381
.y5 2.406 0.675 3.567 0.000 2.406 0.371
.y6 5.217 1.364 3.825 0.000 5.217 0.451
.y7 3.998 1.099 3.639 0.000 3.998 0.368
.y8 1.871 0.688 2.720 0.007 1.871 0.203
ind60 0.341 0.088 3.865 0.000 1.000 1.000
.dem60 4.401 1.281 3.435 0.001 0.922 0.922
.dem65 0.107 0.251 0.427 0.669 0.026 0.026

From this output, we can conclude that the model fit, based on the χ2 statistic, CFI-index, RMSEA
and SRMR, is close to reasonable (only the RMSEA and SRMR are slightly too high). The latent means
of group 1 (Costa Rica, the reference group) are all standardized to 0.000. The latent means of group
2 (Moldova) are -0.344 (ind60, p = 0.020), -0.152 (dem60, p > 0.05) and 0.179 (dem65, p > 0.05). After
allowing partial scalar measurement invariance, we can thus conclude that the industrialization degree
of Costa Rica was significantly higher in 1960 compared to Moldova. Regarding political democracy in
1960 and 1965, the SEM model did not indicate significant differences between these countries.
Visualizing the model
library(lavaanPlot)
labels = list(ind60 = "Industrialization 1960",
dem60 = "Democracy 1960",
dem65 = "Democracy 1965",
y1 = "Press freedom 1960",
y2 = "Political opposition 1960",
y3 = "Fairness of elections 1960",
y4 = "Effective legislature 1960",
y5 = "Press freedom 1965",
y6 = "Political opposition 1965",
y7 = "Fairness of elections 1965",
y8 = "Effective legislature 1965",
x1 = "GNP PC 1960",
x2 = "Energy consumption PC 1960",
x3 = "Labour force % in industry 1960")

lavaanPlot(model = fit,
labels = labels,
node_options = list(shape = "box", fontname = "Helvetica"),
edge_options = list(color = "grey"),
coefs = TRUE, sig = 0.05)

37
6. Sources
Bean, Jerry. (2021). Using R for Social Work Research. Chapter 4: Confirmatory Factor Analysis, and
Chapter 5: Structural Equation Modelling. College of Social Work, Ohio State University.
Retrieved from https://bookdown.org/bean_jerry/using_r_for_social_work_research/.
Buchanan, Erin. (n.d.). Heywood Cases on the Latent Variable. Structural equation modelling with
lavaan in R. Retrieved from
https://projector-video-pdf-converter.datacamp.com/6419/chapter3.pdf.
Byrne, B.M., Shavelson, R.J., & Muthén, B. (1989). Testing for the equivalence of factor covariance
and mean structures: The issue of partial measurement invariance. Psychological Bulletin,
105(3). 456-466. doi: 10.1037/0033-2909.105.3.456.
Center for Statistical Training. (2019a). Can I estimate a structural equation model if the sample data
are not normally distributed? Retrieved from https://centerstat.org/can-i-estimate-an-sem-if-
the-sample-data-are-not-normally-distributed/.
Center for Statistical Training. (2019b). What are modification indices and should I use them when
fitting SEMs to my own data? Retrieved from https://centerstat.org/what-are-modification-
indices-and-should-i-use-them-when-fitting-sems-to-my-own-data/.
Collier, Joel E. (2020). Applied structural equation modelling using AMOS: Basic to advanced
techniques. Routledge. ISBN: 9780367435264.
Fan, Y., Chen, J., Shirkey, G., Ranjeet, J., Wu, S.R., Park, H., & Shao, C. (2016). Applications of
structural equation modelling (SEM) in ecological research: An updated review.
Ecological Processes, 5. doi: 10.1186/s13717-016-0063-3.
Henseler, J., Ringle, C.M., & Sarstedt, M. (2016). Testing measurement invariance of composites
using partial least squares. International Marketing Review, 33(3). 405-431.
doi: 10.1108/IMR-09-2014-0304.
Kolenikov, S., & Bollen, K.A. (2012). Testing negative error variances: Is a Heywood case a symptom
of misspecification? Sociological Methods & Research, 41(1). 124-167. doi:
10.1177/0049124112442138.
Lishinski, Alex. (2021). Intro to lavaanPlot. Retrieved from
https://cran.r-project.org/web/packages/lavaanPlot/vignettes/Intro_to_lavaanPlot.html.
Lishinski, Alex. (2022). Package ‘lavaanPlot’. Version 0.6.2. Retrieved from
https://cran.r-project.org/web/packages/lavaanPlot/lavaanPlot.pdf.
Meuleman, B., Żółtak, T., Pokropek, A., Davidov, E., Muthén, B., Oberski, D. L., Billiet, J., & Schmidt,
P. (2022). Why Measurement Invariance is Important in Comparative Research. A Response
to Welzel et al. (2021). Sociological Methods & Research, 0(0).
doi: 10.1177/00491241221091755.
Putnick, D.L., Bornstein, M.H. (2016). Measurement Invariance Conventions and Reporting: The
State of the Art and Future Directions for Psychological Research.
Developmental Review, 41. 71-90. doi: 10.1016/j.dr.2016.06.004.
Ramlall, Indranarain. (2016). Applied Structural Equation Modelling for Researchers and
Practitioners. Emerald Group Publishing Limited, Bingley. doi: 10.1108/9781786358820.
Rhudy, J.L., Arnau, R.C., Huber, F.A., Lannon, E.W., Kuhn, B.L., Palit, S., Payne, M.F., Sturycz, C.A.,
Hellman, N., Guereca, Y.M., Toledo, T.A., Shadlow, J.O. (2020). Examining Configural,
Metric, and Scalar Invariance of the Pain Catastrophizing Scale in Native American and Non-
Hispanic White Adults in the Oklahoma Study of Native American Pain Risk (OK-SNAP).
Journal of Pain Research, 13. 961-969. doi: 10.2147/JPR.S242126.
Rosseel, Yves, Vanbrabant, Leonard, van de Schoot, Rens. (2016). Being negative can be good for
your test. 4th Constrained Statistical Inference Meeting March 4, 2016. Tilburg University.
Retrieved from https://users.ugent.be/~yrosseel/lavaan/csi.pdf.

38
Rosseel, Yves. (2017). Mplus estimators: MLM and MLR. Department of Data Analysis. Ghent
University (Belgium). Retrieved from https://users.ugent.be/~yrosseel/lavaan/utrecht2010.pdf.
Rosseel, Yves. (2023a). Package ‘lavaan’. Version 0.6-14. Retrieved from
https://cran.r-project.org/web/packages/lavaan/lavaan.pdf.
Rosseel, Yves. (2023b). The lavaan tutorial. Department of Data Analysis. Ghent University
(Belgium). Retrieved from https://lavaan.ugent.be/tutorial/index.html.
Savalei, V., & Kolenikov, S. (2008). Constrained versus unconstrained estimation in structural
equation modelling. Psychological Methods, 13(2). 150-170.
doi: 10.1037/1082-989X.13.2.150.
UCLA. (2021a). Confirmatory Factor Analysis in R with lavaan. Advanced Research Computing
Statistical Methods and Data Analysis. Retrieved from
https://stats.oarc.ucla.edu/r/seminars/rcfa/.
UCLA. (2021b). Introduction to Structural Equation Modelling in R with lavaan. Advanced Research
Computing Statistical Methods and Data Analysis. Retrieved from
https://stats.oarc.ucla.edu/r/seminars/rsem/#s1b.
Xu, Kate. (2012). Multiple group measurement invariance analysis in lavaan. Department of
Psychiatry. University of Cambridge. Retrieved from
https://users.ugent.be/~yrosseel/lavaan/multiplegroup6Dec2012.pdf.

39

You might also like