0% found this document useful (0 votes)

140 views195 pages

Latent

This thesis explores the automatic discovery of latent variable models, focusing on identifying hidden variables that explain observable phenomena. It presents algorithms for learning causal latent variable models from noisy linear measurements without pre-specifying latent variables. The work extends previous research on causal models and emphasizes the significance of latent variables in understanding complex systems.

Uploaded by

boblebuble

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

140 views195 pages

Latent

Uploaded by

boblebuble

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 195

Automatic Discovery of Latent Variable Models

Ricardo Silva
August 2005
CMU-CALD-05-109
School of Computer Science
Carnegie Mellon University
Pittsburgh, PA 15213
Submitted in partial fulllment of the requirements
for the degree of Doctor of Philosophy.
Thesis Committee:
Richard Scheines, CMU (Chair)
Clark Glymour, CMU
Tom Mitchell, CMU
Greg Cooper, University of Pittsburgh
Copyright c ( 2005 Ricardo Silva
This work was partially supported by NASA under Grants No. NCC2-1377, NCC2-1295 and NCC2-1227 to the
Institute for Human and Machine Cognition, University of West Florida. This research was also supported by a
Siebel Scholarship and a Microsoft Fellowship.
The views and conclusions contained in this document are those of the author and should not be interpreted as
representing the ocial policies, either expressed or implied, of any sponsoring institution, the U.S. government or
any other entity.
Keywords: graphical models, causality, latent variables
Abstract
Much of our understanding of Nature comes from theories about unobservable entities. Identifying
which hidden variables exist given measurements in the observable world is therefore an important
step in the process of discovery. Such an enterprise is only possible if the existence of latent factors
constrains how the observable world can behave. We do not speak of atoms, genes and antibodies
because we see them, but because they indirectly explain observable phenomena in a unique way
under generally accepted assumptions.
How to formalize the process of discovering latent variables and models associated with them
is the goal of this thesis. More than nding a good probabilistic model that ts the data well, we
describe how, in some situations, we can identify causal features common to all models that equally
explain the data. Such common features describe causal relations among observed and hidden
variables. Although this goal might seem ambitious, it is a natural extension of several years of
work in discovering causal models from observational data through the use of graphical models.
Learning causal relations without experiments basically amounts to discovering an unobservable
fact (does A cause B?) from observable measurements (the joint distribution of a set of variables
that include A and B). We take this idea one step further by discovering which hidden variables
exist to begin with.
More specically, we describe algorithms for learning causal latent variable models when ob-
served variables are noisy linear measurements of unobservable entities, without postulating a priori
which latents might exist. Most of the thesis concerns how to identify latents by describing which
observed variables are their respective measurements. In some situations, we will also assume that
latents are linearly dependent, and in this case causal relations among latents can be partially
identied. While continuous variables are the main focus of the thesis, we also describe how to
adapt this idea to the case where observed variables are ordinal or binary.
Finally, we examine density estimation, where knowing causal relations or the true model behind
a data generating process is not necessary. However, we illustrate how ideas developed in causal
discovery can help the design of algorithms for multivariate density estimation.
ii
Acknowledgements
Everything passed so fast during my years at CMU, and yet there are so many people to thank.
Richard Scheines and Clark Glymour are outstanding tutors. I think I will never again have
meetings as challenging and as fun as those that we had. I am also very much in debt to Peter
Spirtes, Jiji Zhang and Teddy Seidenfeld for providing a help handing whenever necessary, as well
as to my thesis committee members, Tom Mitchell and Greg Cooper. Diane Stidle was also essential
to guarantee that everything was on the right track, and CALD would not be the same without her.
It was a great pleasure to be part of CALD on its rst years. Deepayan Chakrabarti and Anna
Goldenberg have been with me since Day 1, and they know what it means, and how important
they were to me in all these years. Many other CALDlings were with us in many occasions: the
escapades for food in South Side with Rande Shern and Deepay; the annual Super Bowl parties at
Bubba Beasleys and foosball at Daniel Wilsons; the always ready-for-everything CALD KREM:
Krishna Kumaraswamy, Elena Eneva, Matteo Matteucci and myself (too bad I broke the pattern of
repeated initials. Think of me as the noise term) these guys could party even during a black-out;
Pippin Whitaker, perpetrator of the remarkable feat of convincing me to go to the gym at 5 a.m.
(I still dont know how I was able to wake up and nd the way to the gym by myself). On top of
that, Edoardo Airoldi and Xue Bai were masters of organizing a good CALD weekend, preferably
with the company of Leonid Teverovskiy, Jason Ernst and Pradeep Ravikumar; Xue gets additional
points for being able to drag me to salsa classes (with the help of Lea Kissner and Chris Colohan);
Francisco Pereira is not quite from CALD, but he is not in these acknowledgements just because of
his healthy habit of bringing me some fantastic Porto wine straight from the source (yes, I got one
for my defense too); and one cannot forget the honorary CALDlings Martin Zinkevich and Shobha
Venkataraman.
Josue, Simone and Clara Ramos were fantastic hosts, who made me feel at home when I was
just a newcomer. Whenever you show up in my homecity, make sure to knock at my door. It will
feel like the days in Pittsburgh, snow not included.
I owe a lot to Einat Minkov, including some of my sweetest memories of Pittsburgh. Will I ever
repay for everything? I wont stop trying.
To conclude, it goes without saying that my parents and brother were an essential support on
every step of my life. But let me say it anyway: thank you for everything. This thesis is dedicated
to you.
iv
Contents
1 Introduction 1
1.1 On the necessity of latent variable models . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Thesis scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Causal models, observational studies and graphical models . . . . . . . . . . . . . . . 6
1.4 Learning causal structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.5 Using parametric constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.6 Thesis outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2 Related work 15
2.1 Factor analysis and its variants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.1.1 Identiability and rotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.1.2 An example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.1.3 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.1.4 Other variants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.1.5 Discrete models and item-response theory . . . . . . . . . . . . . . . . . . . . 20
2.2 Graphical models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2.1 Independence models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2.2 General models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3 Learning the structure of linear latent variable models 27
3.1 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2 The setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2.1 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2.2 The Discovery Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.3 Learning pure measurement models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.3.1 Measurement patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.3.2 An algorithm for nding measurement patterns . . . . . . . . . . . . . . . . . 34
3.3.3 Identiability and purication . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.3.4 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.4 Learning the structure of the unobserved . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.4.1 Identifying conditional independences among latent variables . . . . . . . . . 44
3.4.2 Constraint-satisfaction algorithms . . . . . . . . . . . . . . . . . . . . . . . . 44
3.4.3 Score-based algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
vi CONTENTS
3.5.1 Simulation studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.5.2 Real-world applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4 Learning measurement models of non-linear structural models 65
4.1 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.2 Main results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.3 Learning a semiparametric model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.4.1 Evaluating nonlinear latent structure . . . . . . . . . . . . . . . . . . . . . . . 72
4.4.2 Experiments in density estimation . . . . . . . . . . . . . . . . . . . . . . . . 74
4.5 Completeness considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5 Learning local discrete measurement models 79
5.1 Discrete associations and causality . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.2 Local measurement models as association rules . . . . . . . . . . . . . . . . . . . . . 80
5.3 Latent trait models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.4 Learning latent trait measurement models as causal rules . . . . . . . . . . . . . . . 84
5.4.1 Learning measurement models . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.4.2 Statistical tests for discrete models . . . . . . . . . . . . . . . . . . . . . . . . 88
5.5 Empirical evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.5.1 Synthetic experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.5.2 Evaluations on real-world data . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
6 Bayesian learning and generalized rank constraints 101
6.1 Causal learning and non-Gaussian distributions . . . . . . . . . . . . . . . . . . . . . 101
6.2 Probabilistic model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
6.2.1 Parametric formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
6.2.2 Priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
6.3 A Bayesian algorithm for learning latent causal models . . . . . . . . . . . . . . . . . 106
6.3.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
6.3.2 A variational score function . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
6.3.3 Choosing the number of mixture components . . . . . . . . . . . . . . . . . . 111
6.4 Experiments on causal discovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
6.5 Generalized rank constraints and the problem of density estimation . . . . . . . . . . 113
6.5.1 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
6.6 An algorithm for density estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
6.7 Experiments on density estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
6.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
7 Conclusion 125
CONTENTS vii
A Results from Chapter 3 129
A.1 BuildPureClusters: renement steps . . . . . . . . . . . . . . . . . . . . . . . . . 129
A.2 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
A.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
A.3.1 Robust purication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
A.3.2 Finding a robust initial clustering . . . . . . . . . . . . . . . . . . . . . . . . 142
A.3.3 Clustering renement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
A.4 The spiritual coping questionnaire . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
B Results from Chapter 4 149
C Results from Chapter 6 175
C.1 Update equations for variational approximation . . . . . . . . . . . . . . . . . . . . . 175
C.2 Problems with Washdown . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
C.3 Implementation details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
viii CONTENTS
Chapter 1
Introduction
Latent variables, also called hidden variables, are variables that are not observed. Concepts such as
gravitational elds, subatomic particles, antibodies or economical stability are essential building
blocks of models of great practical impact, and yet such entities are unobservable (Klee, 1996).
Sometimes there is overwhelming evidence that hidden variables are actual physical entities, e.g.,
quarks, and sometimes they are useful abstractions, e.g., psychological stress.
Often the goal of statistical analysis with latent variables is to reduce the dimensionality of the
data. Although in many instances this is a practical necessity, it is a goal that is sometimes in
tension with discovering the truth, especially when the truth concerns the causal relations among
latent variables. For instance, there are several methods that accomplish eective dimensionality
reduction by assuming that the latents under study are independent. Because full independence
among random variables is a very strong assumption, models resulting from such methods might
not have any correspondence to real causal mechanisms, even if such models t the data reasonably
well.
When there is uncertainty about the number of latent variables, which variables measure them,
or which measured variables inuence other measured variables, the investigator who aims at a
causal explanation is faced with a dicult discovery problem for which currently available methods
are at best heuristic. Loehlin (2004) argues that while there are several approaches to automatically
learn causal structure (Glymour and Cooper, 1999), none can be seen as competitors of exploratory
factor analysis: the usual focus of automated search procedures for causal Bayes nets is on relations
among observed variables. Loehlins comment overlooks Bayes net search procedures robust to the
presence latent variables (Spirtes et al., 2000), but the general sense of his comment is correct.
The main goal of this thesis is to ll this gap by formulating algorithms for discovering latent
variables that are hidden common causes of a given set of observed variables. Furthermore, we
provide strategies for discovering causal relations among the hidden variables themselves. In appli-
cations as dierent as gene expression analysis and marketing, knowing how latents causally interact
with the given observed measures and among themselves is essential. This is a question that has
been hardly addressed. The common view is that solving this problem is actually impossible, as
illustrated by the closing words of a popular textbook on latent variable modeling (Bartholomew
and Knott, 1999):
When we come to models for relationships between latent variables we have reached
a point where so much has to be assumed that one might justly conclude that the limits
of scientic usefulness have been reached if not exceeded.
2 Introduction
6
H
X
1
X X
X X X
2 3
4 5 6
X
1
X X
X X X
2 3
4 5
(a) (b)
Figure 1.1: An illustration on how the existence of an unrecorded variable can aect a probabilistic
model. Figure (b) represents the remaining set of conditional independencies that still exist after
removing node H from Figure (a). This gure is adapted from (Binder et al., 1997).
This view is a consequence of formulating the problem of discovering latent variables by using
arbitrary methods such as factor analysis, which can generate an innite number of solutions. Iden-
tiability in this case is treated as a mere case of interpretation, where all solutions are acceptable,
and the preferred ones are just those that are easier to interpret. This thesis should be seen as a
case against this type of badly formulated approach, and a counter-example to Bartholomew and
Knotts statement.
This introduction will explain the general approach for latent variable modeling and causal
modeling adopted in this thesis. We rst discuss how latent variables are important (Section 1.1),
especially in causal models. We then dene the scope of the thesis (Section 1.2). Details about
causal models are introduced in Section 1.3. In the end of this chapter we provide a thesis outline.
1.1 On the necessity of latent variable models
Consider rst the problem of density estimation using the graphical modeling framework (Jordan,
1998). In this framework, one represents joint probability distributions by imposing several con-
ditional independence constraints on the joint, where such constraints are represented by a graph.
Assume that we have a distribution that respects the independence constraints represented by Fig-
ure 1.1(a). If for some reason variable H is unrecorded in our database and we want to reconstruct
the marginal probability distribution of the remaining variables, the simplest graph we can use
has at least as many edges as the one depicted in Figure 1.1(b). This graph is relatively dense,
which can lead to computationally expensive inferences and statistically inecient estimation of
probabilities. If instead we use the latent variable model of Figure 1.1(a), we can obtain more
ecient estimators using standard techniques such as maximum likelihood estimation by gradient
descent (Binder et al., 1997). That is, even if we do not have data for particular variables, it is still
the case that a latent variable model might provide more reliable information about the observable
marginal than a model without latents.
In the given example, the hidden variable was postulated as being part of a true model. Some-
times a probabilistic model contains hidden variables not because such variables represent some
physical entity, but because it adds bias to a model in order to reduce the variance of the estimator.
1.1 On the necessity of latent variable models 3
Lead Cognitive Skills

Lead Cognitive Skills

IQ Test
Blood level

b
IQ
(a) (b)
Figure 1.2: In (a), the underlying hypothetized phenomenon. In (b), how the model assumptions
relates the measurements.
Even if such model does not correspond to a physical reality, it can aid predictions when data is
limited.
However, suppose we are interested not only in good estimates of a joint distribution as the
ultimate goal, but on the actual causal structure underlying the joint distribution. Consider rst
the scenario where we are given a set of latent variables. The problem is how to nd the correct
graphical structure representing the causal connections between latent variables, and between any
pair of latent and observed variables.
For example, suppose there is an observed association between exposure to lead and low IQ.
Suppose this association is because exposure to lead causes changes in a childs IQ. Policy makers
are interested in this type of problem because they need to control the environment in order to
achieve a desired eect: should we intervene in how lead is spread in the environment? But what if
it does not actually aect cognitive skills of children, but there is some hidden common cause that
explains this dependency? These are typical questions in econometrics and social science. But also
researchers in articial intelligence and robotics are attentive to such general problems: how can
a robot intervene on its environment in order to achieve its goals? If one does not know how to
quantify such eects, one cannot build any sound decision theoretic machinery for action, since the
prediction of the eects of a manipulation will be wrong to begin with. In order to perform sound
prediction of manipulations, causal knowledge is necessary, and algorithms are necessary to learn
it from data (Spirtes et al., 2000; Pearl, 2000; Glymour and Cooper, 1999).
A simple causal model for the lead (L) and cognitive skills (C) problem is a linear regression
model C = L + , where is the usual zero-mean, normally distributed random variable, and
the model is interpreted as causal. Figure 1.2(a) illustrates this equation as a graphical model.
There is one important problem: how to quantify lead exposure and cognitive skills. The
common practice is to rely on indirect measures (indicators), such as Blood level concentration
(of lead) (BL), which is an indicator of lead exposure. In our hypothetical example, BL cannot
directly substitute for L in this causal analysis because of measurement error (Bollen, 1989), i.e., a
signicant concentration of lead in someones blood might not be real, but an artifact of the physical
properties of the instruments used in this measurement. Concerning variable C, intelligence itself
is probably one of the most ill-dened concepts in existence (Bartholomew, 2004). Measures such
as IQ Tests (IQ) have to be used as indicators of C. Expressing our regression model directly in
terms of observable variables, we obtain IQ = BL +
IQ
.
4 Introduction

l
Cognitive Skills

IQ Test

IQ
Blood level
b

t
Teeth level
Lead
Parents Attentiveness
1
P
2
P
3
P
scale

1
2 3
Figure 1.3: A graphical model with three latents. Variable scale is a deterministic function of its
parents, represented by dashed edges.
However, if the variance of the measurement error of L through BL is not zero, i.e., E[
2
b
] = 0, we
cannot get a consistent estimator of by just regressing IQ on BL. This is not because regression
is fundamentally awed, but because this problem fails to meet its assumptions. By Figure 1.2(b),
we see that there is a common cause between BL and IQ (Lead), which violates an assumptions of
regression: if one wants consistent estimators of causal eects, there cannot be any hidden common
cause between the regressor and the predictor.
One solution is fully modeling the latent structure. Additional diculties arise in latent vari-
able models, however. For instance, the model in Figure 1.2(b) is not identied, i.e., the actual
parameters that can be used to quantify the causal eect of interested cannot be calculated. This
can be solved by using multiple indicators per latent (Bollen, 1989).
Consider the problem of identifying conditional independencies among latents. This is an es-
sential pre-requisite in data-driven approaches for causal discovery (Spirtes et al., 2000; Pearl,
2000; Glymour and Cooper, 1999). In our example, suppose we take into account a common cause
between lead and cognitive abilities: the parents attentiveness to home environment (P), with
multiple indicators P
1
, P
2
, P
3
(Figure 1.3). We want to test if L is independent from C given P
and, if so, conclude that lead is not a direct cause of alterations in childrens cognitive functions.
If these variables were observed, well-known methods of testing conditional independencies could
be used.
However, this is not the case. A common practice is to create proxies for such latent variables,
and to perform tests with the proxies. For instance, a typical proxy is the average of the respective
indicators of the hidden variable of interest. An average of P
1
, P
2
and P
3
is a scale for P, and
scales for L and C can be similarly constructed. In general, however, a scale does not capture all
of the variability of the respective hidden variable, and no conditional independence will hold given
this scale. Measurement error is responsible for such a dierence
1
. Assuming the model is linear,
1
Using a graphical criterion for independence known as d-separation (Pearl, 1988), one can easily verify that the
indicators of L and C cannot be independent of a function of the children on P, unless this function is deterministic
1.2 Thesis scope 5
this problem can be solved by tting the latent variable model and evaluating if the coecient
parameterizing the edge of interest is zero.
So far, we described a problem where latent variables were given in advance. An even more
fundamental problem is discovering which latents exist. A solution to this problem can also be
indirectly applied to the task of multivariate density estimation. This is one of the most dicult
problems in machine learning and statistics, since in general a joint distribution can be generated
by an innite number of dierent latent variable models. However, under an appropriate set of
assumptions, the existence of latents can sometimes be indirectly identied from testable features
of the marginal of the observed variables.
The scientic problem is therefore a problem of learning how our observations are causally
connected. Since it is often the case that such connections happen through hidden common causes,
the scientist has to rst infer which relevant latent variables exist. Only then he or she can
proceed to identify how such hidden variables are causally connected by examining conditional
independencies among latents that can be detected in the observed data. An automatic procedure
to aid this accomplishment this is the main contribution of this thesis.
1.2 Thesis scope
Given the large number of reasons for the importance of latent variable models, we describe here a
simplied categorization of tasks and which ones are relevant to this thesis:
causal inference. This is the main motivation of this thesis, and it is described in more detail
in the next sections;
density estimation. This is a secondary goal, achieved as a by-product of the thesiss main
results. We evaluate empirically how variations of our methods perform in this task;
latent prediction. Sometimes predicting the values of the latents themselves is the goal of
the analysis. For instance, in independent component analysis (ICA) (Hyvarinen, 1999) the
latents are signals that have to be recovered. In educational testing, latents represent the
abilities of an individual. Mathematical and verbal abilities in an exam such as GRE, for
instance, can be treated as latent variables, and individuals are ranked according to their
predicted values (Junker and Sijtsma, 2001). Similarly, in model-based clustering the latent
space can be used to group individuals: the modes of the latent posterior distribution can be
used to represent dierent market groups, for instance. We do not evaluate our methods in
the latent prediction task, but our results might be useful in some domains;
dimensionality reduction. Sometimes a latent space can be used to perform lossy compres-
sion of the observable data. For instance, Bishop (1998) describes an application in image
compressing using latent variable models. This is an example of an application where the
main theoretical results of this thesis are unlikely to be useful;
Within these tasks, there are dierent classes of problems. In some, for example, the observed
variables are basically measurements of some physical or social process. If, for example, we take
dozens of measures of the incident light hitting the surface of the earth, some at ultra-violet wave-
lengths, some at infra-red, etc., then it is reasonable to assume that such observed variables are
on P and invertible.
6 Introduction
measurements of a set of unrecorded physical variables, such as atmospherical and solar processes.
The pixels that compose fMRI images are indirect measurements of the chemical and electrical
processes in human brains. Educational tests intend to measure abstract latent abilities of stu-
dents, such as verbal and mathematical skills. Questionnaires used in social studies are intended
to analyse latent features of the population of interest, such as the attitude of single mothers
with respect to their children. In all these problems, it is also reasonable to assume that observed
variables are indicators of latents of interest, and therefore they are eects, not causes, of latents.
This type of data generating process is the focus of this thesis.
Moreover, because measures are massively connected by hidden common causes, it is unlikely
that conditional independencies hold among such measures unless such independencies are loosely
approximated, e.g., in cases where measures are nearly perfectly correlated with the latents. It
would be extremely useful to have a machine learning procedure that might discover which latent
common causes of such measures were operative, and do so in a way that allowed for discovering
something about how they were related, especially causally. But for that one cannot rely only on
observed conditional independencies. New techniques for causality discovery that do not directly
rely on observed independence constraints is the focus of this thesis.
1.3 Causal models, observational studies and graphical models
In this section we make more precise what we mean by causal modeling and how it is dierent
from non-causal modeling. There are two basic types of prediction problems: prediction under
observation and prediction under manipulation. In the rst type, given an observation of the current
state of the world, an agent infers the probability distribution of a set of variables conditioned on
this observation. For instance, predicting the probability of rain given the measure of a barometer
is such a prediction problem.
The second type consists in predicting the eect of a manipulation on a set of variables. A
manipulation consists on a modication of the probability distribution of a set of variables in a
given system by an agent outside the system. For instance, it is known that some specic range of
atmospherical pressure is a good indication of rain. A barometer measures atmospherical pressure.
If one wants to make rain, why not intervene on a barometer by modifying its sensors? If the
probability of rain is high for a given measure, then providing such a measure might appear as a
good idea.
The important dierence between the two types of prediction is intuitive. If the intervention
on our barometer consists on attaching a random number generator in place of the actual physical
sensors, we do not expect the barometer to aect the probability of rain, even if the resulting
measure is a strong indication of rain under proper physical conditions. We know this because
we know that rain causes changes in the barometer reading, not the opposite. A causal model is
therefore essential to predict the eects of an intervention.
The standard method of estimating a causal model is by performing experiments. Dierent
manipulations are assigned to dierent samples following a distribution that is independent of the
possible causal mechanisms of interest (uniformly random assignments are a common practice). The
dierent eects are then estimated using standard statistical techniques. Double-blinded treatments
in the medical literature are a classical example of experimental design (Rosebaum, 2002).
However, experiments might not be possible for several reasons: they can be unethical (as
in estimating the eects of smoking in lung cancer), too expensive (as in manipulating a large
1.3 Causal models, observational studies and graphical models 7
number of sets of genes, one set at a time), or simply technologically impossible (as in several
subatomic physics problems). Instead, one must rely on observational studies, which attempt to
obtain estimates of causal eects from observational data, i.e., data representative of the population
of interest, but obtained with no manipulations. This can only be accomplished by adopting extra
assumptions that link the population joint distribution to causal mechanisms.
An account of classical techniques of observational studies is given by Rosebaum (2002). In
most cases, the direction of causality is given a priori. The goal is estimating the causal eect of a
variable X on a variable Y , i.e., how Y varies given dierent manipulated values of X. One tries
to measure as many possible common causes between the X and Y in order to estimate the desired
eect, since the presence of hidden common causes will result in biased estimates.
Much background knowledge is required in these methods and, if incorrect, can severely aect
ones conclusions. For instance, if Z is actually a common eect of X and Y , conditioning on Z
adds bias to the estimate of the desired eect, instead of removing it.
Instead, this thesis advocates the framework of data-driven causal graphical models, or causal
Bayesian networks, as described by Spirtes et al. (2000) and Pearl (2000). Such models not only
encompass a wide variety of models used ubiquitously in social sciences, statistics, and economics,
but they are uniquely well suited for computing the eects of interventions.
We still need to adopt assumptions relating causality and joint probabilities. However, such
assumptions rely on a fairly general axiomatic calculus of causality instead of being strongly domain
dependent. The fundamental property of this calculus is assuming that qualitative features of a
true causal model can be represented by a graph. We will focus mostly on directed acyclic graphs
(DAGs), so any reference to a graph in this thesis is an implicit reference to a DAG, unless otherwise
specied. There are, however, extensions of this calculus to cyclic graphs and other types of graphs
(Spirtes et al., 2000).
Each random variable is a node in the corresponding graph, and there is an edge X Y in
the graph if and only if X is a direct cause of Y , i.e., the eect of X on Y when X is manipulated
is non-zero when conditioning on all other causes of Y . Notice that causality itself is not dened.
Instead we rely on the concepts of manipulation and eect, which are causal concepts themselves,
to provide a calculus to solve the practical problems of causal prediction.
Two essential denitions are at the core of the graphical causal framework:
Denition 1.1 (Causal Markov Condition) Any given variable is independent of its non-eects
given its direct causes.
Denition 1.2 (Faithfulness Condition) A conditional independence holds in the joint distri-
bution if and only if it is entailed by the Causal Markov condition in the corresponding graph.
The only dierence between the causal and the non-causal Markov conditions is that in the
former a parent is assumed to be a direct cause. The non-causal Markov condition is widely used in
graphical probabilistic modeling (Jordan, 1998). For DAGs, d-separation is a sound and complete
system to deduce the conditional independencies entailed by the Markov condition (Pearl, 1988),
which in principle can be used to verify if a probability distribution is faithful to a given DAG. We
will use the concept of d-separation in several points of this thesis as as synonym for conditional
independence. The faithfulness condition is also called stability by Pearl (2000). Spirtes et al.
(2000) and Pearl (2000) discuss the implications and suitability of such assumptions.
8 Introduction
Why does the faithfulness condition help us to learn causal models from observational data? The
Markov condition applied to dierent DAGs entails dierent sets of conditional independence con-
straints. Such constraints can in principle be detected from data. If constraints in the distribution
are allowed to be arbitrarily disconnected from the underlying causal graph, then any probability
distribution can be generated from a fully connected graph, and nothing can be learned. This, how-
ever, requires that independencies are generated by cancellation of causal paths. For instance, if
variables X and Y are probabilistically independent, but X and Y are causally connected, then all
causal paths between X and Y cancel each other, amounting to zero association. Our axioms deem
such an event impossible, and in fact this assumption seems to be very common in observational
studies (Spirtes et al., 2000), even though in many cases it is not explicit. We make it explicit.
Therefore, a set of independencies observed in a joint probability distribution can highly con-
strain the possible set of graphs that generated such independencies. It might be the case that all
compatible graphs agree on specic edges, allowing one to create algorithms to identify such edges.
Section 1.4 gives more details about discovery algorithms in the context of this thesis.
It is important to stress that a causal graph is not a full causal model. A graph only indicates
which conditional independencies exist, i.e., it is an independence model, not a probabilistic model
as required to compute causal eects. A full causal model should also describe the joint probability
distribution of its variables. Most graphical models used in practice are parametric, and dened
by local functions: the conditional density of a variable given its parents. In this thesis we will
adopt parametric formulations, mostly multivariate Gaussians or nite mixtures of multivariate
Gaussians.
Once parametric formulations are introduced, other types of constraints are entailed by para-
meterized causal graphs. That is, given a causal graph with a respective parameterization, some
constraints on the joint distribution will hold for any choice of parameter values. One can adopt
a dierent form of faithfulness on which such non-independence constraints observed in the joint
distribution are a result of the underlying causal graph, reducing the set of possible graphs com-
patible with the data. This will be essential in the automatic discovery of latent variable models,
as explained in Section 1.5.
1.4 Learning causal structure
Suppose one is given a joint distribution of two variables, X and Y , which are known to be
dependent. Both graphs X Y and X Y are compatible with this observation (plus an innite
number of graphs where an arbitrary number of hidden common causes of X and Y exist). In
this case, the causal relationship of X and Y is not identiable from conditional independencies.
However, with three or more variables, several sets of conditional independence constraints uniquely
identify the directionality of some edges.
Consider Figure 1.4(a), where variables H
i
are possible hidden variables. If hidden variables
are assumed to not exist, then the directed edges X Z and Y Z can be identied from data
generated by this model. If hidden variables are not discarded a priori, one can still learn that Z
is not a cause of either X or Y . If the true model is the one shown in Figure 1.4(b), in the large
sample limit it is possible to determine that Z is a cause of W under the faithfulness assumption,
even if one allows the possibility of hidden common causes.
In general, we do not have enough information to identify all features of the true causal graph
without experiments. The problem of causal discovery without experimental data should be for-
1.4 Learning causal structure 9
3
X Y
Z
H
H
1
2
H
W
X Y
Z
H
H
1
2
H
3
(a) (b)
Figure 1.4: Two examples of graphs where the directionality of the edges can be inferred.
mulated as a problem of nding equivalence classes of graphs. That is, instead of learning a causal
graph, we learn a set of graphs that cannot be distinguished given the observations. This set forms
the equivalence class of the given set of observed constraints. The most common equivalence class
of causal models is dened by conditional independencies:
Denition 1.3 (Markov equivalence class) The set of graphs that entail exactly the same con-
ditional independencies by the Markov condition.
Enumerating all members of an equivalence class might be unfeasible, because in the worst case
this number is exponential in the number of nodes in the graph. Fortunately, there are compact
representations for Markov equivalence classes. For instance, a pattern (Pearl, 2000; Spirtes et al.,
2000) is a representation for Markov equivalence classes of DAGs when no pair of nodes has a hidden
common cause. A pattern has either directed or undirected edges with the following interpretation:
two nodes are adjacent in a pattern if and only if they are adjacent in all members of the
corresponding Markov equivalence class;
there is an unshielded collider A B C (i.e., a substructure where A and C are parents
of B, and A, C are not adjacent) if and only if the same unshielded collider appears in all
members of the Markov equivalence class;
there is a directed edge A B in the pattern only if the same edge appears in all members
of the Markov equivalence class;
As hinted by the only if condition in the last item, patterns can dier with respect to the
completeness of their orientations. All members of an Markov equivalence class might agree on
the same directed edge that is not part of an unshielded collider (for example, edge Z W in
Figure 1.4(b)), and yet it might not be represented in a valid pattern. The original PC algorithm
described by Spirtes et al. (2000) is not guaranteed to provide a fully informative pattern, but there
are known extensions that provide such a guarantee (Meek, 1997). Some issues of completeness of
causal learning algorithms are discussed in this thesis.
Therefore, a key aspect of causal discovery is providing not only a model that ts the data, but
all models that t the data equally well according to a family of constraints, i.e., equivalence classes.
10 Introduction
There are basically two families of algorithms for learning causal structure from data (Cooper,
1999).
Constraint-satisfaction algorithms check if specic constraints are judged to hold in the popula-
tion by some decision procedure such as hypothesis testing. Each constraint is tested individually.
The choice of which constraints to test is usually based on the outcomes of the previous tests,
which increase the computational eciency of this strategy. Moreover, each test tends to be com-
putationally unexpensive, since only a handful of variables are included in the hypothesis to be
tested.
For example, the PC algorithm of Spirtes et al. (2000) learns Markov equivalence classes under
the assumption of no hidden common causes by starting with a fully connected undirected graph.
An undirected edge is removed if the variables at the endpoints are judged to be independent
conditioned on some set of variables. The order by which these tests are performed is in such a
way that the algorithm is exponential only in the maximum number of parents among all variables
in the true model. If this number is small, then the algorithm is tractable even for problems with
a large number of variables. After removing all possible undirected edges, directionality of edges is
determined according to which constraints were used in the rst stage. Details are given by Spirtes
et al. (2000).
A second family of algorithms is the score-based family. Instead of testing specic constraints,
a score-based algorithm uses a score function to rank how well dierent graphs explain the data.
Since scoring all possible models in unfeasible in all but very small problems, most score-based
algorithms are greedy hill-climbing search algorithms. Starting from some candidate model, a
greedy algorithm applies operators that create a new set of candidates based on modications of
the current graph. New candidates represent dierent sets of independence (or other) constraints.
The best scoring model among this set of candidates will become the new current graph, unless the
current graph itself has a higher score. In this case we reached a local maximum and the search
is halted. For instance, the K2 algorithm of Cooper and Herskovits (1992) was one of the rst
algorithms of this kind. The usual machinery of combinatorial optimization, such as simulated
annealing and tabu search, can be adapted to this problem in order to reach better local maxima.
In Figure 1.5, we show an example of the PC algorithm in action. Figure 1.5(a) shows the
true model, which is unknown to the algorithm. However, this model entails several conditional
independence constraints. For instance, X
1
and X
2
are marginally independent. X
1
and X
4
are
independent given X
3
, and so on. Starting from a fully connected undirected graph, as shown in
Figure 1.5(b), the PC algorithm will remove edges between any pair that is independent conditioned
on some other set of variables. This will result in the graph shown in Figure 1.5(c). Conditional
independencies allow us to identify which unshielded colliders exist, and the graph in Figure 1.5(d)
illustrates a valid pattern for the true graph. However, in this particular case it is also possible to
direct the edge X
3
X
4
. In this case, the most complete pattern represents an unique graph. An
example of a score-based algorithm is given in Chapter 2.
Constraint-satisfaction and score-based algorithms are closely related. For instance, several
score-based algorithms generate new candidate models that are either nested within the current
candidate or vice-versa. Common score functions that compare the current and new candidates are
asymptotically equivalent to likelihood-ratio tests, and therefore choosing a new graph amounts to
accepting
2
or rejecting the null hypothesis corresponding to the more constrained model.
2
We use a non-orthodox application of hypothesis testing on which failing to reject the null hypothesis is interpreted
as accepting the null hypothesis.
1.5 Using parametric constraints 11
X
1 2
3
4
X X
X
X
1 2
3
4
X X
X
X
1 2
3
4
X X
X
X
1 2
3
4
X X
X
(a) (b) (c) (d)
Figure 1.5: A step-by-step demonstration of the PC algorithm. The true model is given in (a).
We start with a full undirected graph among latents (b) and remove edges according the indepen-
dence constraints that hold among the given variables. For example, X
1
and X
2
are marginally
independent. Therefore, the edge X
1
X
2
is removed. However, X
1
and X
3
are not independent
conditioned on any subset of X
2
, X
4
. The edge X
1
X
3
remains. At the end of this stage, we
obtain graph (c). By orienting unshielded colliders, we get graph (d). Extra steps of orientation
detailed by Spirtes et al. (2000) will recreate the true graph.
However, with nite samples, algorithms in dierent families can get dierent results. Usually
score-based search will give better results, but the computational cost might be much higher.
Constraint-satisfaction algorithms tend to be greedier, in the sense that they might remove more
candidates at each step.
Score-based algorithms are especially problematic when latent variables are introduced. While
scoring DAGs without hidden variables can be done as eciently as performing hypothesis tests in
a typical constraint-satisfaction algorithm, this is not true when hidden variables are present. In
practice, strategies such as Structural EM (Friedman, 1998), as explained in Chapter 2, have to
be used. However, Structural EM might increase the chances of an algorithm getting trapped
in a local maxima. Another problem is the consistency of the score function, which we discuss in
Chapter 6.
1.5 Using parametric constraints
We emphasized Markov equivalence classes in the previous section, but there are other important
constraints, besides independence constraints, which can be used for learning causal graphs. They
are crucial when several important variables are hidden.
When latent variables are included in a graph, dierent graphs might represent the same mar-
ginal over the observed variables, even if these graphs represent dierent independencies in the
original graph. A classical example is factor analysis (Johnson and Wichern, 2002). Consider the
graphs in Figure 1.6, where circles represent latent variables. Assume this is a linear model with
additive noise where variables are distributed as multivariate Gaussian. A simple linear transfor-
mation of the parameters of a model corresponding to Figure 1.6(a) will generate a model as in
Figure 1.6(b) such that two models represent dierent sets of conditional independencies, but the
observed marginal distribution is identical.
Moreover, observed conditional independencies are of no use here. There are no observed
12 Introduction
X
6
X
5
X
4
X
2
X
3
X
1
L
2
L
1
X
6
X
5
X
4
X
2
X
3
X
1
L
2
L
1
(a) (b)
Figure 1.6: These two graphs with two latent variables are indistinguishable for an innite number
of normal distributions.
3
2
X
3
X
9
X
7
X
8
X
6
X
5
L
2
X
X
1
4
L
1
L
X
Figure 1.7: A latent variable model which entails several constraints on the observed covariance
matrix. Latent variables are inside ovals.
conditional independencies. One has to appeal to other types of constraints. Consider Figure 1.7,
where X variables are recorded and L variables (in ovals) are unrecorded and unknown to the
investigator. Assume this model is linear.
The latent structure, the dependencies of measured variables on individual latent variables, and
the linear dependency of the measured variables on their parents and (unrepresented) independent
noises in Figure 1.7 imply a pattern of constraints on the covariance matrix among the X variables.
For example, X
1
, X
2
, X
3
have zero covariances with X
7
, X
8
, X
9
. Less obviously, for X
1
, X
2
, X
3
and
any one of X
4
, X
5
, X
6
, three quadratic constraints (tetrad constraints) on the covariance matrix are
implied: e.g., for X
4

34
=
14

23
=
13

24
(1.1)
where
12
is the Pearson product moment correlation between X
1
, X
2
, etc. (Note that any two of
the three vanishing tetrad dierences above entail the third.) The same is true for X
7
, X
8
, X
9
and
any one of X
4
, X
5
, X
6
; for X
4
, X
5
, X
6
, and any one of X
1
, X
2
, X
3
or any one of X
7
, X
8
, X
9
. Further,
for any two of X
1
, X
2
, X
3
or of X
7
, X
8
, X
9
and any two of X
4
, X
5
, X
6
, exactly one such quadratic
constraint is implied, e.g., for X
1
, X
2
and X
4
, X
5
, the single constraint

25
=
15

24
(1.2)
Statistical tests for vanishing tetrad dierences are available for a wide family of distributions.
Linear and non-linear models can imply other constraints on the correlation matrix, but general,
feasible computational procedures to determine arbitrary constraints are not available (Geiger and
Meek, 1999) nor are there any available statistical tests of good power for higher order constraints.
1.5 Using parametric constraints 13
3
2
X
3
X
9
X
7
X
8
X
6
X
5
X
4
L
4
X X X X
10 11 12 13
L
2
X
1
L
1
L
X
Figure 1.8: A more complicated latent variable model which still entails several observable con-
straints.
Given a pure set of sets of measured indicators of latent variables, as in Figure 1.7 in-
formally, a measurement model specifying, for each latent variable, a set of measured variables
inuenced only by that latent variable and individual, independent noises the causal structure
among the latent variables can be estimated by any of a variety of methods. Standard tests of
latent variable models (e.g.,
2
) can be used to compare models with and without a specied
edge, providing indirect tests of conditional independence among latent variables. The conditional
independence facts can then be input to standard Bayes net search algorithms.
In Figure 1.7, the measured variables neatly cluster into disjoint sets of variables and the
variables in any one set are inuenced only by a single common cause and there are no inuences
of the measured variables on one another. In many real cases the inuences on the measured
variables do not separate so simply. Some of the measured variables may inuence others (as in
signal leakage between channels in spectral measurements), and some or many measured variables
may be inuenced by two or more latent variables.
For example, the structure among the latents of a linear, Gaussian system shown in Figure
1.8 can be recovered by the procedures we propose. Our aim in what follows is to prove and
use new results about implied constraints on the covariance matrix of measured variables to form
measurement models that enable estimation of features of the Markov equivalence class of the latent
structure in a wide range of cases. We will develop the theory rst for linear models with a joint
Gaussian distribution on all variables, including latent variables, and then consider possibilities for
generalization.
These examples illustrate that, where appropriate parameterizations can be used, new types
of constraints on the observed marginal will correspond to dierent independencies in the latent
variable graph, even though these conditional independencies themselves cannot be directly tested.
This thesis is entirely built upon this observation. Extra parametric assumptions will be necessary,
but at the benet of broader identiability guarantees. Considering the large number of applications
that adopt such parametric assumptions, our nal results should benet researchers across many
elds, such as as econometrics, social sciences, psychology, etc. (Bollen, 1989; Bartholomew et al.,
2002). From Chapter 3 to 6 we discuss our approach along possible applications.
14 Introduction
1.6 Thesis outline
This thesis concerns algorithms for learning causal and probabilistic graphs with latent variables.
The ultimate goal is learning causal relations among latent variables, but most of the thesis will focus
on discovering which latents exist and how they are related to the observed variables. We provide
theoretical results that our algorithms asymptotically generate outputs with a sound interpretation.
Sound algorithms for learning causal structures indirectly provide a suitable approach for density
estimation, which we show through experiments. The outline of the thesis is as follows:
our rst goal is to learn the structure of linear latent variable models under the assumption
that latents are not children of observed variables. This is the common assumption of factor
analysis and its variants, which are applied to several domains where observed variables are
measures, and not causes, of a large set of hidden common causes. We provide an algorithm
that can learn a specic parametric type of equivalence class according to tetrad constraints
in order to identify which latents exist and which observed variables are their respective
measures. Given this measurement model, we then proceed to nd the Markov equivalence
class among the hidden variables. We prove the pointwise consistency of this procedure. This
is the subject of Chapter 3;
in Chapter 4, we relax the assumption of linearity among latents. That is, hidden variables
can be non-linear functions of their parents, while observed variables are still linear functions
of their respective parents. We show that several theoretical results from Chapter 3 still hold
under this case. We also show that some of the results do not hold for non-linear models;
discrete models are considered in Chapter 5. There is a straightforward adaptation of our
approach in linear models for the case where measurement are discrete ordinal variables.
Because of the extra computational cost of estimating discrete models, we will develop this
case under a dierent framework for learning a set of models for single latent variables. This
has a correspondence with the goal of mining databases for association and causal rules;
nally, in Chapter 6 we develop a heuristic Bayesian learning algorithm for learning latent
variable models in more exible families of probabilistic models and graphs. We emphasize
results in density estimation, since the causal theory for such more general graphical models
is not as developed as the ones studied in the previous chapters.
Chapter 2
Related work
Latent variable modeling is a century-old enterprise. In this chapter, we provide a brief overview
of existing approaches.
2.1 Factor analysis and its variants
The classical technique of factor analysis (FA) is the foundation for many latent variable modeling
techniques. In factor analysis, each observed variable is a linear combination of hidden variables
(factors), plus an additive error term. Error variables are mutually independent and independent
of latent factors. Principal component analysis (PCA) can be seen as a special case of FA, where
the variances of the error terms are constrained to be equal (Bishop, 1998).
Let X represent a vector of observed variables, L represent a vector of latent variables, and a
vector of error terms. A factor analysis model can then be described as
X = L +
where is a matrix of parameters, with entry
ij
corresponding to the linear coecient of L
j
in
the linear equation dening X
i
. In this parameterization, we are setting the mean of each variable
to zero to simplify the presentation.
When estimating parameters, one usually assumes that latents and error variables are multi-
variate normal, which implies a multivariate normal distribution among the observed variables.
The covariance of X is given by

X
= E[XX
T
] = E[LL
T
]
T
+E[
T
] =
L

T
+
where M
T
is the transpose of matrix M, E[X] is the expected value of random variable X,
L
is
the model covariance matrix of the latents and the covariance matrix of the error terms, usually
a diagonal matrix. A common choice of latent covariance matrix is the identity matrix, based on
the assumption that latents are independent. This can be represented as graphical model where
variables in X are children of variables in L, as illustrated by Figure 1.6(a), repeated in Figure 2.1
for convenience. If latent variables are arbitrarily dependent (e.g., as a distribution faithful to a
DAG), this can be represented by a graphical model connecting latents, as shown in Figure 1.6(b).
By denition, the absence of an edge L
j
X
i
is equivalent to assuming
ij
= 0.
16 Related work
X
6
X
5
X
4
X
2
X
3
X
1
L
2
L
1
X
6
X
5
X
4
X
2
X
3
X
1
L
2
L
1
(a) (b)
Figure 2.1: Two examples of graphical representations of factor analysis models.
L
1
X
7
X X X X X
2 3 4 5 6
1 2
L
X
Figure 2.2: A simple structure, in factor analysis terminology, is a latent graphical model where
each observed variable has a single parent.
2.1.1 Identiability and rotation
When learning latent structure from data under the absence of reliable prior knowledge, one does
not want to restrict a priori how latents are connected to their respective measures. That is, in
principle the matrix of coecients (sometimes called the loading matrix) does not contain any a
priori zeroes specied. This creates a problem, since any linear transformation of will generate
an undistinguishable covariance matrix. This can be constructed as follows. Let matrix
R
= R,
where the rotation matrix R is non-singular. One can then verify that

X
=
L

T
+ =
R

T
R
+
where

L
= R
1

L
R
T
. This is independent on what the true latent covariance matrix
L
is.
Since
R
can be substantially dierent from , one cannot learn the proper causal connections
between L and X by using the empirical covariance matrix. This matter can in principle be solved
by using higher order moments of the distribution function (see Section 2.1.4 for a brief discussion
on independent component analysis). However, this is not the case for Gaussian distributions, the
typical case in applications of factor analysis. Moreover, estimating higher order moments is more
dicult than estimating covariances, which can compromise any causal discovery analysis. If one
wants or needs to use only covariance information, a rotation criterion is necessary.
The most common rotation criteria attempt to rotate the loading matrix to obtain something
close to a simple structure (Harman, 1967; Johnson and Wichern, 2002; Bartholomew and Knott,
1999; Bartholomew et al., 2002). A FA model with simple structure is a model where each
observed variable has a single latent parent. Structures close to a simple structure are those
where one or few of the edges into a specic node X
i
have a high absolute value, while all the other
edges into X
i
have coecients close to zero. In real world applications, it is common practice to
ignore loadings with absolute values smaller than some threshold, which may be set according to a
signicance test. Figure 2.2 illustrates a simple structure.
2.1 Factor analysis and its variants 17
Variable L
1
L
2
L
3
L
4
100-m run .167 .857 .246 -.138
Long jump .240 .477 .580 .011
Shot put .966 .154 .200 -.058
High jump .242 .173 .632 .113
400-m run .055 .709 .236 .330
110-m hurdles .205 .261 .589 -0.071
Discus .697 .133 .180 -0.009
Pole vault .137 .078 .513 .116
Javelin .416 .019 .175 .002
1500-m run -0.055 0.056 .113 .990
Table 2.1: Decathlon data modeled with factor analysis.
In practice, the following steps are performed in a factor analysis application.
choose the number k of latents. This can be done by testing models with 1, 2, ..., n latents,
choosing the one that maximizes some score function. For instance, choosing the smallest k
such that a factor analysis model of k independent latents and a fully unconstrained loading
matrix L has a p-value of at least 0.05 according to some test such as
2
(Bartholomew and
Knott, 1999);
t the model with k latents (e.g., by maximum likelihood estimation, Bartholomew and Knott,
1999) and apply a rotation method to achieve something close to a simple structure (e.g.,
the OBLIMIN method, Bartholomew and Knott, 1999);
remove edges from latents to observed variables according to their statistical signicance.
The literature on how to nd connections between the latents themselves is much less developed.
Bartholomew and Knott (1999) present a brief discussion, but it relies heavily on the use of domain
knowledge, which lead to the quote given at the beginning of Chapter 1.
2.1.2 An example
The following example is described in Johnson and Wichern (2002), a factor analytic study of
Olympic decathlon scores since World War II. The scores for all 10 decathlon events were stan-
dardized. Four latent variables were chosen using a method based on the analysis of the eigenvalues
of the empirical correlation matrix. Sample size is 160. Results after rotation are shown in Table
2.1. Latent variables were treated as independent in this analysis. Statistically signicant loadings
(which would correspond to edges in a graphical model) are shown in bold. There is an intuitive
separation of factors, with a clear component for jumping, another for running, another for throwing
and a component for the longer running competition. In this case, components were well-separated.
In many cases, the separation is not clear, as in the examples given in Chapter 3.
Several multivariate analysis books as (Johnson and Wichern, 2002) describe applications of fac-
tor analysis. More specialized books provide more detailed perspectives. For instance, Bartholomew
et al. (2002) describe a series of case studies of factor analysis and related methods in social sciences.
Malinowski (2002) describes applications in chemistry.
18 Related work
2.1.3 Remarks
Given the machinery described in the previous sections, factor analysis has been widely used to
discover latent variables, despite the model identication shortcomings that require rather ad-hoc
matrix rotation methods.
One the the fundamental ideas used to motivate factor analysis with rotation as a method for
nding meaningful hidden variables is that a group of random variables can be clustered according
to the strength of their correlations. As put by a traditional textbook in multivariate analysis
(Johnson and Wichern, 2002, p. 514):
Basically, the factor model is motivated by the following argument: suppose variables
can be grouped by their correlations. That is, suppose all variables within a particular
group are highly correlated among themselves, but have relatively small correlations
with variables in a dierent group. Then it is conceivable that each group of variables
represents a single underlying construct, or factor, that is responsible for the observed
correlations.
Also, Harman (1967) suggests this criterion as an heuristic for clustering variables, achieving
a model closer to a simple structure. We argue that the assumption that the simple structure
can be obtained by such criterion is unnecessary. Actually, there is no reason why it should hold
even in a linear model. For example, consider the following simple structure with three latents
(L
1
, L
2
and L
3
) and with four indicators per latent. Let L
2
= 2L
1
+
L
2
, L
3
= 2L
2
+
L
3
, where
L
1
,
L
2
and
L
3
are all standard normal variables. Let the rst and fourth indicator of each latent
have a loading of 9, and the second and third have a loading of 1. This means, for example, that
the rst indicator of L
1
is more strongly correlated with the rst indicator of L
2
than with the
second indicator of L
1
. Factor analysis with rotation methods will be mislead, typically clustering
indicators of L
2
and L
3
together.
Because of identiability problems, many techniques to learn hidden variables from data are
conrmatory, i.e., they start with a conjecture about a possible latent. Domain knowledge is
used for selecting the initial set of indicators to be tested as a factor analysis model. Statistical
and theoretical tools here aim at achieving validity and reliability assessment (Bollen, 1989) of
hypothesized latent concepts. A model for a single latent is valid if it actually measures the desired
concept, and it is reliable if, for any given value of the latent variable, the conditional variance
of the elements in the construct is reasonably small. Since these criteria rely on unobservable
quantities, they are not easy to evaluate.
Latents conrmed with FA in principle do not rule out other possible models that might t
the data as well. Moreover, when the model does not t the data, nding the reason for the
discrepancy between theory and evidence can be dicult. Consider the case of testing a theoretical
factor analysis model for a single latent. Carmines and Zeller (1979) argue that in general it is
dicult for factor analysis to distinguish a model with few factors against an one-factor model.
The argument is that factor analysis may identify a systematic error variance component as an
extra factor. On an example about indicators of self-steem, they write (p. 67):
In summary, the factor analysis summary of scale data does not provide unambigu-
ous, and even less unimpeachable, evidence of the theoretical dimensionality underlying
these self-steem items. On the contrary, since the bifactorial structure can be a function
of a single theoretical dimension which is contaminated by a method artifact as well as
2.1 Factor analysis and its variants 19
being indicative of two separate, substantive dimensions, the factor analysis leaves the
theoretical structure of self-steem indeterminate.
The criticism is on determining the number of factors based merely in a criterion of statis-
tical tness. In their self-steem problem, the proposed solution was relying on an extra set of
theoretically relevant external variables, other observed variables that are, by domain-knowledge
assumptions, related to the concept of self-steem. First, a scale was formed for each of the two
latents in the factor analysis solution. Then, for each external variable, the correlation with both
scales was computed. Since the pattern of correlations for the two scales was very similar, and
there was no statistically signicant dierence between the correlations for any external variable
comparison, the nal conclusion was that the indicators were actually measuring a single abstract
factor.
In contrast to Carmines and Zeller, the methods described in this thesis are data-driven. Some
problems will be ultimately irreducible to a single or few models. While background knowledge
will always be essential in practice, we will show that our approach at the very least attempts to
produce submodels that can be justied on the grounds of a few domain-independent assumptions
and the data.
Unlike factor analysis, our methods have theoretical justications. If the true model is a simple
structure, the method described in Section 2.1.1 is a reliable way of reconstructing the actual
structure from data despite the counter-example described earlier in this section. However, if the
true model is not a simple structure, even if it is an approximate one, this method is likely to
generate unpredictable results. In Chapter 3 we perform some empirical tests using exploratory
factor analysis. The conclusion is that FA is largely unreliable as a method for nding simple
structures. Also, unlike the pessimistic conclusions of Bartholomew and Knott (1999), we show
that it is possible to nd causal structures among latents, depending on how decomposable the real
model is, without requiring background knowledge.
2.1.4 Other variants
A variety of methodologies were created in order to generalize standard FA for other distributions.
For instance, independent component analysis (ICA) is a family of tools motivated by blind source
separation problems where estimation requires assuming that latents are not Gaussian. Instead,
some measure of independence is maximized without adopting strong assumptions concerning the
marginal distribution of each latent. For instance, Attias (1999) assumes that each latent is dis-
tributed accordingly to a semiparametric family of mixture of Gaussians.
Still, at its heart ICA relies heavily on the original idea of factor analysis, interpreting observed
variables as joint measurements of a set of independent latents. Some extensions, such as tree-based
component analysis (Bach and Jordan, 2003), attempt to relax this assumption by allowing a tree-
structured model among latents. This approach, however, is dicult to generalize to more exible
graphical models due to its computational cost and identiability problems. For a few problems
such as blind source separation such an assumption may be reasonable, but it is more often the
case that it is not. Most variations of factor analysis, while useful for density estimation and data
visualization (Minka, 2000; Bishop, 1998; Ghahramani and Beal, 1999; Buntine and Jakulin, 2004),
insist on the assumption of independent latents.
20 Related work
2.1.5 Discrete models and item-response theory
While several variations of factor analysis concentrate on continuous models, there is also a large
literature on discrete factor analysis. Some concern models with discrete latents and measures,
such as latent class analysis (Bartholomew et al., 2002), discrete PCA (Buntine and Jakulin, 2004)
and latent Dirichlet allocation (Blei et al., 2003). This thesis concerns models with continuous
latents only. A discussion on the suitability of continuous latents can be found in (Bartholomew
and Knott, 1999; Bartholomew et al., 2002).
Factor analysis models with continuous latents and discrete indicators are generally known as
latent trait models. A discussion of latent trait models for ordinal and binary indicators is given
in Chapter 5. In the rest of this section, we discuss latent trait models under the context of item-
response theory (IRT). The eld of IRT consists on the analysis of multivariate (usually binary)
data as measurements of underlying abilities of an individual. This is the case of research on
educational testing, whose goal is to design tests to measure the skills of a student according to
determined factors as mathematical skills or language skills. Once one models each desired
ability as a latent variable, such random variables can be used to rank individuals and provide
information about the distribution of such individuals in the latent space.
Much of the research on IRT consists on designing tests of unidimensionality. That is, a statisti-
cal procedure to determine if a set of questions are indicators of a single latent factor. Conditioned
on such a factor, indicators should be independent. Besides testing for the dimensionality of a set
of observed variables, estimating the response functions (i.e., the conditional distribution of each
indicator given its latent parents) is part of the core research in IRT.
Parametric models of IRT are basically latent trait models. For the purposes of learning latent
structure, they are not essentially dierent from generic latent trait models as explained in Chapter
5. A more distinctive aspect of IRT research is on nonparametric models (Junker and Sijtsma,
2001), where no nite dimensional set of parameters is assumed in the description of the response
functions. Instead, the assumption of monotonicity of all response functions is used: this means
that for a particular indicator X
i
and a vector of latent variables , P(X
i
= 1[) is non-decreasing
as a function of (the coordinates of) .
Some approaches allow mild violations of independence conditioned on the latents, as long
as estimation of the latent values can be consistenly done when the number of questions (i.e.,
indicators) goes to innite (see, e.g., Stout, 1990). Many non-parametric IRT approaches use a
statistic as a proxy for the latent factors (such as the number of correctly answered questions) in
order to estimate non-parametric associations due to common hidden factors. Junker and Sijtsma
(2001) and Habing (2001) briey review some of these approaches. Althought this thesis does not
follow the non-parametric IRT approach in any direction, it might provide future extensions to our
framework.
2.2 Graphical models
Beyond variations of factor analysis, there is a large literature in learning the structure of graphical
models. Graphical models became a representation of choice for computer science and articial
intelligence applications for systems operating under conditions of uncertainty, such as in prob-
abilistic expert systems (Pearl, 1988). Bayesian networks and belief networks are the common
denominations under such contexts. They have been used also for decades in econometrics and so-
cial sciences (Bollen, 1989), usually to represent linear relations with additive errors. Such models
2.2 Graphical models 21
are called structural equation models (SEMs).
The very idea of using graphical models is to be able to express qualitative information that is
dicult or impossible to express with probability distributions only. For instance, the consequences
of conditional independence conditions can be carried on with much less eort under the language
of graphs than under the probability calculus. It becomes easier to add prior knowledge, as well
as using the machinery of graph theory to develop exact and approximate inference algorithms.
However, perhaps the greatest gain in expressive power is allowing the expression of causal relations,
which seems impossible to be achieved (at least in a more general sense) by means of probability
calculus only (Spirtes et al., 2000; Pearl, 2000).
2.2.1 Independence models
We described the PC algorithm in Chapter 1, stressing that such an algorithm assumes that no
pair of observed variables have a hidden common cause. The Fast Causal Inference algorithm
(FCI) (Spirtes et al., 2000) is an alternative algorithm for learning Markov equivalence classes
of a special class of graphs called mixed ancestral graphs (MAGs) by Richardson and Spirtes
(2002). MAGs allow the expression of which pairs of observed variables have hidden common
causes. The FCI algorithm returns a representation of the Markov equivalence class of MAGs
given the conditional independence statements that are known to be true among the observed
variables. This representation shares many similarities with the pattern graphs used to represent
Markov equivalence classes of DAGs.
Consider Figure 2.3(a) representing a true model with three hidden variables H
1
, H
2
and H
3
.
The marginal distribution for W, X, Y, Z is faithful to several Markov equivalent MAGs. All equiv-
alent graphs can be represented by the graph shown in Figure 2.3(b). Although describing such a
representation in detail is out of the scope of this thesis, it suces to say that, e.g., the edge X
oY means that it is possible that X and Y have a hidden common cause, and that we know for
sure that Y is not a cause of X. Edge Z W means that Z causes W, and there is no hidden
common cause between them.
Since only observed conditional independencies are used by FCI, any model where most observed
variables are connected by hidden common causes will be problematic. For instance, consider the
true model given in Figure 2.3(c), where H is a hidden common cause of all observed variables. Since
no observed independencies exist, the output of FCI will be the sound, but minimally informative,
graph of Figure 2.3(d). Such graphs do not attempt to represent latents explicitly. In contrast, this
thesis provides an algorithm able to reconstruct Figure 2.3(c) when observed variables are linear
functions of H.
An algorithm such as FCI is still necessary in models where observed independencies do not
exist. Ultimately, even an algorithm able to explicitly represent latents still needs to describe how
latent nodes are connected. Since explicit latent nodes might have hidden common causes that are
not represented in the graphical model, a representation such as a MAG can be used to account
for these cases.
2.2.2 General models
Many standard models can be recast in graphical representations (e.g., factor analysis as a graph
where edges are oriented from latents to observed variables). Under the graphical modeling lit-
erature, there are several approaches for dealing with latent variables beyond models of Markov
22 Related work
W
X Y
Z
H
H
1
2
H
3
W
X Y
Z
H
W X Y Z W
X Y
Z
(a) (b) (c) (d)
Figure 2.3: Figure (b) represents the Markov equivalence class of MAGs compatible with the
marginal distribution of W, X, Y, Z represented in Figure (a). Figure (d) represents the Markov
equivalence class of MAGs compatible with the marginal distribution of W, X, Y, Z represented
in Figure (c)
equivalence classes. Many of them are techniques for tting parameters giving the structure (Binder
et al., 1997; Bollen, 1989) or choosing the number of latents for a factor analysis model (Minka,
2000).
Elidan et al. (2000) empirically evaluate heuristics for introducing latent variables. These
heuristics were independently suggested in several occasions (e.g., Heckerman, 1998) and consist
on the observation that if two variables are conditionally independent given a set of other observed
variables, then given the faithfulness condition they should not have hidden common causes. Given
a DAG representing probabilistic dependencies among observed variables, a clique of nodes might
be the result of hidden common causes that explain such associations. The specic implementation
of Elidan et al. (2000) introduces latent variables as a hidden common causes of sets of densely
connected nodes (not necessarily cliques, in order to account for statistical mistakes in the original
DAG). Since we are going to use FindHidden in some of our experiments, we describe the variation
we used in Table 2.2. It also serves as an illustration of a score-based search algorith, as suggested
in Chapter 1.
This algorithm uses as a sub-routine a StandardHillClimbing(G
start
) procedure. Starting
from a graph G
start
, this is simply a greedy search algorithm among DAGs: given a current DAG
G, all possible variations of G generated by either
adding one edge to G
deleting one edge from G
reversing an edge in G
are evaluated. Given a dataset D, the candidate that achieves the highest score according to a
given score function T(G, D) is chosen as a new DAG, unless the current graph G has a higher
score. In this case, the algorithm halts and returns G. Although simple, such a heuristic is quite
eective for learning DAGs without latent variables (Chickering, 2002; Cooper, 1999), especially if
enriched with standard techniques in combinatorial optimization for escaping local minima, such
2.2 Graphical models 23
Algorithm FindHidden
Input: a dataset D
1. Let G
null
be a graph over the variables in D with no edges.
2. G StandardHillClimbing(G
null
)
3. Do
4. Let C be the set of semicliques in G
5. Let C
i
C be the semiclique that maximizes T(IntroduceHidden(G, C
i
))
6. G
new
StandardHillClimbing(IntroduceHidden(G, C
i
), D)
7. If T(G
new
, D) > T(G, D)
8. G G
new
9. While G changes
10. Returns G
Table 2.2: One of the possible variations of FindHidden (Elidan et al., 2000; Heckerman, 1995),
which iteratively introduces one latent at a time and attempts to learn a directed graph structure
given such hidden nodes.
as tabu lists, beam search and annealing. FindHidden extends this idea by introducing latent
variables into dense regions of G.
Such dense regions are denominated semicliques, which are basically groups of nodes where
each node is adjacent to at least half of the other members of this group. Heuristics for enumerating
the semicliques of a graph are given by Elidan et al. (2000).
Given a semiclique C
i
, the operation InsertHidden(G, C
i
) returns a modication of a graph
G by introducing a new latent L
i
, removing all edges into elements of C
i
, and making L
i
a common
parent to all elements in C
i
. Moreover, for each parent P
j
of a node in C
i
, we set P
j
to be a parent
of L
i
unless that creates a cycle. According to Step 5 of Table 2.2, in our implementation we choose
among the possible semicliques to start the next cycle of FindHidden by picking the one that has
the best initial score.
But heuristic methods such as FindHidden have as their main goal reducing the number
of parameters in a Bayesian network. The idea is reducing the variance of the resulting density
estimator, achieving better probabilistic predictions. They do not provide any formal interpretation
of what the resulting structure actually is, no explicit assumptions on how such latents should
interact with the observed variables, no analysis of possible equivalence classes, and consequently
no search algorithm that can account for equivalence classes. For probabilistic modeling, the results
described by Elidan et al. (2000) are a convincing demonstration of the suitability of this approach,
which is intuitively sound. For causality discovery under the assumption that all observed variables
have hidden common causes (such as in the problems we discussed in Chapter 1), they are a
unsatisfying solution.
24 Related work
The introduction of proper assumptions on how latents and measures interact makes learning
the proper structure a more realistic possibility. By assuming a discrete distribution of latent
variables and observed measurements in a hidden Markov model (HMM), Beal et al. (2001) present
algorithms for learning the transition and emission probabilities with good empirical results. The
only assumptions about the structure of the true graph is that it is a hidden Markov model, but
no a priori information on the number of latents or which observed variables are indicators of
which latents is necessary. No tests of signicance for the parameters are discussed, since model
selection was not the goal. However, if one wants to have qualitative information of independence
(as necessary in our axiomatic causality calculus), such analysis has to be carried on. This is also
necessary in order to scale this approach for models with a large number of latent variables.
As another example, Zhang (2004) provides a sound representation for latent variable models
of discrete variables (both observed and latent) with a multinomial probabilistic model. The model
is constrained to be a tree, however, and every observed variable has one and only (latent) parent
and no child. Similar to factor analysis, no observed variable can be a child of another observed
variable or a parent of a latent. Instead of searching for variables that satisfy this assumption, Zhang
(2004) assumes the variables measured satisfy it. To some extent, an equivalence class of graphs
is described, which limits the number of latents and the possible number of states each categorical
latent variable can have without being empirically indistinguishable from another graph with less
latents or less states per latent. Under these assumptions, the set of possible latent variable models
is therefore nite.
Approaches such as (Zhang, 2004) and (Elidan et al., 2000) are score-based search algorithms for
learning DAGs with latent variables. Therefore, they require scoring thousands of candidate models,
which can be a very computationally expensive operation since calculating the most common score
functions requires solving non-convex optimization problems. More important, in principle they also
require re-evaluation of the whole model for each score evaluation. The cost of such re-evaluation
is prohibitive in all but very small problems.
However, the Structural EM framework (Friedman, 1998) can highly simplify the problem.
Structural EM algorithms introduce a graphical search module into an expectation-maxization
algorithm (Mitchell, 1997) besides parameter learning. If the score function to be optimized (usually
the posterior distribution of the graph or penalized log-likelihood) is linear in the expected moments
of the hidden variables, such moments are initially calculated (the expectation step), xed as if
they were observed data, and structural search proceeds as if there were no hidden variables (the
maximization step). If the score function does not have this linearity property, some approximations
might be used instead (Friedman, 1998).
There are dierent variations of Structural EM. For instance, we make use of the following
variation:
1. choose an initial graph and an initial set of parameter values. It will be clear in our context
which initial graphs are chosen.
2. maximize the score function with respect to the parameters
3. use the parameter values to obtain all required expected sucient statistics (in our case, rst
and second order moments of the joint distribution of the completed data, i.e., including
observed and hidden data points)
4. apply the respective structure search algorithm to maximize the score function as if the
expected sucient statistics were observed data
2.3 Summary 25
5. if the graphical structure changed, return to Step 2
Structural EM-based algorithms are usually much faster than straighforward implemen-
tations, which might not be feasible at all otherwise. However, this framework is not without its
shortcomings. A bad choice of initial graph, for instance, might easily result in a bad local maxima.
Some guidelines for proper application of Structural EM are given by (Friedman, 1998).
Glymour et al. (1987) and Spirtes et al. (2000) describe algorithms for modifying a latent variable
model using constraints on the covariance matrix of the observed variables. These approaches are
also either heuristic or require strong background knowledge and do not generate new latents from
data. Pearl (1988) discuss a complementary approach that generates new latents, but requires the
true model to be a tree, similarly to Zhang (2004). This thesis can be seen as a generalization
of these approaches with formal results of consistency. Spirtes et al. (2000) present a sound test
of conditional independence among latents, but it requires knowing in advance which observed
variables measure which latents. We discuss this in detail in Chapter 3.
A recurring debate in the structural equation modeling literature is whether one should learn
models from data by the following 2-step approach: 1. nd which latents exist and which observed
variables are their corresponding indicators; 2. given the latents, nd the causal connections among
them. The alternative is trying to achieve both at the same time (Fornell and Yi, 1992; Hayduk
and Glaser, 2000; Bollen, 2000). As we will see, this thesis strongly supports a two-step procedure.
A good deal of criticism on two-step approaches concerns the use of methods that suer from
non-identiability shortcomings, such as factor analysis. In fact, we do not claim we can nd the
true model. Our solution, explained in Chapter 3, is trying to discover only features that can
be identied, and reporting ignorance about what we cannot identify. The structural equation
modeling literature oers no alternative. Instead, current two-step approachs are naive in the
sense they do not account for equivalence classes (Bollen, 2000), and any one-step approach
is hopeless: the arguments for this approach show an unhealthy obsession on using extensive
background knowledge (Hayduk and Glaser, 2000), i.e., they mostly avoid solving the problem
they are supposed to solve, which is learning from data. Although we again stress that assumptions
concerning the true structure of the problem at hand are always necessary, we favor a more data-
driven solution.
2.3 Summary
Probabilistic modeling through latent variables is a mature eld. Causal modeling with latent
variable models is still a fertile eld, mostly due to the fact that researchers on this eld are usually
not concerned about equivalence classes.
Carreira-Perpinan (2001) has an extended review of probabilistic modeling with latent variables.
Glymour (2002) oers a more detailed discussion on the shortcomings of factor analysis. The
journals Psychometrika and Structural Equation Modeling are primary sources of research in latent
variable modeling via factor analysis and SEMs.
26 Related work
Chapter 3
Learning the structure of linear latent
variable models
The associations among a set of measured variables can often be explained by hidden common
causes. Discovering such variables, and the relations among them, is a pressing challenge for
machine learning. This chapter describes an algorithm for discovering hidden variables in linear
models and the relations between them. Under the Markov and faithfulness conditions, we prove
that our algorithm achieves Fisher consistency: in the limit of innite data, all causal claims made
by our algorithm are correct in a sense we make precise. In order to evaluate our results, we perform
simulations and three case studies with real-world data.
3.1 Outline
This chapter concerns linear models, a very important class of latent variable models. It is organized
as follows:
Section 2.2: The setup formally denes the problem and makes explicit the assumptions
we adopt;
Section 2.3: Learning measurement models describes an approach to deal with half
of the given problem, i.e., discovering latent common causes and which observed variables
measure them;
Section 2.4: Learning the structure of the unobserved describes an algorithm to learn
a Markov equivalence class of causal graphs over latent variables given a measurement model;
Section 2.5: Empirical results discusses series of experiments with simulated data and
three real-world data sets, along with criteria of success;
Section 2.6: Conclusion wraps up the contributions of this chapter.
3.2 The setup
We adopt the framework of causal graphical models. More background material in graphical causal
models can be found in Spirtes et al. (2000) or Pearl (2000) and Chapter 1.
28 Learning the structure of linear latent variable models
3.2.1 Assumptions
The goal of our work is to reconstruct features of the structure of a latent variable graphical model
from i.i.d. observational data sampled from a subset of the variables in the unknown model. These
features should be sound and informative. We assume that the true causal graph G generating the
data has the following properties:
A1. there are two types of nodes: observed and latent.
A2. no observed node is an ancestor of any latent node. We call this property the measurement
assumption;
A3. G is acyclic;
We call such objects latent variable graphs. Further, we assume that G is quantitatively instan-
tiated as a semi-parametric probabilistic model with the following properties:
A4. G satises the causal Markov condition;
A5. each observed node O is a linear function of its parents plus an additive error term of positive
nite variance;
A6. let V be the set of random variables represented as nodes in G, and let f(V) be their joint
distribution. We assume that f(V) is faithful to G: that is, a conditional independence
relation holds in f(V) if and only if it is entailed in G by d-separation.
Without loss of generalization, we will assume all random variables have zero mean. We call
such an object a linear latent variable model, or simply latent variable model. A single symbol, such
as G, will be used to denote both a latent variable model and the corresponding latent variable
graph. Notice that Zhang (2004) does not require latent variable models to be linear, but he
requires the entire graph to be a tree, besides relying on the measurement assumption. We do not
need to assume any special constraints in the graphical structure of our models besides being a
directed acyclic graph (DAG).
Linear latent variable models are ubiquitous in econometric, psychometric, and social scientic
studies (Bollen, 1989), where they are usually known as structural equation models. The methods
we describe here rely on statistical constraints for continuous variables that are well known for such
models. In theory, it is straightforward to extend it to model binary or ordinal discrete variables,
as discussed in Chapter 5. The method of Zhang (2004) is applicable to discrete sample spaces
only.
Two important denitions will be used throughout this chapter (Bollen, 1989):
Denition 3.1 (Measurement model) Given a latent variable model G, the submodel contain-
ing the complete set of nodes, and all and only those edges that point into observed nodes, is called
the measurement model of G.
Denition 3.2 (Structural model) Given a latent variable model G, its submodel containing all
and only its latent nodes and respective edges is the structural model of G.
3.2 The setup 29
3.2.2 The Discovery Problem
The discovery problem can loosely be formulated as follows: given a data set with variables O that
are observed variables in a latent variable model G satisfying the above conditions, learn a partial
description of the measurement and structural models of G that is as informative as possible.
Since we put very few restrictions on the graphical structure of the unknown model G, we
will not be able to uniquely determine Gs full structure. For instance, suppose there are many
more latent common causes than observed variables, and every latent is a parent of every observed
variable: no learning procedure can realistically be expected to identify such a structure. However,
instead of making extra assumptions about the unknown graphical structure (e.g., assume the
number of latents in bounded by a known constant, its causal model is tree-structured, etc.), we
adopt a data-driven approach: if there are features that cannot be identied, then we just report
ignorance.
We can further break up the discovery problem into three sub-problems:
DP1. Discover the number of latents in G.
DP2. Discover which observed variables measure each latent G.
DP3. Discover the Markov equivalence class among the latents in G.
The rst two sub-problems involve discovering the measurement model, and the third discover-
ing the structural model. Accordingly, our algorithm takes a two-step approach: in stage 1 it learns
as much as possible about features of the measurement model of G, and in stage 2 it learns as much
about the features of the structural model as possible using the measurement features discovered
in stage 1. Exploratory factor analysis (EFA) can be viewed as an alternative algorithm for stage
1: nding the measurement model. In our simulation studies, we compare our procedure against
EFA on several desiderata relevant to this task.
More specically, we will focus on learning a special type of measurement model, called pure
measurement model.
Denition 3.3 (Pure measurement model) A pure measurement model is a measurement model
in which each observed variable has only one latent parent, and no observed parent. That is, it is
a tree beneath the latents.
A pure measurement model implies a clustering of observed variables: each cluster is a set of
observed variables that share a common (latent) parent, and the set of latents denes a partition
over the observed variables.
There are several reasons for justifying the focus on pure instead of general measurement models.
First, as it is explained in Section 3.4, this provides enough information concerning the Markov
equivalence class of the structural model.
The second reason is motivated by a more practical reason: the equivalence class of general
measurement models that are undistinguishable can be very hard to represent. While, for instance,
a Markov equivalence class for models with no latent variables can be neatly represented by a
single graphical object known as pattern (Pearl, 2000; Spirtes et al., 2000), the same is not true for
latent variable models. For instance, the models in Figure 3.1 dier not only in the direction of the
edges, but also in the adjancencies themselves (X
1
, X
2
adjacent in one case, but not X
3
, X
4
;
X
3
, X
4
adjacent in another case, but not X
1
, X
2
) and the role of the latent variables (ambiguity
30 Learning the structure of linear latent variable models
L
X X
4
X
1 2 3
X
2
1
L
X
1
T
X X
4
X
1 2 3
X
1
T
X X
4
X
1 2 3
3
1
L
X X
4
X
1 2 3
X
2
L
L
L
L
4
5
(a) (b) (c) (d)
Figure 3.1: All of these four models can be undistinguishable by information contained in the
covariance matrix.
about which latent d-separates which observed variables, how they are connected, etc. Notice that,
in Figure 3.1(d), there is no latent that d-separates three observed variables, unlike in Figures
(a), (b) and (c)). Just representing the class of this very small example can be cumbersome and
uninformative.
In the next section, we describe a solution to the problem of learning pure measurement models
by dividing it into two main steps:
1. nd an intermediate representation, called measurement pattern, which implicitly encodes all
the necessary information to nd a pure measurement model. This is done in Section 3.3.2.
2. purify the measurement pattern by choosing a subset of the observed variables given in the
pattern, such that this subset can be partitioned according to the latents in the true graph.
This is done in Section 3.3.3.
Concerning the example given in Figure 3.1, if the input is data generated by any of the models
given by this Figure, our algorithm will be conservative and return an empty model. The equivalence
class is too broad to provide information about latents and their causal connections.
3.3 Learning pure measurement models
Given the covariance matrix of four random variables A, B, C, D we have that zero, one or three
of the following tetrad constraints may hold (Glymour et al., 1987):

CD
=
AC

BD
=
AD

CD
=
AD

BC
where
XY
represents the covariance of X and Y . Like conditional independence constraints, dier-
ent latent variable models can entail dierent tetrad constraints, and this was explored heuristically
by Glymour et al. (1987). Therefore, a given set of observed tetrad constraints will restrict the set
of possible latent variable graphs.
The key to solve the problem of structure learning is a graphical characterization of tetrad
constraints. Consider Figure 3.2(a). A single latent d-separates four observed variables. When this
graphical model is linearly parameterized as
X
1
=
1
L +
1
X
2
=
2
L +
2
X
3
=
3
L +
3
X
4
=
4
L +
4
3.3 Learning pure measurement models 31
L
X
2
X
3
X
4
X
1
L
X
2
X
3
X
4
X
T
1
L
X
2
X
3
X
4
X
T
1
(a) (b) (c)
Figure 3.2: A linear variable model with any of the graphical structures above entails all possible
tetrad constraints in the marginal covariance matrix of X
1
X
4
.
it entails all three tetrad constraints among the observed variables. That is, any choice of values
for coecients
1
,
2
,
3
,
4
and error variances implies

X
1
X
2

X
3
X
4
= (
1

2
L
)(
3

2
L
) = (
1

2
L
)(
2

2
L
) =
X
1
X
3

X
2
X
4
= (
1

2
L
)(
3

2
L
) = (
1

2
L
)(
2

2
L
) =
X
1
X
4

X
2
X
3
where
2
L
is the variance of latent variable L.
While this result is straightforward, the relevant result for a structure learning algorithm is
the converse, i.e., establishing equivalence classes from observable tetrad constraints. For instance,
Figure 3.2(b) and (c) are dierent structures with the same entailed tetrad constraints that should
be accounted for. One of the main contributions of this thesis is to provide several of such iden-
tication results, and sound algorithms for learning causal structure based on them. Such results
require elaborate proofs that are left to the Appendix. What follows are descriptions of the most
signicant lemmas and theorems, and illustrative examples.
We start with one of the most basic lemmas, used as a building block for the more evolved
results. It is basically the converse of the observation above. Let
AB
be the Pearson correlation
coecient of random variables A and B, and let G be a linear latent variable model with observed
variables O:
Lemma 3.4 Let X
1
, X
2
, X
3
, X
4
O be such that
X
1
X
2

X
3
X
4
=
X
1
X
3

X
2
X
4
=
X
1
X
4

X
2
X
3
.
If
AB
= 0 for all A, B X
1
, X
2
, X
3
, X
4
, then there is a node P that d-separates all elements
X
1
, X
2
, X
3
, X
4
in G.
It follows that, if no observed node d-separates X
1
, X
2
, X
3
, X
4
, then node P has to be a latent
node.
In order to learn a pure measurement model, we basically need two pieces of information: i.
which sets of nodes are d-separated by a latent; ii. which sets of nodes do not share any common
hidden parent. The rst piece of information can provide possible indicators (children/descendants)
of a specic latent. However, this is not enough information, since a set S of observed variables
can be d-separated by a latent L, and yet S might contain non-descendants of L (one of the nodes
might have a common ancestor with L and not be a descendant of L, for instance). This is the
reason why we need to cluster observed variables into dierent sets when it is possible to show
they cannot share a common hidden parent. We will show that most non-descendants nodes can
be removed if we are able to separate nodes in such a way.
32 Learning the structure of linear latent variable models
Y X
2
X
3
X
1 1 3
Y
1
X Y
1
Y
2
L
3
Y
2
X
2
X
3
Y
1
X
1
Y
(a) (b) (c)
Figure 3.3: If sets X
1
, X
2
, X
3
, Y
1
and X
1
, Y
1
, Y
2
, Y
3
are each d-separated by some node (e.g., as
in Figures (a) and (b) above), the existence of a common parent L for X
1
and Y
1
implies a common
node d-separating X
1
, Y
1
from X
2
, Y
2
, for instance (as exemplied in Figure (c)).
There are several possible combinations of observable tetrad constraints that allow one to iden-
tify such a clustering. Consider, for instance, the following case. Suppose we have a set of six
observable variables, X
1
, X
2
, X
3
, Y
1
, Y
2
and Y
3
such that:
1. there is some latent node that d-separates all pairs in X
1
, X
2
, X
3
, Y
1
(Figure 3.3(a));
2. there is some latent node that d-separates all pairs in X
1
, Y
1
, Y
2
, Y
3
(Figure 3.3(b));
3. there is no tetrad constraint
X
1
X
2

Y
1
Y
2

X
1
Y
2

X
2
Y
1
= 0;
4. no pairs in X
1
, . . . , Y
3
X
1
, . . . , Y
3
have zero correlation;
Notice that is possible to empirically verify the rst two conditions by using Lemma 3.4. Now
suppose, for the sake of contradiction, that X
1
and Y
1
have a common hidden parent L. One
can show that L should d-separate all elements in X
1
, X
2
, X
3
, Y
1
, and also in X
1
, Y
1
, Y
2
, Y
3
.
With some extra work (one has to consider the possibility of nodes in X
1
, X
2
, Y
1
, Y
2
having
common parents with L, for instance), one can show that this implies that L d-separates X
1
, Y
1

from X
2
, Y
2
. For instance, Figure 3.3(c) illustrates a case where L d-separates all of the given
observed variables.
However, this contradicts the third item in the hypothesis (such a d-separation will imply
the forbidden tetrad constraint, as we show in the formal proof) and, as a consequence, no such L
should exist. Therefore, the items above correspond to an identication rule for discovering some d-
separations concerning observed and hidden variables (in this case, we show that X
1
is independent
of all latent parents of Y
1
given some latent ancestor of X
1
). This rule only uses constraints that
can be tested from the data.
We restrict our algorithm to search for measurement models that entail the observed tetrad
constraints and vanishing partial correlations judged to hold in the population. However, since these
constraints ignore any information concerning the joint distribution besides its second moments,
this might seem too restrictive.
Figure 3.4 helps to understand the limitations of tetrad constraints. Similarly to the example
given in Figure 3.1, here we have several models that can represent the same tetrad constraint,

XZ
=
WZ

XY
, and no other. However, this is much less of a problem when learning
3.3 Learning pure measurement models 33
Z W X Y
Z W X Y
Z W X Y
(a) (b) (c)
Figure 3.4: Three dierent latent variable models that can explain a tetrad constraint
WY

XZ
=

XY
. Bi-directed edges represent independent hidden common causes.
pure models. Moreover, trying to distinguish among such models using higher order moments of
the distribution will increase the chance of committing statistical mistakes, a major concern for
automated structure discovery algorithms.
We claim that what can be learned from pure models alone can still be substantial. This is
supported by the empirical results discussed in Section 6, and by various results on factor analysis
that empirically demonstrate that, under an appropriate rotation, it is often the case that many
observed variables have a single or few signicant parents (Bartholomew et al., 2002), with a
reasonably large pure measurement submodel. Substantive causal information can therefore be
learned in practice using only pure models and the observed covariance matrix.
3.3.1 Measurement patterns
We say that a linear latent variable graph G entails a constraint if and only if the constraint holds in
every distribution with covariance matrix parameterized by , the set of linear coecients and error
variances that denes the conditional expectation and variance of a node given its parents. A tetrad
equivalence class T(() is a set of latent variable graphs T, each member of which entails the same
set of tetrad constraints ( among the measured variables. An equivalence class of measurement
models (() for ( is the union of the measurement models in T((). We now introduce a graphical
representation of common features of all elements of (().
Denition 3.5 (Measurement pattern) A measurement pattern, denoted MP((), is a graph
representing features of the equivalence class (() satisfying the following:
there are latent and observed nodes;
the only edges allowed in an MP are directed edges from latents to observed nodes, and
undirected edges between observed nodes. Every observed node in a MP has at least one
latent parent;
if two observed nodes X and Y in a MP(() do not share a common latent parent, then X
and Y do not share a common latent parent in any member of (();
if X and Y are not linked by an undirected edge in MP((), then X is not an ancestor of Y
in any member of (().
34 Learning the structure of linear latent variable models
8
X
6
X
5
X
4
X
2
X
7
X
3
X
X
1
Figure 3.5: An example of a measurement pattern.
A measurement pattern does not make any claims about the connections between latents. We
show an example in Figure 3.5. By the denition of measurement pattern, this graph claims
that nodes X
1
and X
4
do not have any hidden common parent in common in any member of its
equivalence class, which implies they do not have common hidden parents in the true unknown
graph that generated the observable tetrad constraints. The same holds for any pair in X
1
, X
2

X
4
, X
5
, X
6
, X
7
.
It is also the case that, by the measurement pattern shown in Figure 3.5, X
1
cannot be an
ancestor of X
2
in the true graph; X
1
cannot be an ancestor of X
4
, and so on for all pairs that are
not linked by an undirected edge.
Still in this measurement pattern, X
1
and X
2
might have a common hidden parent in the true
graph. X
3
and X
4
might have a common hidden parent and so on. Also, X
4
might be an ancestor
of X
7
. X
1
might be an ancestor of X
8
. It does not mean, however, that this is actually the case.
Later in this chapter we show an example of a graph that generates this pattern by the algorithm
given in the next section.
3.3.2 An algorithm for nding measurement patterns
Assume for now that the population covariance matrix is known
1
. FindPattern, given in Table
3.1, is an algorithm to learn a measurement pattern. The rst stage of FindPattern searches
for subsets of ( that will guarantee that two observed variables do not have any latent parents in
common.
Let G be the latent variable graph for a linear latent variable model with a set of observed
variables O. Let O

= X
1
, X
2
, X
3
, Y
1
, Y
2
, Y
3
Osuch that for all triplets A, B, C, A, B O

and C O, we have
AB
= 0,
AB.C
= 0. Let
IJKL
represent the tetrad constraint
IJ

JL
= 0 and
IJKL
represent the complementary constraint
IJ

KL

IK

JL
= 0. The
following lemma is a formal description of the example given in Figure 3.3:
Lemma 3.6 (CS1 Test) If constraints
X
1
Y
1
X
2
X
3
,
X
1
Y
1
X
3
X
2
,
Y
1
X
1
Y
2
Y
3
,
Y
1
X
1
Y
3
Y
2
,
X
1
X
2
Y
2
Y
1

all hold, then X

1
and Y
1
do not have a common parent in G.
CS here stands for constraint set, the premises of a rule that can be used to test if two nodes
do not share a common parent. Other sets of observable constraints can be used to reach the same
conclusion. Let the predicate F
1
(X, Y, G) be true if and only if there exist two nodes W and Z in
latent variable graph G such that
WXY Z
and
WXZY
are both entailed, all nodes in W, X, Y, Z
are correlated, and there is no observed C in G such that
AB.C
= 0 for A, B W, X, Y, Z:
1
Appendix A.3 describes how to deal with nite samples.
3.3 Learning pure measurement models 35
X
1
X
2
Y
1
Y
3
Y
2 3
X
X
1
Y
1
Y
3
Y
2 3
X X
2
X
Y
X
1
X
2
Y
1
Y
3
Y
2 3
X
(a) (b) (c)
Figure 3.6: Three examples with two main latents and several independent latent common causes
of two indicators (represented by double-directed edges). In (a), CS1 applies, but not CS2 nor CS3
(even when exchanging labels of the variables); In (b), CS2 applies (assuming the conditions for
X
1
, X
2
and Y
1
, Y
2
), but not CS1 nor CS3. In (c), CS3 applies, but not CS1 nor CS2.
Lemma 3.7 (CS2 Test) If constraints
X
1
Y
1
Y
2
X
2
,
X
2
Y
1
Y
3
Y
2
,
X
1
X
2
Y
2
X
3
,
X
1
X
2
Y
2
Y
1
all hold such
that F
1
(X
1
, X
2
, G) = true, F
1
(Y
1
, Y
2
, G) = true, X
1
is not an ancestor of X
3
and Y
1
is not an
ancestor of Y
3
, then X
1
and Y
1
do not have a common parent in G.
Lemma 3.8 (CS3 Test) If constraints
X
1
Y
1
Y
2
Y
3
,
X
1
Y
1
Y
3
Y
2
,
X
1
Y
2
X
2
X
3
,
X
1
Y
2
X
3
X
2
,
X
1
Y
3
X
2
X
3
,

X
1
Y
3
X
3
X
2
,
X
1
X
2
Y
2
Y
3
all hold, then X
1
and Y
1
do not have a common parent in G.
These rules are illustrated in Figure 3.6. Notice that those rules are not redundant: only one
can be applied on each situation. For CS2 (Figure 3.6(b)), nodes X and Y are depicted as auxiliary
nodes that can be used to verify predicates F
1
. For instance, F
1
(X
1
, X
2
, G) is true because all three
tetrads in the covariance matrix of X
1
, X
2
, X
3
, X hold.
Sometime it is possible to guarantee that a node is not an ancestor of another, as required, e.g.,
to apply CS2:
Lemma 3.9 If for some set O

= X
1
, X
2
, X
3
, X
4
O,
X
1
X
2

X
3
X
4
=
X
1
X
3

X
2
X
4
=
X
1
X
4

X
2
X
3
and for all triplets A, B, C, A, B O

, C O, we have
AB.C
= 0 and
AB
= 0, then no ele-
ment A O

is a descendant of an element of O

`A in G.
This lemma is a straightforward consequence of Lemma 3.4 and the assumption that no observed
node is an ancestor of a latent node. For instance, in Figure 3.6(b) the existence of the observed
node X (linked by a dashed edge to the parent of X
1
) will allow us to infer that X
1
is not an
ancestor of X
3
, since all three tetrad constraints hold in the covariance matrix of X, X
1
, X
2
, X
3
.
Node Y plays a similar role with respect to Y
1
and Y
3
.
Algorithm FindPattern has the following property:
Theorem 3.10 The output of FindPattern is a measurement pattern MP(() with respect to the
tetrad and zero/rst order vanishing partial correlation constraints ( of .
Figure 3.7 illustrates an application of FindPattern
2
. A full example of the algorithm is given
2
Notice we do not make use of vanishing partial correlations where the size of the conditioning set is never greater
than 1. We are motivated by problems where there is a strong belief that every pair of observed variables has at least
one common hidden cause. Using such higher order constraints would just lead to higher possibility of commiting
statistical mistakes.
36 Learning the structure of linear latent variable models
8
X
6
X
5
X
4
X
2
X
3
X
7
X X
1
8
X
6
X
5
X
4
X
2
X
7
X
3
X
X
1
(a) (b)
Figure 3.7: In (a), a model that generates a covariance matrix . In (b), the output of FindPat-
tern given . Pairs in X
1
, X
2
X
4
, . . . , X
7
are separated by CS2. Notice that the presence
of an undirected edge does not mean that adjacent nodes in the pattern are actually adjacent in
the true graph (e.g., X
3
and X
8
share a common parent in the true graph, but are not adjacent).
Observed nodes adjacent in the output pattern always share at least one parent in the pattern, but
only sometimes they are actually children of a same parent (e.g., X
4
and X
7
) in the true graph.
Nodes sharing a common parent in the pattern might not share a parent in the true graph (e.g.,
X
1
and X
8
).
in Figure 3.8.
3.3.3 Identiability and purication
The FindPattern algorithm is sound, but not necessarily complete. That is, there might be
graphical features shared by all members of the measurement model equivalence class that are not
discovered by FindPattern. In general, a measurement pattern might not be informative enough,
and this is the motivation for discovering pure measurement models: we would like to know in more
detail how the latents in the output are related to the ones in the true graph. This is essential in
order to nd a corresponding structural model.
The output of FindPattern cannot, however, reliably be turned into a pure measurement
model in the obvious way, by removing from it all nodes that have more than one latent parent
and one of every pair of adjacent nodes, as attemped by the following algorithm:
Algorithm TrivialPurification: remove all nodes that have more than one latent parent,
and for every pair of adjacent observed nodes, remove an arbitrary node of the pair.
TrivialPurification is not correct. To see this, consider Figure 3.9(a), where with the
exception of pairs in X
3
, . . . , X
7
, every pair of nodes has more than one hidden common cause.
Giving the covariance matrix of such model to FindPattern will result in a pattern with one
latent only (because no pair of nodes can be separated by CS1, CS2 or CS3), and all pairs that
are connected by a double directed edge in Figure 3.9(a) will be connected by an undirected edge
in the output pattern. One can verify that if we remove one node from each pair connected by an
undirected edge in this pattern, the output with the maximum number of nodes will be given by
the graph in Figure 3.9(b).
There is no clear relation between the latent in the pattern and the latents in the true graph.
While it is true that all nodes in X
3
, . . . , X
7
have a latent common cause (the parent of X
4
, X
5
, X
6
)
3.3 Learning pure measurement models 37
8
X
6
X
5
X
4
X
2
X
3
X
7
X X
1
X
4
X
5
X
6
X
X
7
2
X
1
X
X
8
3
(a) (b)
X
4
X
5
X
6
X
X
7
2
X
1
X
X
8
3
X
4
X
5
X
6
X
X
7
2
X
1
X
X
8
3
(c) (d)
8
X
6
X
5
X
4
X
2
X
7
X
3
X
X
1
8
X
6
X
5
X
4
X
2
X
7
X
3
X
X
1
(e) (f)
Figure 3.8: A step-by-step example on how a measurement pattern for the model given in (a) can
be learned by FindPattern. Suppose we are given only the observed covariance matrix of the
model in (a). We start with a fully connected graph among the observables (b), and remove some
of the edges according to CS1-CS3. For instance, the edge X
1
X
4
is removed by CS2 applied to
the tuple X
1
, X
2
, X
3
, X
4
, X
5
, X
6
. This results in graph (c). In (d), we highlight the two dierent
(and overlapping) maximal cliques found in this graph (edge X
3
X
8
belongs to both cliques). The
two cliques are transformed into two latents in (e). Finally, in (f) we add the required undirected
edges (since, e.g., X
1
and X
8
are not part of any foursome where all three tetrad constraints hold).
38 Learning the structure of linear latent variable models
Algorithm FindPattern
Input: a covariance matrix
1. Start with a complete graph G over the observed variables.
2. Remove edges for pairs that are marginally uncorrelated or uncorrelated conditioned on a
third variable.
3. For every pair of nodes linked by an edge in G, test if some rule CS1, CS2 or CS3 applies.
Remove an edge between every pair corresponding to a rule that applies.
4. Let H be a graph with no edges and with nodes corresponding to the observed variables.
5. For each maximal clique in G, add a new latent to H and make it a parent to all corresponding
nodes in the clique.
6. For each pair (A, B), if there is no other pair (C, D) such that
AC

BD
=
AD

BC
=
AB

CD
,
add an undirected edge A B to H.
7. Return H.
Table 3.1: Returns a measurement pattern corresponding to the tetrad and rst order vanishing
partial correlations of .
in the true graph, such observed nodes cannot be causally connected by a linear model as suggested
by Figure 3.9(b). In that graph, all three tetrads constraints among X
3
, X
4
, X
5
, X
7
are entailed.
This is not the case in the true graph.
Consider instead the algorithm BuildPureClusters of Table 3.2, which initially builds a mea-
surement pattern using FindPattern. Variables are removed whenever some tetrad constraints
are not satised, which corrects situations exemplied by Figure 3.9. Some extra adjustments con-
cern clusters with proper subsets that are not consistently correlated to another variable (Steps 6
and 7) and a nal merging of clusters (Step 8). We explain the necessity of these steps in Appendix
A.1.
Notice that we leave out some details in the description of BuildPureClusters, i.e., there
are severals ways of performing choices of nodes in Steps 2, 4, 5 and 9. We suggest an explicit
way of performing these choices in Appendix A.3. There are two reasons why we present a partial
description of the algorithm. The rst is that, independently of how such choices are made, one can
make several claims about the relationship of an output graph and the true measurement model.
The graphical properties of the output of BuildPureClusters are summarized by the following
theorem.
Theorem 3.11 Given a covariance matrix assumed to be generated from a linear latent variable
model G with observed variables O and latent variables L, let G
out
be the output of BuildPureClusters()
with observed variables O
out
O and latent variables L
out
. Then G
out
is a measurement pattern,
and there is an unique injective mapping M : L
out
L with the following properties:
1. Let L
out
L
out
. Let X be a child of L
out
in G
out
. Then M(L
out
) d-separates X from O
out
`X
in G;
3.3 Learning pure measurement models 39
Algorithm BuildPureClusters
Input: a covariance matrix
1. G FindPattern().
2. Choose a set of latents in G. Remove all other latents and all observed nodes that are not
children of the remaining latents and all clusters of size 1.
3. Remove all nodes that have more than one latent parent in G.
4. For all pairs of nodes linked by an undirected edge, choose one element of each pair to be
removed.
5. If for some set of nodes A, B, C, all children of the same latent, there is a fourth node D
in G such that
AB

CD
=
AC

BD
=
AD

BC
is not true, remove one of these four nodes.
6. For every latent L with at least two children, A, B, if there is some node C in G such that

AC
= 0 and
BC
= 0, split L into two latents L
1
and L
2
, where L
1
becomes the only parent
of all children of L that are correlated with C, and L
2
becomes the only parent of all children
of L that are not correlated with C;
7. Remove any cluster with exactly 3 variables X
1
, X
2
, X
3
such that there is no X
4
where
all three tetrads in the covariance matrix X = X
1
, X
2
, X
3
, X
4
hold, all variables of X are
correlated and no partial correlation of a pair of elements of X is zero conditioned on some
observed variable;
8. While there is a pair of clusters with latents L
i
and L
j
, such that for all subsets A, B, C, D of
the union of the children of L
i
, L
j
we have
AB

CD
=
AC

BD
=
AD

BC
, and no marginal
independence or conditional independence in sets of size 1 are observed in this cluster, set
L
i
= L
j
(i.e., merge the clusters);
9. Again, verify all implied tetrad constraints and remove elements accordingly. Iterate with the
previous step till no changes happen;
10. Remove all latents with less than three children, and their respective measures;
11. if G has at least four observed variables, return G. Otherwise, return an empty model.
Table 3.2: A general strategy to nd a pure MP that is also a linear measurement model of a subset
of the latents in the true graph. As explained in the body of the text, steps 2, 4, 5 and 9 are not
described algorithmically in this Section.
40 Learning the structure of linear latent variable models
X
9
X X
8 7
X
6
X
5
X
4
X
2
X
3
X
1
6
X
5
X
4
X
3
X
7
X
(a) (b)
Figure 3.9: In (a), a model that generates a covariance matrix . The output of FindPattern
given contains a single latent variable that is a parent of all observed nodes. In (b), the pattern
with the maximum number of nodes that can be obtained by removing one node from each adjacent
pair of observed nodes. This model is incorrect, since there is no latent that d-separates all of the
nodes in (b) in a linear model.
2. M(L
out
) d-separates X from every latent L in G for which M
1
(L) is dened;
3. Let O

O
out
be such that each pair in O

is correlated. At most one element in O

is not
a descendant of its respective mapped latent parent in G, or has a hidden common cause with
it;
Informally, there is a labeling of latents in G
out
according to the latents in G, and in this
relabeled output graph any d-separation between a measured node and some other node will hold
in the true graph, G. This is illustrated by Figure 3.10. Given the covariance matrix generated
by the true model in Figure 3.10(a), BuildPureClusters generates the model shown in Figure
3.10(b). Since the labeling of the latents is arbitrary, Theorem 3.11 is a formal description that
latents in the output should correspond to latents in the true model up to a relabeling.
For each group of correlated observed variables, we can guaranteee that at most one edge from
a latent into an observed variable is incorrectly directed. By incorrectly directed, we mean the
condition dened in the third item of Theorem 3.11: although all observed variables are children
of latents in the output graph, one of these edges might be misleading, since in the true graph one
of the observed variables might not be a descendant of the respective latent. This is illustrated by
Figure 3.11.
Notice also that we cannot guarantee that an observed node X with latent parent L
out
in G
out
will be d-separated from the latents in G not in G
out
, given M(L
out
): if X has a common cause
with M(L
out
), then X will be d-connected to any ancestor of M(L
out
) in G given M(L
out
). This
is also illustrated by Figure 3.11.
Let an DAG G be an I-map of a distribution D if and only if all independencies entailed in G by
the Markov condition also hold in D (the faithfulness condition explained in Chapter 1 includes the
converse) (Pearl, 1988). Using the notation from the previous theorem, the parametrical properties
of the output of BuildPureClusters are described as follows:
3.3 Learning pure measurement models 41
12
X
6
X
5
X
4
X
2
X
3
X
X
7
L
1
L
2
X
X
X X X
8 9 10
11
1
X
5
X
4
X X
9 1
X
2
X
3
X X
7
X
11
T
1
T
2
6
(a) (b)
Figure 3.10: Given as input the covariance matrix of the observable variables X
1
X
12
connected
according to the true model shown in Figure (a), the BuildPureClusters algorithm will generate
the graph shown in Figure (b). It is clear there is an injective mapping M(.) from latents T
1
, T
2
to
latents L
1
, L
2
such that M(T
1
) = L
1
and M(T
2
) = L
2
and the properties described by Theorem
3.11 hold.
Theorem 3.12 Let M(L
out
) L be the set of latents in G obtained by the mapping function M().
Let
O
out
be the population covariance matrix of O
out
. Let the DAG G
aug
out
be G
out
augmented by
connecting the elements of L
out
such that the structural model of G
aug
out
is an I-map of the distribution
of M(L
out
). Then there exists a linear latent variable model using G
aug
out
as the graphical structure
such that the implied covariance matrix of O
out
equals
O
out
.
This result is essential to provide an algorithm that is guaranteed to nd a Markov equivalence
class for the latents in M(L
out
) using the output of BuildPureClusters as a starting point.
The second reason why we do not provide details of some steps of BuildPureClusters at
this point is because there is no unique way of implementing it. Dierent purications might be
of interest. For instance, one might be interested in the pure model that has the largest possible
number of latents. Another one might be interested in the model with the largest number of
observed variables. However, some of these criteria might be computationally intractable to achieve.
Consider for instance the following criterion, which we denote as {
3
: given a measurement
pattern, decide if there is some choice of nodes to be removed such that the resulting graph is a
pure measurement model and each latent has at least three children. This problem is intractable:
Theorem 3.13 Problem {
3
is NP-complete.
By presenting the high-level description of BuildPureClusters as in Table 3.2, we show
there is no need to solve a NP-hard problem in order to have the same theoretical guarantees of
interpretability of the output. For example, there is a stage in FindPattern where it appears
necessary to nd all maximal cliques, but, in fact, it is not. Identifying more cliques increases the
chance of having a larger output (which is good) by the end of the algorithm, but it is not required
for the algorithms correctness. Stopping at Step 5 of FindPattern after a given amount of time
will not aect Theorems 3.11 or 3.12.
Another computational concern are the O(N
5
) loops in Step 3 of FindPattern, N being the
number of observed variables. However, it is not necessary to compute this loop entirely. One can
stop Step 3 at any time at the price of losing information, but not the theoretical guarantees of
BuildPureClusters. This anytime property is summarized by the following corollary:
42 Learning the structure of linear latent variable models
6
X
5
X
4
X
7
X
1
X
2
X
3
X
1
L
2
L
3
L
4
L
2
6
X
5
X
4
X
2
X
3
X
1
X
T
1
T
(a) (b)
Figure 3.11: Given as input the covariance matrix of the observable variables X
1
X
7
connected
according to the true model shown in Figure (a), one of the possible outputs of BuildPureClus-
ters algorithm is the graph shown in Figure (b). It is clear there is an injective mapping M(.)
from latents T
1
, T
2
to latents L
1
, L
2
, L
3
, L
4
such that M(T
1
) = L
2
and M(T
2
) = L
3
. However,
in (b) the edge T
1
X
1
does not express the correct causal direction of the true model. Notice
also that X
1
is not d-separated from L
4
given M(T
1
) = L
2
in the true graph.
Corollary 3.14 The output of BuildPureClusters retains its guarantees even when rules CS1,
CS2 and CS3 are applied an arbitrary number of times in FindPattern for any arbitrary subset
of nodes and an arbitrary number of maximal cliques is found.
3.3.4 Example
In this section, we illustrate how BuildPureClusters works given the population covariance
matrix of a known latent variable model. Suppose the true graph is the one given in Figure 3.12(a),
with two unlabeled latents and 12 observed variables. This graph is unknown to BuildPureClus-
ters, which is given only the covariance matrix of variables X
1
, X
2
, ..., X
12
. The task is to learn
a measurement pattern, and then a puried measurement model.
In the rst stage of BuildPureClusters, the FindPattern algorithm, we start with a fully
connected graph among the observed variables (Figure 3.12(b)), and then proceed to remove edges
according to rules CS1, CS2 and CS3, giving the graph shown in Figure 3.12(c). There are two
maximal cliques in this graph: X
1
, X
2
, X
3
, X
7
, X
8
, X
11
, X
12
and X
4
, X
5
, X
6
, X
8
, X
9
, X
10
, X
12
.
They are distinguished in the gure by dierent edge representations (dashed and solid - with the
edge X
8
X
12
present in both cliques). The next stage takes these maximal cliques and creates
an intermediate graphical representation, as depicted in Figure 3.12(d). In Figure 3.12(e), we add
the undirected edges X
7
X
8
, X
8
X
12
, X
9
X
10
and X
11
X
12
, nalizing the measurement
pattern returned by FindPattern. Finally, Figure 3.12(f) represents a possible puried output
of BuildPureClusters given this pattern. Another purication with as many nodes as in the
graph in Figure 3.12(f) substitutes node X
9
for node X
10
.
3.4 Learning the structure of the unobserved
Even given a correct measurement model, it might not be possible to identify the corresponding
structural model. Consider the case of factor analysis again, applied to multivariate normal mod-
els. In Figure 1.6 we depicted two graphs that are both able to represent a same set of normal
distributions.
One might argue that this is an artifact of the Gaussian distribution, and identiability could
be improved by assuming other distributions other than normal for the given variables. However,
3.4 Learning the structure of the unobserved 43
12
X
6
X
5
X
4
X
2
X
3
X
X
X
X X X X
7 8 9 10
11
1
X
4
X
5
X
6
X
X
7
X
9
2
X
1
X
X
12
X
8
X
11
X
10
3
(a) (b)
X
4
X
5
X
6
X
X
7
X
9
2
X
1
X
X
12
X
8
X
11
X
10
3
X
2
X
3
X
6
X
5
X
4
X
X
7
X
9
X
8
X
10
X
12
X
11
1
(c) (d)
X
2
X
3
X
6
X
5
X
4
X
X
7
X
9
X
8
X
10
X
12
X
11
1
X
2
X
3
X
X
7
X
11
6
X
5
X
4
X X
9
1
(e) (f)
Figure 3.12: A step-by-step demonstration of how a covariance matrix generated by graph in Figure
(a) will induce the pure measurement model in Figure (f).
for linear models, Gaussian distributions are an important case that cannot be ignored. Moreover,
it is dicult to design identication criteria that are both computationally feasible (e.g., avoiding
the minimization of a KL-divergence) and statistically realistic (how much ne-grained information
about the distribution, such as high-order moments, could be reliably used in model selection?)
We take an approach that we believe to be much more useful in practice: to guarantee identi-
ability of the structural model by constraining the acceptable measurement models used as input,
and do it without requiring high-order moments. We will from now assume the following condition
44 Learning the structure of linear latent variable models
for our algorithm:
the given measurement model has a pure measurement submodel with at least two measures
per latent;
Notice that does not mean that the given measurement model has to be pure, but only a subset
of it
3
has to be pure. The intuition about the suitability of this assumption is as follows: in
pure measurement models, d-separation among latents entails d-separation among pure observed
measures, and that has immediate consequences on the rank of the covariance matrix of the d-
separated observed variables.
3.4.1 Identifying conditional independences among latent variables
The following theorem is due to Spirtes et al. (2000):
Theorem 3.15 Let G be a pure linear latent variable model. Let L
1
, L
2
be two latents in G, and
Q a set of latents in G. Let X
1
be a measure of L
1
, X
2
be a measure of L
2
, and X
Q
be a set of
measures of Q containing at least two measures per latent. Then L
1
is d-separated from L
2
given
Q in G if and only if the rank of the correlation matrix of X
1
, X
2
X
Q
is less than or equal
to [Q[ with probability 1 with respect to the Lebesgue measure over the linear coecients and error
variances of G.
We can then use this constraint to identify
4
conditional independencies among latents provided
we have the correct pure measures.
3.4.2 Constraint-satisfaction algorithms
Given Theorem 3.15, conditional independence tests can then be used as an oracle for constraint-
satisfaction techniques for causality discovery in graphical models, such as the PC algorithm
(Spirtes et al., 2000) which assumes the variables being tested to have no unmeasured hidden
common causes (i.e., in our case, no pair of latents in our system can have another latent as
a common cause that is not measured by some observed variable). An alternative is the FCI
algorithm (Spirtes et al., 2000), which makes no such an assumption.
We dene the algorithm PC-MIMBuild
5
as the algorithm that takes as input a measurement
model satisfying the assumption of purity mentioned above and a covariance matrix, and returns the
Markov equivalence of the structural model among the latents in the measurement model according
to the PC algorithm. A FCI-MIMBuild algorithm is dened analogously. In the limit of innite
data, the following result follows from Theorems 3.11 and 3.15 and the consistency of PC and FCI
algorithms (Spirtes et al., 2000):
3
The denition of measurement submodel has to preserve all ancestral relationships. So, if measure X is not a
parent of Y , but a chain X K Y exists, any submodel that includes X and Y but not K has to include the
edge X Y .
4
One way to test if the rank of a covariance matrix in Gaussian models is at most q is to t a factor analysis
model with q latents and assess its signicance (Bartholomew and Knott, 1999).
5
MIM stands for multiple indicator model, a term in structural equation model literature describing latent
variable models with multiple measures per latent.
3.5 Evaluation 45
Corollary 3.16 Given a covariance matrix assumed to be generated from a linear latent variable
model G, and G
out
the output of BuildPureClusters given , the output of PC-MIMBuild or
FCI-MIMBuild given (, G
out
) returns the correct Markov equivalence class of the latents in G
corresponding to latents in G
out
according to the mapping implicit in BuildPureClusters.
An example of the PC algorithm in action is given in Chapter 1. Exactly the same procedure
could be applied to a graph consisted of latent variables, as illustrated in Figure 3.13. This example
corresponds to the one given in Figure 1.5.
3.4.3 Score-based algorithms
Given Theorem 3.15, conditional independence constraints can then be used as search operators
for score-based techniques for causality discovery in graphical models. Score-based approaches for
learning the structure of Bayesian networks, such as GES (Meek, 1997), are usually more robust
to variability on small samples than PC or FCI. If one is willing to assume that there are no extra
hidden common causes connecting variables on its causal system, then GES should be a more
robust choice than the PC algorithm.
We know of no consistent score function for linear latent variable models that can be easily
computed. As a heuristic, we suggest using the Bayesian Information Criterion (BIC) function.
Using BIC along with Structural EM (Friedman, 1998) and GES results in a very computation-
ally ecient way of learning structural models, where the measurement model is xed and GES is
restricted to modify edges among latents only. Assuming a Gaussian distribution, the rst step of
Structural EM uses a fully connected structural model in order to estimate the rst expected
latent covariance matrix. We call this algorithm GES-MIMBuild and use it as the structural
model search component in all the algorithms we now compare.
3.5 Evaluation
We evaluate our algorithm on simulated and real data. In the simulation studies, we draw samples
of three dierent sizes from 9 dierent latent variable models involving three dierent structural
models and three dierent measurement models. We then consider the algorithms performance on
three empirical datasets: one involving stress, depression, and spirituality; one concerning attitude
of single mothers with respect to their children; and one involving test anxiety previously analyzed
with factor analysis in (Bartholomew et al., 2002).
3.5.1 Simulation studies
We compare our algorithm against two versions of exploratory factor analysis, and measure the
success of each on the following discovery problems, as previously dened:
DP1. Discover the number of latents in G.
DP2. Discover which observed variables measure each latent G.
DP3. Discover causal structure among the latents in G.
46 Learning the structure of linear latent variable models
L
2
L
1
X
2
X
3
X
4
X
3
L
4
L
7
X
8
X
5
X
6
X
1
L
2
L
1
X
2
X
3
X
4
X
3
L
4
L
7
X
8
X
5
X
6
X
1
(a) (b)
L
2
L
1
X
2
X
3
X
4
X
3
L
4
L
7
X
8
X
5
X
6
X
1
L
2
L
1
X
2
X
3
X
4
X
3
L
4
L
7
X
8
X
5
X
6
X
1
(c) (d)
Figure 3.13: A step-by-step demonstration of the PC-MIMBuild algorithm. The true model is
given in (a). We start with a full undirected graph among latents (b) and remove edges according
to the independence tests described in Section 3.4.1, obtaining graph (c). By orienting unshielded
colliders, we get graph (d). Extra steps of orientation will recreate the true graph. An identical
example of the PC algorithm for the case where the variables of interest are observed is given in
Figure 1.5.
Since factor analysis addresses only tasks DP1 and DP2, we compare it directly to Build-
PureClusters on DP1 and DP2. For DP3, we use our procedure and factor analysis to compute
measurement models, then discover as much about the features of the structural model among the
latents as possible by applying GES-MIMBuild to the measurement models output by BPC and
factor analysis.
We hypothesized that three features of the problem would aect the performance of the al-
gorithms compared. First, the sample size should be important. Second, the complexity of the
3.5 Evaluation 47
SM1 SM2 SM3
1
X
9
X X
8 7
X
6
X
5
X
4
X
2
X
3
X
1
X
9
X X
8 7
X
6
X
5
X
4
X
2
X
3
X
X X X X
10 11 12
X
13
X
14 15
X
X
X
16
17
18
1
X
9
X X
8 7
X
6
X
5
X
4
X
2
X
3
X
X X X X
10 11 12
X
13
X
14 15
MM1 MM2 MM3
Figure 3.14: The structural and measurement models used in our simulation studies. When com-
bining the 4-latent structural model SM3 with any measurement model, we add edges out of the
fourth latent respecting the pattern used in the measurement model.
structural model might matter, and third, the complexity and level of impurity in the generating
measurement model might matter. We used three dierent sample sizes for each study: 200, 1,000,
and 10,000. We constructed nine generating latent variable graphs by using all combinations of the
three structural models and three measurement models we show in Figure 3.14.
MM1 is a pure measurement model with three indicators per latent. MM2 has ve indicators
per latent, one of which is impure because its error is correlated with another indicator, and another
because it measures two latents directly. MM3 involves six indicators per latent, half of which are
impure. Thus the level of impurity increases from MM1 to MM3.
SM1 entails one unconditional independence among the latents: L1 is independent L3. SM2
entails one rst order conditional independence: L1 L3 [ L2, and SM3 entails one rst order
conditional independence: L2 L3 [ L1, and one second order conditional independence relation:
L1 L4 [ L2, L3. Thus the statistical complexity of the structural models increases from SM1
to SM3.
Clearly any discovery procedure ought to be able to do very well on samples of 10,000 drawn
from a generating model involving SM1 and MM1. Not as clear is how well a procedure can do on
samples of size 200 drawn from a generating model involving SM3 and MM3.
Generating Samples
For each generating latent variable graph, we used the Tetrad IV program
6
with the following
procedure to draw 10 multivariate normal samples of size 200, 10 at size 1,000, and 10 at size
10,000.
1. Pick coecients for each edge in the model randomly from the interval [1.5, 0.5] [0.5, 1.5].
6
Available at http://www.phil.cmu.edu/projects/tetrad.
48 Learning the structure of linear latent variable models
2. Pick variances for the exogenous nodes (i.e., latents without parents and error nodes) from
the interval [1, 3].
3. Draw one pseudo-random sample of size N.
Algorithms Studied
We used three algorithms in our studies:
1. BPC: BuildPureClusters + GES-MIMBuild
2. FA: factor analysis + GES-MIMBuild
3. P-FA: factor analysis + Purify + GES-MIMBuild
BPC is the implementation of BuildPureClusters and GES-MIMBuild described in Ap-
pendix A.3. FA involves combining standard factor analysis to nd the measurement model with
GES-MIMBuild to nd the structural model. For standard factor analysis, we used factanal
from R 1.9 with the oblique rotation promax. FA and variations are still widely used and are
perhaps the most popular approach to latent variable modeling (Bartholomew et al., 2002). We
choose the number of latents by iteratively increasing its number till we get a signicant t above
0.05, or till we have to stop due to numerical instabilities
7
.
Factor analysis is not directly comparable to BuildPureClusters since it does not generate
pure models only. We extend our comparison of BPC and FA by including a version of factor
analysis with a post processing step to purify the output of factor analysis. Puried Factor Analy-
sis, or P-FA, takes the measurement model output by factor analysis and proceeds as follows: 1.
for each latent with two children only, remove the child that has the highest number of parents. 2.
remove all latents with one child only, unless this latent is the only parent of its child. 3. removes
all indicators that load signicantly on more than one latent. The measurement model output by
P-FA typically contains far fewer latent variables than the measurement model output by FA.
Success on nding latents and a good measurement model
In order to compare the output of BPC, FA, and P-FA on discovery tasks DP1 (nding the
correct number of underlying latents) and DP2 (measuring these latents appropriately), we must
map the latents discovered by each algorithm to the latents in the generating model. That is, we
must dene a mapping of the latents in the G
out
to those in the true graph G. Although one could
do this in many ways, for simplicity we used a majority voting rule in BPC and P-FA. If a majority
of the indicators of a latent L
i
out
in G
out
are measures of a latent node L
j
in G, then we map L
i
out
to L
j
. Ties were in fact rare, and broken randomly. In this case, the latent that did not get the
new label keeps a random label unrelated to latents in G. At most one latent in G
out
is mapped to
a xed latent L in G, and if a latent in G had no majority in G
out
, it was not represented in G
out
.
The mapping for FA was done slightly dierently. Because the output of FA is typically an
extremely impure measurement model with many indicators loading on more than one latent, the
simple minded majority method generates too many ties. For FA we do the mapping not by
7
That is, where Heywood cases (Bartholomew and Knott, 1999) happened during tting for 20 random re-starts.
In this case, we just used the previous number of latents where Heywood cases did not happen.
3.5 Evaluation 49
majority voting of indicators according to their true clusters, but by verifying which true latent
corresponds to the highest sum of absolute values of factor loadings for a given output latent. For
example, let L
out
be a latent node in G
out
. Suppose S
1
is the sum of the absolute values of the
loadings of L
out
on measures of the true latent L
1
only, and S
2
is the sum of the absolute values
of the loadings of L
out
on measures of the true latent L
2
only. If S
2
> S
1
, we rename L
out
as L
2
.
If two output latents are mapped to the same true latent, we label only one of them as the true
latent by choosing the one that corresponds to the highest sum of absolute loadings. The remaining
latent receives a random label.
We compute the following scores for the output model G
out
from each algorithm, where the
true graph is labelled G
I
, and where G is a purication of G
I
:
latent omission, the number of latents in G that do not appear in G
out
divided by the total
number of true latents in G;
latent commission, the number of latents in G
out
that could not be mapped to a latent in
G divided by the total number of true latents in G;
misclustered indicators, the number of observed variables in G
out
that end up in the wrong
cluster divided by the number of observed variables in G;
indicator omission, the number of observed variables in G that do not appear in the G
out
divided by the total number of observed variables in G;
indicator commission, the number of observed nodes in G
out
that are not in G divided by
the number of nodes in G that are not in G
I
. These are nodes that introduce impurities in
the output model;
To be generous to factor analysis, we considered in FA outputs only latents with at least three
indicators
8
. Again, to be conservative, we calculate the misclustered indicators error in the
same way as in BuildPureClusters or P-FA, but here an indicator is not counted as mistakenly
clustered if it is a child of the correct latent, even if it is also a child of a wrong latent.
Simulation results are given in Tables 3.3 and 3.4, where each number is the average error
across 10 trials with standard deviations in parenthesis. Notice there are at most two maximal
pure measurement models for each setup (there are two possible choices of which measures to
remove from the last latent in MM
2
and MM
3
) and for each G
out
we choose our gold standard G
as a maximal pure measurement submodel that contains the most number of nodes found in G
out
.
Each result is an average over 10 experiments with dierent parameter values randomly selected
for each instance and three dierent sample sizes (200, 1000 and 10000 cases).
Table 3.3 evaluates all three procedures on the rst two discovery tasks: DP1 and DP2. As
predicted, all three procedures had very low error rates in rows involving MM1 and sample sizes
of 10,000. In general, FA has very low rates of latent omission, but very high rates of latent
commission, and P-FA, not surprisingly, does the opposite: very high rates of latent omission but
very low rates of commission. In particular, FA is very sensitive to the purity of the generating
measurement model. With MM2, the rate of latent commission for FA was moderate; with MM3
it was disastrous. BPC does reasonably well on all measures in Tables 3.3 at all sample sizes and
for all generating models.
8
Even with this help, we still found several cases in which latent commission errors were more than 100%, indicating
that there were more spurious latents in the output graphs than latents in the true graph.
50 Learning the structure of linear latent variable models
Table 3.4 gives results regarding indicator ommissions and commission, which, because FA
keeps the original set of indicators it is given, only make sense for BPC and P-FA. P-FA omits
far too many indicators, a behavior that we hypothesize will make it dicult for GES-MIMBuild
on the measurement model output by P-FA.
Success on nding the structural model
As we have said from the outset, the real goal of our work is not only to discover the latent
variables that underly a set of measures, but also the causal relations among them. In the nal
piece of the simulation study, we applied the best causal model search algorithm we know of, GES,
modied for this purpose as GES-MIMbuild, to the measurement models output by BPC, FA,
and P-FA.
If the output measurement model has no errors of latent omission or commission, then scoring
the result of the structural model search is fairly easy. The GES-MIMbuild search outputs an
equivalence class, with certain adjacencies unoriented and certain adjacencies oriented. If there
is an adjacency of any sort between two latents in the output, but no such adjacency in the true
graph, then we have an error of edge commission. If there is no adjacency of any sort between
two latents in the output, but there is an edge in the true graph, then we have an error of edge
omission. For orientation, if there is an oriented edge in the output that is not oriented in the
equivalence class for the true structural model, then we have an error of orientation commission.
If there is an unoriented edge in the output which is oriented in the equivalence class for the true
model, we have an error of orientation omission.
If the output measurement model has any errors of latent commission, then we simply leave
out the comitted latents in the measurement model given to GES-MIMbuild. This helps FA
primarily, as it was the only procedure of the three that had high errors of latent commission.
If the output measurement model has errors of latent omission, then we compare the marginal
involving the latents in the output model for the true structural model graph to the output struc-
tural model equivalence class. For each of the structural models we selected, SM1, SM2, and SM3,
all marginals can be represented faithfully as DAGs. Our measure of successful causal discovery,
therefore, for a measurement model involving a small subset of the latents in the true graph is very
lenient. For example, if the generating model was SM3, which involves four latents, but the output
measurement model involved only two of these latents, then a perfect search result in this case
would amount to nding that the two latents are associated. Thus, this method of scoring favors
P-FA, which tends to omit latents.
In summary then, our measures for assessing the ability of these algorithms to correctly discover
at least features of the causal relationships among the latents are as follows:
edge omission (EO), the number of edges in the structural model of G that do not appear
in G
out
divided by the possible number of edge omissions (2 in SM
1
and SM
2
, and 4 in SM
3
,
i.e., the number of edges in the respective structural models);
edge commission (EC), the number of edges in the structural model of G
out
that do not
exist in G divided by the possible number of edge commissions (only 1 in SM
1
and SM
2
, and
2 in SM
3
);
orientation omission (OO), the number of arrows in the structural model of G that do
not appear in G
out
divided by the possible number of orientation omissions in G (2 in SM
1
3.5 Evaluation 51
and SM
3
, 0 in SM
2
);
orientation commission (OC), the number of arrows in the structural model of G
out
that
do not exist in G divided by the number of edges in the structural model of G
out
;
We have bent over, not quite backwards, to favor variations of factor analysis. Tables 3.5 and
3.6 summarize the results. Along with each average we provide the number of trials where no
errors of a specic type were made. Althought it is clear from Tables 3.5 and 3.6 that factor analyis
works well when the true models are pure, in general factor analysis commits way more errors of
edge commission, since the presence of several spurious latents create spurious dependence paths.
As a consequence, several orientation omissions follow. Under the same statistics, P-FA seems to
work better than FA, but this is an artifact of P-FA having less latents on average than the other
methods.
Figures 3.15 and 3.16 illustrate. Each picture contains a plot of the average edge error of each
algorithm (i.e., the average of all four error statistics from Tables 3.5 and 3.6) with several points
per algorithm representing dierent sample sizes or dierent measurement models, and is evaluated
for a specic combination of structural model (SS
2
). The pattern for the other two simulated
structural models is similar.
The optimal performance is the bottom left. It is clear that P-FA achieves relatively high accu-
racy solely because of high percentage of latent omission. This pattern is similar across all structural
models. Notice that FA is quite competitive when the true model is pure. BuildPureClusters
tends to get lower latent omission error with the more complex measurement models (Figure 3.15)
because the higher number of pure indicators in those situations helps the algorithm to identify
each latent.
In summary, factor analysis provides little useful information out of the given datasets. In
contrast, the combination of BuildPureClusters and GES-MIMBuild largely succeeds in such
a dicult task, even at small sample sizes.
3.5.2 Real-world applications
We now discuss results obtained in three dierent domains in social sciences and psychology. Even
though data collected from such domains (usually through questionnaires) may pose signicant
problems for exploratory data analysis since sample sizes are usually small and noisy, nevertheless
they have a very useful property for our empirical evaluation: questionnaires are designed to target
specic latent factors (such as stress, job satisfaction, and so on) and a theoretical measurement
model is developed by experts in the area to measure the desired latent variables, thus providing a
basis for comparison with the output of our algorithm. The chance that various observed variables
are not pure measures of their theoretical latents is high. Measures are usually discrete, but often
ordinal with a Likert-scale that can be treated as normally distributed measures with little loss
(Bollen, 1989).
The theoretical models contain very few latents, and therefore are not as useful to evaluate
MIMBuild as they are to BuildPureClusters.
Student anxiety factors: A survey of test anxiety indicators was administered to 335 grade 12
male students in British Columbia. The survey consisted of 20 measures on symptoms of anxiety
52 Learning the structure of linear latent variable models
0.5
0.4
0.3
0.2
0.1
0
0.6 0.5 0.4 0.3 0.2 0.1 0
E
d
g
e

e
r
r
o
r
Latent omission
SM
2
+ MM
1
1
1,3
1
2
2
2
3 3
BPC
Sample size 200 1
Sample size 1000 2
Sample size 10000 3
FA
PFA
0.5
0.4
0.3
0.2
0.1
0
0.6 0.5 0.4 0.3 0.2 0.1 0
SM
2
+ MM
2
1
1
1
2,3
2
2
3
3
BPC
FA
PFA
0.5
0.4
0.3
0.2
0.1
0
0.6 0.5 0.4 0.3 0.2 0.1 0
SM
2
+ MM
3
1,2
1
1
2
2
3
3
3
BPC
FA
PFA
Figure 3.15: Comparisons of methods on measurement models of increasing complexity (from MM
1
to MM
3
). While BPC tends to have low error on both dimensions (latent omission and edge error),
the other two methods fail on either one.
3.5 Evaluation 53
0.5
0.4
0.3
0.2
0.1
0
0.6 0.5 0.4 0.3 0.2 0.1 0
E
d
g
e

e
r
r
o
r
Latent omission
SM
2
+ Sample 200
1
1
1
2
2
2
3
3
3
BPC
FA
PFA
0.5
0.4
0.3
0.2
0.1
0
0.6 0.5 0.4 0.3 0.2 0.1 0
Latent omission
SM
2
+ Sample 1000
1
1
1
2
2
2
3
3
3
BPC
FA
PFA
0.5
0.4
0.3
0.2
0.1
0
0.6 0.5 0.4 0.3 0.2 0.1 0
E
d
g
e

e
r
r
o
r
Latent omission
SM
2
+ Sample 10000
1
1
1
2
2
2
3
3
3
BPC
FA
PFA
Figure 3.16: Comparisons of methods on increasing sample sizes. BPC has low error even at small
sample sizes, while the other two methods show an apparent bias that does not go away with very
large sample size.
54 Learning the structure of linear latent variable models
Emotionality Worry
X
2
X
X
X
X
X
X
8
9
10
15
16
18
X
X
X
X
X
X
X
X
3
4
5
6
7
14
17
20
Figure 3.17: A theoretical model for psychological factors of test anxiety.
under test conditions. The covariance matrix as well as a description of the variables is given by
Bartholomew et al. (2002)
9
.
Using exploratory factor analysis, Bartholomew concluded that two latent common causes un-
derly the variables in this data set, agreeing with previous studies. The original study identied
items x
2
, x
8
, x
9
, x
10
, x
15
, x
16
, x
18
as indicators of an emotionality latent factor (this includes
physiological symptoms such as jittery and faster heart beatting), and items x
3
, x
4
, x
5
, x
6
, x
7
, x
14
, x
17
, x
20

as indicators of a more psychological type of anxiety labeled worry by Bartholomew et al. No

further description is given about the remaining ve variables. Bartholomew et al.s factor analy-
sis with oblique rotation roughly matches this model. Bartholomews exploratory factor analysis
model for a subset of the variables is shown in Figure 3.17. This model is not intended to be
pure. Instead, the gure represents which of the two latents is more strongly connected to each
indicator. The measurement model itself is not constrained.
We ran our algorithm 10 times with dierent random orderings of variables and we got always
the same following measurement model:
1. x
2
, x
8
, x
9
, x
10
, x
11
, x
16
, x
18
2. x
3
, x
5
, x
7
3. x
6
, x
14
Interestingly, the largest cluster closely corresponds to the emotionality factor as described
by previous studies. The remaining two clusters are a split of worry into two subclusters with
some of the original variables eliminated. Variables in the second cluster are only questions that
explicitly describe thinking about sucess/failure (the only other question in the survey with the
same characteristic was x
17
which was eliminated). Variables x
6
and x
14
are can be interpreted as
indicating self-defeat.
The BuildPureClusters measurement model, with our interpretation of the latents based
on the questionnaire contents, is shown in Figure 3.18(a).
If we treat Bartholomews model as a pure model (as in Figure 3.17 with correlated latents, the
result is a model that fails a chi-square test, p = 0. The full factor analysis of this dataset ts the
9
The data are available online at http://multilevel.ioe.ac.uk/team/aimdss.html.
3.5 Evaluation 55
Selfdefeating
Emotionality
X
2
X
14
X
5
X
7
X
6
Cares about achieving
X
X
X
X
X
X
8
9
10
16
18
11
X
3
Selfdefeating
Emotionality
X
2
X
14
X
5
X
7
X
6
Cares about achieving
X
X
X
X
X
X
8
9
10
16
18
11
X
3
(a) (b)
Figure 3.18: The output of BPC and GES-MIMBuild for the test anxiety study.
data with a p-value greater than 0.05, but it requires that many of the indicators have signicant
loadings on both latents. There might be no simple principled way to explain why such loadings
are necessary. They may be due to direct eects of one variable on another, or due to other latent
factors independent of the two conjectured. Besides that, the signicance of such coecients is tied
to the ad-hoc rotation method employed in order to obtain a simple structure.
Applying GES-MimBuild to the BPC measurement model of Figure 3.18(a) we obtain Figure
3.18(b). The model passes a chi square test handily, p = 0.47.
To summarize, by dropping only 3 out of 15 previously classied variables (among a total of 20
variables), our algorithm built a better tting measurement model that is simpler to understand.
The algorithm used absolutely no domain-specic prior knowledge, and did not rely in any way on
ad-hoc rotation methods.
Well-being and spiritual coping: Bongjae Lee from the University of Pittsburgh conducted
a study of religious/spiritual coping and stress in graduate students. In December of 2003, 127
Masters in Social Works students answered a questionnaire intendent to measure three main factors:
stress, measured with 21 items, each using a 7-point scale (from not all stressful to ex-
tremely stressful) according to situations such as: fullling responsabilities both at home
and at school; meeting with faculty; writing papers; paying monthly expenses; fear
of failing; arranging childcare;
well-being, measured with 20 items, each using a 4-point scale (from rarely or none to most
or all the time) according to indicators such as: my appetite was poor; I felt fearful; I
enjoyed life I felt that people disliked me; my sleep was restless;
religious/spiritual coping, measured with 20 items, each using a 4-point scale (from not at
all to a great deal) according to indicators such as: I think about how my life is part of a
larger spiritual force; I look to God (high power) for strength in crises; I wonder whether
God (high power) really exists; I pray to get my mind o of my problems;
As an illustration, the full questionnaire is given in Appendix A. Theoretical latents are not
necessarily unidimensional, i.e., they might be partioned into an unknown set of sublatents and
56 Learning the structure of linear latent variable models
-
1
... C C
2 20
Dep
1
...
Dep
Dep
2
20
St
1
...
St
St
2
21
Stress
Depression
Coping
+
C
Figure 3.19: A theoretical model for the interaction of religious coping, stress and depression. The
signs on the edges depicts the theoretical signs of the corresponding eects.
their indicators might be impure, but there was no prior knowledge about which impurities might
exist.
The goal of the original study was to use graphical models to quantify how spiritual coping
moderates the association of stress and well-being. Lees model is shown in Figure 3.19. The
undirected edge means lack of knowledge about causal directionality. This model fails a chi square
test: its p-value is zero.
Our goal in this analysis is to verify if we get a clustering consistent with the theoretical
measurement model (i.e., questions related to dierent topics will not end up in a same cluster), and
analyse how questions are partioned within each theoretical cluster (i.e., how a group of questions
related to the same theoretical latent ended up divided in dierent subclusters) using no prior
knowledge.
The algorithm was applied 10 times with a dierent random choice of variable ordering each
time. On average we got 18.2 indicators (standard deviation of 1.8). Clusters with only one variable
were excluded. On average, 5.5 latents were discovered (standard deviation of 0.85). Counting only
latents with at least three indicators, we had on average 4 latents (standard deviation of 0.67). In
comparison, using the theoretical model as an initial model and by applying purication directly
10
,
i.e. without automated clustering, we obtained 15 variables (8 indicators of stress, 4 indicators of
coping and 3 indicators of depression). We should not expect to do much better with an automated
clustering method. This clustering is given below:
1. Clustering C0 (p-value: 0.28):
STR03, STR04, STR16, STR18, STR20
DEP09, DEP13, DEP19
COP09, COP12, COP14, COP15
By comparing each result from an individual BPC run to the theoretical model and taking
the proportion of indicators that were clustered dierently from the theoretical model, we had an
10
In order to save time, we rst applied a constraint-based purication method described in (Spirtes et al., 2000)
as a rst step, using false discovery rates as a method for controlling to multiple hypothesis tests. Due to relatively
large number of variables, this method is quite conservative and will tend to underprune the model, and therefore
should not compromise the subsequent score-based purication that was applied. For instance, after the rst step
the model still had a p-value of zero according to a chi-square test.
3.5 Evaluation 57
19
Stress
Depression
Coping
St
St
St
St
3
St
4
16
18
20
C C C C
9 12 14 15
Dep
Dep
Dep
9
13
+
Stress
Depression
Coping
St
St
St
St
3
St
4
16
18
20
C C C C
9 12 14 15
Dep
Dep
Dep
9
13
19
+
(a) (b)
Figure 3.20: The output of BPC and GES-MIMBuild for the coping study.
average percentage of 0.05 (standard deviation of 0.05). The proportionally high standard deviation
is a consequence of the small percentages: in 4 out of 10 cases there was no indicator mistakenly
clustered with respect to the questionnaire, in 5 out of 10 we had only one mistake, and in only
one case there were two mistakes.
The outputs with the highest number of indicators were also the ones with the highest number
of latents: One full model automatically produced with GES-MIMBuild with the prior knowledge
that STRESS is not an eect of other latent variables is given in Figure 3.20(b). This model passes
a chi square test: p = 0.28.
Single-mothers self-ecacy and childrens development: Jackson and Scheines (2005)
present a longitudinal study on single black mothers with one child in New York City from 1996 to
1999. The goal of the study was to detect the relationship among perceived self-ecacy, mothers
employement, maternal parenting and child outcomes. Overall, there were nine factors used in this
study. Three of them, age, education and income, are represented directly by one indicator each
(here represented as W2moage, W2moedu and W2faminc, respectively). The other six factors are
latent variables measured by a varied number of indicators:
1. nancial strain (3 indicators, represented by W2nan1, W2nan2, W2nan3)
2. parenting stress (26 indicators, represented by W2paroa - W2paroz)
3. emotional support from family (20 indicators, represented by W2suf01 - W2suf20)
4. emotional support from friends (20 indicators, W2sufr01 - S2sufr20)
5. tangible support (i.e., more material than psychological. 4 indicators, W2ssupta - W2ssuptd)
6. problem behaviors of child (30 indicators, W2mneg1 - W2mneg30)
We do not reproduce the original questionnaire here due to its size. The questionnaire is based
on previous work on creating scales for such latents. As before, we evaluate how our algorithm
output compares to the theoretical model. The extra diculty here is that the distribution of the
variables, which are ordinal categorical, are signicantly skewed. Some of the categories are very
rare, and we smoothed the original levels by collapsing values that were adjacent and represented
58 Learning the structure of linear latent variable models
less than 5% of the total total number of cases. Several variables ended up binary by doing this
transformation, which reduces the eciency of models based on multivariate Gaussian distributions.
1 out of the 106 variables was also removed (W2sufr04) since 98% of the points fell into one of the
two possible categories. The sample size is 178, relatively large for this kind of study, but it still
considerably small for exploratory data analysis.
As before, the algorithm was applied 10 times with a dierent random choice of variable ordering
each time. On average we got 21 indicators (standard deviation of 3.35) excluding clusters with only
one variable. On average, 7.3 latents were discovered (standard deviation of 1.5). Counting only
latents with at least three indicators, we had on average 4.3 latents (standard deviation of 0.86).
Moreover, comparing each result to the theoretical model and taking the proportion of indicators
that were wrongly clustered, we had an average percentage of 0.08, with standard deviation of 0.07.
It was noticeable that the small theoretical clusterings (nancial strain and tangible sup-
port) did not show up in the nal models, but we claim that errors of omission are less harmful
than those of commission, i.e., wrong clustering. However, it was relatively unexpected that the
clusterings obtained in the rst stage of our implementation (i.e., the output of FindInitialSe-
lection, Appendix A.3) were larger in the number of indicators than the ones obtained at the end
of the process. This can be explained by the fact that the initial step is a more constrained search,
and therefore less prone to overt. Since our data set is noisier than in the previous cases, we
choose to evaluate only the three largest clusters obtained from FindInitialSelection. In this
case, we had an average proportion of 0.037 wrongly clustered items (standard deviation: 0.025),
4.9 clusters (deviation: 0.33), 4.6 clusters of size at least three (deviation: 0.71) and 24.2 indicators
(deviation: 2.8). Notice that the clusters were less fragmented than in the previous case, i.e., we
had less clusters, more indicators per clustering, and a insignicant number of clusters with less
than three indicators.
The largest clusters in this situations were the following:
1. Cluster D1 (p-value: 0.46):
W2sufr02 W2sufr05 W2sufr08 W2sufr13 W2sufr14 W2sufr19 W2sufr20
W2mneg14 W2mneg15 W2mneg2 W2mneg22 W2mneg26 W2mneg28 W2mneg29
W2suf01 W2suf05 W2suf08
W2paro2e W2paro2j W2paro2t W2paro2w
W2suf07 W2suf12 W2suf17
2. Cluster D2 (p-value: 0.22):
W2sufr01 W2sufr08 W2sufr10 W2sufr12 W2sufr13 W2sufr14 W2sufr19 W2sufr20
W2suf04 W2suf05 W2suf10
W2paro2e W2paro2j W2paro2t W2paro2w
W2paro2k W2suf12 W2suf17
W2mneg2 W2mneg5 W2mneg12 W2mneg14 W2mneg21 W2mneg22 W2mneg26
3. Cluster D3 (p-value: 0.29):
W2mneg2 W2mneg10 W2mneg22 W2mneg26 W2mneg28 W2mneg29
W2sufr01 W2sufr05 W2sufr08 W2sufr09 W2sufr12 W2sufr13 W2sufr14 W2sufr19
W2suf02 W2suf04 W2suf05 W2suf11 W2suf13 W2suf20
W2paro2e W2paro2j W2paro2t W2paro2w
W2paro2k W2suf12 W2suf17
3.6 Summary 59
One can see that such models largely agree with those formed from prior knowledge. However,
sucess in this domain is not as interesting as in the previous two cases: unlike in the test anxiety and
spiritual coping models, the covariance matrix of the latent variables has a majority number of very
small entries, resulting in a considerably easier clustering by just observing marginal independencies
among items.
Still, the cases where theoretical clusters were split seem to be in accordance with the data:
merging the W2suf indicators in a single pure cluster in D1 will result in a model with a p-value
of 0.008. Merging the W2suf variables in D2 will also result in a low p-value (0.06) even when
W2paro2k is removed. Unsurprinsingly, doing a similar merging in D3 gives a model with a p-value
of 0.04. This is a strong indication that W2suf12 and W2suf17 should form a cluster on their own.
In fact, these two items are formulated as two very similar indicators: members of my family come
to me for emotional support and members of my family seek me out for companionship. No
other indicator for this latent seems to fall in the same category. Why this particular pair is singled
out in comparison with other indicators for this latent is a question for future studies and a simple
example of how our procedure can help in understanding the latent structure of the data.
3.6 Summary
We introduced a method to discover latents and identify their respective indicators. This generalizes
the work on modifying measurement models described by Glymour et al. (1987) and complements
the MIMBuild approach of Spirtes et al. (2000). What can be learned from our approach is that
identiability matters, that intuitively appealing heuristics (e.g., rotation methods) might fail when
the goal is structure learning with causal interpretation, and that in many times is preferable to
model the relationships of a subset of the given variables than trying to force a bad model over all
of them (Kano and Harada, 2000).
Still, the full linearity assumption might be too strong. This will be relaxed in the next chapter.
60 Learning the structure of linear latent variable models
Evaluation of output measurement models
Latent omission Latent commission Misclustered indicator
Sample BPC FA P-FA BPC FA P-FA BPC FA P-FA
SM
1
+MM
1
200 0.10(.2) 0.00(.0) 0.10(.2) 0.00(.0) 0.00(.0) 0.00(.0) 0.01(.0) 0.00(.0) 0.00(.0)
1000 0.17(.2) 0.00(.0) 0.13(.2) 0.00(.0) 0.00(.0) 0.03(.1) 0.00(.0) 0.01(.0) 0.01(.0)
10000 0.07(.1) 0.00(.0) 0.13(.2) 0.00(.0) 0.00(.0) 0.00(.0) 0.00(.0) 0.00(.0) 0.00(.0)
SM
1
+MM
2
200 0.00(.0) 0.03(.1) 0.60(.3) 0.03(.1) 0.77(.2) 0.10(.2) 0.01(.0) 0.12(.1) 0.02(.0)
1000 0.00(.0) 0.00(.0) 0.17(.2) 0.00(.0) 0.47(.2) 0.27(.3) 0.00(.0) 0.08(.1) 0.10(.1)
10000 0.00(.0) 0.00(.0) 0.23(.2) 0.03(.1) 0.33(.3) 0.17(.2) 0.02(.1) 0.07(.1) 0.03(.1)
SM
1
+MM
3
200 0.00(.0) 0.00(.0) 0.33(.3) 0.07(.1) 1.13(.3) 0.17(.2) 0.03(.1) 0.16(.1) 0.04(.1)
1000 0.00(.0) 0.00(.0) 0.30(.2) 0.07(.1) 0.87(.3) 0.33(.3) 0.03(.1) 0.12(.1) 0.06(.1)
10000 0.03(.1) 0.00(.0) 0.27(.3) 0.00(.0) 0.70(.3) 0.37(.3) 0.00(.0) 0.12(.1) 0.09(.1)
SM
2
+MM
1
200 0.10(.2) 0.00(.0) 0.13(.2) 0.00(.0) 0.00(.0) 0.00(.0) 0.06(.1) 0.01(.0) 0.00(.0)
1000 0.03(.1) 0.00(.0) 0.17(.2) 0.00(.0) 0.00(.0) 0.00(.0) 0.02(.1) 0.00(.0) 0.00(.0)
10000 0.00(.0) 0.00(.0) 0.07(.1) 0.00(.0) 0.00(.0) 0.00(.0) 0.00(.0) 0.00(.0) 0.00(.0)
SM
2
+MM
2
200 0.03(.1) 0.00(.0) 0.33(.2) 0.07(.1) 0.80(.3) 0.17(.2) 0.06(.1) 0.15(.1) 0.04(.1)
1000 0.00(.0) 0.00(.0) 0.27(.2) 0.00(.0) 0.53(.3) 0.23(.3) 0.00(.0) 0.08(.1) 0.06(.1)
10000 0.00(.0) 0.00(.0) 0.10(.2) 0.00(.0) 0.27(.3) 0.23(.3) 0.00(.0) 0.08(.1) 0.06(.1)
SM
2
+MM
3
200 0.00(.0) 0.03(.1) 0.53(.2) 0.00(.0) 1.13(.3) 0.03(.1) 0.01(.0) 0.07(.1) 0.01(.0)
1000 0.00(.0) 0.00(.0) 0.27(.2) 0.00(.0) 0.73(.3) 0.13(.2) 0.00(.0) 0.08(.1) 0.03(.1)
10000 0.00(.0) 0.00(.0) 0.37(.2) 0.00(.0) 0.97(.3) 0.27(.3) 0.00(.0) 0.08(.1) 0.05(.1)
SM
3
+MM
1
200 0.12(.2) 0.02(.1) 0.38(.2) 0.00(.0) 0.05(.1) 0.00(.0) 0.05(.1) 0.02(.1) 0.01(.0)
1000 0.10(.2) 0.02(.1) 0.12(.2) 0.00(.0) 0.02(.1) 0.00(.0) 0.01(.0) 0.02(.1) 0.00(.0)
10000 0.05(.1) 0.00(.0) 0.20(.1) 0.00(.0) 0.00(.0) 0.00(.0) 0.00(.0) 0.00(.0) 0.00(.0)
SM
3
+MM
2
200 0.02(.1) 0.05(.2) 0.60(.2) 0.10(.2) 0.62(.1) 0.08(.2) 0.03(.1) 0.16(.3) 0.01(.0)
1000 0.02(.1) 0.02(.1) 0.30(.3) 0.02(.1) 0.38(.2) 0.10(.1) 0.01(.0) 0.18(.2) 0.07(.1)
10000 0.00(.0) 0.05(.1) 0.45(.2) 0.00(.0) 0.35(.2) 0.10(.2) 0.00(.0) 0.18(.2) 0.04(.1)
SM
3
+MM
3
200 0.02(.1) 0.02(.1) 0.58(.2) 0.05(.1) 0.98(.3) 0.08(.1) 0.04(.1) 0.19(.2) 0.01(.0)
1000 0.02(.1) 0.08(.2) 0.35(.2) 0.00(.0) 0.72(.3) 0.08(.1) 0.00(.0) 0.23(.3) 0.03(.0)
10000 0.00(.0) 0.08(.1) 0.30(.3) 0.00(.0) 0.60(.3) 0.08(.1) 0.00(.0) 0.27(.3) 0.02(.0)
Table 3.3: Results obtained with BuildPureClusters (BPC), factor analysis (FA) and puried
factor analysis (P-FA) for the problem of learning measurement models. Each number is an average
over 10 trials, with the standard deviation over these trials in parenthesis.
3.6 Summary 61
Evaluation of output measurement models
Indicator omission Indicator commission
Sample BPC P-FA BPC P-FA
SM
1
+MM
1
200 0.12(.2) 0.41(.3)
1000 0.18(.2) 0.19(.2)
10000 0.09(.2) 0.14(.2)
SM
1
+MM
2
200 0.08(.0) 0.87(.1) 0.07(.1) 0.07(.1)
1000 0.07(.1) 0.46(.2) 0.00(.0) 0.13(.2)
10000 0.06(.1) 0.38(.2) 0.03(.1) 0.10(.2)
SM
1
+MM
3
200 0.17(.1) 0.78(.2) 0.04(.1) 0.08(.1)
1000 0.12(.1) 0.58(.2) 0.06(.1) 0.10(.2)
10000 0.13(.1) 0.42(.3) 0.00(.0) 0.06(.1)
SM
2
+MM
1
200 0.10(.1) 0.43(.2)
1000 0.03(.1) 0.23(.2)
10000 0.03(.1) 0.11(.1)
SM
2
+MM
2
200 0.16(.1) 0.77(.1) 0.30(.3) 0.03(.1)
1000 0.06(.1) 0.57(.1) 0.00(.0) 0.07(.2)
10000 0.06(.1) 0.31(.2) 0.00(.0) 0.10(.2)
SM
2
+MM
3
200 0.16(.1) 0.85(.1) 0.18(.2) 0.04(.1)
1000 0.08(.1) 0.56(.2) 0.02(.1) 0.10(.1)
10000 0.05(.1) 0.72(.1) 0.00(.0) 0.16(.1)
SM
3
+MM
1
200 0.14(.1) 0.65(.2)
1000 0.12(.2) 0.28(.2)
10000 0.08(.1) 0.21(.1)
SM
3
+MM
2
200 0.14(.1) 0.84(.1) 0.10(.2) 0.02(.1)
1000 0.11(.1) 0.51(.2) 0.00(.0) 0.02(.1)
10000 0.05(.0) 0.56(.2) 0.00(.0) 0.02(.1)
SM
3
+MM
3
200 0.14(.1) 0.87(.1) 0.17(.1) 0.02(.1)
1000 0.13(.1) 0.66(.1) 0.03(.1) 0.07(.1)
10000 0.13(.1) 0.52(.2) 0.00(.0) 0.08(.1)
Table 3.4: Results obtained with BuildPureClusters (BPC) and puried factor analysis (P-
FA) for the problem of learning measurement models. Each number is an average over 10 trials,
with standard deviations in parenthesis.
62 Learning the structure of linear latent variable models
Evaluation of output structural models
Edge omission Edge commission
Sample BPC FA P-FA BPC FA P-FA
SM
1
+MM
1
200 0.05 09 0.05 09 0.10 08 0.10 09 0.30 07 0.20 08
1000 0.05 09 0.10 08 0.05 09 0.20 08 0.30 07 0.10 09
10000 0.00 10 0.05 09 0.15 07 0.00 10 0.00 10 0.00 10
SM
1
+MM
2
200 0.00 10 0.15 07 0.00 10 0.00 10 0.40 06 0.00 10
1000 0.00 10 0.00 10 0.15 07 0.10 09 0.40 06 0.20 08
10000 0.00 10 0.05 09 0.25 05 0.20 08 0.50 05 0.20 08
SM
1
+MM
3
200 0.00 10 0.25 05 0.05 09 0.20 08 0.70 03 0.10 09
1000 0.00 10 0.15 07 0.10 08 0.10 09 0.70 03 0.10 09
10000 0.00 10 0.05 09 0.05 09 0.00 10 0.40 06 0.20 08
SM
2
+MM
1
200 0.00 10 0.00 10 0.00 10 0.20 08 0.30 07 0.10 09
1000 0.00 10 0.05 09 0.00 10 0.00 10 0.30 07 0.00 10
10000 0.00 10 0.00 10 0.00 10 0.20 08 0.30 07 0.20 08
SM
2
+MM
2
200 0.00 10 0.15 07 0.05 09 0.40 06 0.30 07 0.20 08
1000 0.00 10 0.10 09 0.00 10 0.10 09 0.60 04 0.10 09
10000 0.00 10 0.05 09 0.00 10 0.10 09 0.70 03 0.50 05
SM
2
+MM
3
200 0.00 10 0.15 07 0.00 10 0.20 08 0.70 03 0.00 10
1000 0.00 10 0.15 07 0.00 10 0.20 08 0.40 06 0.20 08
10000 0.00 10 0.10 08 0.00 10 0.00 10 0.50 05 0.20 08
SM
3
+MM
1
200 0.12 05 0.12 06 0.08 08 0.20 06 0.20 06 0.10 09
1000 0.05 08 0.08 08 0.08 07 0.15 08 0.10 08 0.15 07
10000 0.05 08 0.15 04 0.15 04 0.15 08 0.15 08 0.05 09
SM
3
+MM
2
200 0.02 09 0.28 03 0.05 08 0.55 03 0.55 02 0.00 10
1000 0.00 10 0.12 07 0.05 08 0.25 07 0.75 02 0.20 07
10000 0.00 10 0.00 10 0.00 10 0.10 08 0.80 02 0.00 10
SM
3
+MM
3
200 0.02 09 0.32 02 0.08 07 0.40 05 0.50 02 0.00 10
1000 0.08 07 0.02 09 0.10 06 0.30 06 0.65 02 0.00 10
10000 0.00 10 0.05 08 0.02 09 0.15 07 0.65 03 0.25 07
Table 3.5: Results obtained with the application of GES-MIMBuild to the output of BPC, FA,
and P-FA, with the number of perfect solutions over ten trials on the right of each average.
3.6 Summary 63
Evaluation of output structural models
Orientation omission Orientation commission
Sample BPC FA P-FA BPC FA P-FA
SM
1
+MM
1
200 0.10 09 0.15 08 0.05 09 0.00 10 0.00 10 0.00 10
1000 0.20 08 0.00 10 0.00 10 0.00 10 0.05 09 0.00 10
10000 0.00 10 0.00 10 0.00 10 0.00 10 0.00 10 0.00 10
SM
1
+MM
2
200 0.00 10 0.20 07 0.00 10 0.00 10 0.05 09 0.00 10
1000 0.10 09 0.20 07 0.00 10 0.00 10 0.00 10 0.00 10
10000 0.20 08 0.25 05 0.00 10 0.00 10 0.00 10 0.00 10
SM
1
+MM
3
200 0.20 08 0.40 04 0.10 09 0.00 10 0.05 09 0.00 10
1000 0.10 09 0.10 09 0.10 09 0.00 10 0.10 08 0.00 10
10000 0.00 10 0.30 06 0.10 09 0.00 10 0.00 10 0.00 10
SM
2
+MM
1
200 0.00 10 0.00 10 0.00 10
1000 0.00 10 0.00 10 0.00 10
10000 0.00 10 0.00 10 0.00 10
SM
2
+MM
2
200 0.00 10 0.00 10 0.00 10
1000 0.00 10 0.10 09 0.00 10
10000 0.00 10 0.10 09 0.00 10
SM
2
+MM
3
200 0.00 10 0.10 08 0.00 10
1000 0.00 10 0.05 09 0.00 10
10000 0.00 10 0.05 09 0.00 10
SM
3
+MM
1
200 0.15 08 0.00 10 0.00 10 0.22 07 0.35 06 0.15 08
1000 0.10 09 0.00 10 0.05 09 0.10 09 0.00 10 0.04 09
10000 0.05 09 0.00 10 0.00 10 0.04 09 0.00 10 0.00 10
SM
3
+MM
2
200 0.50 05 0.30 06 0.00 10 0.08 09 0.16 07 0.00 10
1000 0.30 07 0.45 04 0.10 09 0.00 10 0.05 09 0.04 09
10000 0.20 08 0.40 06 0.00 10 0.00 10 0.00 10 0.00 10
SM
3
+MM
3
200 0.50 04 0.15 08 0.00 10 0.19 06 0.14 08 0.10 09
1000 0.20 07 0.35 05 0.00 10 0.15 07 0.02 09 0.10 09
10000 0.00 10 0.35 05 0.20 07 0.00 10 0.00 10 0.00 10
Table 3.6: Results obtained with the application of GES-MIMBuild to the output of BPC, FA,
and P-FA, with the number of perfect solutions over ten trials on the right of each average.
64 Learning the structure of linear latent variable models
Chapter 4
Learning measurement models of
non-linear structural models
The assumption of full linearity, as discussed in Chapter 3, might be too strong. It turns out
that much can still be discovered if one allows exible functional relations among latent variables.
Tetrad constraints in the observed covariance matrix continue to carry information concerning the
measurement model. This chapter discusses how to identify some important features of the true
model using tetrad constraints, even when latents are non-linearly related. Such identication rules
justify applying an essentially unmodied BuildPureClusters algorithm to a much larger class
of models. However, we do not have identication rules to detect d-separations among latents in
this case. Moreover, the clusters we discover might not correspond to single latents in the true
model. With the help of background knowledge, the modied BuildPureClusters is still a
valuable tool in latent variable modeling.
4.1 Approach
We modify the assumptions introduced in Chapter 3. For now, we assume that the latent variable
model to be discovered has a graphical structure and parameterization that obey the following
conditions besides the causal Markov condition:
A1. no observed variable is a parent of a latent variable;
A2. any observed variable is a linear function of its parents with additive noise of nite positive
variance;
A3. all latent variables have nite positive variance, and the correlation of any two latents lies
strictly in the open interval (-1, 1);
A4. there are no cycles that include an observed variable;
This means that observed variables can have observed parents, that latents can be (noisy) non-
linear functions of their parents, and that cycles are allowed among latents. No other structural
assumptions are required (such as full acyclicity), and we do not require linearity among latents.
In the previous chapter, we made use of the faithfulness assumption, which states that a con-
ditional independence holds in the joint distribution if and only if it is entailed in the respective
66 Learning measurement models of non-linear structural models
graphical model by the Markov condition. The movitation is that observed conditional indepen-
dences should be the result of the graphical structure, not of an accidental choice of parameters
dening the probability of a node given its parents.
The parametric assumptions in this chapter are not strong enough in order to test conditional
independences between hidden variables, and therefore no graphical conditions for independence
among latents will be assumed. In particular, faithfulness will not be assumed. Instead, since the
structural model is treated as a black box, our results will have a measure-theoretic motivation.
All results presented here have the following characteristics:
C1. they hold with probability 1 with respect to the Lebesgue measure over the set of linear
coecients and error variances that partially parameterize the density function of an observed
variable given its parents;
C2. they hold for any distribution of the latent variables (that obeys the given assumptions);
One can show that the Lebesgue argument is no dierent from the faithfulness assumption for
typical families of graphical models, such as multinomial and Gaussian (Spirtes et al., 2000).
Our goal is not to fully identify a graphical structure. The assumptions are too weak to reallis-
tically accomplish this goal. Instead we will focus on a more restricted task:
GOAL: to identify d-separations between a pair of observed variables, or a pair of observed
and latent variable, conditioned on sets of latents variables.
Identifying d-separations between latents is a topic for future research, where specic assump-
tions concerning latent structure can be adopted according to the problem at hand. Instead, we
focus in what can be achieved without any further assumptions.
As before, the strategy to accomplish our goal is to use constraints in the observed covariance
matrix that will allow us to identify the following features of the unknown latent variable model:
F1. the existence of hidden variables;
F2. that observed variable X cannot be an ancestor of observed variable Y ;
F3. that observed variable X cannot have a common parent with observed variable Y ;
Section 4.2 will describe empirical methods that can in many cases identify the above features.
In Section 4.3, we describe a variation of BuildPureClusters as a way of putting together these
pieces of information to learn a measurement model for non-linear structural models.
4.2 Main results
Assume for now we know the true population covariance matrix. Without loss of generality, as-
sume also that all variables have zero mean. Let G(O) be the graph of a latent variable model
with observed variables O. The following lemma illustrates a simple result that is intuitive but
does not follow immediately from correlation analysis, since observed nodes may have non-linear
dependencies:
4.2 Main results 67
Lemma 4.1 If for A, B, C O we have
AB
= 0 or
AB.C
= 0, then A and B cannot share a
common latent parent in G.
Althought vanishing partial correlations can sometimes be useful, we are mostly motivated
by problems where all observed variables have hidden common ancestors. In this case, vanishing
partial correlations are useless. We will still use tetrad constraints detectable from the covariance
matrix of the observed variables.
The following result allows us to learn that observed variable X cannot be an ancestor of
observed variable Y in several situations:
Lemma 4.2 For any set O

= A, B, C, D O, if
AB

CD
=
AC

BD
=
AD

BC
such that for
all triplets X, Y, Z, X, Y O

, Z O, we have
XY.Z
= 0 and
XY
= 0, then no element in
X O

is an ancestor of any element in O

`X in G.
Notice that this result allows us to identify the non-existence of several ancestral relations even
when no conditional independences are observed and latents are non-linearly related. A second
way of learning such a relation is as follows: let G(O) be a latent variable graph and A, B
be two elements of O. Let the predicate Factor
1
(A, B, G) be true if and only there exists a set
C, D O such that the conditions of Lemma 4.2 are satised for O

= A, B, C, D, i.e.,

CD
=
AC

BD
=
AD

BC
with the corresponding partial correlation constraints. The second
approach for detecting lack of ancestral relations between two observed variables is given by the
following lemma:
Lemma 4.3 For any set O

= X
1
, X
2
, Y
1
, Y
2
O, if Factor
1
(X
1
, X
2
, G) = true, Factor
1
(Y
1
, Y
2
, G) =
true,
X
1
Y
1

X
2
Y
2
=
X
1
Y
2

X
2
Y
1
, and all elements of X
1
, X
2
, Y
1
, Y
2
are correlated, then no element
in X
1
, X
2
is an ancestor of any element in Y
1
, Y
2
in G and vice-versa.
We dene the predicate Factor
2
(A, B, G) to be true if and only it is possible to learn that A is
not an ancestor of B in the unknown graph G that contains these nodes by using Lemma 4.3.
We now describe two ways of detecting if two observed variables have no (hidden) common
parent in G(O). Let rst X
1
, X
2
, X
3
, Y
1
, Y
2
, Y
3
O. The following identication conditions are
sound:
CS1. If
X
1
Y
1

X
2
X
3
=
X
1
X
2

X
3
Y
1
=
X
1
X
3

X
2
Y
1
,
X
1
Y
1

Y
2
Y
3
=
X
1
Y
2

Y
1
Y
3
=
X
1
Y
3

Y
1
Y
2
,
X
1
X
2

Y
1
Y
2
=

X
1
Y
2

X
2
Y
1
and for all triplets X, Y, Z, X, Y X
1
, X
2
, X
3
, Y
1
, Y
2
, Y
3
, Z O, we have

XY
= 0,
XY.Z
= 0, then X
1
and Y
1
do not have a common parent in G.
CS2. If Factor
1
(X
1
, X
2
, G), Factor
1
(Y
1
, Y
2
, G), X
1
is not an ancestor of X
3
, Y
1
is not an ances-
tor of Y
3
,
X
1
Y
1

X
2
Y
2
=
X
1
Y
2

X
2
Y
1
,
X
2
Y
1

Y
2
Y
3
=
X
2
Y
3

Y
2
Y
1
,
X
1
X
2

X
3
Y
2
=
X
1
Y
2

X
3
X
2
,

X
1
X
2

Y
1
Y
2
=
X
1
Y
2

X
2
Y
1
and for all triplets X, Y, Z, X, Y X
1
, X
2
, X
3
, Y
1
, Y
2
, Y
3
, Z
O, we have
XY
= 0,
XY.Z
= 0, then X
1
and Y
1
do not have a common parent in G.
As in the previous chapter, CS here stands for constraint set, a set of constraints in the
observable joint that are empirically veriable. In the same way, call CS0 the separation rule of
Lemma 4.1. The following lemmas state the correctness of CS1 and CS2:
Lemma 4.4 CS1 is sound.
68 Learning measurement models of non-linear structural models
L
X
3
X
1
X
P Q
2
Figure 4.1: In this gure, L and Q are immediate latent ancestors of X
3
, since there are directed
paths from L and Q into X
3
that do not contain any latent node. Latent P, however, is not an
immediate latent ancestor of X
3
, since every path from P to X
3
contains at least one other latent.
Lemma 4.5 CS2 is sound.
We have shown before that such identication results also hold in fully linear latent variable
models. One might conjecture that, as far as identifying ancestral relations among observed vari-
ables and hidden common parents goes, linear and non-linear latent variable models are identical.
However, this is not true.
Theorem 4.6 There are sound identication rules that allow one to learn if two observed variables
share a common parent in a linear latent variable model that are not sound for non-linear latent
variable models.
In other words, one gains more identication power if one is willing to assume full linearity of
the latent variable model. We will see more of the implications of assuming linearity.
Another important building block in our approach is the identication of which latents exist.
Dene an immediate latent ancestor of an observed node O in a latent variable graph G as a latent
node L that is a parent of O or the source of a directed path L V O where V is an
observed variable. Notice that this implies that every element in this path, with the exception of
L, is an observed node, since we are assuming that observed nodes cannot have latent descendants.
Figure 4.1 illustrates the concept.
Lemma 4.7 Let S O be any set such that, for all A, B, C S, there is a fourth variable D O
where i.
AB

CD
=
AC

BD
=
AD

BD
and ii. for every set X, Y A, B, C, D, Z O we
have
XY.Z
= 0 and
XY
= 0. Then S can be partioned into two sets S
1
, S
2
where
1. all elements in S
1
share a common immediate latent ancestor, and no two elements in S
1
have any other common immediate latent ancestor;
2. no element S S
2
has any common immediate latent ancestor with any other element in
S`S;
3. all elements in S are d-separated given the latents in G;
Unlike the linear model case, a set of tetrad constraints
AB

CD
=
AC

BD
=
AD

BD
is
not a sucient condition (along with non-vanishing correlations) for the existence of a node d-
separating nodes A, B, C, D. For instance, consider the graph in Figure 4.2(a), which depicts a
latent variable graph with three latents L
1
, L
2
and L
3
, and four measured variables, W, X, Y, Z.
4.3 Learning a semiparametric model 69
W X Y
Z
L L L
2 1 3 3
W X Y
Z
L L L
2 1
(a) (b)
Figure 4.2: It is possible that
L
1
L
3
.L
2
= 0 even though L
2
does not d-separate L
1
and L
3
. That
happens, for instance, if L
2
=
1
L
1
+
2
, L
3
=
2
L
2
1
+
3
L
2
+
3
, where L
1
,
2
and
3
are normally
distributed with zero mean.
L
2
does not d-separate L
1
and L
3
, but there is no constraint in the assumptions that precludes
the partial correlation of L
1
and L
3
given L
2
of being zero. For example, in the additive model
L
2
=
1
L
1
+
2
, L
3
=
2
L
2
1
+
3
L
2
+
3
, where L
1
,
1
,
2
are standard normals, we have that
13.2
= 0,
which will imply all three tetrad constraints among W, X, Y, Z.
In this case Lemma, 4.7 says that, for S = W, X, Y, Z, we have some special partition of S.
In Figure 4.2(a) given by S
1
= W, X, Y, Z, S
2
= . In Figure 4.2(b), S
1
= X, Y , S
2
= W, Z.
However, tetrad constraints, and actually no covariance constraint, can distinguish both graphs
from a model where a single latent d-separates all four indicators.
We will see an application of our results in the next section, where they are used to identify
interesting clusters of indicators, i.e., disjoint sets of observed variables that measure disjoint sets
of latents.
4.3 Learning a semiparametric model
The assumptions and identication rules provided in the previous section can be used to learn a
partial representation of the unknown graphical structure that generated the data, as suggested in
Chapter 3. Given a set of observed variables O, let O

O, and let C be a partition of O

into
sets C
1
, . . . , C
k
such that
SC1. for any X
1
, X
2
, X
3
C
i
, there is some X
4
O

such that
X
1
X
2

X
3
X
4
=
X
1
X
3

X
2
X
4
=

X
1
X
4

X
2
X
3
, 1 i k and X
4
is correlated with all elements in X
1
, X
2
, X
3
;
SC2. for any X
1
C
i
, X
2
C
j
, i = j, we have that X
1
and X
2
are separated by CS0, CS1 or CS2;
SC3. for any X
1
, X
2
C
i
, Factor
1
(X
1
, X
2
, G) = true or Factor
2
(X
1
, X
2
, G) = true;
SC4. for any X
1
, X
2
C
i
, X
3
C
j
,
X
1
X
3
= 0 if and only if
X
2
X
3
= 0;
Any partition with structural conditions SC1-SC4 has the following properties:
Theorem 4.8 If a partition C = C
1
, . . . , C
k
of O

respects structural conditions SC1-SC4, then

the following holds in the true latent variable graph G that generated the data:
70 Learning measurement models of non-linear structural models
1. for all X C
i
, Y C
j
, i = j, X and Y have no common parents, and X is d-separated from
the latent parents of Y given the latent parents of X;
2. for all X, Y O

, X is d-separated from Y given the latent parents of X;

3. every set C
i
can be partitioned into two groups according to Lemma 4.7;
The BuildPureClusters algorithm of Chapter 3 can be adapted to generate a partition with
these properties. In this case, condition CS3 is not used. We describe a possible implementation
of BuildPureClusters in Section 4.4. This variation can in principle allow some sets C
i
of size
1 and 2. In those cases, the properties established by Lemma 4.7 hold vacuously. This algorithm
cannot identify how each set C
i
can be further partitioned into such two subsets, one where every
node has an unique common immediate latent ancestor, and one where each node has no common
immediate latent ancestor with any other node. It might be the case that no two nodes in C
i
have
a common immediate latent ancestor. It might be the case that all nodes in C
i
have an unique
common immediate latent ancestor. The combination of Lemma 4.7 and domain knowledge can be
useful to nd the proper sub-partition.
These are weaker results than the ones obtained for linear models. There, each set C
i
is
associated with an unique latent variable L
i
from G (as long as [C
i
[ > 2). Furthermore, conditioned
on L
i
each node in C
i
is d-separated from all other nodes in O

, as well as from their respective latent

parents. There might be no latent node in the non-linear case with these properties, as previously
illustrated in Figure 4.2. In this case, the trivial partition C = W, X, Y, Z, with a single
element, will satisfy the structural conditions SC1-SC4, and therefore the properties of Theorem
4.8. However, there is no unique latent variable in this system that d-separates all elements of
W, X, Y, Z. This would not be the case in a linear system.
This the fundamental dierence between the work presented here and the one developed by Silva
et al. (2003). There, it was assumed that each latent d-separated at least three unique measures,
i.e., each latent was assumed to have three observed children that were d-separated by it. In this
way, it was possible to use CS1 and Lemma 4.2 to identify all latents and an unique map between
a set C
i
and a latent. Although one might adopt this assumption in studies where one already
has a strong idea of which latents exist, this is in general an untestable assumption. This chapter
explores what is possible to achieve when minimal assumptions about the graphical structure are
adopted, and expands it with extra identication rules. Lemmas 4.3 and 4.5 are new identication
rules. Lemma 4.7, Theorems 4.6, 4.8 and Theorem 4.9 below are all new results that are necessary:
with the stronger assumptions of Silva et al. (2003), all latents could be identied, which highly
simplied the problem. This is not the case here.
As in the linear case, it is still possible to parameterize a latent variable model using the partition
C = C
1
, . . . , C
k
of a subset O

of the given observed variables such that the rst two moments
of the distribution of O

can still be represented. Given a graph G, a linear parameterization

of G associates a parameter with each edge and two parameters with each node, such that each
node V is functionally represented as a linear combination of its parents plus an additive error:
V =
V
+
i

i
Pa
V
i
+
V
, where Pa
V
i
is the set of parents of V in G, and
V
is a random variable
with zero mean and variance
V
(
V
and
V
are the two extra parameters by node). Notice that
this parameterization might not be enough to represent all moments of a given family of probability
distributions.
A linear latent variable model is a latent variable graph with a particular instance of a linear
parameterization. The following result mirrors the one obtained for linear models:
4.4 Experiments 71
Theorem 4.9 Given a partition C of a subset O

of the observed variables of a latent variable

graph G such that C satises structural constraints SC1-SC4, there is a linear latent variable model
for the rst two moments of the density of O

.
Consider the graph G
linear
constructed by the following algorithm:
1. initialize G
linear
with a node for each element in O

;
2. for each C
i
C, add a latent L
i
to G, and for each V C
i
, add an edge L
i
V
3. fully connect the latents in G
linear
to form an arbitrary directed acyclic graph;
For instance, the G
linear
graph associated with Figures 4.2(a) and 4.2(b) would be a one-factor
model where a single latent L is the common parent of W, X, Y, Z, and L d-separates its children.
The constructive proof of Theorem 4.9 (see Appendix B) shows that G
linear
can be used to
parameterize a model of the rst two moments of O

. This has an important heuristic implication:

if the joint distribution of the latents and observed variables can be reasonably approximated by a
mixture of Gaussians, where each component has the same graphical structure, one can t a mixture
of G
linear
graphical models. This can be motivated by assuming each mixture component represents
a dierent subpopulation with its own probabilitistic model, where the same causal structures hold,
and the distributions are close to normal (e.g., a drug might have dierent quantitative eects on
dierent genders but with the same qualitative causal structure). Each model will approximate the
mean and covariance of the observed variables for a particular component of the mixture: since
each component has the same graphical structure, the same required constraints in the component
covariance matrix hold, and therefore the same parametric formulation can be used.
Notice this is less stringent than assuming that the causal model is fully linear. Assuming the
distribution is fully linear can result in a wrong structure that might not be approximated well
(e.g., if one applies unsound identication rules, as suggested by Theorem 4.6). Here, at least in
principle the structure can be correctly induced. The joint distribution is approximated, and the
quality of approximation will be dependendent on the domain.
As a future work, it would be interesting to empirically assess the robustness of MIMBuild
with respect to small deviations from linearity, or with respect to monotone non-linear functions.
4.4 Experiments
In this section we evaluate an alternative implementation of BuildPureClusters on the tasks of
causality discovery and density estimation.
This alternate implementation is a simple variation of algorithm RobustBuildPureClusters
of Table A.2 (Appendix A). One dierence is that we do not use identication rule CS3. The main
dierence is that we do not use RobustPurify, which requires a test of tness (in Chapter
3 we adopted the multivariate Gaussian family). Instead, we adapt the original formulation of
BuildPureClusters (Table 3.2), which makes use of tetrad constraints, as follows: rst, remove
all variables that appear in more than one cluster. For any pair of variables X, Y in the pre-
puried graph, try to nd another pair W, Z, in the same graph (i.e., do not use removed
indicators) such that all three tetrad hold among W, X, Y, Z. If X and Y are in the same cluster,
then W has to be in the same cluster, but not necessarily Z. If X and Y are in dierent clusters,
then W and Z are either in the same cluster as X, or in the same cluster ar Y . Unless otherwise
72 Learning measurement models of non-linear structural models
L
4
L
1
L
2
L
3
1 2 3 4 9 10 11 12 5 6 7 8 13 14 15 16
Figure 4.3: An impure model with a diamond-like latent structure. Notice there are two ways to
purify this graph: by removing 6 and 13 or removing 6 and 15.
specied, we used the test of tetrad constraints described by Bollen (1990), which is an asymptotic
distribution-free test of such constraints.
4.4.1 Evaluating nonlinear latent structure
In this section we perform an experiment with a non-linear latent structure and non-normally
distributed data. The graph in Figure 4.3 is parameterized by the following nonlinear structural
equations:
L
2
= L
2
1
+
L2
L
3
=

L
1
+
L3
L
4
= sin(L
2
/L
3
) +
L4
where L
1
is distributed as a mixture of two beta distributions, Beta(2, 4) and Beta(4, 2), where each
one has prior probability of 0.5. Each error term
L
is distributed as a mixture of a Beta(4, 2) and
the symmetric of a Beta(2, 4), where each component in the mixture has a prior probability that
is uniformly distributed in [0, 1], and the mixture priors are drawn individually for each latent in
L
2
, L
3
, L
4
. The error terms for the indicators also follow a mixture of betas (2, 4) and (4, 2), each
one with a mixing proportion individually chosen according to a uniform distribution in [0, 1]. The
linear coecients relating latents to indicators and indicators to indicators were chosen uniformly
in the interval [1.5, 0.5] [0.5, 1.5].
To give an idea of how nonnormal the observed distribution can be, we submitted a sample
of size 5000 for a Shapiro-Wilk normality test in R 1.6.2 for each variable, and the hypothesis of
normality in all 16 variables was strongly rejected, where the highest p-value was at the order of
10
11
. Figure 4.4 depicts histograms for each variable in a specic sample. We show a randomly
selected correlation matrix from a sample of size 5000 in Table 4.1.
In principle, the asymptotic distribution-free test of tetrad constraints from Bollen (1990) should
be the method of choice if the data does not pass a normality test. However, such test uses the fourth
moments of the empirical distribution, which can take a long time to be computed if the number of
variables is large (since it takes O(mn
4
) steps, where m is the number of data points and n is the
number of variables). Caching a large matrix of fourth moments may require secondary memory
storage, unless one is willing to pay for multiple passes through the data set every time a test is
demanded or if a large amount of RAM is available. Therefore, we also evaluate the behavior of the
algorithm using the Wishart test (Spirtes et al., 2000; Wishart, 1928), which assumes multivariate
normality
1
. Samples of size 1000, 5000 and 50000 were used. Results are given in Table 4.2. Such
1
We did not implement distribution-free tests of vanishing partial correlations. In these experiments we use tests
for jointly normal variables, which did not seem to aect the results.
4.4 Experiments 73
1.0 -0.683 -0.693 -0.559 -0.414 -0.78 -0.369 -0.396 -0.306 0.328 -0.309 -0.3 -0.231 0.227 0.276 -0.278
-0.683 1.0 0.735 0.603 0.442 0.64 0.389 0.425 0.347 -0.363 0.338 0.339 0.243 -0.238 -0.282 0.282
-0.693 0.735 1.0 0.603 0.426 0.637 0.378 0.408 0.348 -0.365 0.341 0.337 0.236 -0.239 -0.279 0.284
-0.559 0.603 0.603 1.0 0.357 0.524 0.316 0.334 0.282 -0.298 0.279 0.287 0.18 -0.196 -0.222 0.227
-0.414 0.442 0.426 0.357 1.0 0.789 0.761 0.811 0.19 -0.203 0.197 0.194 0.356 -0.371 -0.429 0.439
-0.78 0.64 0.637 0.524 0.789 1.0 0.713 0.757 0.284 -0.304 0.289 0.284 0.354 -0.364 -0.429 0.438
-0.369 0.389 0.378 0.316 0.761 0.713 1.0 0.734 0.171 -0.183 0.174 0.174 0.321 -0.333 -0.387 0.401
-0.396 0.425 0.408 0.334 0.811 0.757 0.734 1.0 0.175 -0.188 0.184 0.183 0.326 -0.34 -0.402 0.41
-0.306 0.347 0.348 0.282 0.19 0.284 0.171 0.175 1.0 -0.858 0.821 0.818 0.199 -0.191 -0.239 0.239
0.328 -0.363 -0.365 -0.298 -0.203 -0.304 -0.183 -0.188 -0.858 1.0 -0.848 -0.843 -0.212 0.204 0.256 -0.25
-0.309 0.338 0.341 0.279 0.197 0.289 0.174 0.184 0.821 -0.848 1.0 0.805 0.201 -0.19 -0.238 0.237
-0.3 0.339 0.337 0.287 0.194 0.284 0.174 0.183 0.818 -0.843 0.805 1.0 0.211 -0.2 -0.246 0.244
-0.231 0.243 0.236 0.18 0.356 0.354 0.321 0.326 0.199 -0.212 0.201 0.211 1.0 -0.654 -0.898 0.777
0.227 -0.238 -0.239 -0.196 -0.371 -0.364 -0.333 -0.34 -0.191 0.204 -0.19 -0.2 -0.654 1.0 0.78 -0.787
0.276 -0.282 -0.279 -0.222 -0.429 -0.429 -0.387 -0.402 -0.239 0.256 -0.238 -0.246 -0.898 0.78 1.0 -0.92
-0.278 0.282 0.284 0.227 0.439 0.438 0.401 0.41 0.239 -0.25 0.237 0.244 0.777 -0.787 -0.92 1.0
Table 4.1: An example of a sample correlation matrix of a sample of size 5000.
Evaluation of estimated puried models
1000 5000 50000
Wishart test
missing latents 0.20 0.11 0.20 0.11 0.18 0.12
missing indicators 0.21 0.11 0.22 0.08 0.10 0.13
misplaced indicators 0.01 0.02 0.0 0.0 0.0 0.0
impurities 0.0 0.0 0.0 0.0 0.1 0.21
Bollen test
missing latents 0.18 0.12 0.13 0.13 0.10 0.13
missing indicators 0.15 0.09 0.16 0.14 0.14 0.11
misplaced indicators 0.02 0.05 0.0 0.0 0.1 0.03
impurities 0.15 0.24 0.10 0.21 0.0 0.0
Table 4.2: Results obtained for estimated puried graphs with the nonlinear graph. Each number
is an average over 10 trials, with an indication of the standard deviation over these trials.
test might be useful as an approximation, even though it is not the theoretically correct way of
approaching such kind of data.
The results are quite close to each other, although the Bollen test at least seems to get better
with more data. Results for the proportion of impurities vary more, since we have only two
impurities in the true graph. The major diculty in this example is again the fact that we have
two clusters with only three pure latents each. It was quite common that we could not keep the
cluster with variables 5, 7, 8 and some other cluster in the same nal solution because the test
(which requires the evaluation of many tetrad constraints) that contrasts two clusters would fail
(Step 10 of FindInitialSelection in Table A.3). To give an idea of how having more than
three indicators per latent can aect the result, running this same example with 5 indicators per
latent (which means at least four pure indicators for each latent) produce better results than
anything reported in Table 4.2 with samples smaller than 1000. That happens because Step 10
of FindInitialSelection only needs one triplet from each cluster, and the chances of having at
least one triplet from each group that satises its criterion increases with a higher number of pure
indicators per latent.
74 Learning measurement models of non-linear structural models
4.4.2 Experiments in density estimation
In this section, we will concentrate on evaluating our procedure as a way of nding submodels with
a good t. The goal is to show that causally motivated algorithms can be also suitable for density
estimation. We run our algorithm over some datasets from the UCI Machine Learning Repository
to obtain a graphical structure analogous to G
linear
described in the previous section. We then
t the data to such a structure by using a mixture of Gaussian latent DAGs with a standard
EM algorithm. Each component has a full parameterization: dierent linear coecients and error
variances for each variable on each mixture component. The number of mixture components is
chosen by tting the model with 1 to up to 7 components and choosing the one that maximizes the
BIC score.
We compare this model against the mixture of factor analyzers (MofFA) (Ghahramani and
Hinton, 1996). In this case, we want to compare what can be gained by tting a model where
latents are allowed to be dependent, even when we restrict the observed variables to be children of
a single latent. Therefore, we t mixtures of factor analyzers using the same number of latents we
nd with our algorithm. The number of mixture components is chosen independently, using the
same BIC-based procedure. Since BPC can return only a model for a subset of the given observed
variables, we run MofFA for the same subsets output by our algorithm.
In practice, our approach can be used in two ways. First, as a way of decomposing the full
joint of a set O of observed variables by splitting it into two sets: one set where variables X can be
modeled as a mixture of G
linear
models, and another set of variables Y = O`X whose conditional
probability f(Y[X) can be modeled by some other representation of choice. Alternatively, if the
observed variables are redundant (i.e., many variables are intended to measure the same latent
concept), this procedure can be seen as a way of choosing a subset whose marginal is relatively
easy to model with simple causal graphical structures.
As a baseline, we use a standard mixture of Gaussians (MofG), where an unconstrainted mul-
tivariate Gaussian is used on each mixture component. Again, the number of mixture components
is chosen independently by maximizing BIC. Since the number of variables used in our experiments
are relatively small, we do not expect to perform signicantly better than MofG in the task of
density estimation, but a similar performance is an indication that our highly constrained models
provide a good t, and therefore our observed rank constraints can be reasonably expected to hold
in the population.
We ran a 10-fold cross-validation experiment for each one the following four UCI datasets: iono,
specft, water and wdbc, all of which are measured over continuous or ordinal variables. We
tried also the small dataset wine (13 variables), but we could not nd any structure using our
method. The other datasets varied from 30 to 40 variables. The results given in Table 6.9 show the
average log-likelihood per data point on the respective test sets, also averaged over the 10 splits.
These results are subtracted from the baseline established by MofG. We also show the average
percentage of variables that were selected by our algorithm. The outcome is that we can represent
the joint of a signicant portion of the observed variables as a simple latent variable model where
observed variables have a single parent. Such models do not lose information comparing to the full
mixture of Gaussians. In one case (iono) we were able to signicantly improve over the mixture
of factor analyzers when using the same number of latent variables.
In the next chapter we show how these results can be improved by using Bayesian search
algorithms which also allow the insertion of more observed variables, and not only those that have
a single parent in a linearized graph.
4.5 Completeness considerations 75
Dataset BPC MofFA % variables
iono 1.56 1.10 -3.03 2.55 0.37 0.06
spectf -0.33 0.73 -0.75 0.88 0.34 0.07
water -0.01 0.74 -0.90 0.79 0.36 0.04
wdbc -0.88 1.40 -1.96 2.11 0.24 0.13
Table 4.3: The dierence in average test log-likelihood of BPC and MofFA with respect to a
multivariate mixture of Gaussians. Positive values indicate that a method gives a better t that
the mixture of Gaussians. The statistics are the average of the results over a 10-fold cross-validation.
A standard deviation is provided. The average number of variables used by our algorithm is also
reported.
4.5 Completeness considerations
So far, we have emphasized the soundness BuildPureClusters in both its linear and non-linear
versions. However, an algorithm that always returns an empty graph is vacuously sound. Build-
PureClusters is of interest only if it can return useful information about the true graph. In
Chapter 3, we only briey described issues concerning completeness of this algorith, i.e., how many
of the common features of all tetrad-equivalent models can be discovered.
It has to be stressed that there is no guarantee of how large the set of indicators in the output
of BuildPureClusters will be for any problem. It can be an empty set, for instance, if all ob-
served variables are children of several latents. Variations of BuildPureClusters are still able
to asymptotically nd the submodel with the largest number of latents that can be identied with
CS rules. To accomplish that, one has to apply the following algorithm in place of Step 2 of Table
3.2:
Algorithm MaximumLatentSelection
1. Create an empty graph G
L
, where each node correspond to a latent
2. Add an undirected edge L
i
L
j
if and only if L
i
has three pure indicators that L
j
does not
have, and vice-versa
3. Return a maximum clique of G
L
An interesting implication is: if there is a pure submodel of the true measurement model where
each latent has at least three indicators, then this algorithm will identify all latents (Silva et al.,
2003). This assumption is not testable, however. Moreover, because of the maximum clique step,
this algorithm is exponential in the number of latents, in the worst case.
In principle, much of the identiability limitations here described can be solved if one explores
constraints that uses information besides the second moments of the observed variables. Still,
it is of considerable interest to know what can be done with covariance information only, since
using higher order moments highly increases the chance of commiting statistical mistakes. This is
especially dicult concerning learning the structure of latent variable models.
Although we do not provide a complete characterization of the tetrad equivalence class, we can
provide a necessary condition in order to identify if two nodes have no common latent parent when
no marginal vanishing correlations are observed:
76 Learning measurement models of non-linear structural models
Lemma 4.10 Let G(O) be a latent variable graph where no pair in O is marginally uncorrelated,
and let X, Y O. If there is no pair P, Q O such that
XY

PQ
=
XP

Y Q
holds, then
there is at least one graph in the tetrad equivalence class of G where X and Y have a common
latent parent.
Notice this does not mean one cannot distinguish between models where X and Y have and do
not have a common hidden parent. We are claiming that for tetrad equivalence classes only. For
instance, in some situations this can be done by using only conditional independencies, which is
the base of the Fast Causal Inference algorithm of Spirtes et al. (2000). Figure 4.5 illustrates
a case.
In practice, it is not of great interest having identication rules that require the use of many
variables. The more variables are necessary, the more computationally expensive any search al-
gorithm gets, as well as less statistically reliable. Our CS rules, for instance, require 6 variables,
which is already a considerably high number. However, as far as using tetrad constraints goes, one
cannot expect to extend BuildPureClusters with identication rules that are computationally
simpler than CS1, CS2 or CS3. The following result shows that in the general case (i.e., where
marginal independencies are not observed), one does not have a criterion for clustering indicators
that uses less than six variables using tetrad constraints:
Theorem 4.11 Let X O be a set of observed variables, [X[ < 6. Assume
X
1
X
2
= 0 for all
X
1
, X
2
X. There is no possible set of tetrad constraints within X for deciding if two nodes
A, B X do not have a common parent in a latent variable graph G(O).
Notice again that it might be the case a combination of tetrad and conditional independence
constraints might provide an identication rule that uses less than 6 variables (in a case where
conditional independencies alone are not enough). This result is for tetrad constraints only.
4.6 Summary
We presented empirically testable conditions that allows one to learn structural features of latent
variable models where latents are non-linearly related. These results can be used in an algorithm
for learning a measurement model for some latents without making any assumptions about the true
graphical structure, besides the fairly general assumption by which observed variables cannot be
parents of latent variables.
4.6 Summary 77
1
data[1:5000, 1]
F
r
e
q
u
e
n
c
y
0.5 1.5
0
1
0
0
2
data[1:5000, 2]
F
r
e
q
u
e
n
c
y
1.0 0.0 1.0
0
1
5
0
3
data[1:5000, 3]
F
r
e
q
u
e
n
c
y
1.0 0.0 1.0
0
2
0
0
4
data[1:5000, 4]
F
r
e
q
u
e
n
c
y
0.5 0.0 0.5 1.0
0
1
5
0
5
data[1:5000, 5]
F
r
e
q
u
e
n
c
y
2 0 1 2
0
1
5
0
6
data[1:5000, 6]
F
r
e
q
u
e
n
c
y
1.5 0.0 1.0
0
3
0
0
7
data[1:5000, 7]
F
r
e
q
u
e
n
c
y
1.0 0.0 1.0
0
3
0
0
8
data[1:5000, 8]
F
r
e
q
u
e
n
c
y
0 1 2 3
0
1
5
0
9
data[1:5000, 9]
F
r
e
q
u
e
n
c
y
2 0 1
0
1
0
0
10
data[1:5000, 10]
F
r
e
q
u
e
n
c
y
1.5 0.0 1.0
0
2
0
0
11
data[1:5000, 11]
F
r
e
q
u
e
n
c
y
2.0 0.5 1.0
0
1
5
0
12
data[1:5000, 12]
F
r
e
q
u
e
n
c
y
1.0 0.0 1.0
0
2
0
0
13
data[1:5000, 13]
F
r
e
q
u
e
n
c
y
1.5 0.0 1.0
0
1
5
0
14
data[1:5000, 14]
F
r
e
q
u
e
n
c
y
1.0 0.0 1.0
0
3
0
0
15
data[1:5000, 15]
F
r
e
q
u
e
n
c
y
2 0 1 2
0
2
0
0
16
data[1:5000, 16]
F
r
e
q
u
e
n
c
y
2.0 0.5 1.0
0
1
5
0
Figure 4.4: Univariate histograms for each of the 16 variables (organized by row) from a data set
of 5000 observations sampled from the graph in Figure 4.3. 30 bins were used.
78 Learning measurement models of non-linear structural models
Y
V
Q P
X
L
V
Q P
X
Y
(a) (b)
Figure 4.5: These two models (L is the only hidden variable) can be distinguished by using condi-
tional independence constraints, but not through tetrad constraints only.
Chapter 5
Learning local discrete measurement
models
The BuildPureClusters algorithm (BPC) constructs a global measurement model: a single
model composed of several latents. In linear models, it provides sucient information for the
application of sound algorithms for learning latent structure. As dened, BPC can be applied only
to continuous (or approximately continuous) data.
However, one might be interested in a local model, which we dene as a set of several small
models covering a few variables each, where their respective set of variables might overlap. A local
model is usually not globally consistent: in probabilistic terms, this means the marginal distribution
for a given set of variables diers according to dierent elements of the local model. In causal terms,
this means conicting causal directions. The two main reasons why one would use a local model
instead of a global one are: 1. ease of computation, especially for high dimensional problems; 2.
there might be no good global model, but several components of a local model might be of interest.
In this chapter, we develop a framework for learning local measurement models of discrete
data using BuildPureClusters and compare it to one of the most widely used local model
formulations: association rules.
5.1 Discrete associations and causality
Discovering interesting associations in discrete databases is a key task in data mining. Dening
interestingness is, however, an elusive task. One can informally describe interesting (conditional)
associations as those that allow one to create policies that maximize a measure of success, such
as prot in private companies or increase of life expectancy in public health. Ultimately, many
questions phrased as nd interesting associations in data mining literature are nothing but causal
questions with observational data (Silverstein et al., 2000).
A canonical example is the following hypothetical scenario: one where baby diapers and beer are
products with a consistent association across several market basket databases. From this previously
unknown association and extra prior knowledge, an analyst was able to infer that this association
is due to the causal process where fathers, when assigned to the duty of buying diapers, indulge on
buying some beer. One possible policy that makes use of this information is displaying beer and
diapers in the same aisle to convince parents to buy beer more frequently when buying diapers.
In this case, the link from association to causality came from prior knowledge about a hidden
80 Learning local discrete measurement models
variable. The interpretation of the hidden variable, however, came from the nature of the two items
measuring it, and without the knowledge of the statistical support for this association, it would be
unlikely that the analyst would conjecture the existence of such a latent.
Association rules (Agrawal and Srikant, 1994) are a very common tool for discovering interesting
associations. A standard association rule is simply a propositional rule of the type If A, then B,
or simply A B, with two particular features:
the support of the rule: the number of cases in the database where events A and B jointly
the condence of the rule: the proportion of cases that have B, counting only among those
that have A
Searching for association rules requires nding a good trade-o between these two features.
With extra assumptions, association rule mining inspired by algorithms such as the PC algorithm
can be used to reveal causal rules (Silverstein et al., 2000; Cooper, 1997).
However, in many situations the causal explanation for the observed associations is due to
latent variables, such as in our example above. The number of rules can be extremely large even
in relatively small data sets. More recent algorithms may dramatically reduce the number of
rules when compared to classical alternatives (Zaki, 2004), but even there the set of rules can be
unmanageable. Although rules can describe specic useful knowledge, they do not take in account
that hidden common causes might explain several patterns not only in a much more succint way, but
in a way on which leaping from association to causation would require less background knowledge.
How to introduce hidden variables in causal association rules is the goal of the algorithm described
in this chapter.
5.2 Local measurement models as association rules
Association rules are local models by nature. That is, the output of an association rule analysis
consists on a set of rules covering only some variables in the original domain. Such rules might be
contradictory: the probability P(B[A) might be dierent according to rules A B, C and A
B, D, for instance, depending on which model is used to represent these conditional distributions
1
.
One certainly loses statistical power by using local models instead of global ones. This is one
of the main reasons why an algorithm such as GES usually performs better than the PC search
(Spirtes et al., 2000), for instance (although PC outputs a global model, it does so by merging
several local pieces of information derived nearly independently). However, searching for local
models can be orders of magnitude faster than searching for global models. For large problems, it
might be simply impossible to nd a global model. Pieces of a local model can still be useful, as in
causal association rules compared to full graphical models (Silverstein et al., 2000; Cooper, 1997).
Moreover, not requiring global consistency can in some sense be advantageous: for instance, it
is well known that the PC algorithm might return cyclic graphs even though it is not supposed
to. This happens because the PC algorithm builds a model out of local pieces of information,
and such pieces might not t globally as a DAG. Although such an output might be the result
of statistical mistakes, they can also indicate failure of assumptions, such as the non-existence of
hidden variables. An algorithm such as GES will always return a DAG, and therefore it is less
1
This will not be the case if this probability is the standard maximum likelihood estimator of an unconstrained
multinomial distribution.
5.2 Local measurement models as association rules 81
X
9
X X
8 7
X
6
X
5
X
4
X
2
X
3
X
1
L
2
L
3
L
1
Figure 5.1: Latent L
2
will never be identied by any implementation of BPC that attempts to
include L
1
and L
3
, althought individually it has a pure measurement model.
robust to failures of the assumptions. When datasets are very large, running PC might be a better
alternative than, or at least a complement to, GES.
This is especially interesting for the problem of nding pure measurement models. Build-
PureClusters will return a pure model if one exists. However, one might lose information that
could be easily derived using the same identication conditions of Chapter 3. Consider the model in
Figure 5.1. Latent L
2
cannot exist in the same pure model as the other two latents, since it requires
deleting too many indicators of L
1
and L
3
. However, one can verify there is a pure measurement
model with at least four (direct and indirect) indicators for L
2
(X
1
, X
2
, X
3
, X
4
), which could be
derived independently.
Learning a full model with impurities might be statistically dicult, as discussed in previous
chapters: in simulations, estimated measurement patterns are considerably far o from the real
ones. Listing all possible combinations of pure models might be intractable. Instead, an interesting
compromise for nding measurement models can be described in three steps:
1. nd one-factor models only;
2. lter such models;
3. use the selected one-factor models according to the problem at hand.
The rst step can be used to generate local models, i.e., sets of one-factor models generated
independently, without the necessity of being globally coherent. This means that in principle one
might generate a one-factor model for X
1
, X
2
, X
3
, X
4
, X
1
, X
2
, X
3
, X
5
, X
2
, X
3
, X
4
, X
5
, but
fail to generate a one-factor model using X
1
, X
2
, X
3
, X
4
, X
5
, althought the rst three logically
imply the latter. This could not happen if assumptions hold and data is innite, but it is possible
for nite samples and real-world data.
Since the local model might have many one-factor elements, one might need to lter elements
considered irrelevant by some criteria. By following this framework, we will introduce a variation
of BuildPureClusters for discrete data that performs Steps 1 and 2 above. We leave Step 3 to
be decided according to the application. For instance, one might select one-factor models, learn
which impurities might hold among them, and then use the nal result to learn the structure among
latents. This, however, can be very computationally expensive, orders of magnitude more costly
than the case for continuous variables (Bartholomew and Knott, 1999; Buntine and Jakulin, 2004).
Another alternative is a more theory-driven approach, where latents are just labeled by an
expert in the eld but no causal model for the latents is automatically derived from data. They
can be derived by theory or ignored: in this case, each one-factor model itself is the focus of the
82 Learning local discrete measurement models
5
1
*

2
X
*
X
*
X
*
X
*
X
1
X X X X
2 3 4 5
2 3 4
X
INTEREST
*
NOCARE
*
TOUCH
*
NOSAY
*
INTEREST
*
Efficacy
Responsiveness
NOSAY COMPLEX NOCARE TOUCH
COMPLEX
Figure 5.2: Graphical representations of two latent trait models with 5 ordinal observed variables.
analysis. This is similar to performing data analysis with association rules, where rules themselves
are taken as independent pieces of knowledge. Each one-factor model can then be seen as a causal
association rule with a latent variable as an antecedent and a probabilistic model where observed
variables are independent given this latent.
The rest of the chapter is organized as follows: in Section 5.3, we discuss the parametric formu-
lation we will adopt for the problem of learning discrete measurement models. In Section 5.4, we
formulate the problem more precisely. Section 5.4.1 describes the variation of BuildPureClus-
ters for local models. Finally, in Section 5.5 we evaluate the method with synthetic and real-world
data.
5.3 Latent trait models
Factor analysis and principal component analysis (PCA) are classical latent variable models for
continuous measures. For discrete measures, several variations of discrete principal component
analysis exist (Buntine and Jakulin, 2004), but they all rely on the assumption that latents are
independent. There is little reason, if any at all, to make such an articial assumption if the goal
is causal analysis among the latents.
Several approaches exist for learning models with correlated latent variables. For instance, Pan
et al. (2004) present a scalable approach for discovering dependent hidden variables in a stream of
continuous measures. While such type of approach might be very useful in practice, they are still
not clear on which causal assumptions are being made in order to interpret the latents. In contrast,
in the previous chapters we presented a set of well-dened assumptions that are used to infer and
justify the choice of latent variables that are generated, based on the axiomatic causality calculus
of Pearl (2000); Spirtes et al. (2000). This chapter is on how to extend them to discrete ordinal (or
binary) data based on the framework of latent trait models and local models.
Latent trait models (Bartholomew and Knott, 1999) are models for discrete data that in general
do not make the assumption of latent independence. However, they usually rely on distributional
assumptions, such as a multivariate Gaussian distribution for the latents. We consider such as-
sumptions to be much less harmful for causal analysis than the assumption of full independence,
and in several cases acceptable, such as in variables used in social sciences and psychology (Bollen,
1989; Bartholomew et al., 2002).
The main idea in latent trait models is to model the joint latent distribution as a multivariate
Gaussian. However, in this model the observed variables are not direct measures of the latents.
Instead, the latents in the trait model have other hidden, continuous measures. Such extra hid-
den measures are quantitative indicators of the latent feature of interest. To distinguish these
latent indicators from the target latents, we will refer to the former as underlying variables
(Bartholomew and Knott, 1999; Bartholomew et al., 2002).
5.3 Latent trait models 83
*
0.25 1.25 1.50 2.75
0.75
1.15
X
X
1
*
2
Figure 5.3: Two ordinal variables X
1
and X
2
can be seen as discretizations of two continuous
variables X

1
and X

2
. The lines in the graph above represent thresholds that dene the discretiza-
tion. The ellipse represents a countourplot of the joint Gaussian distribution of the two underlying
continuous variables. Notice that the degree of correlation of the underlying variables has testable
implications on the joint distribution of the observed ordinal variables.
This model is more easily understood through a graphical representation. As a graphical model,
a latent trait model has three layers of nodes: the rst layer corresponds to the latent variables; in
the second layer, underlying variables are children of latents and other underlying variables; in the
third layer, each discrete measure has a single parent in the underlying variable layer. Consider
Figure 5.2(a), for example. The top layer corresponds to our target latents,
1
and
2
. These
targets have underlying measures X

1
X

5
. The underlying measures are observed as discrete
ordinal variables X
1
X
5
.
As another example, consider the following simplied political action survey data set discussed
in detail by Joreskog (2004). It consists on a questionnaire intended to gauge how citizens evaluate
the political ecacy of their governments. The variables used in this study correspond to questions
to which the respondent has to give his/her degree of agreement on a discrete ordinal scale of 4
values. The given variables are the following:
NOSAY: People like me have no say on what the government does
VOTING: Voting is the only way that people like me can have any say about how the
government runs things
COMPLEX: Sometimes politics and government seem so complicated that a person like me
cannot really understand what is going on
NOCARE: I dont think that public ocials care much about what people like me think
TOUCH: Generally speaking, those we elect to Congress in Washington lose touch with
people pretty quickly
INTEREST: Parties are only interested in peoples votes but not in their opinion
In (Joreskog, 2004), a theoretical model consisting of two latents, one with measures NOSAY,
COMPLEX and NOCARE, and another with measures NOCARE, TOUCH and INTEREST is
given. This is represented in Figure 5.2(b). The rst latent would correspond to a previously
established theoretical trait of Ecacy, individuals self-perceptions that they are capable of un-
derstanding politics and competent enough to participate in political acts such as voting (Joreskog,
2004, p. 21). The second latent would be the pre-established trait of Responsiveness, belief that
84 Learning local discrete measurement models
the public cannot inuence political outcomes because government leaders and institutions are un-
responsive. VOTING is discarded by Joreskog (2004) for this particular data under the argument
that the question is not clearly phrased.
Under this framework, our goal is to discover pieces of the measurement model of the latent
variable model. The mapping from an underlying variable X

to the respective observed dis-

crete variable X is dened as follows. Let X be an ordinal variable of n values, 1, . . . , n. Let

X
1
, . . . ,
X
n1
be a set of real numbers such that
X
1
<
X
2
< <
X
n1
. Then:
X =

1 if X

<
X
1
;
2 if
X
1
X

<
X
2
;
. . .
n if
X
n1
X

;
where the underlying variable X

with parents z
X
1
, . . . , z
X
k
is given by
X

k
i=1

X
i
z
X
i
+
X
;
N(0,
2
X
);
where each
X
i
corresponds to the linear eect of parent z
X
i
on X

, and z
X
i
is either a target
latent or an underlying variable. Latents and underlying variables are centered at zero without
loss of generality.
Since the underlying variables can be correlated, this imposes constraints on the observed joint
distribution of the ordinal variables. Figure 5.3 illustrates this case for two ordinal variables X
1
and
X
2
of 3 and 5 values respectively. The correlation of the two underlying variables corresponding to
two ordinal variables in a latent trait model is called the polychoric correlation (or tetrachoric, if
the two variables are binary, Basilevsky, 1994; Bartholomew and Knott, 1999).
Therefore, the tness of a latent trait model depends on how well the covariance matrix of
polychoric correlations t the respective factor analysis model composed of the latents and un-
derlying variables X

. Several estimators and algorithms exist for estimating polychoric correlations

(Bartholomew and Knott, 1999) and evaluating latent trait models. They will be essential in our
approach for learning measurement models of latent traits as discussed in the next section.
5.4 Learning latent trait measurement models as causal rules
BuildPureClusters was designed to nd pure measurement models. There were three main
reasons why we focused on pure, instead of general, measurement models:
a pure measurement model with two measures per latent is enough information to learn
dependencies among latents;
a pure measurement model can be estimated much more reliably from data than general
models. This will be of special importance in this chapter, where learning models of discrete
variables require large samples;
general, unrestricted, models are not fully identiable. That is, in general a large number
of structures might be compatible with the data. It is not known which equivalence classes
exists, or even if a simple representation for such equivalence classes exist;
5.4 Learning latent trait measurement models as causal rules 85
12
X
6
X
5
X
4
X
2
X
3
X
X
X
X X X X
7 8 9 10
11
1
Figure 5.4: Several sets of observed variables can measure the same latent variable. In this case,
X
1
, X
2
, X
3
and any othe observable variable can be used to measure the latent on the left in a
way that is detectable by tetrad constraints.
Since in this chapter we are focusing on multivariate Gaussian latents, we will also make the
assumption of full linearity, as in Chapter 3. We will adapt this approach for measurement models
with discrete binary and ordinal variables. However, we will relax the requirement for pure measure-
ment models. Instead, our algorithm will return a set o of sets of observed variables S
1
, . . . , S
k

by assuming there is a latent trait model G that generated the data. Each set S o has the
following properties:
there is an unique latent variable L in the true unknown latent trait model where conditioned
on L all elements of S are independent;
at most one element in S is not a descendant of L in G;
Furthermore, it is desirable to make each set S maximal, i.e., no element can be added to it
and still make it comply with the two properties above. One can think of each set S as an causal
association rule where the antecedent of the rule is a latent variable and the rule is a naive Bayes
model where observations are independent given the latent. Since the number of sets with this
property might be very large, we further lter such sets as follows:
sometimes it is possible to nd out that two observed variables cannot share any common
hidden parent in G. When this happens, we will not consider sets containing such a pair.
This can drastically reduce the number of rules and computational time;
we eliminate some sets in o that are measuring the same latent as some other set;
In the next section we rst describe a variation of BuildPureClusters based on these prin-
ciples.
5.4.1 Learning measurement models
In order to learn measurement models, one has to discover the following pieces of information
concerning the unknown graph that represents the model:
which latent nodes exist;
which pairs of observed variables are known not to have any hidden common parent;
86 Learning local discrete measurement models
Algorithm BuildSinglePureClusters
Input: , a sample covariance matrix of a set of variables O
1. (Selection, C, C
0
) FindInitialSelection().
2. For every pair of nonadjacent nodes N
1
, N
2
in C where at least one of them is not in Selection and
an edge N
1
N
2
exists in C
0
, add a RED edge N
1
N
2
to C.
3. For every pair of adjacent nodes N
1
, N
2
in C linked by a YELLOW edge, add a RED edge N
1
N
2
to C.
4. For every pair of nodes linked by a RED edge in C, apply successively rules CS1 and CS2. Remove
an edge between every pair corresponding to a rule that applies.
5. Let H be the set of maximal cliques in C.
6. P
C
PurifyIndividualClusters(H, C
0
, ).
7. Return FilterRedundant(P
C
).
Table 5.1: An algorithm for learning locally pure measurement models. It requires information
returned in graphs ( and (
0
, which are generated in algorithm FindInitialSelection, described
in Table 5.2.
which sets of observed variables are independent conditioned on some latent variable;
Using the same assumptions from Chapter 3, it is still the case that the following holds for the
underlying variables:
Corollary 5.1 Let G be a latent trait model, and let X
1
, X
2
, X
3
, X
4
be underlying variables such
that
X
1
X
2

X
3
X
4
=
X
1
X
3

X
2
X
4
=
X
1
X
4

X
2
X
3
. If
AB
= 0 for all A, B X
1
, X
2
, X
3
, X
4
,
then there is a node P that d-separates all elements in X
1
, X
2
, X
3
, X
4
.
Since no two underlying variables are independent conditional on an observed variable, then
node P has to be a latent variable (possibly an underlying variable).
This is not enough information. In Figure 5.4 (repeated from 3.12(a)), for instance, the
latent node on the left d-separates X
1
, X
2
, X
3
, X
4
, and the latent on the right d-separates
X
1
, X
4
, X
5
, X
6
. Although these one-factor models are sound, we would rather not include X
1
and
X
4
in a same rule since they are not children of the same latent. We accomplish this by detecting
as many observed variables that cannot (directly) measure any common latent as possible. In this
case, pairs in X
1
, X
2
, X
3
, X
7
, X
11
X
4
, X
5
, X
6
, X
9
, X
10
can be separated using the CS rules of
Chapter 3.
The algorithm BuildSinglePureClusters (BSPC, Table 5.1) makes use of such results in
order to learn latents with respective sets of pure indicators. However, we need an initial step called
FindInitialSolution (Table 5.2) due to the same reasons explained in Appendix A.3: to reduce
the number of false positives when applying the CS rules.
The goal of FindInitialSelection is to nd a pure submodels using only DisjointGroup
(dened in Appendix A.3) instead of CS1 or CS2 (CS3 is not used in our implementation be-
cause it tends to commit many more false positive mistakes). These pure submodels are then used
as an starting point for learning a more complete model in the remaining stages of BuildSin-
glePureClusters.
5.4 Learning latent trait measurement models as causal rules 87
Algorithm FindInitialSelection
Input: , a sample covariance matrix of a set of variables O
1. Start with a complete graph C over O.
2. Remove edges of pairs that are marginally uncorrelated or uncorrelated conditioned on a third variable.
3. C
0
C.
4. Color every edge of C as BLUE.
5. For all edges N
1
N
2
in C, if there is no other pair N
3
, N
4
such that all three tetrads constraints
hold in the covariance matrix of N
1
, N
2
, N
3
, N
4
, change the color of the edge N
1
N
2
to GRAY.
6. For all pairs of variables N
1
, N
2
linked by a BLUE edge in C
If there exists a pair N
3
, N
4
that forms a BLUE clique with N
1
in C, and a pair
N
5
, N
6
that forms a BLUE clique with N
2
in C, all six nodes form a clique in C
0
and
DisjointGroup(N
1
, N
3
, N
4
, N
2
, N
5
, N
6
; ) = true, then remove all edges linking elements in
N
1
, N
3
, N
4
to N
2
, N
5
, N
6
.
Otherwise, if there is no node N
3
that forms a BLUE clique with N
1
, N
2
in C,
and no BLUE clique in N
4
, N
5
, N
6
such that all six nodes form a clique in C
0
and
DisjointGroup(N
1
, N
2
, N
3
, N
4
, N
5
, N
6
; ) = true, then change the color of the edge N
1
N
2
to
YELLOW.
7. Remove all GRAY and YELLOW edges from C.
8. List
C
FindMaximalCliques(C).
9. P
C
PurifyIndividualClusters(List
C
, C
0
, ).
10. F
C
FilterRedundant(P
C
).
11. Let Selection be the set of all elements in P
C
.
12. Add all GRAY and YELLOW edges back to C.
13. Return (Selection, C, C
0
).
Table 5.2: Selects an initial pure model.
The denition of FindInitialSelection in Table 5.2 is slightly dierent from the one in
Appendix A.3. It is still the case that if a pair X, Y cannot be separated into dierent clusters,
but also does not participate in any true instantiation of DisjointGroup in Step 6 of Table Table
5.2, then this pair will be connected by a GRAY or YELLOW edge: this indicates that these two
nodes cannot be in a pure submodel with two latents and three indicators per latent. Otherwise,
these nodes are compatible, meaning that they might be in such a pure model. This is indicated
by a BLUE edge.
In FindInitialSelection we then nd cliques of compatible nodes (Step 8). Each clique is
a candidate for a one-factor model (a latent model with one latent only). We purify every clique
found to create pure one-factor models (Step 9). This avoids using clusters that are large not
because they are all unique children of the same latent, but because there was no way of separating
its elements.
88 Learning local discrete measurement models
Algorithm PurifyIndividualClusters
Inputs: Clusters, a set of subsets of some set O;
an undirected graph G
0
;
, a sample covariance matrix of O.
1. Output
2. Repeat Steps 3-8 below for all Cluster Clusters
3. If Cluster has two variables X, Y only, verify if there are two other variables W and Z in O such that:

XY

WZ
=
XW

Y Z
=
XZ

WY
and all variables in W, X, Y, Z are adjacent in G
0
. If true, add
Cluster to Output.
4. If Cluster has three variables X, Y, Z only, verify if there is a fourth variable W in O such that:

XY

WZ
=
XW

Y Z
=
XZ

WY
and all variables in W, X, Y, Z are adjacent in G
0
. If true, add
Cluster to Output.
5. If Cluster has more than three variables
6. For each pair of variables X, Y in Cluster, if there is no pair of nodes W, Z Cluster such that

XY

WZ
=
XW

Y Z
=
XZ

WY
, add a GRAY edge X Y to Cluster.
7. While there are GRAY edges in Cluster, remove the node with the largest number of adjacent
nodes
8. If Cluster has more than three variables, add it to Output. Otherwise, add it to Output if and
only if the
criteria in Steps 3 or 4 can be applied.
9. Return Output.
Table 5.3: Identifying the pure measures per cluster.
After we nd pure one-factor models, we lter those that are judge to be redundant. For
instance, if two sets in P
C
have a common intersection of at least three variables, we known that
theoretically they are related to the same latent (follows from Corollary 5.1). We order the elements
in P
C
by size
2
and remove sets that either have a large enough intersection with a previously added
set or where all (but possibly one) of its elements are contained in the union of the previously added
sets. Table 5.4 describes this process in more detail.
5.4.2 Statistical tests for discrete models
It is clear that the same tetrad constraints used in the continuous case can be applied to underlying
latent variables in the respective latent trait model. The dierence lies on how to test such con-
straints. For the continuous case, there are fast Gaussian and large-sample distribution free tests
of tetrad constraints, but for latent trait models tests are relatively expensive.
To test if a tetrad constraint
XZ

WY
=
XW

Y Z
holds, we t a latent trait model with two
latents
1
,
2
, where
1
is a parent of
2
3
. Each latent has two underlying variables as children: X

and Y

for
1
; W

and Z

for
2
. Each underlying variable has the respective observed indicator.
2
Ties are broken randomly in our implementation. Instead, one can implement dierent criteria, such as the sum
of the absolute value of the polychoric correlations within each set in PC.
3
This model is probabilistically identical to the one where the edge 1 2 is reversed.
5.4 Learning latent trait measurement models as causal rules 89
Algorithm FilterRedundant
Inputs: Clusters, a set of subsets of some set O;
1. Output .
2. Sort Clusters by size in descending order
3. For all elements Cluster Clusters according to the given order
4. If there is some element in Output that intersects Clusters in three or more variables, skip
to the next element of Clusters
5. Let N be the number of elements of Cluster
6. If at least N 1 elements of Cluster are present in the union of the elements of Output, skip
to the next element of Clusters
7. Otherwise, add Clusters to Output.
8. Return Output.
Table 5.4: Filtering redundant clusters.
The tetrad will judged to hold in the population if the model passes a
2
test at a pre-dened
signicance level (Bartholomew and Knott, 1999). Testing if all three tetrads hold is analogous,
using a single latent .
Ideally, one would like to use full-information methods, i.e., methods where all parameters are
t simultaneously, such as the maximum likelihood estimator (MLE). However, nding the MLE is
relatively computationally expensive even for a small model of four variables. Since our algorithm
might require thousands of such estimations, this is not a feasible method.
Instead, we use a three-stage approach. Similar estimators are used, for instance, in commercial
systems such as LISREL (Joreskog, 2004). Testing a latent trait model is done by the following
steps:
1. let X be an ordinal variable taking values in the space 1, 2, . . . , m(X). Estimate the
threshold parameters
X
1
, . . . ,
X
m(X)
by direct inversion of the normal cumulative distri-
bution function using the empirical counts. That is, given the marginal empirical counts
n
X
1
, . . . , n
X
m(X)
corresponding to the values of X in a sample of size N, estimate
X
1
as

1
(n
X
1
/N). Estimate
X
m(X)
as
1
(1 n
X
m(X)
/N). Estimate
X
i
, 1 < i < m(X), as

1
((n
X
i
n
X
i1
)/N).
2. in this step we estimate the polychoric correlation independently for each pair. This is done
by maximum likelihood. Let the model loglikelihood function for a pair X, Y be given by
L =
m(X)

i=1
m(Y )

j=1
n
ij
log
ij
() (5.1)
where
ij
() is the population probability of the event X = i, Y = j with polychoric
correlation and n
ij
is the corresponding empirical count. Probability
ij
() is given by

ij
() =

X
i+1

X
i

Y
j+1

Y
j

2
(u, v, )dudv (5.2)
90 Learning local discrete measurement models
where
2
is the bivariate normal density function with zero mean and correlation coecient
. Thresholds are xed according to the previous step. We therefore optimize (5.1) with
respect to only. Gradient-based optimization methods can be used here.
3. given all estimates of polychoric correlations, we have an estimate of the correlation matrix
of the underlying variables, (). To test the corresponding latent trait model, we t ()
to the factor analysis model corresponding to the latents and underlying variables to get an
estimate of the coecient parameters. We then calculate the expected cell probabilities and
return the p-value corresponding to the
2
test.
The drawback of this estimator is that is not as statistically ecient as the MLE. This means
that our method is unreliable with small sample sizes. We recommend a sample size of at least
1,000 data points, even for binary variables. An open problem would be adjusting for the actual
sample size used in the test, since the estimated covariance among underlying variables has more
variance than the sample covariance matrix that would be estimated if such variables were observed.
Therefore, this indirect test of tetrad constraint among latent variables has less power than the
respective test for observed variables used in Chapter 3. However, false positives are still the main
concern of any causal discovery algorithm that relies on hypothesis testing.
In our implementation, we use signicance tests in two ways to minimize false positives/false
negatives in rules CS1 and CS2. These rules have in their premises tetrad constraints that need
to be true or need to be false in order for a rule to apply. For those constraints that need to be
true, we require the corresponding p-value to be at least 0.10. For those constraints that need to
be false, we require that the corresponding p-value to be at most 0.01. Those values were chosen
by doing preliminary simulations.
5.5 Empirical evaluation
In the following sections we evaluate BSPC in a series of simulated experiments where the ground
truth is known. We also report exploratory results in two real data sets. In the simulated cases,
we report statistics about the number of association rules that the standard algorithm Apriori
(Agrawal and Srikant, 1994) returns on the same data. The goal is to provide evidence that in the
presence of latent variables, association rules might produce thousands of rules, even though it fails
to actually capture the causal processes that are essential in policy making.
The Apriori algorithm is an ecient search procedure that generates association rules in two
phases. We briey describe it for the case where variables are binary. In the rst stage, all sets
of variables that are of high support
4
are found. This search is made ecient by rst constructing
sets of small size and only looking for larger sets with by expanding small sets that are frequent
enough. Notice that this only generates sets of positive association. Within each frequent set,
Apriori nds conditional probabilities that are of high condence
5
.
5.5.1 Synthetic experiments
Let G be our true graph, from which we want to extract features of the measurement model as
causal rules. The graph is known to us by simulation, but it is not known to the algorithm. The
4
That is, they co-occur in a large enough number of database records, according to some given threshold.
5
That is, given a frequent set of binary variables X = {X1, X2, . . . , X
k
}, it attempts to nd a partition of
X = XA XB such that P(XB = 1|XA = 1) is above some threshold.
5.5 Empirical evaluation 91
X
9
X X
8 7
X
6
X
5
X
4
X
2
X
3
X
1
15
X
9
X X
8 7
X
6
X
5
X
4
X
2
X
3
X
X X X X
10 11 12
X
13
X
14
1
15
X
X
16
17
18
1
X
9
X X
8 7
X
6
X
5
X
4
X
2
X
3
X
X X X X
10 11 12
X
13
X
14
X
MM1 MM2 MM3
Figure 5.5: The measurement models used in our simulation studies.
goal of experiments with synthetic data is to objectively measure the performance of BSPC in
nding correct and informative latent causal rules of ordinal variables from G.
Correctness in our setup is measured by a Precision statistic. That is,
given o, a set of latent causal rules, and S
i
o a particular rule, the individual precision of
S
i
is the proportion of observed variables in S
i
that are d-separated given an unique latent
in G. The precision of the set o is the average of the respective individual precisions.
For example, if o = S
1
, S
2
, S
3
, 4 out of 5 observed variables in S
1
are d-separated by latent
L
x
in G, 3 out of 3 observed variables in S
2
are d-separated by a latent L
y
in G, 2 out of 3 observed
variables in S
3
are d-separated by a latent L
z
in G, then the precision of o is (4/5+1+2/3)/3 0.82.
Completeness in our setup is measured by a Recall statistic. That is,
given o, a set of latent causal rules, the recall of o is the proportion of latents L
i
in G such
that there is at least one rule in o containing at least two children of L
i
and at most one
observed variable that is not a child
6
of L
i
.
For example, if G has four latents, and three of them are represented by some rule in o as
described above, then the recall of o is 0.75.
In our study we use the three graphs depicted in Figure 5.5, similar to some of the graphs
used in Chapter 3. MM1 has already a pure measurement model. MM2 has two possible pure
submodels: the one including X
1
, . . . , X
10
, X
12
, X
14
, and another including X
15
instead of X
14
.
MM3 has the same pure measurement models as MM2 with the addition of indicator X
16
.
Notice that in our experiments all latents are potentially identiable by BSPC. The goal is
not to test its assumptions, but to evaluate how well it performs in nite samples.
Given each graph, we generated 20 parametric models. 10 of such models were used to generate
samples of 1,000 cases. The remaining 10 were used to generate samples of 5,000 cases. The total
number of runs of our algorithm is therefore 60. To facilitate comparison against Apriori, all
observed variables are binary. The sampling scheme to generate synthetic models and data was as
follows:
1. Pick coecients for each edge in the model randomly from the interval [1.5, 0.5] [0.5, 1.5]
(all latents and underlying variables can have zero mean without loss of generalization).
6
Since in some cases it is not theoretically possible to rule out this possibility (Chapter 3).
92 Learning local discrete measurement models
Evaluation of BSPC output
Sample Precision Recall #Rules
MM
1
1000 1.00(.0) 0.97(.1) 3.2(.4)
5000 0.98(.05) 0.97(.1) 2.9(.3)
MM
2
1000 0.94(.04) 1.00(.0) 3.2(1.03)
5000 0.94(.05) 1.00(.0) 3.4(0.70)
MM
3
1000 0.90(.06) 0.90(.16) 4.2(.91)
5000 0.90(.08) 0.90(.22) 3.5(.52)
Table 5.5: Results obtained with BuildSinglePureClusters for the problem of learning mea-
surement models as causal rules. Each number is an average over 10 trials, with the standard
deviation over these trials in parenthesis.
2. Pick variances for the exogenous nodes (i.e., latents without parents and error nodes) from
the interval [1, 3];
3. Normalize coecients such that all underlying variables have variance 1;
4. For each of the two values of a given observed binary variable, generate a random integer in
1, . . . , 5 as the weight of the value. Normalize the weights to sum to 1. Set each threshold

k
to
1
(S
k
), where
1
is the inverse of the cumulative distribution function of a normal
(0, 1) variable, and S
k
is the sum of the weights of values 1, . . . , k.
A similar sampling scheme for continuous data was used in Chapter 3. It is not an easy setup,
and some of the variables might have a high proportion of its variance due to the error terms.
Factor analysis failed to produce meaningful results under this sampling scheme, for instance.
Results are displayed in Table 5.5 using the evaluating criteria introduced in the beginning
of this section. We also display the number of rules that are generated. Ideally, in all cases we
should generate exactly 3 rules. However, due to statistical mistakes, more or less than 3 rules can
be generated. It is noticeable that there is a tendency to produce more rules than necessary as
the measurement increases in complexity. It is also worthy to point out that without the ltering
described in the previous section, we obtain around around 5 to 8 rules in most of the experiments,
with a larger dierence between the results at a sample size of 1,000 compared to 5,000.
As a comparison, we report the distribution of rules generated by Apriori in Table 5.6. The
implementation used is the one of Borgelt and Kruse (2002) with the default parameters. We report
the maximum and minimum number of rules for each model and sample size across the 10 trials,
as well as average and standard deviation. The outcome is that not only Apriori generates a very
large number of rules, the actual number per trial varies enormously. For MM1 at sample size 5000,
we had a trial with as few as 9 rules, and one with as much as 546, even though the causal process
that generated the data is the same across trials.
5.5.2 Evaluations on real-world data
This section describes the use of BSPC on two real-world data sets. Unlike the synthetic data
study, we do not have objective measure of evaluation. Instead, we will use data sets whose results
5.5 Empirical evaluation 93
Apriori statistics
Sample MIN MAX AVG STD
MM
1
1000 15 159 81 59.4
5000 9 546 116 163.9
MM
2
1000 243 2134 1070.4 681.2
5000 336 3565 1554.7 1072.2
MM
3
1000 363 6036 2916.7 1968.7
5000 158 4434 2608.3 1214.6
Table 5.6: Results obtained with Apriori. Our goal is to evaluate how many association rules
are generated when hidden variables are the explanation of all observed associations. Not only the
number of rules is overwhelming, but the algorithm depicts a high variability in the number. For
each combination of model (MM
1
, MM
2
and MM
3
) and sample size (1000, 5000) we show the
least number of rules (MIN) in 10 independent trials, the maximum number (MAX), the average
(AVG) and standard deviation (STD).
INTEREST
Efficacy
Responsiveness
NOSAY COMPLEX NOCARE TOUCH VOTING
Efficacy
Responsiveness
COMPLEX TOUCH INTEREST
Figure 5.6: A theoretical model for the voting dataset is shown in (a), while BSPC output is shown
in (b).
can be reasonably evaluated by using common-sense knowledge.
Political action survey
We start with a very simple example consisting of a data set of six variables only. The data set is
the simplied polical action survey data set discussed in Section 5.3. We used a subset of the data
where missing values were lled by a particular method given in (Joreskog, 2004). This data is
available as part of the LISREL software for latent variable analysis. The model chosen by Joreskog
(2004) is shown again, without the underlying variables and latent connection, in Figure 5.6(a).
Recall that variable VOTING is discarded by Joreskog (2004) for this particular data under the
argument that the question is not clearly phrased, an argument we believe to be unsubstantial. In
our data-driven approach, we also found two latents: one corresponding to NOSAY and VOTING;
another corresponding to TOUCH and INTEREST. This is shown in Figure 5.6(b). Our output
partially matches the theoretical model without making any use of prior knowledge.
Freedom and tolerance data set: self-evaluation of social attitude
We applied BSPC to the data collected in a 1987 study
7
on freedom and tolerance in the United
States (Gibson, 1991). This is a large study comprising 381 questions targeting political tolerance
and perceptions of personal freedom in the United States. 1267 respondents completed the inter-
7
Available at http://webapp.icpsr.umich.edu/cocoon/ICPSR-STUDY/09454.xml
94 Learning local discrete measurement models
view. Each question is an ordinal variable with 2 to 5 levels, often with an extra non-ordinal value
corresponding to a Dont know/No answer reply.
However, several questions are explicity dependent on answers given to previous questions
8
.
To simplify the task, in this empirical evaluation we will rst focus on a particular section of this
questionnaire, the Deck 6. Other subsection of the study is used in a separate experiment described
in the next section.
This deck of questions is composed of a self-administred questionnaire of 69 items concerning
an individuals attitude with respect to other people. Answers corresponding to Dont know/No
answer usually amounted to 1% of all respondents for each question. We modied these answers
on each question to correspond to the majority answer to avoid throwing away data.
The measurement model obtained by BSPC was a set of 15 clusters (i.e., causal latent rules)
where 40 out of the 69 questions appear on at least on rule. All clusters with at least three observed
variables are depicted in Tables 5.7 and 5.8.
There is a clear relation among items within most rules. For instance, items on Rule 1 of Table
5.7 correspond to measures of a latent trait of empathy and easiness of communication. This causal
view of the associations among these questions makes more sense than a set of association rules
without a latent variable.
Rule 2 has three items (X28, X30, X61) that clearly correspond to measures of a tendency of
impulsive reaction. The fourth item (X41) is not clearly related to this trait, but the data supports
the idea that this latent trait explains the associations between pushing oneself too much and
reacting strongly to other peoples actions and ideas.
Rule 3 is clearly a set of indicators of the trait of deciding when to change ones mind and plan
of action. Rule 4 is apparently due to a more specic latent along the same lines: unwillingness to
change according to other peoples opinion. It is interesting to note that is theoretically plausible
that dierent rules might correspond to dierent latents and share a same observed variable (X9
in this case).
Rule 5 overlaps with Rule 1, and again stress indicators of ability to communicate with other
people and understand other peoples ideas. Rule 6 is a set corresponding to a latent trait of
attitude to risks. Rule 7 seems to be explained by a trait of being energic on implementing ones
ideas. Rule 8 is a rule measuring the ability of remaining calm under dicult conditions, and
seems to have some overlap with Rule 6. Rule 9 is not completely clear because of item X37, and
conceptually appears to overlap with Rule 7. Finally, Rule 10 is a rule where the associations are
apparently due to a latent variable concerning individualism.
It is also interesting to stress that each estimated rule is composed of questions that are not
physically adjacent in the actual questionnaire. Rule 1, for example, is composed of questions
scattered over the response form (X3, X7, X27, X31, X61). The respondents are not estimulated
to respond in a similar pattern by trying to keep coherence or balance with respect to previous
answers.
Althought this given set of causal latent rules might not be perfect, they do explain a lot
concerning the mechanisms explaining observed associations using very few rules.
8
For instance, opinions about a particular political group that was selected by the respondent on a previous
question, or whole sets of answers where only a subset of the individuals are asked to ll out.
5.5 Empirical evaluation 95
Rule 1
X27 I feel it is more important to be sympathetic and understanding of other people than to be practical
and tough-minded
X3 I like to discuss my experiences and feelings openly with friends instead of keeping them to myself
X31 People nd it easy to come to me for help, sympathy, and warm understanding
X67 When I have to meet a group of strangers, I am more shy than most people
X7 I would like to have warm and close friends with me most of the time
Rule 2
X28 I lose my temper more quickly than most people
X30 I often react so strongly to unexpected news that I say or do things that I regret
X41 I often push myself to the point of exhaustion or try to do more than I really can
X61 I nd it upsetting when other people dont give me the support that I expect from them
Rule 3
X9 I usually demand very good practical reasons before I am willing to change my old ways of doing things
X53 I see no point in continuing to work on something unless there is a good chance of success
X46 I like to think about things for a long time before I make a decision
Rule 4
X9 I usually demand very good practical reasons before I am willing to change my old ways of doing things
X17 I usually do things my own way rather than giving in to the wishes of other people
X11 I hate to change the way I do things, even if many people tell me there is a new and better way to do it
Rule 5
X3 I like to discuss my experiences and feelings openly with friends instead of keeping them to myself
X40 I am slower than most people to get excited about new ideas and activities
X12 My friends nd it hard to know my feelings because I seldom tell them about my private thoughts
Table 5.7: Clusters of variables obtained by BSPC on Deck 6 of the Freedom and Tolerance data
set. On the left column, the question number according to the original questionnaire. On the right
column, the respective textual description of the question.
Freedom and tolerance data set: tolerance concerning freedom of speech and govern-
ment perception
We applied BSPC to data corresponding to Decks 4 and 5 of the same study described in the
previous section. We removed two questions from Deck 4 that could be answered only by some
respondents (questions 58B and 59B). We did the same in Deck 5, keeping only all subitems
of questions 86, 87, 90-93. As in the data set from the previous section, every item should be
answered according to an ordinal measure of agreement. Blank values and dont know answers
were processed to reect the opinion of the majority. The total number of items amounted to
70. The reason why we did not use any of the other decks in our experiments was mostly due to
interdependence between answers (i.e., an answer in one question explicitly aecting other answers,
or determining which other questions should skipped).
Questions in our 70 item dataset were mostly about attitude to tolerance of freedom of speech,
how one interacts with other people to discuss sensitive issues, and how one perceives the role of
the government in freedom of speech issues. 52 items out of the 70 appear in some rule given by
the output of BPC. All rules as given in Tables 5.9, 5.10 and 5.11. Unfortunately, BSPC did not
cluster such items into well separated causal rules as in the previous cases.
There is a considerable overlap between some rules. For instance, questions about ones attitude
96 Learning local discrete measurement models
Rule 6
X51 Most of the time I would prefer to do something risky (like hanggliding or parachute jumping) rather
than having to stay quiet and inactive for a few hours
X47 Most of the time I would prefer to do something a little risky (like riding in a fast automobile over
steep hills and sharp turns) rather than having to stay quiet and inactive for a few hours
X29 I am usually condent that I can easily do things that most people would consider dangerous (such as
driving an automobile fast on a wet or icy road)
Rule 7
X52 I am satised with my accomplishments, and have little desire to do better
X54 I have less energy and get tired more quickly than most people
X57 I often need naps or extra rest periods because I get tired so easily
Rule 8
X8 I nearly always stay relaxed and carefree, even when nearly everyone else is fearful
X1 I usually am condent that everything will go well, in situations that worry most people
X26 I usually stay calm and secure in situations that most people would nd physically dangerous
Rule 9
X59 I am more energetic and tire less quickly than most people
X49 I try to do as little work as possible, even when other people expect more of me
X37 I often avoid meeting strangers because I lack condence with people I do not know
Rule 10
X15 It wouldnt bother me to be alone all the time
X58 I dont go out of my way to please other people
X38 I usually stay away from social situations where I would have to meet strangers, even if I am assured
that they will be friendly
Table 5.8: Continuation of Table 5.7.
about discussing polemical/sensitive opinions, better reected by Rule 11 (Table 5.11), are scattered
around other rules. Questions concerning the Supreme Court (Rule 1, Table 5.9) are not in a rule
of their own, as one would expect a priori. Questions about expression of racist opinions are also
scattered. Questions about indirect demonstrations of support (wearing buttons, putting a sign
in front of ones house), as in Rule 6 (Table 5.10), are well-clustered, but still mixed with barely
related questions. Although every rule (perhaps with the exception of Rule 3, Table 5.9) might
be individually interpreted as measuring one broad latent concept concerning freedom of speech,
from a global point of view some groups of questions are intuitively measuring a more specic trait
(e.g., attitude with respect to the Supreme Court). This partially undermines the results, since the
given clustering is not as informative as it could be. An interesting question for future research
is if more statistically robust approaches for learning discrete measurement models could detect
more ne-grained dierences in this data set, or if the data itself is too noisy too allow further
conclusions.
5.6 Summary
We introduced a novel algorithm for nding associations among discrete variables due to hidden
common causes. It can be described as a method for clustering variables based on explicit causal
assumptions.
5.6 Summary 97
Our emphasis in comparing BSPC with association rules is due to the fact that none of the
approaches tries to nd a global model that includes all variables, and both are primarily used for
policy making. That is, they are used in the deduction of causal processes by a combination of
data-driven submodels and prior knowledge. However, generic latent variable models are usually
ad-hoc methods, unlike BSPC.
One method is not intended to substitute the other. Latent trait models rely on substantial
parametric assumptions, while association rules do not. Association rules can also be much more
scalable when the required rule supports are relatively high and data is sparse. However, standard
association rules, or even causal rules, do not make use of latent variables, which might result in
an very complicated and ultimately unusable model for policy making.
The assumption of a Gaussian distribution for latent variables was essential in the approach
described here. Bartholomew and Knott (1999) argue that for domains such as social sciences and
econometrics, such assumptions are not harmful if the goal is parameter estimation. However, two
issues remain unclear: how well the method tetrad tests work with small deviations from normality;
and which kind of output will be generated if the model deviates considerably from the assumptions
(i.e., if a nearly empty model will be generated - which is good, or if a large spurious model will
be the output instead). Work in non-parametric item response theory (Junker and Sijtsma, 2001)
might provide more exible causal models, although it is unclear how robust such methods could
be.
Scalability is also a very important issue. Fast clustering procedures for discrete variables, as
the one proposed by Chakrabarti et al. (2004), might be crucial as an initialization procedure,
spliting the task of nding one-factor models on disjoint sets of variables.
98 Learning local discrete measurement models
Rule 1
X7 Should we allow a speech extremely critical of the U. S. Constitution?
X12 It is better to live in an orderly society than to allow people so much freedom that they can
become disruptive.
X14 Free speech is just not worth it if it means that we have to put up with the danger to society
of radical and extremist political views.
X15 When the country is in great danger we may have to force people to testify against themselves in
court even if it violates their rights.
X17 No matter what a persons political beliefs are, he is entitled to the same legal rights and
protections as anyone else.
X19 Any person who hides behind the laws when he is questioned about his activities doesnt
deserve much consideration.
X24 Would you say you engage in political discussions with your friends?
X31 Would you be willing to sign a petition that would be published in the local newspaper with
your name on it supporting the unpopular political view?
X43 Do you think the government would allow you to organize a nationwide strike of all workers
to oppose the actions of the government?
X46 If the Supreme Court continually makes decisions that the people disagree with, it might be
better to do away with the Court altogether.
X48 It would not make much dierence to me if the U.S. Constitution were rewritten so as to
reduce the powers of the Supreme Court.
X49 The power of the Supreme Court to declare acts of Congress unconstitutional should be eliminated.
X50 The right of the Supreme Court to decide certain types of controversial issues should be
limited by the Congress.
Rule 2
X3 If such a person wanted to make a speech in your community claiming that Blacks are inferior, should
he be allowed to speak, or not?
X6 Should such a person be allowed to organize a march in your community, and claim that Blacks
are inferior?
X20 Because demonstrations frequently become disorderly and disruptive, radical and extremist political
groups shouldnt be allowed to demonstrate.
X39 Would you be allowed to publish pamphlets to oppose the actions of the government?
X40 Would you be allowed to organize protest marches and demonstrations to oppose the actions
of the government?
X43 Do you think the government would allow you to organize a nationwide strike of all workers to oppose
the actions of the government?
X59 How likely is it that you would try to get the government to stop the demonstration (of an undesired
group)?
X60 How likely is it that you would try to get people to go to the demonstration (of an undesired group)
and stop it in any way possible, even if it meant breaking the law?
X62 Or would you do nothing to try to stop the demonstration from taking place?
Rule 3
X6 Should such a person be allowed to organize a march in your community, and claim that Blacks
are inferior?
X9 Do you believe it should be allowed... A speech advocating the overthrow of the U.S. Government.
X21 I believe in free speech for all, no matter what their views might be.
X65 How likely would you be to try to get the legislatures decision reversed by some other
governmental body or court?
Table 5.9: Clusters of variables obtained by BSPC on Decks 4 and 5 of the Freedom and Tolerance
data set. On the left column, the question number according to the order they appear in the
original questionnaire. On the right column, a simplied textual description of the question. See
Gibson (1991) for more details.
5.6 Summary 99
Rule 4
X14 Free speech is just not worth it if it means that we have to put up with the danger to society of
radical and extremist political views
X22 It is refreshing to hear someone stand up for an unpopular political view, even if most people nd
the view oensive.
X25 Would you say you engage in political discussions with casual acquaintances?
X43 Do you think the government would allow you to organize a nationwide strike of all workers to oppose
the actions of the government?
X44 Now, on a dierent subject, some people pay attention to what the United States Supreme Court is doing
most of the time. Others arent that interested. Would you say that you pay attention to the Supreme Court
most of the time, some of the time, or hardly at all?
X68 My local government council usually gives interested citizens an opportunity to express their views before
making its decisions.
Rule 5
X5 If some people in your community suggested that a book he wrote which said Blacks are inferior should be
taken out of your public library, would you favor removing this book, or not?
X11 Do you believe a speech that might incite listeners to violence should be allowed?
X15 When the country is in great danger we may have to force people to testify against themselves in court even
if it violates their rights.
X18 Do you agree strongly, agree, disagree, or disagree strongly with this: Free speech ought to be allowed for
all political groups even if some of the things they say are highly insulting and threatening to some
segments of society.
X19 Any person who hides behind the laws when he is questioned about his activities doesnt deserve much
consideration.
X20 Because demonstrations frequently become disorderly and disruptive, radical and extremist political
groups shouldnt be allowed to demonstrate.
X38 Do you think the government would allow you to organize public meetings to oppose the government?
X40 Would you be allowed to organize protest marches and demonstrations to oppose the actions of the
government?
X59 How likely is it that you would try to get the government to stop the demonstration?
Rule 6
X31 Would you be willing to sign a petition that would be published in the local newspaper with your
name on it supporting the unpopular political view?
X32 Would you be willing to wear a button to work or in public in support of the unpopular view?
X33 Would you be willing to put a bumper sticker on your car in support of that position?
X34 Would you be willing to put a sign in front of your home or apartment in support of the unpopular view?
X35 Would you be willing to participate in a demonstration in support of that position?
X45 In general, would you say that the Supreme Court is too liberal or too conservative or about
right in its decisions?
X49 The power of the Supreme Court to declare acts of Congress unconstitutional should be eliminated.
Rule 7
X1 Should we allow a speech extremely critical of the U. S. Constitution?
X2 Do you think that a book he (a writer with racist views) wrote should be removed from a public library?
X8 Should we allow a speech extremely critical of various minority groups?
X67 The members of my local government council seldom consider the views of all sides to an issue before
making a decision.
Table 5.10: Continuation of Table 5.9.
100 Learning local discrete measurement models
Rule 8
X4 Should such a person (a person of racist position) be allowed to teach in a college or university, or not?
X6 Should such a person be allowed to organize a march in your community, and claim that Blacks
are inferior?
X8 Should we allow a speech extremely critical of various minority groups?
X9 Should we allow a speech advocating the overthrow of the U.S. Government?
X10 Should we allow a speech designed to incite listeners to violence?
X59 How likely is it that you would try to get the government to stop the demonstration?
X65 How likely would you be to try to get the legislatures decision reversed by some other governmental
body or court?
X66 How likely is it that you would do nothing at the moment but vote against the members of the
local legislature at the next election?
Rule 9
X23 Would you say you engage in political discussions with your family?
X29 Best not to say anything about (polemical issues) to casual acquaintances.
X30 Best not to say anything about (polemical issues) to your neighbors.
X44 Now, on a dierent subject, some people pay attention to what the United States Supreme Court is
doing most of the time. Others arent that interested. Would you say that you pay attention to the
Supreme Court most of the time, some of the time, or hardly at all?
X46 If the Supreme Court continually makes decisions that the people disagree with, it might be better
to do away with the Court altogether.
X48 It would not make much dierence to me if the U.S. Constitution were rewritten so as to reduce the
powers of the Supreme Court.
Rule 10
X28 Have you ever had a political view that was so unpopular that you thought it best not to say
anything about it to your friends?
X51 I am sometimes reluctant to talk about politics because it creates enemies.
X56 I am sometimes reluctant to talk about politics because I dont like arguments.
Rule 11
X26 Would you say you engage in political discussions with your neighbors?
X27 Best not to say anything about (polemical issues) to your family.
X28 Best not to say anything about (polemical issues) to your friends.
X30 Best not to say anything about (polemical issues) to your neighbors.
Table 5.11: Continuation of Table 5.10.
Chapter 6
Bayesian learning and generalized
rank constraints
BuildPureClusters is an algorithm for learning the causal structure of latent variable models by
testing tetrad constraints at a given signicance level. In Chapter 3, a large batch of experiments
demonstrated that this algorithm is robust for multivariate Gaussian distributions. However, this
will not be the case for more complicated distributions such as mixtures of Gaussians. In this
chapter, we introduce a score-based algorithm based on the principles of BuildPureClusters
that is more eective in handling mixture of Gaussians distributions.
Moreover, we evaluate how a modication of this algorithm can be used in the problem of density
estimation. This is motivated by several algorithms based on factor analysis and its variants that are
used in unsupervised learning (i.e., density estimation). Such algorithms have applications in, e.g.,
outlier detection and classication with missing data. In factor analysis for density estimation, the
goal is to smooth the data by introducing rank constraints in the covariance matrix of the observed
variables. Our modied algorithm searches for rank constraints in a relatively ecient way inspired
by the clustering idea of BuildPureClusters. Experiments demonstrate the suitability of this
approach.
6.1 Causal learning and non-Gaussian distributions
In Chapter 4, we performed experiments using BuildPureClusters to nd a measurement model
for a set of latents whose distribution deviated considerably from a multivariate Gaussian. Condi-
tioned on the latents, however, the observed variables were still Gaussian. The performance of the
algorithm was not as good as in the experiments of Chapter 3, where all variables were multivariate
Gaussian, but still reasonable.
Results get considerably worse when the population follows a mixture of Gaussians distribution,
where observed variables are not Gaussian given the latents. For instance, in the case where
each conditional distribution of an indicator given its latent parents also depends on the mixture
component. In this case, the number of false positive tests of tetrad constraints is high even for
reasonable sample sizes. In simulation studies using the same graphs of Chapter 3 and a mixture
of Gaussians model, one can show that BPC will return a mostly empty model.
This chapter describes alternative algorithms inspired by BuildPureClusters to learn a
graphical structure using a mixture of Gaussians model. The focus on mixtures of Gaussians is due
102 Bayesian learning and generalized rank constraints
to two main reasons:
rst, in causal models it is of interest to model a mixture of Gaussian-distributed populations
that follow the same causal linear structure, but with dierent parameters (e.g., the distribu-
tion of physiological measurements given the latent factors of interest might dier in dierent
genders, and yet the graphical structure of the measurement model is the same). Since the
variable determining the mixture component can be hidden, we need a mixture of Gaussians
approach in this case;
second, a mixture of Gaussians is a practical and exible model for the multivariate distribu-
tion of a population (Roeder and Wasserman, 1997; Mitchell, 1997), especially when data is
limited and more sophisticated models cannot be estimated reliably;
Instead of relying on an algorithm for constraint-satisfaction learning of causal graphs, we
present an alternative score-based approach for the problem. In particular, Silva (2002) described
the score-based Washdown algorithm for learning pure measurement models with Gaussian data.
The outline of the algorithm is as follows:
1. Start with a one-factor model using all observed variables
That is, create a model with a single latent that is the common parent of all observed variables.
This is illustrated at the top of Figure 6.1.
2. Until the model passes a signicance test (using the
2
test), remove from the model the indicator
that will most increase the likelihood of the model
That is, given the latent variable model with k indicators, consider all submodels with k 1
indicators that are generated by removing one indicator. Choose the one with the highest likeli-
hood
1
and iterate. This is illustrated in Figure 6.1.
3. If some node was removed in the previous step, add a new latent to the model, make it a children
of all other latents, and re-insert all removed nodes as children of the next latent in the sequence.
Go back to Step 2.
That is, suppose indicator X
i
, that is a child of latent L
j
, was removed in the previous step.
We now introduce X
i
back into the model, but as a child of latent L
j+1
. If latent L
j+1
does not
exist, create it. There is a natural order for the latents in Washdown, since one latent is created
at each time. We move X
i
to the next latent according to this order. Latents are fully connected
to avoid introducing other constraints besides those that are a result of the given measurement
model. Figure 6.3 illustrates a simple case of Washdown, where the algorithm reconstructs a pure
submodel of the true model shown in Figure 6.2.
The motivation for this algorithm is as follows: in Step 2, if there is some tetrad constraint
that is entailed by the candidate model but that does not hold in the true model, we expect that
removing one of the nodes that participate in this invalid constraint will increase the t of the
model. Heuristically, one expects that the node that most violates the implied tetrad constraints
1
This is analogous to the purication step in BuildPureClusters as described in Appendix A.3.
6.2 Probabilistic model 103
. . .
X
2
X
3
X
6
X
5
X
4
X
1
L
6
X
5
X
4
X
2
X
3
X
1
X
2
X
3
X
5
X
4
X
1
L
6
X
5
X
4
X
1
X
3
X
1
L
6
X
5
X
4
X
3
X
6
X
5
X
4
X
1
X
1
X
3
X
5
X
4
X
1
L
. . .
1
L
1
L
1
L
1
Figure 6.1: Washdown iteratively removes one indicator at a time by choosing the submodel with
the highest likelihood. In this example, we start with the model on the top, evaluate 6 possible
candidates, and choose to remove X
2
. Given this new graph, we evaluate 5 possible candidates,
and decide to remove X
6
.
according to the data will be the one chosen in Step 2. This is a heuristic and it is not guaranteed
to return a pure model even if one exists. See the results in Appendix C.2 for an explanation.
However, if some pure model is returned, and it passes a statistical test, then at least asymptot-
ically one can guarantee that the tetrad constraints in the model should hold in the population. By
the theoretical results from Chapter 3, if the returned pure model has three indicators per cluster,
the implied constraints are equivalent to a causal model with the corresponding latents and causal
directions. The bottom line is that Washdown is not guaranteed to return a structure, but if it
returns one, then it should be correct.
In Section 6.2 we introduce our parametric formulation of a mixture of Gaussians. In Section
6.3, we will present a Bayesian version of Washdown for mixtures of Gaussians. Experiments with
the Bayesian Washdown are reported in Section 6.4, where we observe that this problem can still
be quite dicult to solve. Based on Washdown, we provide a generalization of the algorithm for
the problem of density estimation in Section 6.5, with the corresponding experiments in Section
6.7.
6.2 Probabilistic model
We assume the population distribution is a nite mixture of Gaussians. Our generative model will
follow closely previous work in mixture of factor analysers (Ghahramani and Beal, 1999).
104 Bayesian learning and generalized rank constraints
9
X
3
X
4
X
L
3
L
2
1
L
X
10
X
11
X
12
X
7
X
8 6
X
5
X
X
1
X
2
Figure 6.2: Graph that generates the data used in the example of Figure 6.3.
6.2.1 Parametric formulation
Let s be a discrete variable with a nite sample space 1, . . . , S. Variable s is modeled as a multino-
mial with parameter :
s Multinomial() (6.1)
Let L
(k)
L be a latent variable such that L
(k)
, conditioned on s, is a linear function of its
parents with additive noise that follows a Gaussian distribution. That is
L
(k)
[s N(
jP
(k)
L

kjs
L
(j)
, 1/
ks
) (6.2)
where P
(k)
L
is the index set corresponding to the parents of L
(k)
in G,
kjs
corresponds to the
coecient of L
(j)
in L
(k)
on component s, and
ks
is the inverse of the error variance of L
(k)
given
s and its parents.
Let X be our observed variables, and dene Z = L X 1. Analogously,
X
(k)
[s N(
jP
(k)
X

kjs
Z
(j)
, 1/
k
) (6.3)
Let the constant 1 be a parent of all X X. The role of 1 in Z is to create an intercept
term for the linear regression of X
(k)
on its parents. Notice that the precision parameter
k
is not
dependent on s.
6.2.2 Priors
A useful metric for ranking graphs is their posterior probability given the data. For this purpose,
we should rst specify priors over the graphs and parameters.
Our prior for is the following Dirichlet:
Dirichlet(a

) (6.4)
where m

= [1/S, . . . , 1/S] is a xed vector and a

is a hyperparameter. The hyperparameter is a

measure of deviation from an uniform mixture proportion.
We adopt a simple prior for the parameters B =
jps
. As in empirical Bayesian analysis,
we plan to optimize the hyperparameters of our model. It is therefore important to minimize the
computational time of this optimization. Thus, we adopt the same prior for all parameters B:
N(0, 1/

L
) (6.5)
6.2 Probabilistic model 105
T
X
2
X
3
X
4
X
X
9
X
10
X
11
X
12
X
7
X
8 6
X
5
X
1
1
T
7
X
8 6
X
5
X X
10
1
X
2
X
3
X
4
X X
9
X
11
X
12
Discarded:
1
X
(a) (b)
T
X
2
X
3
X
4
X X
9
X
11
X
12
X
7
X
8 6
X
5
X X
10
2
1
T
1
T
7
X
8 6
X
5
X
2
X
3
X
4
X
1
X X
9
X
10
X
11
X
12
2
Discarded:
1
T
X
(c) (d)
T
7
X
8 6
X
5
X X
10
X
9
X
11
X
12 2
X
3
X
4
X
1
X
3
1
2
T
T
X
T
7
X
8 6
X
5
X
X
11
X
12 2
X
3
X
4
X
1
X X
9
X
10
Discarded:
3
1
2
T
T
X
(e) (f)
T
7
X
8 6
X
5
X X
10
X
11
X
12
2
X
3
X
4
X
X
9
1
X
4
1
2
3
T
T
T
X
T
7
X
8 6
X
5
X X
10
X
11
X
12
2
X
3
X
4
X
1
3
2
T
T
X
(g) (h)
Figure 6.3: A run of Washdown for data generated from the model in Figure 6.2. We start with
the one-factor model in (a), and by using the process of node elimination, we generate the graph
in (b), where nodes X
1
, X
2
, X
3
, X
4
, X
9
, X
11
, X
12
are eliminated. We wash down such discarded
nodes to a new cluster, corresponding to latent L
2
(c). Another round of node elimination generates
the graph in (d) with the respective discarded nodes. Such nodes are washed down to the next
latents (X
10
moves to L
2
, the others move to L
3
) as depicted in (e). Nodes are eliminated again
generate graph (f). The eliminated nodes are clustered under latent L
4
, as in Figure (g). Because
this latent has too few indicators, we eliminate it, arriving at the nal graph in (h). Notice that
the label of the latents is arbitrary and corresponds only to the order of creation.
106 Bayesian learning and generalized rank constraints
That is, we have a single hyperparameter

L
, which can be optimized by a closed formula given all
other parameters.
We will not dene a prior over the error precisions for L and X, the set , . The number
of error precisions for X does not increase with model complexity. Therefore, no penalization for
complexity is needed for these parameters. Concerning the precisions for L, they do not introduce
extra degrees of freedom since the scale of the latent variables can be adjusted according to an
arbitrary number. Therefore, we will also treat them as hyperparameters to be tted.
Concerning elements in =
kjs
, we also adopt a single prior for the parameters . For
each observed variable X
(k)
, and each
k
:
N(0, 1/

X
(k)
), (6.6)
if is does not correspond to an intercept term, and
N(0, 1/
t
X
(k)
), (6.7)
if does correspond to an intercept term. That is, the number of hyperparameters

X
(k)
,
t
X
(k)

neither increases with the number of mixture components nor with the number of parents of variable
X
(k)
.
This model can be interpreted as a mixture of causal models of dierent subpopulations, where
each subpopulation has the same causal structure, but dierent causal eects. The measurement
error, represented by , the matrix of precision parameters for the observed variables, is the same
across subpopulations.
Another motivation for making independent of s is computational: rst, estimation can get
much more unstable if is allowed to vary with s. Second, a prior for is not strictly necessary,
and therefore we will not need to t the corresponding hyperparameters. Usual prior distributions
for precision parameters, such as gamma distributions, have hyperparameters that cannot be t
by a closed formula (see, e.g, Beal and Ghahramani, 2003). This could slow down the procedure
considerably.
The natural question to make is what happens to the entailment of tetrad constraints in nite
mixtures of linear models. Again, a constraint is entailed if and only if it holds for all parameter
values of the mixture model. We can appeal to a measure theoretical argument, not unlike the one
used in Chapter 4, to argue that observed tetrad constraints that are not entailed by the graphical
structure require coincidental cancellation of parameters, and therefore are ruled out as unlikely.
This argument is less convincing when the number of mixtures approaches innite. Nevertheless,
we will be implicitly assuming that the number of mixture components is not high. That is, high
to the point where constraints are judged to hold in the population by nite sample scoring, and
yet they are not graphically entailed.
6.3 A Bayesian algorithm for learning latent causal models
.
The original Washdown of Silva (2002) was based on a
2
test. We introduce a variation of
this algorithm using a Bayesian score function. Based on the success of Bayesian score functions
in other structure learning algorithms (Cooper, 1999), we conjecture that in general it should be
a better alternative than
2
tests for small sample sizes. Moreover, the
2
stopping criterion of
6.3 A Bayesian algorithm for learning latent causal models 107
Algorithm Washdown
Input: a data set D of observed variables O
Output: a DAG
1. Let G be an empty graph
2. G
0
G
3. Do
4. G IntroduceLatentCluster(G, G
0
, O)
5. Do
6. Let O argmax
OG
T(G
\O
, D)
7. If T(G
\O
, D) > T(G, D)
8. Remove O from G
9. While G is modied
10. If GraphImproved(G, G
0
)
11. G
0
G
12. While G
0
is modied
13. Return G
0
Table 6.1: Build a latent variable model where observed variables either share the same parents or
no parents.
X
2
X
3
X
6
X
5
X
4
X
L
1
1
X
3
X
6
X
5
X
4
X
L
1
1
X
2 (rank1)

11 12 13 14 15

16

21
31
51
61
23456
(a) (b) (c)
Figure 6.4: Deciding if X
1
should be excluded of the one-factor model in (a) is done by comparing
models (a) and (b). Equivalently, removing X
1
generates a model where the entries corresponding
to the covariance of X
1
and X
i
(
1i
) are not constrained, while the remaining covariance matrix

23456
is a rank-1 model, as illustrated by (c).
the original Washdown function depended on a pre-specied signicance value that can be quite
arbitrary, while our suggested score function does not have any special parameters to be set a priori.
Let T(G, D) be a function that scores graph G using dataset D. Our goal with Washdown will
be nding local maxima for T in the space of pure measurement models. Section 6.3.1 describes
the algorithm. Several implementation details are left to Appendix C.3. A proposed score function
T is described only in Section 6.3.2.
6.3.1 Algorithm
The modied Washdown algorithm is shown in Table 6.1. We will explain it step by step.
In Table 6.1, graph G is our candidate graph, the one that will have indicators removed and
latents added to. Graph G
0
represents the candidate graph in the previous iteration of the algo-
rithm. Moving to the next iteration in Washdown only happens when graph G is better to G
0
108 Bayesian learning and generalized rank constraints
Algorithm IntroduceLatentCluster
Input: two graphs G, G
0
; a set of observed variables O;
Output: a DAG
1. Let NodeDump be the set of observed nodes in O that are not in G
2. Let T be the number of latents in G
3. Add a latent L
T
to G and form a complete DAG among latents in G.
4. For all V NodeDump
5. If V G
0
6. Let L
i
be the parent of V in G
0
7. Set L
i+1
to be the parent of V in G
8. Else
9. Set L
T
to be the parent of V in G
10. If L
T
does not have any children
11. Remove L
T
from G
12. Return G
Table 6.2: Introduce a new latent by moving nodes down the latent layer.
according to the function GraphImproved, shown in Table 6.3 and explained in detail later in
this section.
G starts without any nodes. Function IntroduceLatentCluster, described in Table 6.2,
adds a new latent node to G (connecting it to all other latents) and moves around observed variables
that are not in G. As in the original Washdown, illustrated in Figure 6.3, latents in Gare numbered
(L
1
, L
2
, L
3
, etc.). Any node removed from G that was originally child of latent L
i
will be assigned
to be an indicator of latent L
i+1
. It is this ow of indicators, downstream the latent layer, that
justies the name washdown.
After the addition of a new latent, we proceed to the cycle of indicator removal. This is
represented by steps 5-9 in Table 6.1. The way this removal is implemented is one of the main
dierences between the original algorithm of Silva (2002) and the new Washdown. Let G
\O
be a
modication of graph G generated by removing all edges into O, and adding an edge from every
observed node in G into O. By denition, G
\
= G. We will select the observable node O in G
that maximizes T(G
\O
, D).
The intuition for this comparison is as follows. For example, consider a latent variable model
with a single latent, where this latent is the common parent of all observed variables and no other
edges exist. Figure 6.4(a) illustrates this type of model. To simplify the exposition, we will consider
a model with only one Gaussian component. The covariance matrix of X
1
X
6
can be represented
as
=

+ (6.8)
where is the vector corresponding to the edge coecients relating the latent to each of the
observed variables and is the respective matrix of residuals. That is, this one-factor model
imposes a rank constraint in the rst term of this sum:

is by construction a matrix of rank 1

(i.e., tetrad constraints hold for each foursome of observed variables). However, it may be the case
that in the true model not all variables are independent conditioned on L
1
. Consider dropping X
1
out of this one-factor model. That is:
1. variables X
2
, . . . , X
6
still form a one-factor model, and therefore their marginal covariance
matrix (minus the residuals) is still constrained to have rank 1
6.3 A Bayesian algorithm for learning latent causal models 109
Algorithm GraphImproved
Input: two graphs G
1
, G
2
; a dataset D
Output: a boolean
1. Let O
i
be the set of observed variables in graph G
i
2. O
All
O
1
O
2
3. For i = 1, 2
4. Initialize G

i
G
i
5. Let O
C
O
All
`O
i
and them to G
i
6. Add edges V O to G
i
for all (V, O) O
i
O
C
7. Form a full DAG among elements O
C
in G

i
8. If T(G

1
, D) > T(G

2
, D)
9. Return true
10. Else
11. Return false
Table 6.3: Compare two graphs that initially might have dierent sets of observed variables.
2. the conditional distribution p(X
1
[X
2
, . . . , X
6
) is unconstrained. This can be done by as
shown in Figure 6.4(b)
That is, we modify the implied joint distribution p
0
(X
1
, . . . , X
6
) into a new joint p
1
(X
1
[X
2
, . . . , X
6
)
p
1
(X
2
, . . . , X
6
) where p
1
(X
1
[ X
2
, . . . , X
6
) is saturated (no further constraints imposed). This op-
eration will remove any rank constraints that include X
1
. This idea is largely inspired by the search
procedure described by Kano and Harada (2000). The algorithm of Kano and Harada (2000) adds
and removes nodes in a factor analysis graph by doing an analogous comparison of nested models.
That approach, however, was intended to modify a factor analysis graph given a priori, i.e., it was a
purication procedure for a pre-dened clustering. We use it as a step to build clusters from data.
Empirically, this procedure for selecting which indicator to remove worked better in preliminary
experiments than simply choosing among models that dier from G by having one less indicator,
as used in (Silva, 2002). This is intuitive, because it measures not only how well the remaining
indicators t the data, but also how much is gained in representing the covariance between the
removed indicator and the other variables without imposing constraints.
At Step 10 of Table 6.1, we have to decide if we proceed to the next iteration or if we halt. In the
original Washdown formulation, we would always start the next iteration and not proceed if the
new model passed a statistical test at a given signicance level
2
. This has two major drawbacks:
it requires a choice of signicance level, which is many times arbitrary; it requires the test to
have signicant power. For more complex distributions as mixtures of Gaussians, having a test of
acceptable power might be dicult.
Instead, we use the criterion dened by function GraphImproved (Table 6.3). Both the current
candidate graph, G, and the previous graph, G
0
, embody a set of tetrad constraints. The score
function is expected to reect how well such constraints are supported by the data: in this case,
the better the score, the better supported are the tetrad constraints. However, due to variable
elimination, G and G
0
might dier with respect to their set of observed variables. Comparing
them directly is meaningless: for instance, if G equals G
0
with some indicators removed, then the
likelihood of G will be higher than G
0
.
2
In the original Washdown, clusters with 1 or 2 indicators would just be removed in the end.
110 Bayesian learning and generalized rank constraints
L
X
2
X
3
X
6
X
5
X
4
X X
7
X
8
X
9
X
10
X
12
1
1
L
X
2
X
3
X X
7
X
11 6
X
5
X
4
X X
9
1 2
L
1
(a) (b)
11
X
2
X
3
X
6
X
5
X
4
X X
7
X
8
X
9
X
10
X
12
1
L
X
1
12
X
2
X
3
X X
7
X
11 6
X
5
X
4
X X
9
1 2
L L
X
X X
8
10
1
(c) (d)
Figure 6.5: Graphs in (a) and (b) are transformed into graphs (c) and (d) before comparison in
method GraphComparison.
Instead, we normalize G and G
0
in GraphImproved before making the comparison. Nodes
in G that are not in G
0
are added to G
0
. Nodes in G
0
not in G are added to G. Such nodes are
connected to the pre-existing nodes by adding all possible edges from the original nodes into the
new nodes. The goal is to include the new nodes without imposing any constraints on how they
are mutually connected and connected with respect to the existing nodes.
For example, consider Figure 6.5. Graph G
0
has a single cluster (Figure 6.5(a)). Graph G has
two clusters (Figure 6.5(b)). Graph G
0
has nodes X
8
, X
10
and X
12
that are not present in G
0
.
Node X
11
is in G but not in G
0
. Therefore, we normalize both graphs with respect to each other,
obtaining G

0
in Figure 6.5(c) and G

in Figure 6.5(d). If G

scores better than G

0
, we accept G as
our new graph and proceed to the next iteration.
Figure 6.3, used to illustrate the algorithm described by Silva (2002), also illustrates the new
Washdown algorithm. Most modications are in the internal evaluations, but the overall structure
of the algorithm remains the same. The only dierence, in this example, is that we choose the
model in Figure 6.3(g) instead of Figure 6.3(h) not because we eliminate clusters with less than 2
indicators, but because the score of the former is higher than the score of the latter.
6.3.2 A variational score function
We adopt the posterior distribution of a graph as its score function. Our prior over graph struc-
tures will be uniform, which implies that the score function of a graph G amounts to the marginal
likelihood p(D[G), D being the data set. Since calculating this posterior is intractable for any prac-
tical search algorithm, we adopt a variational approximation for it that is similar to the Bayesian
variational mixture of factor analysers (Ghahramani and Beal, 1999).
6.3 A Bayesian algorithm for learning latent causal models 111
In fact, Beal and Ghahramani (2003) show by an heuristic argument that asymptotically this
variational approximation is equivalent to the BIC score. However, for nite samples we have the
power of tting the hyperparameters and choosing a more suitable penalization function than the
one given by BIC. In experiments on model selection described by Beal and Ghahramani (2003), this
variational framework was able to give better results than BIC at roughly the same computational
cost.
Let the posterior probability of the parameters and hidden variables be approximated as follows:
p(, B, , s
i
, L
i

n
i=1
[X) q()q(B)q()

n
i=1
q(s
i
, L
i
)
where p(.) is the density function, q(.) are the variational approximations, n is the sample size. The
main approximation assumption is the conditional decoupling of parameters and latent variables.
Given the logarithm of marginal distribution of the data
L lnp(X) = ln (

d p([a

dB p(B[

L
)

d p([

X
)

N
i=1
[

S
s
i
=1
p(s
i
[)

dL
i
p(L
i
[s
i
, B, )p(X
i
[Z
i
, s
i
, , )])
we introduce our variational approximation by using Jensens inequality
L

d dB d(ln
p(|a

)p(B|

L
)p(|

X
)
q()q(L)q()
+

n
i=1
[

S
s
i
=1

dL
i
q(s
i
, L
i
) (ln
p(s
i
|)p(L
i
|s
i
,B,)
q(s
i
,L
i
)
+ ln p(X
i
[Z
i
, s
i
, , )]))
Therefore, our score function is
T(G, D) =

d ln
p(|a

)
q()
+

S
s=1
[

dB
s
q(B
s
) ln
p(B
s
)
q(B
s
)
+

d
s
q(
s
) ln
p(
s
)
q(
s
)
]

n
i=1

S
s
i
=1
q(s
i
)[

dq() ln
p(s
i
|)
q(s
i
)
+

dB
s
dL
i
q(B
s
)q(L
i
[s
i
) ln
p(L
i
|s
i
,B
s
,
s
)
q(L
i
|s
i
)
+

d
s
dL
i
q(
s
)q(L
i
[s
i
) ln p(X
i
[s
i
, Z
i
,
s
, )]
In this function, the rst three lines correspond to the negative KL-divergence between the
priors and the approximate posteriors. The fourth line is the expected log-likelihood of the data by
the approximate posteriors. Although this variational score is not guaranteed to consistently rank
models, it is a natural extension of the BIC score (also inconsistent for latent variable models),
where penalization term increases not with the number of parameters, but by how much they
deviate from a given prior.
Optimizing our variational bound is a non-convex optimization problem. To t our variational
parameters, we alternate between optimization steps where we nd the value of one parameter, or
hyperparameter, while xing all the others. The steps are given in Appendix C, Section C.1.
6.3.3 Choosing the number of mixture components
So far we mentioned how to search for a graph for a given probabilistic model, but we did not
mention how to choose the number of Gaussian components to begin with. A principled alternative
for choosing the number of mixture components would be running the Washdown algorithm for
varying numbers and choosing the one with the best score. Since this is computationally expensive,
we instead heuristically choose the number of components according to the output of the variational
mixture of factor analyzers Ghahramani and Beal (1999) and x it as an input for Washdown.
112 Bayesian learning and generalized rank constraints
20
X
6
X
5
X
4
X
2
X
3
X
9
X X
8 7
X
X X X X X X X X
X X X
10 11 12
13 14 15 16 17 18 19
1
Figure 6.6: The network used in the experiments throughout Section 6.4.
Evaluation of output measurement models
Trial Latent omission Latent commission Misclustered indicator Impurities
1 1 0 0 0 0
2 0 0 2 1 0
3 0 0 2 0 0
4 3 0 12 0 0
5 2 0 9 1 0
6 2 0 8 0 4
7 3 0 12 0 0
8 3 0 13 0 2
9 0 0 2 0 0
10 0 2 0 4 0
average 1.4 0.2 6 0.6 0.6
Table 6.4: Results on structure learning for Washdown for samples of size 200.
6.4 Experiments on causal discovery
.
In this section we perform simulated experiments using the same type of graphical models
described in the experimental section of Chapter 3. The goal is to analyse how well Washdown
is able to reconstruct the true graph. We will use the graphical structure in Figure 6.6 to generate
synthetic datasets. In all of our experiments, we generated data from a mixture of 3 Gaussians.
Within each Gaussian, we sampled the parameters following the same procedure of Chapter 3. The
probability of each Gaussian component was chosen by selecting uniformly an integer in 1, 2, 3, 4, 5
and normalizing. Distributions where one of the components had probability less than 0.15 were
discarded.
The criteria of success are the same of Chapter 3, using counts instead of percentuals. Table
6.4 shows the results for 10 independent trials using sample size of 200. Table 6.5 shows the results
for 10 independent trials using sample size of 1000.
The results, especially for sample size of 200 (on which variance is high), might not be as
good as in the Gaussian case presented in Chapter 3. However, these problems are much harder,
and BuildPureClusters, for instance, does not provide reasonable outputs (it returns a mostly
empty graph). Washdown, while much more computationally expensive, is still able to return
mostly correct outputs for the given problem at reasonable sample sizes.
6.5 Generalized rank constraints and the problem of density estimation 113
Evaluation of output measurement models
Trial Lat. omission Lat. commission Ind. omission Impurities Misclustered ind.
1 0 0 1 0 0
2 0 0 3 1 2
3 0 0 0 0 0
4 0 0 0 0 1
5 0 0 3 2 1
6 0 0 0 0 0
7 1 0 0 0 0
8 0 0 1 0 0
9 0 1 0 2 1
10 1 0 0 0 0
average 0.2 0.1 0.8 0.5 0.5
Table 6.5: Results on structure learning for Washdown for samples of size 1000.
6.5 Generalized rank constraints and the problem of density esti-
mation
.
As discussed in the rst chapter of this thesis, latent variable models are also important tools in
density estimation. For instance, Bishop (1998) discusses variations of factor analysis and mixtures
of factor analysers (probabilistic principal component analysis, to be more specic) for the problem
of density estimation. One of the applications discussed by Bishop was in digit recognition from
images, which can be used for automated ZIP code identication from scanned envelopes. Instead
of building a discriminative model that would classify each image according to the set 0, 1, . . . , 9,
his proposed model calculates the posterior probability of each digit using 10 dierent density
models, one for each digit. In this way, it is possible to raise a ag when none of the density models
recognizes a digit with high probability, so that human classication is required and a better trade-
o between automation and cost of human intervention can be achieved. Outlier detection, as in
the digit recognition example, is a common application of density models. Latent variable models,
usually variations of factor analysis, are among the most common tools for this task. Bishop (1998)
and Carreira-Perpinan (2001) describe several other applications.
Most approaches based on factor analysis are also computationally appealing. The problem
of structure learning is in many cases reduced to the problem of choosing the number of latents.
Maximum likelihood estimation of a few latent variable models is computationally feasible even for
very large problems. If one wants a Bayesian criterion for model selection, in practice one could use
the BIC score which only requires a maximum likelihood estimator. Other approximate Bayesian
scores are computationally feasible, such as the variational score discussed here. Minka (2000)
provides other examples of approximation methods to compute the posterior of a factor analysis
model.
The common factor analysis model consists on a fully connected measurement model with
disconnected latents, i.e., a model where every indicator is a child of every latent, and where there
are no edges connecting latents. However, this space of factor analysis graphs might not be the
best choice for a probabilistic model. For instance, assume the true linear model that generated
our data is shown in Figure 6.7(a). In the usual space of factor analysis graphs, we would need the
114 Bayesian learning and generalized rank constraints
2
X
9
X X
8 7
X
6
X
5
X
4
X
2
X
3
X
L L L
3 1
1
2
X
9
X X
8 7
X
6
X
5
X
4
X
2
X
3
X
L L L
3 1
1
(a) (b)
Figure 6.7: If fully connected measurement models with disconnected latents are used to represent
the joint density function of model (a), the result is the model shown in (b).
graph represented in Figure 6.7(b). The relatively high number of parameters in this case might
lead to inecient statistical estimation, requiring more data than we would need if we used the
right structure.
Alternatively, a standard hill-climbing method with hidden variables, such as FindHidden
(Elidan et al., 2000, discussed in Chapter 2), might nd a better structure, as measured by how
well the model ts the data. However, this type of approach has two disadvantages:
it is computationally expensive, since at every step we have to decide which edge to re-
move/add/reverse. Each search operator scales quadratically with the number of variables;
its search space is naive. Single edge modications might be appropriate for some spaces
of latent variable models (e.g., if the true model and all models in the search space have
likelihood functions with a single global optimum). In general, however, if the model has
many unidentiable parameters, or if sample sizes are relatively small, models that dier
by a single edge might be undistinguishable or nearly undistinguishable and easily misguide
the algorithm. Instead of operating on single edges, a search algorithm should operate on
graphical modications that entail dierent relevant constraints in the observed marginal;
If the given observed variables have many hidden common causes, as in the problems that
motivate this thesis, it might be more appropriate to discard the general FindHidden approach
and adopt an algorithm that operates directly in the space of factor analysis graphs. Such an
algorithm should have the following desirable features:
unlike standard model selection in factor analysis, its search space should include a large class
of latent variable graphs, not only fully connected measurement models;
unlike FindHidden, its search space should have operators that scale at least linearly with
the number of observed variables;
unlike FindHidden, any two neighbors in the search space should imply dierent constraints
in the observed marginal, and such dierent constraints should be relatively easy to distinguish
given reasonable sample sizes;
Washdown satises these criteria. Its search space of pure measurement models with connected
latents can represent several distributions using sparse graphs, where fully connected measurement
models with disconnected latents would require many edges, as in the example of Figure 6.7. Each
6.5 Generalized rank constraints and the problem of density estimation 115
search operator scales linearly with the number of observed variables
3
. Pure measurement models
that dier according to tetrad constraints can be expected to be easier to distinguish with small
samples than dense graphs that dier on a single edge.
However, Washdown has one essential limitation that precludes it of being directly applicable
to density estimation problems: it might discard an unpredictable number of observed variables.
For instance, in Chapter 4 we analysed the behavior of BuildPureClusters in some real-world
data. Approximately two thirds of the observed variables where eliminated.
There are two main ways of modifying Washdown. One is to somehow force all variables to
be indicators in a pure measurement model, as in the approach described by Zhang (2004) for
learning latent trees of discrete variables. The drawback is that such an approach can sometimes
(or perhaps even frequently) undert the data, as Zhang acknowledges.
Another way is to adopt a hybrid approach, as hinted in the end of Chapter 4. For instance,
by using an algorithm where dierent pure measurement models are learnt for dierent subsets of
variables, and combined in the end by introducing the required impurities. Consider, for example,
the true model shown in Figure 6.8(a). An algorithm that learns pure measurement models only
cannot generate the cluster represented by latent L
3
if it includes L
1
and L
2
, and vice-versa.
Now imagine we run Washdown and obtain a model that includes latents L
1
and L
2
, as well all
of their respective indicators, as in Figure 6.8(b). We can run Washdown again with the discarded
indicators X
9
, X
10
, X
11
, X
12
and obtain an independent one-factor model, as in Figure 6.8(c). We
could then merge these two marginally pure models into a single latent variable model, as in Figure
6.8(d). Starting from this model, we could apply a standard greedy search approach to introduce
bi-directed edges among indicators if necessary.
This is the main idea behind the generalized Washdown approach we introduce in the next
section. However, there are a few other issues to be solved if we want a model that will include all
observed variables.
First, building pure models where observed variables have a single parent might not be enough.
If, for instance, the true model is the one in Figure 6.7(b), this generalized Washdown algorithm
will not work: it should still return an empty graph.
The proposed solution is to embed Washdown in a even more general framework: we rst
try to build several disjoint pure models with one latent per cluster, until an empty graph is
returned. When this happens, and we still have unclustered indicators, we attempt to build several
disjoint pure models with two latents per cluster, and so on. This approach explores general rank
constraints. That is, a cluster with k latents imposes a rank constraint in the covariance matrix of
its p indicators, namely, that it can be decomposed into two matrices as follows:
=

+
where is the covariance matrix of the latents; is the set of edge coecients connecting indicators
to their latent parents (a pk matrix); and is diagonal matrix representing the covariance matrix
of the residuals. That is, is constrained to be the sum of a matrix of rank k (

) and a diagonal
matrix (). Figure 6.9 presents a type of output that could be generated using this algorithm.
The problem is not completely solved yet. It would not make sense, for instance, to have
clusters with a single indicator, as in Figure 6.10: instead of having each latent connected to the
3
This is not to say that Washdown is more computationally ecient than FindHidden in general. If the graph
found in the rst stage of FindHidden, which does not require latent variables, is quite sparse, then FindHidden is
likely to be faster than Washdown. However, for problems where the initial graph is found to be somewhat dense,
FindHidden can be slow compared to an approach such as Washdown.
116 Bayesian learning and generalized rank constraints
1
X
3
X
4
X
L
3
L
2
1
L
X
10
X
11
X
12
X
9
X
7
X
8 6
X
5
X
X
2
(a)
First call
X
3
X
4
X
L
2
1
L
X
7
X
8 6
X
5
X
X
1 2
Second call
L
3
X
10
X
11
X
12
X
9
(b) (c)
Third stage
X
3
X
4
X
L
3
L
2
1
L
X
10
X
11
X
12
X
9
X
7
X
8 6
X
5
X
X
1 2
(d)
Figure 6.8: Given data sampled from the model in (a), one variation of Washdown could be used
to rst generate the model in (b) and then, independently, the model in (c). Both models would
then be merged, and bi-directed edges could be later added by some greedy search (d).
6.5 Generalized rank constraints and the problem of density estimation 117
Additional bidirected edges
L
3
L
2
1
L
X
10
X
11
X
12
X
9
X
7
X
8 6
X
5
X
2
X
3
X
4
X X
1
From round 1
X X X
2
X X X X X X X
20 21 22 23 24 25 26 27 28 29
X X X X X X X
13 14 15 16 17 18 19
L L
4 5
L L L
6 7 8
From round 3 From round 4 From round 2
Fully connected latents
Figure 6.9: A possible outcome of a generalized Washdown that allows multiple latent parents
per indicator and impurities. In this case, four calls to a clustering procedure were made. In the
rst two calls, models with one latent per cluster were build. In the third call, two latents per
cluster. In the fourth call, three latents. We merge all models and fully connect all latents, which
is represented by bold edges between subgraphs in the gure above. Bi-directed edges are then
added by an additional greedy search for impurities.
4
1 2
X
3
X
4
X
1
L
L
2
L
L
3
X X
4
X
X
1
3
X
2
(a) (b)
Figure 6.10: For density estimation it makes little sense to have a model of fully connected latents
with one indicator per latent, as in (a), even if this is the true causal model. The same distribution
could be represented without latent variables, as in (b), with less parameters.
118 Bayesian learning and generalized rank constraints
6
L
2
1
L
2
X
3
X
4
X X
1
X X
5
Discarded nodes
1
L
2
X
3
X
4
X X
1
X X
5 6
First round model
1
L
2
X
3
X
4
X X
1
X
5
X
6
(a) (b) (c)
Figure 6.11: Suppose the true model is the one in (a). A call to Washdown can return the model
in (b), where indicators X
5
and X
6
are discarded. These indicators cannot be used to form any one-
factor model (nor any other factor model) since a one-factor model does not impose any constraint
on their joint distribution. Our solution is just to add X
5
and X
6
to the latent variable model and
proceed with a standard greedy search method. A possible outcome is the one shown in (c).
other latents as in this example, we could just connect all the indicators directly and eliminate the
latents. Although this is an extreme example, it illustrates that small clusters are undesirable.
In general, a k-factor model (i.e., a factor model with k latents) is only statistically meaningful if
there is a minimal number of indicators in this model. For instance, a one-factor model needs at
least four indicators, since any covariance matrix of three or less variables can be represented by a
one-factor model. We do not want small clusters.
If, after attempting to create pure models with 1, 2, . . . , k latents per cluster, we end up with
a set of unclustered indicators that is not large enough for a k-factor model, we will not attempt
to create a new cluster. Instead, we will just add these remaining nodes to the model and do a
standard greedy search to connect them to the clustered ones. Figure 6.11 illustrates a case where
we have only two remaining indicators, and a possible resulting model where these two indicators
are included in the latent variable model.
To summarize, we propose the following extension of Washdown for density estimation prob-
lems. This proposal is motivated by the necessity of including all observed variables and by the
computational convenience of clustering nodes instead of nding arbitrary factor analysis graphs:
attemp to create pure models with one latent per cluster (as in Washdown). After nding
one such model, try to create a new one with the indicators that were discarded. Each new
model is generated independently of the previous models. Iterate until no such a pure model
is returned.
attemp to create now pure models with two latents per cluster. Iterate.
after no new k-factor model can be constructed with the remaining indicators, merge all
pure models create so far into a single global model. Add bi-directed edges among indicators
using a greedy search algorithm, if necessary. We explain how to parameterize such edges in
Appendix C.
add all the unclustered indicators to this global model, and connect them to the other nodes
by using a standard greedy search.
Inspired by the strategy used for learning causal models with Washdown, we expect this
algorithm to nd rst sets of indicators whose marginal is a sparse measurement model of a few
6.6 An algorithm for density estimation 119
Algorithm K-LatentClustering
Input: a data set D of observed variables O, an integer k
Output: a DAG
1. Let G be an empty graph
2. G
0
G
3. Do
4. G IntroduceKLatentCluster(G, G
0
, O, k)
5. Do
6. Let O argmax
OG
T(G
\O
, D)
7. If T(G
\O
, D) > T(G, D)
8. Remove O from G
9. While G is modied
10. If GraphImproved(G, G
0
)
11. G
0
G
12. While G
0
is modied
13. Return G
0
Table 6.6: Build a latent variable model where observed variables either share the same k latent
parents or no parents.
latent variables, and only increase the number of required latent parents for the remaining indicators
if the data says so. We conjecture this provides a good trade-o between learning latent variable
models that are relatively sparse and the required computational cost of this search.
6.5.1 Remarks
We do not have theoretical results concerning equivalence classes of causal models for combinations
of rank-r (r > 1) models. Wegelin et al. (2005) describe an equivalence class of some types of rank-r
models. They do not provide an equivalence class of all graphs that are undistinguishable given
arbitrary combinations of dierent rank-r constraints. However, for density estimation problems
it is not necessary to describe such an equivalence class, as long as the given procedure provides
a better estimation of the joint than other methods. The generalized variation of Washdown
presented in the next section and the results of Wegelin et al. (2005) might be used as a starting
point to new approaches for causal discovery.
6.6 An algorithm for density estimation
.
We rst introduce a slightly modied Washdown algorithm that takes an input not only a
dataset, but also an integer parameter indicating how many latent parents each indicator should
have (i.e., how many latents per cluster). We call this variation the K-LatentClustering al-
gorithm, as shown in Table 6.6. This algorithm is identical to Washdown, with the exception
of introducing k latents within each cluster, as made explicit by algorithm IntroduceKLatent-
Cluster (Table 6.7).
Finally, we only need to formalize how K-LatentClustering will be used to generate sev-
eral disjoint pure measurement models, and how such models are combined. This is detailed by
120 Bayesian learning and generalized rank constraints
Algorithm IntroduceKLatentCluster
Input: two graphs G, G
0
; a set of observed variables O;
an integer k dening the cluster size
Output: a DAG
1. Let NodeDump be the set of observed nodes in O that are not in G
2. Let T be the number of clusters in G
3. Add a new cluster of k latents LC
T
to G and form a complete DAG among latents in G.
4. For all V NodeDump
5. If V G
0
6. Let LC
i
be the parent set of V in G
0
7. Set LC
i+1
to be the parent set of V in G
8. Else
9. Set LC
T
to be the parent set of V in G
10. If LC
T
d-separates an insucient number of nodes
11. Remove LC
T
from G and add its observed
children back to NodeDump
12. Return G
Table 6.7: Introduce a new latent set by moving nodes down the latent layer.
algorithm FullLatentClustering given in Table 6.8. Notice that, in step 16 of FullLatent-
Clustering, we initialize our nal greedy search by making all latents be the parents of the last
nodes added to our graph, the set O
C
. In step 17, we never add edges from O
C
into a previously
clustered node. This simplication of the search space is justied as follows, and illustrated in Fig-
ure 6.12: any two previously clustered nodes participate in some rank constraint in the marginal
covariance matrix (e.g., nodes X
1
and X
2
in the Figure 6.12(a) participate in a rank-1 constraint
with nodes X
3
and X
4
). If some other node is set to be a parent of two clustered nodes, this
constraint is destroyed (e.g., making X
5
a parent of both X
1
and X
2
would destroy the rank-1
constraint int the covariance matrix of X
1
, X
2
, X
3
, X
4
). Although allowing two clustered nodes
to have an observed common parent might correct some previous statistical mistake, to simplify
the search space we just forbid edges from O
C
into clustered nodes.
More implementation details, concerning for instance the nature of the bi-directed edges used
in K-LatentClustering, are given in Appendix C.3.
Finally, we complement the search by looking for structure among the latents, exactly as in the
GES-MIMBuild algorithm of Chapter 3. We call the combination FullLatentClustering +
GES-MIMBuild the RankBasedAutomatedSearch algorithm (RBAS).
6.7 Experiments on density estimation
We evaluate our algorithm against the mixture of factor analysers (MofFa), one of the approaches
most closely related to RBAS, and against FindHidden, a standard algorithm for learning graph-
ical models with latent variables (Elidan et al., 2000). Both RBAS and MofFA intend to be
applied to the same kind of data (observed variables with many hidden common causes) using the
same type of probabilistic model (nite mixture of Gaussians). FindHidden is best suited when
many conditional independencies among observed variables are present in the true model.
The data are normalized to a multivariate standard Normal distribution. We evaluate a model
6.7 Experiments on density estimation 121
Algorithm FullLatentClustering
Input: a data set D
Output: a DAG
1. i 0; D
0
D; Solutions ; k 1
2. Do
3. G
i
K-LatentClustering(D
i
, k)
4. If G
i
is not empty
5. Solutions Solutions G
i
6. D
i+1

Di\Gi
(D
i
)
7. i i + 1
8. While Solutions changes
9. Increase k by 1 and repeat Steps 2-8 till the covariance matrix of D
i
does not have enough entries to
justify a k-factor model
10. Let G
full
be the graph composed by merging all graphs in Solutions, where latents are fully connected
as an arbitrary DAG
11. For every pair G
i
, G
j
Solutions
12. Let G
partial
be the respective merge of G
i
, G
j
13. Do a standard greedy search, adding bi-directed edges X
i
X
j
to G
partial
14. Do a standard greedy search, deleting bi-directed edges X
i
X
j
from G
partial
15. Add all bi-directed edges in G
partial
to G
full
16. Let O
C
be the set of all nodes in O that are not in G
full
. Add O
C
to G
full
and make all latent nodes
be parents of all nodes O
C
in G
full
17. Do a standard hill climbing procedure do add edges or delete edges into O
C
in G
full
, or reverse edges
connecting two elements of O
C
18. Return G
full
Table 6.8: Merge the solutions of multiple K-LatentClustering calls.
by its average log-likelihood on a test set. We perform model selection by using Bayesian criteria.
The variational Bayesian mixture of factor analysers (Ghahramani and Beal, 1999) is used to get the
number of mixture distributions. For MofFA, we chose the number of factors by using BIC and a
grid search from 1 to 15 latents
4
. For FindHidden, we use the implementation with Structural
EM described in Chapter 2, but where we also re-evaluate the full model after each modication
in order to avoid bad local optima due to the Structural EM approximation. We used exactly
the same probabilistic model and variational approximation as in RBAS. Once a model is selected
by RBAS, MofFA or FindHidden using a training set, we estimate its parameters by maximum
likelihood over the training set and test it with an independent test set.
The datasets used in the experiments are as follows. All datasets and their descriptions can
be obtained from the UCI Repository (Blake and Merz, 1998). We basically chose datasets with a
large number of continuous eletronic measurements of some natural phenomena, plus a synthetic
dataset (wave). Discrete variables were removed. Instances with missing values were removed.
ionosphere (iono): 351 instances / 34 variables
heart images (spectf): 349 / 44
water treatment plant (water): 380 / 38
4
The available software for variational mixture of factor analysers does not perform model selection for the number
of factors. We used the same number of factors per component, which in this study is not a real issue, since in all
datasets the number of chosen components was 2.
122 Bayesian learning and generalized rank constraints
Unclustered
1
L
2
X
3
X
4
X X
1
X
5
1
L
2
X
3
X
4
X X
1
X
5
(a) (b)
Figure 6.12: Suppose X
1
X
4
are clustered as an one-factor model as in (a), and an unclustered
node X
5
has to be added to this model. If X
5
is set to be a common parent of X
1
X
4
as in (b),
this contradicts the previously established rank-1 constraint in the covariance matrix of X
1
X
4
implied by the clustering. A simple solution is to avoid adding any edges at all from X
5
into nodes
in X
1
X
4
.
waveform generator (wave): 5000 / 21
Table 6.9 shows the results. We use 5-fold cross-validation, and report the results for each
partition. Results for RBAS and for the dierence RBAS MofFA (R M) and RBAS -
FindHidden (R F) are given.
As a baseline, we also report results for the fully connected DAG over the observed variables
using no latent variables. This provides an indication on how much the t can increase by searching
for the proper rank constraints. The resuls are given in Table 6.10.
In three datasets, we obtained a clear advantage over MofFA. We outperform FindHidden
in iono and spectf according to a sign test, and iono according to a t-test at a 0.01 signicance
level. One of the reasons RBAS and MofFA did not perform better in the water dataset is due
to the presence of several ordinal variables. The variational score function was especially unstable
in this case, where dierent starting points would frequently lead to quite dierent scores. Since
FindHidden relies less on latent variables, this might be an explanation of why it gave more stable
results across all data partitions. In the dataset wave, all three methods gave basically the same
result, but in this case even the fully connected model performs as well.
It is interesting to notice that iono is the dataset that generated the DAG with the highest
number of edges per node in FindHidden before the introduction of any latent. The DAGs
generated with the spectf dataset are much more sparse, but RBAS consistently outperform the
standard FindHidden approach. The dataset water is used to illustrate an interesting phenomenon:
RBAS does not work well with a dataset which has several discrete ordinal variables, being very
unstable.
It should also be obvious that we do not claim that RBAS can be expected to outperform an
algorithm such as FindHidden if a nite mixture of Gaussians is a bad probabilistic model for
the given problem or if few rank constraints that are useful for clustering variables hold in the
population. Datasets such as iono, in which observed variables are connected by many hidden
common causes, represent the ideal type of problem for this type of approach. Due to the higher
computational cost of RBAS, one might want to use a MofFA model to evaluate how well it ts
the data compared to some method as FindHidden before trying our algorithm. If MofFA is of
6.8 Summary 123
Table 6.9: Evaluation of the average test log-likelihood of the outcomes of three algorithms. Each
line is the result of a single split in a 5-fold cross-validation. The entry R M is the dierence
between RBAS and MofFA. The entry R F is the dierence between RBAS and FindHidden.
The table also provides the respective averages (avg) and standard deviations (dev).
iono spectf water wave
Set RBAS R - M R - F RBAS R - M R - F RBAS R - M R - F RBAS R - M R - F
1 -34.65 4.84 9.54 -47.60 1.48 2.33 -30.91 6.33 5.10 -24.11 -0.06 0.80
2 -25.60 6.06 11.58 -45.76 4.72 4.66 -29.69 2.48 4.36 -23.97 -0.05 -0.61
3 -28.30 7.05 11.53 -47.93 -0.01 0.21 -40.76 7.57 -1.74 -23.87 -0.10 0.96
4 -32.90 4.25 6.73 -43.42 2.31 4.64 -42.57 4.97 -2.77 -24.10 -0.09 0.97
5 -32.87 7.72 9.89 -41.52 3.01 5.13 -44.4 8.08 -9.63 -24.24 -0.05 -0.04
avg -30.86 5.98 9.86 -45.24 2.30 3.40 -39.21 5.88 -0.94 -24.06 -0.07 0.43
dev 3.77 1.46 1.98 2.74 1.75 2.08 8.82 2.25 6.00 0.14 0.02 0.69
Table 6.10: Evaluation of the average test log-likelihood of the outcomes of three algorithms. A
fully connected DAG was used in this case as a baseline.
iono spectf water wave
Set Full DAG Full DAG Full DAG Full DAG
1 -63.27 -54.09 -52.61 -24.06
2 -42.78 -58.46 -39.05 -23.95
3 -55.15 -55.43 -60.70 -23.84
4 -52.93 -51.60 -57.57 -24.08
5 -64.75 -53.44 -52.48 -24.24
avg -55.78 -54.60 -54.33 -24.03
dev 8.86 2.56 9.25 0.15
at least competitive performance compared to FindHidden, one might want to apply RBAS to
the given problem.
6.8 Summary
We introduced a new Bayesian search algorithm for learning latent variable models. This approach
is shown to be especially interesting for density estimation problems. For causality discovery, it can
provide provide models where BuildPureClusters fail. The new algorithm also motivates new
problems in identication of linear latent variable models using generalized rank constraints and
score-based search algorithms that try to achieve a better trade-o between computational cost
and quality of the results.
124 Bayesian learning and generalized rank constraints
Chapter 7
Conclusion
This thesis introduced several new techniques for learning the structure of latent variables models.
The fundamental point of this thesis is that common and appealing heuristics (e.g., factor rotation
methods) fail when the goal is structure learning with a causal interpretation. In many cases it is
preferable to model the relationships of a subset of the given variables than trying to force a bad
model over all of them (Kano and Harada, 2000).
Its main contributions are:
identiability results for learning dierent types of d-separations in a large class of continuous
latent variable models;
algorithms for discovering causal latent variable structures in linear, non-linear and discrete
cases, using such identication rules;
empirical evaluation of causality discovery algorithms, including a study of the shortcomings
of the most common method, factor analysis;
an algorithm for heuristic Bayesian learning of probabilistic models, one of the few methods
with arbitrarily connected latents, motivated by results in causal analysis;
The procedures described in this thesis are not meant to discover causal relations when the true
measurement model is far from a pure model. This includes, for instance:
modeling text documents as a mixture of a large number of latent topics (Blei et al., 2003);
chemometrics studies where observed variables are a mixture of many hidden components
(Malinowski, 2002);
in general, blind source separation problems, where measures are linear combinations of all
latents in the study (Hyvarinen, 1999);
A number of open problems invite further research. They can be divided into three main classes.
126 Conclusion
New identiability results in covariance structures
completeness of the tetrad equivalence class of measurement models: can we identify all the
common features of measurement models in the same tetrad equivalence class? A simpler,
and practical, result would be nding all possible identication rules using no more than six
observed variables. Anything more than that might be of limited applicability due to the
computational cost and lack of statistical reliability of such criteria;
the graphical characterization of tetrad constraints in linear DAGs with faithful distributions
was fully developed by Spirtes et al. (2000) and Shafer et al. (1993) and provided the main
starting point for this thesis. Can we provide a graphical characterization for conditional
tetrad constraints that could be used to learn directed edges among indicators?
more generally, a graphical characterization of rank constraints and other type of covariance
constraints to learn latent variable models, possibly identifying the nature of some impure re-
lationships. Steps towards such results can be found, e.g., in (Grzebyk et al., 2004; Stanghellini
and Wermuth, 2005; Wegelin et al., 2005);
Improving discrete models
new heuristics to increase the scalability of the causal rule learner of Chapter 5, including
special treatment of sparse data such as market basket data, one of the main motivations
behind association rule algorithms (Agrawal and Srikant, 1994);
computationally tractacle approximations for global models with discrete measurement mod-
els. Estimating latent trait models with a large number of latents is hard. Even nding
the maximum likelihood estimator requires high-dimensional integration (Bartholomew and
Knott, 1999; Bartholomew et al., 2002). Monte Carlo approximation algorithms (e.g., Wedel
and Kamakura, 2001) are out of question for our problem of model search due to their
extremely demanding computational cost. Deterministic approximations, such as the one de-
scribed by Chu and Ghahramani (2004) to solve the problem of Bayesian ordinal regression,
are the only viable alternatives. Finding suitable approximations that can be integrated with
model search is an open problem;
Learning non-linear latent structure
using constraints generated by higher order moments of the observed distribution. Although
it was stressed throughout this thesis that such constraints are more problematic in model
selection problems to the increased diculty on statistical estimation, they nevertheless can
be useful in practice for small model selection problems. For example, in problems partially
solved by covariance constraints. An example of the use of higher order constraints in linear
models for non-Gaussian distributions is given by Kano and Shimizu (2003). Several paramet-
ric formulations of factor analysis models with non-linear relations exist (Bollen and Paxton,
1998; Wall and Amemiya, 2000; Yalcin and Amemiya, 2001), but no formal description of
equivalence classes or systematic search procedures exist to the best of our knowledge;
in special, nding non-linear causal relationships among latent variables given a xed linear
measurement model can be seen as a problem of regression with measurement error and
127
instrumental variables (Carroll et al., 2004). Our techniques for learning measurement models
for non-linear structural models as a way of nding instrumental variables could also be
adapted to this specic problem. Moreover, research in non-parametric item response theory
(Junker and Sijtsma, 2001) can also provide ideas for the discrete case;
moreover, since our algorithms are basically using information concerning dot products of
vectors of random variables (i.e., covariance information), it can be adapted to non-linear
spaces by means of the kernel trick (Scholkopf and Smola, 2002; Bach and Jordan, 2002).
This basically consists on mapping the input space to some feature space by a non-linear
transformation. In this feature space, algorithms designed for linear models (e.g., principal
component analysis, Scholkopf and Smola, 2002) can be applied in a relatively straightforward
and computationally unexpensive way. This might be problematic if one is interested in a
causal description of the data generating process, but not as much if the goal is density
estimation.
This thesis was concluded roughly a hundred years after Charles Spearman published what is
usually acknowledged as the rst application of factor analysis (Spearman, 1904). Much has been
done concerning estimation of latent variable models (Bartholomew et al., 2002; Loehlin, 2004;
Jordan, 1998), but little progress on automated search of causal models with latent variables was
achieved. Few problems in automated learning and discovery are as dicult and fundamental as
learning causal relations among latent variables without background knowledge and experimental
data. Better methods are available now, and further improvements will surely come from machine
learning research.
128 Conclusion
Appendix A
Results from Chapter 3
A.1 BuildPureClusters: renement steps
Concerning the nal steps of Table 3.2, it might be surprising that we merge clusters of variables that
we know cannot share a common latent parent in the true graph. However, we are not guaranteed
to nd a large enough number of pure indicators for each of the original latent parents, and as a
consequence only a subset of the true latents will be represented in the measurement pattern. It
might be the case that, with respect to the variables present in the output, the observed variables
in two dierent clusters might be directly measuring some ancestor common to all variables in
these two clusters. As an illustration, consider the graph in Figure A.1(a), where double-directed
edges represent independent hidden common causes. Assume any sensible purication procedure
will choose to eliminate all elements in W
2
, W
3
, X
2
, X
3
, Y
2
, Y
3
, Z
2
, Z
3
because they are directly
correlated with a large number of other observed variables (extra edges and nodes not depicted).
Meanwhile, one can verify that all three tetrad constraints hold in the covariance matrix of
W
1
, X
1
, Y
1
, Z
1
, and therefore there will be no undirected edges connecting pairs of elements in
this set in the corresponding measurement pattern. Rule CS1 is able to separate W
1
and X
1
into
two dierent clusters by using W
2
, W
3
, X
2
, X
3
as the support nodes, and analogously the same
happens to Y
1
and Z
1
, W
1
and Y
1
, X
1
and Z
1
. However, no test can separate W
1
and Z
1
, nor X
1
and Y
1
. If we do not merge clusters, we will end up with the graph seen in Figure A.1(b) as part
of our output pattern. Although this is a valid measurement pattern, and in some situations we
might want to output such a model, it is also true that W
1
and Z
1
measure a same latent L
0
(as
well as X
1
and Y
1
). It would be problematic to learn a structural model with such a measurement
model. There is a deterministic relation between the latent measured by W
1
and Z
1
, and the latent
measured by X
1
and Y
1
: they are the same latent! Probability distributions with deterministic
relations are not faithful, and that causes problems for learning algorithms.
Finally, we show examples where Steps 6 and 7 of BuildPureClusters are necessary. In
Figure A.2(a) we have a partial view of a latent variable graph, where two of the latents are
marginally independent. Suppose that nodes X
4
, X
5
and X
6
are correlated to many other measured
nodes not in this gure, and therefore are removed by our purication procedure. If we ignore Step
6, the resulting pure submodel over X
1
, X
2
, X
3
, X
7
, X
8
, X
9
will be the one depicted in Figure
A.2(b) (X
1
, X
2
are clustered apart from X
7
, X
8
, X
9
because of marginal zero correlation, and
X
3
is clustered apart from X
7
, X
8
, X
9
because of CS1 applied to X
3
, X
4
, X
5
X
7
, X
8
, X
9
).
However, no linear latent variable model can be parameterized by this graph: if we let the two
130 Results from Chapter 3
4
L
1
X
2
X
3
X
1 2 3
1 2 3
W
Y Y Y
Z Z Z
W
1 2
W
3
L
L
L
L 0
2
3
1
Y
L
0
L
0
W
1 1 1
1
Z X
(a) (b)
Figure A.1: The true graph in (a) will generate at some point a puried measurement pattern as
in (b). It is desirable to merge both clusters.
X
9
X X
8 7
X
6
X
5
X
4
X
2
X
3
X
1
X
2
X
3
X
9
X X
8 7
X
1
(a) (b)
Figure A.2: Suppose (a) is our true model. If for some reason we need to remove nodes X
4
, X
5
and
X
6
from our nal pure graph, the result will be as shown in Figure (b), unless we apply Step 6 of
BuildPureClusters. There are several problems with (b), as explained in the text.
latents to be correlated, this will imply X
1
and X
7
being correlated. If we make the two latents
uncorrelated, X
3
and X
7
will be uncorrelated.
Step 7 exists to avoid rare situations where three observed variables are clustered together and
are pairwise part of some foursome entailing all three tetrad constraints with no vanishing marginal
and partial correlation, but still should be removed because they are not simultaneously in such a
foursome. They might not be detected by Step 4 if, e.g., all three of them are uncorrelated with
all other remaining observed variables.
A.2 Proofs
Before we present the proofs of our results, we need a few more denitions:
a path in a graph G is a sequence of nodes X
1
, . . . , X
n
such that X
i
and X
i+1
are adjacent
in G, 1 i < n. Paths are assumed to be simple by denition, i.e., no node appears more
than once. Notice there is an unique set of edges associated with each given path. A path is
A.2 Proofs 131
E
A
B
C D
T
A
B
D
E
C
M
C D A B
CP
N
(a) (b) (c)
Figure A.3: In (a), C is a choke point for sets A, B D, E, since it lies on all treks connecting
nodes in A, B to nodes in D, E and lies also on the D, E side of all of such treks. For
instance, C is on the D, E side of A C D, where A is the source of such a trek. Notice
also that this choke point d-separates nodes in A, B from nodes in D, E. Analogously, D is
also a choke point for A, B D, E (there is nothing on the denition of a choke point I J
that forbids it of belonging I J). In Figure (b), C is a choke point for sets A, B D, E that
does not d-separate such elements. In Figure (c), CP is a node that lies on all treks connecting
A, C and B, D but it is not a choke point, since it does not lie on the A, C side of trek
A M CP B and neither lies on the B, D side of D N CP A. The same node,
however, is a A, D B, C choke point.
into X
1
(or X
n
) if the arrow of the edge X
1
, X
2
is into X
1
(X
n1
, X
n
into X
n
);
a collider on a path X
1
, . . . , X
n
is a node X
i
, 1 < i < n, such that X
i1
and X
i+1
are
parents of X
i
;
a trek is a path that does not contain any collider;
the source of a trek is the unique node in a trek to which no arrows are directed;
the I side of a trek between nodes I and J with source X is the subpath directed from X to
I. It is possible that X = I, and the I side is just node I;
a choke point CP between two sets of nodes I and J is a node that lies on every trek between
any element of I and any element of J such that CP is either (i) on the I side of every such
trek
1
or (ii) on the J side or every such trek.
With the exception of choke points, all other concepts are well known in the literature of
graphical models (Spirtes et al., 2000; Pearl, 1988, 2000). What is interesting in a choke point is
that, by denition, such a node is in all treks linking elements in two sets of nodes. Being in all
treks connecting a node X
i
and a node X
j
is a necessary condition for a node to d-separate X
i
and
X
j
, although this is not a sucient condition.
Consider Figure A.3, which illustrates several dierent choke points. In some cases, the choke
point will d-separate a few nodes. The relevant fact is that even when the choke point is a latent
variable, this has an implication on the observed marginal distribution, as stated by the Tetrad
Representation Theorem:
1
That is, for every {I, J} I J, CP is on the I side of every trek T = {I, . . . , X, . . . , J}, X being the source of
T.
132 Results from Chapter 3
Theorem A.1 (The Tetrad Representation Theorem) Let G be a linear latent variable model,
and let I
1
, I
2
, J
1
, J
2
be four variables in G. Then
I
1
J
1

I
2
J
2
=
I
1
J
2

I
2
J
1
if and only if there is a
choke point between I
1
, I
2
and J
1
, J
2
.
Proof: The original proof was given by Spirtes et al. (2000). Shafer et al. (1993) provide an alter-
native and simplied proof.
Shafer et al. (1993) also provide more details on the denitions and several examples.
Therefore, unlike a partial correlation constraint obtained by conditioning on a given set of
variables, where such a set should be observable, some d-separations due to latent variables can be
inferred using tetrad constraints. We will use the Tetrad Representation Theorem to prove most of
our results. The challenge lies on choosing the right combination of tetrad constraints that allows
us to identify latents and d-separations due to latents, since the Tetrad Representation Theorem is
far from providing such results directly.
In the following proofs, we will frequently use the symbol G(O) to represent a linear latent
variable model with a set of observed nodes O. A choke point between sets I and J will be denoted
as I J. We will rst introduce a lemma that is going to be useful to prove several other results.
The lemma is a slightly reformulated version of the one given in Chapter 3 to include a result on
choke points:
Lemma 3.4 Let G(O) be a linear latent variable model, and let X
1
, X
2
, X
3
, X
4
O be such that

X
1
X
2

X
3
X
4
=
X
1
X
3

X
2
X
4
=
X
1
X
4

X
2
X
3
. If
AB
= 0 for all A, B X
1
, X
2
, X
3
, X
4
, then
an unique choke point P entails all the given tetrad constraints, and P d-separates all elements in
X
1
, X
2
, X
3
, X
4
.
Proof: Let P be a choke point for pairs X
1
, X
2
X
3
, X
4
. Let Q be a choke point for pairs
X
1
, X
3
X
2
, X
4
. We will show that P = Q by contradiction.
Assume P = Q. Because there is a trek that links X
1
and X
4
throught P (since
X
1
X
4
= 0), we
have that Q should also be on that trek. Suppose T is a trek connecting X
1
to X
4
through P and
Q, and without loss of generality assume this trek follows an order that denes three subtreks: T
0
,
from X
1
to P; T
1
, from P to Q; and T
2
, from Q to X
4
, as illustrated by Figure A.4(a). In principle,
T
0
and T
2
might be empty, i.e., we are not excluding the possibility that X
1
= P or X
4
= Q.
There must be at least one trek T
Q2
connecting X
2
and Q, since Q is on every trek between X
1
and X
2
and there is at least one such trek (since
X
1
X
2
= 0). We have the following cases:
Case 1: T
Q2
includes P. T
Q2
has to be into P, and P = X
1
, or otherwise there will be a trek
connecting X
2
to X
1
through a (possibly empty) trek T
0
that does not include Q, contrary to our
hypothesis. For the same reason, T
0
has to be into P. This will imply that T
1
is a directed path
from P to Q, and T
2
is a directed path from Q to X
4
(Figure A.4(b)).
Because there is at least one trek connecting X
1
and X
2
(since
X
1
X
2
= 0), and because Q is
on every such trek, Q has to be an ancestor of at least one member of X
1
, X
2
. Without loss of
generality, assume Q is an ancestor of X
1
. No directed path from Q to X
1
can include P, since P
is an ancestor of Q and the graph is acyclic. Therefore, there is a trek connecting X
1
and X
4
with
Q as the source that does not include P, contrary to our hypothesis.
A.2 Proofs 133
Q
1
X
4
T
0
T
1
T
2
P X
Q2
1
X
4
T
0
T
1
T
2
X
2
P Q
T
X
(a) (b)
2
P
X
X
X
X
1
3
4
S
P
X
X
X
X
1
3
4
2
(c) (d)
Figure A.4: In (a), a depiction of a trek T linking X
1
and X
4
through P and Q, creating three
subtreks labeled as T
0
, T
1
and T
2
. Directions in such treks are left unspecied. In (b), the existence
of a trek T
Q2
linking X
2
and Q through P will compel the directions depicted as a consequence of
the given tetrad and correlation constraints (the dotted path represents any possible continuation
of T
Q2
that does not coincide with T). The conguration in (c) cannot happen if P is a choke point
entailing all three tetrads among marginally dependent nodes X
1
, X
2
, X
3
, X
4
. The conguration
in (d) cannot happen if P is a choke point for X
1
, X
3
X
2
, X
4
, since there is a trek X
1
P X
2
such that P is not on the X
1
, X
3
side of it, and another trek X
2
S P X
3
such that P is
not on the X
2
, X
4
side of it.
Case 2: T
Q2
does not include P. This is case is similar to Case 1. T
Q2
has to be into Q, and
Q = X
4
, or otherwise there will be a trek connecting X
2
to X
4
through a (possible empty) trek
T
2
that does not include P, contrary to our hypothesis. For the same reason, T
2
has to be into Q.
This will imply that T
1
is a directed path from Q to P, and T
0
is a directed path from P to X
1
.
An argument analogous to Case 1 will follow.
We will now show that P d-separates all nodes in X
1
, X
2
, X
3
, X
4
. From the P = Q result,
we know that P lies on every trek between any pair of elements in X
1
, X
2
, X
3
, X
4
. First consider
the case where at most one element of X
1
, X
2
, X
3
, X
4
is linked to P through a trek that is into
P. By the Tetrad Representation Theorem, any trek connecting two elements of X
1
, X
2
, X
3
, X
4

goes through P. Since P cannot be a collider on any trek, then P d-separates these two elements.
To nish the proof, we only have to show that there are no two elements A, B X
1
, X
2
, X
3
, X
4

such that A and B are both connected to P through treks that are both into P.
We will prove that by contradiction, that is, assume without loss of generality that there is a
trek connecting X
1
and P that is into P, and a trek connecting X
2
and P that is into P. If there is
no trek connecting X
1
and P that is out of P neither any trek connecting X
2
and P that is out of P,
then there is no trek connecting X
1
and X
2
, since P is on every trek connecting these two elements
134 Results from Chapter 3
according to the Tetrad Representation Theorem. But this implies
X
1
X
2
= 0, a contradiction, as
illustrated by Figure A.4(c).
Consider the case where there is also a trek out of P and into X
2
. Then there is a trek connect-
ing X
1
to X
2
through P that is not on the X
1
, X
3
side of pair X
1
, X
3
X
2
, X
4
to which P
is a choke point. Therefore, P should be on the X
2
, X
4
of every trek connecting elements pairs
in X
1
, X
3
X
2
, X
4
. Without loss of generality, assume there is a trek out of P and into X
3
(because if there is no such trek for either X
3
and X
4
, we fall in the previous case by symmetry).
Let S be the source of a trek into P and X
2
, which should exist since X
2
is not an ancestor of P.
Then there is a trek of source S connecting X
3
and X
2
such that P is not on the X
2
, X
4
side
of it as shown in Figure A.4(d). Therefore P cannot be a choke point for X
1
, X
3
X
2
, X
4
.
Contradiction.
Lemma 4.2 Let G(O) be a linear latent variable model. If for some set O

= X
1
, X
2
, X
3
,
X
4
O,
X
1
X
2

X
3
X
4
=
X
1
X
3

X
2
X
4
=
X
1
X
4

X
2
X
3
and for all triplets A, B, C, A, B
O

, C O, we have
AB.C
= 0 and
AB
= 0, then no element A O

is a descendant of an
element of O

`A in G.
Proof: Without loss of generality, assume for the sake of contradiction that X
1
is an ancestor of
X
2
. From the given tetrad and correlation constraints and Lemma 3.4, there is a node P that lies
on every trek between X
1
and X
2
and d-separates these two nodes. Since P lies on the directed
path from X
1
to X
2
, P is a descendant of X
1
, and therefore an observed node. However, this
implies
X
1
X
2
.P
= 0, contrary to our hypothesis.
Lemma 4.4 Let G(O) be a linear latent variable model. Assume O

= X
1
, X
2
, X
3
, Y
1
, Y
2
, Y
3

O. If constraints
X
1
Y
1
X
2
X
3
,
X
1
Y
1
X
3
X
2
,
Y
1
X
1
Y
2
Y
3
,
Y
1
X
1
Y
3
Y
2
,
X
1
X
2
Y
2
Y
1
all hold, and that for
all triplets A, B, C, A, B O

, C O, we have
AB
= 0,
AB.C
= 0, then X
1
and Y
1
do not
have a common parent in G.
Proof: We will prove this result by contradiction. Suppose that X
1
and Y
1
have a common parent
L in G. Suppose L is not a choke point for X
1
, X
2
Y
1
, X
3
corresponding to one of the tetrad
constraints given by hypothesis. Because of the trek X
1
L Y
1
, then either X
1
or Y
1
is a
choke point. Without loss of generality, assume X
1
is a choke point in this case. By Lemma 4.2
and the given constraints, X
1
cannot be an ancestor of either X
2
or X
3
, and by Lemma 3.4 it
is also the choke point for X
1
, Y
1
X
2
, X
3
. That means that all treks connecting X
1
and
X
2
, and X
1
and X
3
should be into X
1
. Since there are no treks between X
2
and X
3
that do
not include X
1
, and all paths between X
2
and X
3
that include X
1
collide at X
1
, that implies

X
2
X
3
= 0, contrary to our hypothesis. By symmetry, Y
1
cannot be a choke point. Therefore, L is
a choke point for X
1
, Y
1
X
2
, X
3
and by Lemma 3.4, it also lies on every trek for any pair in
S
1
= X
1
, X
2
, X
3
, Y
1
.
Analogously, L is on every trek connecting any pair from the set S
2
= X
1
, Y
1
, Y
2
, Y
3
. It fol-
lows that L is on every trek connecting any pair from the set S
3
= X
1
, X
2
, Y
1
, Y
2
, and it is on the
X
1
, Y
1
side of X
1
, Y
1
X
2
, Y
2
, i.e., L is a choke point that implies
X
1
X
2
Y
2
Y
1
. Contradiction.
Remember that predicate F
1
(X, Y, G) is true if and only if there exist two nodes W and Z in
G such that
WXY Z
and
WXZY
are both entailed, all nodes in W, X, Y, Z are correlated, and
A.2 Proofs 135
2
2
X
1
1
Y
L
Y
X
2
Y X
1 T
2
T
1
T
3
T
4
L S
Y
1
(a) (b)
Figure A.5: Figure (a) illustrates necessary treks among elements of X
1
, X
2
, Y
1
, Y
2
, L according
to the assumptions of Lemma 4.5 if we further assume that X
1
is a choke point for pairs X
1
, X
2

Y
1
, Y
2
(other treks might exist). Figure (b) rearranges (a) by emphasizing that Y
1
and Y
2
cannot
be d-separated by a single node.
there is no observed C in G such that
AB.C
= 0 for A, B W, X, Y, Z.
Lemma 4.5 Let G(O) be a linear latent variable model. Assume O

= X
1
, X
2
, X
3
, Y
1
, Y
2
, Y
3

O, such that F
1
(X
1
, X
2
, G) and F
1
(Y
1
, Y
2
, G) hold, Y
1
is not an ancestor of Y
3
and X
1
is not an
ancestor of X
3
. If constraints
X
1
Y
1
Y
2
X
2
,
X
2
Y
1
Y
3
Y
2
,
X
1
X
2
Y
2
X
3
,
X
1
X
2
Y
2
Y
1
all hold, and that for
all triplets A, B, C, A, B O

, C O, we have
AB
= 0,
AB.C
= 0, then X
1
and Y
1
do not
have a common parent in G.
Proof: We will prove this result by contradiction. Assume X
1
and Y
1
have a common parent L.
Because of the tetrad constraints given by hypothesis and the existence of the trek X
1
L Y
1
,
one node in X
1
, L, Y
1
should be a choke point for the pair X
1
, X
2
Y
1
, Y
2
. We will rst
show that L has to be such a choke point, and therefore lies on every trek connecting X
1
and Y
2
,
as well as X
2
and Y
1
. We then show that L lies on every trek connecting Y
1
and Y
2
, as well as X
1
and X
2
. Finally, we show that L is a choke point for X
1
, Y
1
X
2
, Y
2
, contrary to our hypothesis.
Step 1: If there is a common parent L to X
1
and Y
1
, then L is a X
1
, X
2
Y
1
, Y
2
choke point. For
the sake of contradiction, assume X
1
is a choke point in this case. By Lemma 4.2 and assumption
F
1
(X
1
, X
2
, G), we have that X
1
is not an ancestor of X
2
, and therefore all treks connecting X
1
and
X
2
should be into X
1
. Since
X
2
Y
2
= 0 by assumption and X
1
is on all treks connecting X
2
and
Y
2
, there must be a directed path out of X
1
and into Y
2
. Since
X
2
Y
2
.X
1
= 0 by assumption and
X
1
is on all treks connecting X
2
and Y
2
, there must be a trek into X
1
and Y
2
. Because
X
2
Y
1
= 0,
there must be a trek out of X
1
and into Y
1
. Figure A.5(a) illustrates the conguration.
Since F
1
(Y
1
, Y
2
, G) is true, by Lemma 3.4 there must be a node d-separating Y
1
and Y
2
(nei-
ther Y
1
nor Y
2
can be the choke point in F
1
(Y
1
, Y
2
, G) because this choke point has to be latent,
according to the partial correlation conditions of F
1
). However, by Figure A.5(b), treks T
2
T
3
and T
1
T
4
cannot both be blocked by a single node. Contradiction. Therefore X
1
cannot be a
choke point for X
1
, X
2
Y
1
, Y
2
and, by symmetry, neither can Y
1
.
136 Results from Chapter 3
X
2
Y
2
Y
1
P
T
PY
X
2
Y
2
Y
1
X
1
+1
Y
P
L
X
2
Y
2
Y
1
X
1
+1
Y
P
L
1
Y
(a) (b) (c)
Figure A.6: In (a), a depiction of T
Y
and T
X
, where edges represent treks (T
X
can be seen more
generally as the combination of the solid edge between X
2
and P concatenated with a dashed edge
between P and Y
1
representing the possibility that T
Y
and T
X
might intersect multiple times in
T
PY
, but in principle do not need to coincide in T
PY
if P is not a choke point.) In (b), a possible
congurations of edges < X
1
, P > and < P, Y
+1
> that do not collide in P, and P is a choke point
(and Y
+1
= Y ). In (c), the edge < Y
1
, P > is compelled to be directed away from P because of
the collider with the other two neighbors of P.
Step 2: L is on every trek connecting Y
1
and Y
2
and on every trek connecting X
1
and X
2
. Let L be
the choke point for pairs X
1
, X
2
Y
1
, Y
2
. As a consequence, all treks between Y
2
and X
1
go
through L. All treks between X
2
and Y
1
go through L. All treks between X
2
and Y
2
go through
L. Such treks exist, since no respective correlation vanishes.
Consider the given hypothesis
X
2
Y
1

Y
2
Y
3
=
X
2
Y
3

Y
2
Y
1
, corresponding to a choke point X
2
, Y
2

Y
1
, Y
3
. From the previous paragraph, we know there is a trek linking Y
2
and L. L is a parent of
Y
1
by construction. That means Y
2
and Y
1
are connected by a trek through L.
We will show by contradiction that L is on every trek connecting Y
1
and Y
2
. Assume there is a
trek T
Y
connecting Y
2
and Y
1
that does not contain L. Let P be the rst point of intersection of T
Y
and a trek T
X
connecting X
2
to Y
1
, starting from X
2
. If T
Y
exists, such point should exist, since
T
Y
should contain a choke point X
2
, Y
2
Y
1
, Y
3
, and all treks connecting X
2
and Y
1
(including
T
X
) contain the same choke point.
Let T
PY
be the subtrek of T
Y
starting on P and ending one node before Y
1
. Any choke point
X
2
, Y
2
Y
1
, Y
3
should lie on T
PY
(Figure A.6(a)). (Y
1
cannot be such a choke point, since all
treks connecting Y
1
and Y
2
are into Y
1
, and by hypothesis all treks connecting Y
1
and Y
3
are into
Y
1
. Since all treks connecting Y
2
and Y
3
would need to go through Y
1
by denition, then there
would be no such trek, implying
Y
2
Y
3
= 0, contrary to our hypothesis.)
Assume rst that X
2
= P and Y
2
= P. Let X
1
be the node before P in T
X
starting from X
2
.
Let Y
1
be the node before P in T
Y
starting from Y
2
. Let Y
+1
be the node after P in T
Y
starting
from Y
2
(notice that it is possible that Y
+1
= Y
1
). If X
1
and Y
+1
do not collide on P (i.e., there
is no structure X
1
P Y
+1
), then there will be a trek connecting X
2
to Y
1
through T
PY
after
P. Since L is not in T
PY
, L should be before P in T
X
. But then there will be a trek connecting
X
2
and Y
1
that does not intersect T
PY
, which is a contradiction (Figure A.6(b)). If the collider
does exist, we have the edge P Y
+1
. Since no collider Y
1
P Y
+1
can exist because T
Y
is a
trek, the edge between Y
1
and P is out of P. But that forms a trek connecting X
2
and Y
2
(Figure
A.2 Proofs 137
Y
2
Y
3
Y
1
X
1
L M
Y
2
Y
1
Y
3
X
2
M L
3
X X
1
(a) (b)
Figure A.7: In (a), Y
2
and X
1
cannot share a parent, and because of the given tetrad constraints,
L should d-separate M and Y
3
. Y
3
is not a child of L either, but there will be a trek linking L and
Y
3
. In (b), an (invalid) conguration for X
2
and X
3
, where they share an ancestor between M and
L.
A.6(c)), and since L is in every trek between X
2
and Y
2
and T
Y
does not contain L, then T
X
should
contain L before P, which again creates a trek between X
2
and Y
1
that does not intersect T
PY
.
If X
2
= P, then T
PY
has to contain L, because every trek between X
2
and Y
1
contains L.
Therefore, X
2
= P. If Y
2
= P, then because every trek between X
2
and Y
2
should contain L,
we again have that L lies in T
X
before P, which creates a trek between X
2
and Y
1
that does not
intersect T
PY
. Therefore, we showed by contradiction that L lies on every trek between Y
2
and Y
1
.
Consider now the given hypothesis
X
1
X
2

X
3
Y
2
=
X
1
Y
2

X
3
X
2
, corresponding to a choke point
X
2
, Y
2
X
1
, X
3
. By symmetry with the previous case, all treks between X
1
and X
2
go through
L.
Step 3: If L exists, so does a choke point X
1
, Y
1
X
2
, Y
2
. By the previous steps, L intermedi-
ates all treks between elements of the pair X
1
, Y
1
X
2
, Y
2
. Because L is a common parent of
X
1
, Y
1
, it lies on the X
1
, Y
1
side of every trek connecting pairs of elements in X
1
, Y
1
X
2
, Y
2
.
L is a choke point for this pair. This implies
X
1
X
2
Y
2
Y
1
. Contradiction.
Lemma 3.8 Let G(O) be a linear latent variable model. Let O

= X
1
, X
2
, X
3
, Y
1
, Y
2
, Y
3

O. If constraints
X
1
Y
1
Y
2
Y
3
,
X
1
Y
1
Y
3
Y
2
,
X
1
Y
2
X
2
X
3
,
X
1
Y
2
X
3
X
2
,
X
1
Y
3
X
2
X
3
,
X
1
Y
3
X
3
X
2
,

X
1
X
2
Y
2
Y
3
all hold, and that for all triplets A, B, C, A, B O

, C O, we have
AB
=
0,
AB.C
= 0, then X
1
and Y
1
do not have a common parent in G.
Proof: We will prove this result by contradiction. Suppose X
1
and Y
1
have a common parent L
in G. Since all three tetrads hold in the covariance matrix of X
1
, Y
1
, Y
2
, Y
3
, by Lemma 3.4 the
choke point that entails these constraints d-separates the elements of X
1
, Y
1
, Y
2
, Y
3
. The choke
point should be in the trek X
1
L Y
1
, and since it cannot be an observed node because by
hypothesis no d-separation conditioned on a single node holds among elements of X
1
, Y
1
, Y
2
, Y
3
,
L has to be a latent choke point for all pairs of pairs in X
1
, Y
1
, Y
2
, Y
3
.
It is also given that
X
1
Y
2
X
2
X
3
,
X
1
Y
2
X
3
X
2
,
X
1
Y
1
Y
2
Y
3
,
X
1
Y
1
Y
3
Y
2
holds. Since it is the case that

X
1
X
2
Y
2
Y
3
, by Lemma 4.4 X
1
and Y
2
cannot share a parent. Let T
ML
be a trek connecting some
parent M of Y
2
and L. Such a trek exists because
X
1
Y
2
= 0.
We will show by contradiction that there is no node in T
ML
`L that is connected to Y
3
by a trek
that does not go through L. Suppose there is such a node, and call it V . If the trek connecting
V and Y
3
is into V , and since V is not a collider in T
ML
, then V is either an ancestor of M or an
138 Results from Chapter 3
ancestor of L. If V is an ancestor of M, then there will be a trek connecting Y
2
and Y
3
that is not
through L, which is a contradiction. If V is an ancestor of L but not M, then both Y
2
and Y
3
are
d-connected to a node V is a collider at the intersection of such d-connecting treks. However, V
is an ancestor of L, which means L cannot d-separate Y
2
and Y
3
, a contradiction. Finally, if the
trek connecting V and Y
3
is out of V , then Y
2
and Y
3
will be connected by a trek that does not
include L, which again is not allowed. We therefore showed there is no node with the properties of
V . This conguration is illustrated by Figure A.7(a).
Since all three tetrads hold among elements of X
1
, X
2
, X
3
, Y
2
, then by Lemma 3.4, there is a
single choke point P that entails such tetrads and d-separates elements of this set. Since T
ML
is a
trek connecting Y
2
to X
1
through L, then there are three possible locations for P in G:
Case 1: P = M. We have all treks between X
3
and X
2
go through M but not through L,
and some trek from X
1
to Y
3
goes through L but not through M. No choke point can exist for
pairs X
1
, X
3
X
2
, Y
3
, which by the Tetrad Representation Theorem means that the tetrad

X
1
Y
3

X
2
X
3
=
X
1
X
2

Y
3
X
3
cannot hold, contrary to our hypothesis.
Case 2: P lies between M and L in T
ML
. This conguration is illustrated by Figure A.7(b). As
before, no choke point exists for pairs X
1
, X
3
X
2
, Y
3
, contrary to our hypothesis.
Case 3: P = L. Because all three tetrads hold in X
1
, X
2
, X
3
, Y
3
and L d-separates all pairs in
X
1
, X
2
, X
3
, one can verify that L d-separates all pairs in X
1
, X
2
, X
3
, Y
3
. This will imply a
X
1
, Y
3
X
2
, Y
2
choke point, contrary to our hypothesis.
Theorem 3.10 The output of FindPattern is a measurement pattern with respect to the tetrad
and vanishing partial correlation constraints of
Proof: Two nodes will not share a common latent parent in a measurement pattern if and only if
they are not linked by an edge in graph C constructed by algorithm FindPattern and that hap-
pens if and only if some partial correlation vanishes or if any of rules CS1, CS2 or CS3 applies. But
then by Lemmas 4.4, 4.5, 3.8 and the equivalence of vanishing partial correlations and conditional
independence in linearly faithful distributions (Spirtes et al., 2000) the claim is proved. The claim
about undirected edges follows from Lemma 4.2.
Theorem 3.11 Given a covariance matrix assumed to be generated from a linear latent vari-
able model G(O) with latent variables L, let G
out
be the output of BuildPureClusters() with
observed variables O
out
O and latent variables L
out
. Then G
out
is a measurement pattern, and
there is an injective mapping M : L
out
L with the following properties:
1. Let L
out
L
out
. Let X be the children of L
out
in G
out
. Then M(L
out
) d-separates any
element X X from O
out
`X in G;
2. M(L
out
) d-separates X from every latent in G for which M
1
(.) exists;
3. Let O

O
out
be such that each pair in O

is correlated. At most one element in O

with
latent parent L
out
in G
out
is not a descendant of M(L
out
) in G, or has a hidden common
cause with it;
A.2 Proofs 139
Proof: We will start by showing that for each cluster Cl
i
in G
out
, there exists an unique latent
L
i
in G that d-separates all elements of Cl
i
. This shows the existance of an unique function from
latents in G
out
to latents in G. We then proceed to prove the three claims given in the theorem,
and nish by proving that the given function is injective.
Let Cl
i
be a cluster in a non-empty G
out
. Cl
i
has three elements X, Y and Z, and there is
at least some W in G
out
such that all three tetrad constraints hold in the covariance matrix of
W, X, Y, Z, where no pair of elements in X, Y, Z is marginally d-separated or d-separated by
an observable variable. By Lemma 3.4, it follows that there is an unique latent L
i
d-separating
X, Y and Z. If Cl
i
has more than three elements, it follows that since no node other than L
i
can
d-separate all three elements in X, Y, Z, and any choke point for W

, X, Y, Z, W

Cl
i
, will
d-separate all elements in W

, X, Y, Z, then there is an unique latent L

i
d-separating all elements
in C
i
. An analogous argument concerns the d-separation of any element of Cl
i
and observed nodes
in other clusters.
Now we will show that each L
i
d-separates each X in Cl
i
from all other mapped latents. As
a byproduct, we will also show the validity of the third claim of the theorem. Consider Y, Z,
two other elements of Cl
i
besides X, and A, B, C, three elements of Cl
j
. Since L
i
and L
j
each
d-separate all pairs in X, Y A, B, and no pair in X, Y A, B has both of its elements
connected to L
i
(L
j
) through a trek that is into L
i
(L
j
) (since L
i
, or L
j
, d-separates then), then
both L
i
and L
j
are choke points for X, Y A, B. According to Lemma 2.5 given by Shafer
et al. (1993), any trek connecting an element from X, Y to an element in A, B passes through
both choke points in the same order. Without loss of generality, assume the order is rst L
i
, then
L
j
.
If there is no trek connecting X to L
i
that is into L
i
, then L
i
d-separates X and L
j
. The same
holds for L
j
and A with respect to L
i
. If there is a trek T connecting X and L
i
that is into L
i
, and
since all three tetrad constraints hold in the covariance matrix of X, Y, Z, A by construction, then
there is no trek connecting A and L
i
that is into L
i
(Lemma 3.4). Since there are treks connecting
L
i
and L
j
, they should be all out of L
i
and into L
j
. This means that L
i
d-separates X and L
j
. But
this also creates a trek connecting X and L
j
that is into L
j
. Since all three tetrad constraints hold
in the covariance matrix of X, A, B, C by construction, then there is no trek connecting A and
L
j
that is into L
j
(by the d-separation implied by Lemma 3.4). This means that L
j
d-separates
A from L
i
. This also means that the existance of such a trek T out of X and into L
i
forbids the
existance of any trek connecting a variable correlated to X that is into L
i
(since all treks connecting
L
i
and some L
j
are out of L
i
), which proves the third claim of the theorem.
We will conclude by showing that given two clusters Cl
i
and Cl
j
with respective latents L
i
and
L
j
, where each cluster is of size at least three, if they are not merged, then L
i
= L
j
. That is, the
mapping from latents in G
out
to latents in G, as dened at the beginning of the proof, is injective.
Assume L
i
= L
j
. We will show that these clusters will be merged by the algorithm, proving
the counterpositive argument. Let X and Y be elements of Cl
i
and W, Z elements of Cl
j
. It
immediately follows that L
i
is a choke point for all pairs in W, X, Y, Z, since L
i
d-separates any
pair of elements of W, X, Y, Z, which means all three tetrads will hold in the covariance matrix of
any subset of size four from Cl
i
Cl
j
. These two clusters will then be merged by BuildPureClus-
ters.
Theorem 3.12 Given a covariance matrix assumed to be generated from a linear latent vari-
able model G(O) with latent variables L, let G
out
be the output of BuildPureClusters() with
140 Results from Chapter 3
observed variables O
out
O and latent variables L
out
. Let M(L
out
) L be the set of latents in
G obtained by the mapping function M(). Let
O
out
be the population covariance matrix of O
out
,
i.e., the corresponding marginal of . Let the DAG G
aug
out
be G
out
augmented by connecting the
elements of L
out
such that the structural model of G
aug
out
is an I-map of the distribution of M(L
out
).
Then there exists a linear latent variable model using G
aug
out
as the graphical structure such that the
implied covariance matrix of O
out
equals
O
out
.
Proof: If a linear model is an I-map DAG of the true distribution of its variables, then there
is a well-known natural instantiation of the parameters of this model that will represent the true
covariance matrix (Spirtes et al., 2000). We will assume such parametrization for the structural
model, and denote as
L
() the parameterized latent covariance matrix. Instead of showing that
G
aug
out
is an I-map of the respective set of latents and observed variables and using the same argument,
we will show a valid instantion of its parameters directly.
Assume without loss of generality that all variables have zero mean. To each observed node
X with latent ancestor L
X
in G such that M
1
(L
X
) is a parent of X in G
out
, the linear model
representation is:
X =
X
L
X
+
X
For this equation, we have two associated parameters,
X
and
2

X
, where
2

X
is the variance
of
X
. We instantiate them by the linear regression values, i.e.,
X
=
XL
X
/
2
L
X
, and
2

X
is the
respective residual variance. The set
X

2

X
of all
X
and
2

X
, along with the parameters
used in
L
(), is our full set of parameters .
Our denition of linear latent variable model requires

Y
= 0,

X
L
X
= 0 and

X
L
Y
= 0,
for all X = Y . This corresponds to a covariance matrix () of the observed variables with entries
dened as:
E[X
2
]() =
2
X
() =
2
X

2
L
X
+
2

X
E[XY ]() =
XY
() =
X

L
X
L
Y
To prove the theorem, we have to show that
O
out
= () by showing that correlations between
dierent residuals, and residuals and latent variables, are actually zero.
The relation

X
L
X
= 0 follows directly from the fact that
X
is dened by the regression
coecient of X on L
X
. Notice that if X and L
X
do not have a common ancestor,
X
is the direct
eect of L
X
in X with respect to G
out
. As we know, by Theorem 3.11, at most one variable in any
set of correlated variables will not fulll this condition.
We have to show also that
XY
=
XY
() for any pair X, Y in G
out
. Residuals
X
and
Y
are
uncorrelated due to the fact that X and Y are independent given their latent ancestors in G
out
,
and therefore

Y
= 0. To verify that

X
L
Y
= 0 is less straightforward, but one can appeal to
the graphical formulation of the problem. In a linear model, the residual
X
is a function only of
the variables that are not independent of X given L
X
. None of this variables can be nodes in G
out
,
since L
X
d-separates X from all such variables. Therefore, given L
X
none of the variables that
dene
X
can be dependent on L
Y
, implying

X
L
Y
= 0.
Theorem 3.13 Problem {
3
is NP-complete.
A.3 Implementation 141
Proof: Direct reduction from the 3-SAT problem: let S be a 3-CNF formula from which we want
to decide if there is an assignment for its variables that makes the expression true. Dene G as
a latent variable graph with a latent node L
i
for each clause C
i
in M, with an arbitrary fully
connected structural model. For each latent in G, add ve pure children. Choose three arbitrary
children of each latent L
i
, naming them C
1
i
, C
2
i
, C
3
i
. Add a bi-directed edge C
p
i
C
q
j
for each
pair C
p
i
, C
q
j
, i = j, if and only that they represent literals over the same variable but of opposite
values. As in the maximum clique problem, one can verify that there is a pure submodel of G with
at least three indicators per latent if and only if S is satisable.
The next corollay suggests that even an invalid measurement pattern could be used in Build-
PureClusters instead of the output of FindPattern. However, an arbitrary (invalid) measure-
ment pattern is unlikely to be informative at all after being puried. In constrast, FindPattern
can be highly informative.
Corollary 3.14 The output of BuildPureClusters retains its guarantees even when rules CS1,
CS2 and CS3 are applied an arbitrary number of times in FindPattern for any arbitrary subset
of nodes and an arbitrary number of maximal cliques is found.
Proof: Independently of the choice made on Step 2 of BuildPureClusters and which nodes
are not separated into dierent cliques in FindPattern, the exhaustive verication of tetrad con-
straints by BuildPureClusters provides all the necessary conditions for the proof of Theorem
3.11.
Corollary 3.16 Given a covariance matrix assumed to be generated from a linear latent variable
model G, and G
out
the output of BuildPureClusters given , the output of PC-MIMBuild or
FCI-MIMBuild given (, G
out
) returns the correct Markov equivalence class of the latents in G
corresponding to latents in G
out
according to the mapping implicit in BuildPureClusters
Proof: By Theorem 3.11, each observed variable is d-separated from all other variables in G
out
given its latent parent. By Theorem 3.12, one can parameterize G
out
as a linear model such that
the observed covariance matrix as a function of the parameterized G
out
equals its corresponding
marginal of . By Theorem 3.15, the rank test using the measurement model of G
out
is therefore a
consistent independence test of latent variables. The rest follows immediately from the consistency
property of PC and FCI given a valid oracle for conditional independencies.
A.3 Implementation
Statistical tests for tetrad constraints are described by Spirtes et al. (2000). Although it is
known that in practice constraint-based approaches for learning graphical model structure are
outperformed on accuracy by score-based algorithms such as GES (Chickering, 2002), we favor a
constraint-based approach due mostly to computational eciency. Moreover, a smart implementa-
tion of can avoid many statistical shortcomings.
142 Results from Chapter 3
A.3.1 Robust purication
We do avoid a constraint-satisfaction approach for purication. At least for a xed p-value and using
false discovery rates to control for multiplicity of tests, purication by testing tetrad constraints
often throws away many more nodes than necessary when the number of variables is relative small,
and does not eliminate many impurities when the number of variables is too large. We suggest a
robust purication approach as follows.
Suppose we are given a clustering of variables (not necessarily disjoint clusters) and a undirect
graph indicating which variables might be ancestors of each other, analogous to the undirect edges
generated in FindPattern. We purify this clustering not by testing multiple tetrad constraints,
but through a greedy search that eliminates nodes from a linear measurement model that entails
tetrad constraints. This is iterated till the current model ts the data according to a chi-square
test of signicance (Bollen, 1989) and a given acceptance level. Details are given in Table A.1.
This implementation is used as a subroutine for a more robust implementation of Build-
PureClusters described in the next section. However, it can be considerably slow. An alterna-
tive is using the approximation derived by Kano and Harada (2000) to rapidly calculate the tness
of a factor analysis model when a variable is removed. Another alternative is a greedy search over
the initial measurement model, freeing correlations of pairs of measured variables. Once we found
which variables are directly connected, we eliminate some of them till no pair is impure. Details
of this particular implementation are given by Silva and Scheines (2004). In our experiments with
synthetic data, it did not work as well as the iterative removal of variables described in Table A.1.
However, we do apply this variation in the last experiment described in Section 6, because it is
computationally cheaper. If the model search in RobustPurify does not t the data after we
eliminate too many variables (i.e., when we cannot statistically test the model) we just return an
empty model.
A.3.2 Finding a robust initial clustering
The main problem of applying FindPattern directly by using statistical tests of tetrad constraints
is the number of false positives: accepting a rule (CS1, CS2, or CS3) as true when it does not hold
in the population. One can see that might happen relatively often when there are large groups of
observed variables that are pure indicators of some latent: for instance, assume there is a latent L
0
with 10 pure indicators. Consider applying CS1 to a group of six pure indicators of L
0
. The rst
two constraints of CS1 hold in the population, and so assume they are correctly identied by the
statistical test. The last constraint,
X
1
X
2

Y
1
Y
2
=
X
1
Y
2

X
2
Y
1
, should not hold in the population,
but will not be rejected by the test with some probability. Since there are 10!/(6!4!) = 210 ways
of CS1 being wrongly applied due to a statistical mistake, we will get many false positives in all
certainty.
We can highly minimize this problem by separating groups of variables instead of pairs. Consider
the test DisjointGroup(X
i
, X
j
, X
k
, Y
a
, Y
b
, Y
c
; ):
DisjointGroup(X
i
, X
j
, X
k
, Y
a
, Y
b
, Y
c
; ) = true if and only if CS1 returns true for all sets
X
1
, X
2
, X
3
, Y
1
, Y
2
, Y
3
, where X
1
, X
2
, X
3
is a permutation of X
i
, X
j
, X
k
and Y
1
, Y
2
, Y
3

is a permutation of Y
a
, Y
b
, Y
c
. Also, we test an extra redundant constraint: for every
pair X
1
, X
2
X
i
, X
j
, X
k
and every pair Y
1
, Y
2
Y
a
, Y
b
, Y
c
we also require that

X
1
Y
1

X
2
Y
2
=
X
1
Y
2

X
2
Y
1
.
A.3 Implementation 143
Algorithm RobustPurify
Inputs: Clusters, a set of subsets of some set O;
C, an undirect graph over O;
, a sample covariance matrix of O.
1. Remove all nodes that have appear in more than one set in Clusters.
2. For all pairs of nodes that belong to two dierent sets in Clusters and are adjacent in C, remove the
one from the largest cluster or the one from the smallest cluster if this has less than three elements.
3. Let G be a graph. For each set S Clusters, add all nodes in S to G and a new latent as the only
common parent of all nodes in S. Create an arbitrary full DAG among latents.
4. For each variable V in G, t a graph G

(V ) obtained from G by removing V . Update G by choosing

the graph G

(V ) with the smallest chi-square score. If some latent ends up with less than two children,
remove it. Iterate till a signicance level is achieved.
5. Do mergings if that increases the tness. Iterate 4 and 5 till no improvement can be done.
6. Eliminate all clusters with less than three variables and return G.
Table A.1: A score-based purication.
Notice it is much harder to obtain a false positive with DisjointGroup than, say, with CS1
applied to a single pair. This test can be implemented in steps: for instance, if for no four foursome
including X
i
and Y
a
we have that all tetrad constraints hold, then we do not consider X
i
and Y
a
in DisjoingGroup.
Based on DisjointGroup, we propose here a modication to increase the robustness of Build-
PureClusters, the RobustBuildPureClusters algorithm, as given in Table A.2. It starts
with a rst step called FindInitialSelection (Table A.3). The goal of FindInitialSelection
is to nd a pure model using only DisjointGroup instead of CS1, CS2 or CS3. This pure model
is then used as an starting point for learning a more complete model in the remaining stages of
RobustBuildPureClusters.
In FindInitialSelection, if a pair X, Y cannot be separated into dierent clusters, but
also does not participate in any successful application of DisjointGroup, then this pair will be
connected by a GRAY or YELLOW edge: this indicates that these two nodes cannot be in a pure
submodel with three indicators per latent. Otherwise, these nodes are compatible, meaning that
they might be in such a pure model. This is indicated by a BLUE edge.
In FindInitialSelection we then nd cliques of compatible nodes (Step 8)
2
. Each clique is
a candidate for a one-factor model (a latent model with one latent only). We purify every clique
found to create pure one-factor models (Step 9). This avoids using clusters that are large not
because they are all unique children of the same latent, but because there was no way of separating
its elements. This adds considerably more computational cost to the whole procedure.
After we nd pure one-factor models M
i
, we search for a combination of compatible groups.
Step 10 rst indicates which pairs of one-factor models cannot be part of a pure model with three
indicators each: if M
i
and M
j
are not pairwise a two-factor model with three pure indicators (as
tested by DisjointGroup), they cannot be both part of a valid solution.
ChooseClusteringClique is a heuristic designed to nd a large set of one-factor models
2
Any algorithm can be used to nd maximal cliques. Notice that, by the anytime properties of our approach, one
does not need to nd all maximal cliques
144 Results from Chapter 3
Algorithm RobustBuildPureClusters
Input: , a sample covariance matrix of a set of variables O
1. (Selection, C, C
0
) FindInitialSelection().
2. For every pair of nonadjacent nodes N
1
, N
2
in C where at least one of them is not in Selection and
an edge N
1
N
2
exists in C
0
, add a RED edge N
1
N
2
to C.
3. For every pair of nodes linked by a RED edge in C, apply successively rules CS1, CS2 and CS3.
Remove an edge between every pair corresponding to a rule that applies.
4. Let H be a complete graph where each node corresponds to a maximal clique in C.
5. FinalClustering ChooseClusteringClique(H).
6. Return RobustPurify(FinalClustering, C, ).
Table A.2: A modied BuildPureClusters algorithm.
(nodes of H) that can be grouped into a pure model with three indicators per latent (we need a
heuristic since nding a maximum clique in H is NP-hard). First, we dene the size of a clustering
H
candidate
(a set of nodes from H) as the number of variables that remain according to the following
elimination criteria: 1. eliminate all variables that appear in more than one one-factor model inside
H
candidate
; 2. for each pair of variables X
1
, X
2
such that X
1
and X
2
belong to dierent one-factor
models in H
candidate
, if there is an edge X
1
X
2
in C, then we remove one element X
1
, X
2
from
H
candidate
(i.e., guarantee that no pair of variables from dierent clusters which were not shown to
have any common latent parent will exist in H
candidate
). We eliminate the one that belongs to the
largest cluster, unless the smallest cluster has less than three elements to avoid extra fragmentation;
3. eliminate clusters that have less than three variables.
The heuristic motivation is that we expected that a model with a large size will have a large
number of variables after purication. Our suggested heuristic to be implemented as ChooseClus-
teringClique is trying to nd a good model using a very simple hill-climbing algorithm that starts
from an arbitrary node in H and add new clusters to the current candidate according to the one
that will increase its size mostly while still forming a maximal clique in H. We stop when we cannot
increase the size of the candidate. This is calculated using each node in H as a starting point, and
the largest candidate is returned by ChooseClusteringClique.
A.3.3 Clustering renement
The next steps in RobustBuildPureClusters are basically the FindPattern algorithm of
Table 3.1 with a nal purication. The main dierence is that we do not check anymore if pairs
of nodes in the initial clustering given by Selection should be separated. The intuition explaining
the usefulness of this implementation is as follows: if there is a group of latents forming a pure
subgraph of the true graph with a large number of pure indicators for each latent, then the initial
step should identify such group. The consecutive steps will rene this solution without the risk
of splitting the large clusters of variables, which are exactly the ones most likely to produce false
positive decisions. RobustBuildPureClusters has the power of identifying the latents with
large sets of pure indicators and rening this solution with more exible rules, covering also cases
where DisjointGroup fails.
Notice that the order by which tests are applied might inuence the outcome of the algorithms,
A.4 The spiritual coping questionnaire 145
since if we remove an edge XY in C at some point, then we are excluding the possibility of using
some tests where X and Y are required. Imposing such restriction reduces the overall computational
cost and statistical mistakes. To minimize the ordering eect, an option is to run the algorithm
multiple times and select the output with the highest number of nodes.
A.4 The spiritual coping questionnaire
The following questionnaire is provided to facilitate understanding of the religious/spiritural coping
example given in Section 3.5.2. It can also serve as an example of how questionnaires are actually
designed.
Section I This section intends to measure the level of stress of the subject. In the actual ques-
tionnaire, it starts with the following instructions:
Circle the number next to each item to indicate how stressful each of these events has been for you
since entered your graduate program. If you have never experienced one of the events listed below,
then circle number 1. If one of the events listed below has happened to you and has caused you a
great deal of stress, rate that event toward the Extremely Stressful end of the rating scale. If an
event has happened to you while you have been in graduate school, but has not bothered you at all,
rate that event toward the lower end of the scale (Not at all Stressful).
The student then chooses the level of stress by circling a number on a 7 point scale. The questions
of this section are:
1. Fullling responsibilities both at home and at school
2. Trying to meet peers of your race/ethnicity on campus
3. Taking exams
4. Being obligated to participate in family functions
5. Arranging childcare
6. Finding support groups sensitive to your needs
7. Fear of failing to meet program expectations
8. Participating in class
9. Meeting with faculty
10. Living in the local community
11. Handling relationships
12. Handling the academic workload
13. Peers treating you unlike the way they treat each other
14. Faculty treating you dierently than your peers
15. Writing papers
16. Paying monthly expenses
17. Family having money problems
146 Results from Chapter 3
Algorithm FindInitialSelection
Input: , a sample covariance matrix of a set of variables O
1. Start with a complete graph C over O.
2. Remove edges of pairs that are marginally uncorrelated or uncorrelated conditioned on a third variable.
3. C
0
C.
4. Color every edge of C as BLUE.
5. For all edges N
1
N
2
in C, if there is no other pair N
3
, N
4
such that all three tetrads constraints
hold in the covariance matrix of N
1
, N
2
, N
3
, N
4
, change the color of the edge N
1
N
2
to GRAY.
6. For all pairs of variables N
1
, N
2
linked by a BLUE edge in C
If there exists a pair N
3
, N
4
that forms a BLUE clique with N
1
in C, and a pair
N
5
, N
6
that forms a BLUE clique with N
2
in C, all six nodes form a clique in C
0
and
DisjointGroup(N
1
, N
3
, N
4
, N
2
, N
5
, N
6
; ) = true, then remove all edges linking elements in
N
1
, N
3
, N
4
to N
2
, N
5
, N
6
.
Otherwise, if there is no node N
3
that forms a BLUE clique with N
1
, N
2
in C,
and no BLUE clique in N
4
, N
5
, N
6
such that all six nodes form a clique in C
0
and
DisjointGroup(N
1
, N
2
, N
3
, N
4
, N
5
, N
6
; ) = true, then change the color of the edge N
1
N
2
to
YELLOW.
7. Remove all GRAY and YELLOW edges from C.
8. List
C
FindMaximalCliques(C).
9. Let H be a graph where each node corresponds to an element of List
C
and with no edges. Let M
i
de-
note both a node in H and the respective set of nodes in List
C
. Let M
i
RobustPurify(M
i
, C, );
10. Add an edge M
1
M
2
to H only if there exists N
1
, N
2
, N
3
M
1
and N
4
, N
5
, N
6
M
2
such that
DisjointGroup(N
1
, N
2
, N
3
, N
4
, N
5
, N
6
; ) = true.
11. H
choice
ChooseClusteringClique(H).
12. Let H
clusters
be the corresponding set of clusters, i.e., the set of sets of observed variables, where each
set in H
clusters
correspond to some M
i
in H
choice
.
13. Selection RobustPurify(H
clusters
, C, ).
14. Return (Selection, C, C
0
).
Table A.3: Selects an initial pure model.
A.4 The spiritual coping questionnaire 147
18. Adjusting to the campus environment
19. Being obligated to repay loans
20. Anticipation of nding full-time professional work
21. Meeting deadlines for course assignments
Section II This section intends to measure the level of depression of the subject. In the actual
questionnaire, it starts with the following instructions:
Below is a list of the ways you might have felt or behaved. Please tell me how often you have felt
this way during the past week.
The student then chooses the level of frequency that some events happened to him/her by circling
a number on a 4 point scale. The scale is Rarely or None of the Time (less than 1 day), Some
or Little of the Time (1 - 2 days), Occasionally or a Moderate Amount of the Time (3 - 4 days)
and Most or All of the Time (5 - 7 days). The events are as follows:
1. I was bothered by things that usually dont bother me
2. I did not feel like eating; my appetite was poor
3. I felt that I could not shake o the blues even with help from my family or friends
4. I felt that I was just as good as other people
5. I had trouble keeping my mind on what I was doing
6. I felt depressed
7. I felt that everything I did was an eort
8. I felt hopeful about the future
9. I thought my life had been a failure
10. I felt fearful
11. My sleep was restless
12. I was happy
13. I talked less than usual
14. I felt lonely
15. People were unfriendly
16. I enjoyed life
17. I had crying spells
18. I felt sad
19. I felt that people disliked me
20. I could not get going
148 Results from Chapter 3
Section III This section intends to measure the level of spiritual coping of the subject. In the
actual questionnaire, it starts with the following instructions:
Please think about how you try to understand and deal with major problems in your life. These
items ask what you did to cope with your negative event. Each item says something about a par-
ticular way of coping. To what extent is your religion or higher power involved in the way you cope?
The student then chooses the level of importance of some spiritual guideline by circling a number
on a 4 point scale. The scale is Not at all, Somewhat, Quite a bit, A great deal. The
guidelines are:
1. I think about how my life is part of a larger spiritual force
2. I work together with God (high power) as partners to get through hard times
3. I look to God (high power) for strength, support, and guidance in crises
4. I try to nd the lesson from God (high power) in crises
5. I confess my sins and ask for God (high power)s forgiveness
6. I feel that stressful situations are God (high power)s way of punishing me for my sins or lack
of spirituality
7. I wonder whether God has abandoned me
8. I try to make sense of the situation and decide what to do without relying on God (high
power)
9. I question whether God (high power) really exists
10. I express anger at God (high power) for letting terrible things happen
11. I do what I can and put the rest in God (high power)s hands
12. I do not try much of anything; simply expect God (high power) to take my worries away
13. I pray for a miracle
14. I pray to get my mind o of my problems
15. I ignore advice that is inconsistent with my faith
16. I look for spiritual support from clergy
17. I disagree with what my religion wants me to do or believe
18. I ask God (high power) to help me nd a new purpose in life
19. I try to nd a completely new life through religion
20. I seek help from God (high power) in letting go of my anger
Appendix B
Results from Chapter 4
All of the following proofs hold with probability 1 with respect to the Lebesgue measure taken over
the set of linear coecients and error variances that partially parameterize the density function of
an observed variable given its parents. In all of the following proofs, G is a latent variable graph
with a set O of observable variables. In some of these proofs, we use the term edge label as a
synonym of the coecient associated with an edge that is into an observed node. Without loss of
generality, we will also assume that all variables have zero mean, unless specied otherwise. The
symbol X
t
will stand for a nitely indexed set of variables.
Lemma 4.1 If for A, B, C O we have
AB
= 0 or
AB.C
= 0, then A and B cannot share a
common latent parent in G.
Proof: We will prove this argument by contradiction. Assume A and B have a common parent L,
i.e., let A, B, C be dened according to the following linear functions
A = aL +

p
a
p
A
p
+
A
B = bL +

i
b
i
B
i
+
B
C =

j
c
j
C
j
+
C
where L is a common latent parent of A and B, A
p
represents parents of A, B
i
are parents
of B, C
j
parents of C, and a
p
b
i
c
j
a, b,
A
,
B
,
C
are parameters of the graphical
model,
A
,
B
,
C
being the variances of error terms
A
,
B
,
C
, respectively.
By the equations above,
AB
= ab
2
L
+ K, where K is a polynomial containing the remaining
terms of the respective expression. We will show rst that no term in K has a factor ab. For that
to happen, either the symbol b appears in some
LB
j
, or the symbol a appears in some
LA
i
, or the
symbol ab appears within some
ApB
i
. The symbol b will appear in some
LB
j
only if there is path
from L to B
j
through B, but that cannot happen since B
j
is a parent of B and the graph is acyclic
beneath the latents. The arguments for a and
LA
i
, and ab with respect to
ApB
i
are analogous.
Consider rst that the hypothesis
AB
= 0 is true. With probability 1 with respect to the
Lebesgue measure over parameters a
p
b
i
c
j
a, b,
A
,
B
,
C
, the polynomial identity
ab
2
L
+K = 0 will hold. For this identity to hold, every term in the polynomial should vanish. Since
the only term containing the expression ab is the one given above, we therefore need ab
2
L
= 0.
However, by assumption, ab = 0 and latent variables have positive variance, which contradicts
ab
2
L
= 0.
150 Results from Chapter 4
Assume now that
AB.C
= 0. This implies
AB

2
C

BC
= 0 where
2
C
> 0 by assumption.
By expressing
AB

2
C
as a function of the given coecients, we obtain ab
2
L

2
C
+ Q, where Q is a
polynomial that does not contain any term that includes some symbol in c
j

C
(using arguments
analogous to the previous case). Since C is not an ancestor of L (because L is latent) no term in
ab
2
L
contains the symbol
C
, nor any coecient c
j
. Since every term in
AC

BC
that might
contain
C
must also contain some c
j
, then no term in
AC

BC
can cancel any term in ab
2
L

C
(which is contained in ab
2
C

2
C
). This implies ab
2
L

C
= 0, a contradiction.
Lemma 4.2 For any set O

= A, B, C, D O, if
AB

CD
=
AC

BD
=
AD

BC
such that for
all triplets X, Y, Z, X, Y O

, Z O, we have
XY.Z
= 0 and
XY
= 0, then no element in
X O

is an ancestor of any element in O

`X in G.
Since G is acyclic among observed variables, then at least one element in O

is not an ancestor
in G of any other element in this set. By symmetry, we can assume without loss of generality that
D is such node. Since the measurement model is linear, we can write A, B, C, D as linear functions
of their parents:
A =

p
a
p
A
p
B =

i
b
i
B
i
C =

j
c
j
C
j
D =

k
d
k
D
k
where on the right-hand side of each equation we have the respective parents of A, B, C and D.
Such parents can be latents, another indicators or, for now, the respective error term, but each
indicator has at least one latent parent besides the error term. Let L be the set of latent variables in
G. Since each indicator is always a linear function of its parents, by composition of linear functions
we have that each X O

will be a linear function of its immediate latent ancestors, i.e., latent

ancestors L
X
of X such that there is a directed path from L
X
to X in G that does not contain
any other element of L. The equations above can then be rewritten as:
A =

Ap
L
Ap
B =

B
i
L
B
i
C =

j

C
j
L
C
j
D =

k

D
k
L
D
k
where on the right-hand side of each equation we have the respective immediate latent ancestors
of A, B, C and D and parameters are functions of the original coecients of the measurement
model. Notice that in general the sets of immediate latent ancestors for each pair of elements in
O

will overlap.
Since the graph is acyclic, at least one element of A, B, C is not an ancestor of the other
two. By symmetry, assume without loss of generality that C is such a node. Assume also C is an
ancestor of D. We will prove by contradiction that this is not possible. Let L be a latent parent of
C, where the edge from L into C is labeled with c, corresponding to its linear coecient. We can
rewrite the equation for C as
C = cL +

C
j
L
C
j
(B.1)
where by an abuse of notation we are keeping the same symbols
C
j
and L
C
j
to represent the
other dependencies of C. Notice that it is possible that L = L
C
j
for some L
C
j
if there is more
151
A
B
C
D
...
...
... ...
...
...
...

L
c
...
D
...

2
C
...
L
c
...
X
1

d
=
1
+
2

3
(a) (b)
Figure B.1: (a) The symbol
d
is dened as the sum over all directed paths from C to D of the
product of the labels of each edge that appears in each path. Here the larger edges represent edges
in such directed paths. (b) An example: we have two directed paths from C to D. The symbol
d
then stands for
1
+
2

3
, where each term in this polynomial corresponds to one directed path.
Notice that it is not possible to obtain any additive term that forms
d
out of the product of some

Ap
,
B
i
,
C
j
, since D is not an ancestor of any of them: in our example,
1
and
2
cannot appear
in any
Ap

B
i

C
j
product (
3
may appear if X is an ancestor of A or B).
than one directed path from L to C, but this will not be relevant for our proof. In this case, the
corresponding coecient is modied by subtracting c. It should be stressed that the symbol c
does not appear anywhere in the polynomial corresponding to

j

C
j
L
C
j
, where in this case the
variables of the polynomial are the original coecients parameterizing the measurement model and
the immediate latent ancestors of C.
By another abuse of notation, rewrite A, B and D as
A = c
a
L +

Ap
L
Ap
B = c
b
L +

B
i
L
B
i
D = c
d
L +

D
k
L
D
k
Each

symbol is a polynomial function of all (possible) directed paths from C to X

A, B, D, as illustrated in Figure B.1. The possible corresponding
X
t
coecient for L is adjusted
in the summation by subtracting c
X
t
(again, L may appear in the summation if there are directed
paths from L to X

that do not go through C). If C has more than one parent, then the expression
for

will appear again in some

Xt
. However, the symbol c cannot appear again into any
Xt
,
since

summarizes all possible directed paths from C to X

. This remark will be very important

later when we factorize the expression corresponding to the tetrad constraints. Notice that, by
assumption,
a
=
b
= 0, and
d
= 0. We keep
a
and
b
in our equations to account for the
next cases, where we will prove that B and A cannot be ancestors of D. The reasoning will be
analogous, but the respective s will be nonzero.
Another important point to be emphasized is that no term inside
d
can appear in the expression
for A and B. That happens because D is not an ancestor of A, B or C, and at least the edges from
the parents of D to D cannot appear in any trek between any pair of elements in A, B, C and
152 Results from Chapter 4
every term inside
d
contains the label of one edge between a parent of D and D. This remark
will also be very important later when we will factorize the expression corresponding to the tetrad
constraints.
By the denitions above, we have:

AB
= c
2

2
L
+c
a

B
i

L
B
i
L
+c
b

L
Ap
L
+

B
i

L
Ap
L
B
i

CD
= c
2

2
L
+c

D
k

L
D
k
L
+c
d

C
j

L
C
j
L
+

C
j

D
k

L
C
j
L
D
k

AC
= c
2

2
L
+c
a

C
j

L
C
j
L
+c

L
Ap
L
+

C
j

L
Ap
L
C
j

BD
= c
2

2
L
+c
b

D
k

L
D
k
L
+c
d

B
i

L
B
i
L
+

B
i

D
k

L
B
i
L
D
k
Consider the polynomial identity
AB

CD

AC

BD
= 0 as a function of the parameters of
the measurement model, i.e., the linear coecients and error variances for the observed variables.
Assume this constraint is entailed by G and its unknown latent covariance matrix. With a Lebesgue
measure over the parameters, this will hold with probability 1, which follows from the fact that
the solution set to non-trivial polynomial constraints has measure zero. See Meek (1997) and
references within for more details. This also means that every term in this polynomial expression
should vanish to zero with probability 1: i.e., the coecients (functions of the latent covariance
matrix) of every term in the polynomial should be zero. Therefore, the sum of all terms with a
factor
dt
= l
1
l
2
...l
z
at a given choice of exponents for each l
1
, ..., l
z
should be zero, where
dt
is
some term inside the polynomial
d
.
Before using this result, we need to identify precisely which elements of the polynomial
AB

BD
can be factored by, say, c
2

dt
, for some
dt
. This can include elements from any term
that will explicitly show c
2

d
when multiplying the covariance equations above among others, but
we have to consider the multiplicity of the factors that compose
dt
. Let
dt
= l
1
l
2
...l
z
. We want
to factorize our tetrad constraint according to terms that contain l
1
l
2
...l
z
with multiplicity 1 for
each label (i.e., our terms cannot include l
2
1
, for instance, or some subset of l
1
, ..., l
z
). Since C
does not have some descendant X that is a common ancestor of A and D or B and D, this means
that no algebraic term
a
,
b
or
Ap
,
B
i
can contain some symbol in l
1
, ..., l
z
. Notice that some

D
k
s will be functions of
dt
: every immediate latent ancestor of C is an immediate latent ancestor
of D. Therefore, for each common immediate latent ancestor parent L
q
of C and D, we have
that
Dq
=
d

Cq
+t(L
q
, D) =
dt

Cq
+ (
d

dt
)
Cq
+t(L
q
, D), where t(L
q
, D) is a polynomial
representing other directed paths from L
q
to D that do not go through C.
For example, consider the expression c
2

B
i

L
B
i
L

D
k

L
D
k
L

, which is an additive
term inside the product
AB

CD
. If we group only those terms inside this expression that contain

dt
, we will get c
2

B
i

L
B
i
L

C
j

L
C
j
L

where the index j runs over the same latent

ancestors as in (B.1). As discussed before, no factor of
dt
can be a factor of any term in
B
i
. The
same holds for
a
. Therefore, the multiplicity of each l
1
, ..., l
z
in this term is exactly 1.
When one writes down the algebraic expression for
AB

CD

AC

BD
as functions of s, c,

a
,
b
,
dt
, the terms
c
2

dt
[
2
L

B
i

L
Ap
L
B
i
+
a

2
L

C
j

L
C
j
L
C
j

+
a

B
i

L
B
i
L

C
j

L
C
j
L
+

L
Ap
L

C
j

L
C
j
L
]
c
2

dt
[
b

2
L

C
j

L
Ap
L
C
j
+
a

2
L

B
j

C
j

L
B
i
L
C
j
+
a

C
j

L
C
j
L

C
j

L
C
j
L
+

L
Ap
L

B
i

L
B
i
L
]
153
will be the only ones that can be factorized by c
2

dt
, where the power of c in such terms is 2, and
the multiplicity of each l
1
, ..., l
z
is 1. Since this has to be identically zero and
dt
= 0, we have the
following relation:
f
1
(G) = f
2
(G) (B.2)
where
f
1
(G) = c
2
[
2
L

B
i

L
Ap
L
B
i
+
a

2
L

C
j

L
C
j
L
C
j

+
a

B
i

L
B
i
L

C
j

L
C
j
L
+

L
Ap
L

C
j

L
C
j
L
]
f
2
(G) = c
2
[
b

2
L

C
j

L
Ap
L
C
j
+
a

2
L

B
j

C
j

L
B
i
L
C
j
+
a

C
j

L
C
j
L

C
j

L
C
j
L
+

L
Ap
L

B
i

L
B
i
L
]
Similarly, when we factorize terms that include c
dt
, where the respective powers of c, l
1
, ..., l
z
in the term have to be 1, we get the following expression as an additive term of
AB

BD
:
c
dt
[
a

B
i

L
B
i
L

C
j

L
C
j
L
C
j

+
b

L
Ap
L

C
j

L
C
j
L
C
j

+
2

C
j

L
C
j
L

B
i

L
Ap
L
B
i
]
c
dt
[
a

C
j

L
C
j
L

B
i

C
j

L
B
i
L
C
j
+

L
Ap
L

B
i

C
j

L
B
i
L
C
j
+

C
j

L
C
j
L

C
j

L
Ap
L
C
j
+

B
i

L
B
i
L

C
j

L
Ap
L
C
j
]
for which we have:
g
1
(G) = g
2
(G) (B.3)
where
g
1
(G) = c[
a

B
i

L
B
i
L

C
j

L
C
j
L
C
j

+
b

L
Ap
L

C
j

L
C
j
L
C
j

+
2

C
j

L
C
j
L

B
i

L
Ap
L
B
i
]
g
2
(G) = c[
a

C
j

L
C
j
L

B
i

C
j

L
B
i
L
C
j
+

L
Ap
L

B
i

C
j

L
B
i
L
C
j
+

C
j

L
C
j
L

C
j

L
Ap
L
C
j
+

B
i

L
B
i
L

C
j

L
Ap
L
C
j
]
Finally, we look at terms multiplying
dt
without c, which will result in:
h
1
(G) = h
2
(G) (B.4)
where
h
1
(G) =

B
i

L
Ap
L
B
i

C
j

L
C
j
L
C
j

h
2
(G) =

C
j

L
Ap
L
C
j

B
i

C
j

L
B
i
L
C
j
Writing down the full expression for
AC

BC
and
2
C

AB
will result in:

BC
= P(G) +f
2
(G) +g
2
(G) +h
2
(G) (B.5)
154 Results from Chapter 4

2
C

AB
= P(G) +f
1
(G) +g
1
(G) +h
1
(G) (B.6)
where
P(G) = c
4

b
(
2
L
)
2
+c
3

2
L

C
j

L
C
j
L
+c
3

2
L

B
i

L
B
i
L
+
c
3

2
L

C
j

L
C
j
L
+c
2

C
j

L
C
j
L

B
i

L
B
i
L
+
c
3

2
L

L
Ap
L
+c
2

C
j

L
C
j
L

L
Ap
L
By (B.2), (B.3), (B.4), (B.5) and (B.6), we have:

BC
=
2
C

AB

AB

AC

BC
(
2
C
)
1
= 0
AB.C
= 0
Contradiction. Therefore, C cannot be an ancestor of D, and more generally, of any element in
O

`C.
Assume without loss of generality that B is not an ancestor of A. C is not an ancestor of any
element in O

`C. If B does not have a descendant that is a common ancestor of C and D, then
by analogy with the (C, D) case (where now more than one element will be nonzero as hinted
before, since we have to consider the possibility of B being an ancestor of both C and D), B cannot
be an ancestor of C nor D.
Assume then that B has a descendant X that is a common ancestor of C and D, where X = C
and X = D, since C is not an ancestor of D and vice-versa. Notice also that X is not an ancestor
of A, since B is not an ancestor of A. Relations such as Equation B.2 might not hold, since we
might be equating terms that have dierent exponents for symbols in l
1
, ..., l
z
. However, since
now we have an observed intermediate term X, we can make use of its error variance parameter

X
corresponding to the error term
X
.
No term in
AB
can have
X
, since
X
is independent of both A and B. There is at least one
term in
CD
that contains
X
as a factor. There is no term in
AC
that contains
X
as a factor,
since
X
is independent of A. There is no term in
BD
that contains
X
as a factor, since
X
is
independent of B. Therefore, in
AB

CD
we have at least one term that has
X
, while no term
in
AC

BD
contains such term. That requires some parameters or the variance of some latent
ancestor of B to be zero, which is a contradiction.
Therefore, B is not an ancestor of any element in O

`B. In a completely analogous way, one

can show that A is not an ancestor of any element in O

`A.
The following lemma will be useful to proof Lemma 4.2:
Lemma B.1 For any set A, B, C, D = O

O, if
AB

CD
=
AC

BD
=
AD

BD
such that for
every set X, Y O

, Z O we have
XY.Z
= 0 and
XY
= 0, then no pair of elements in O

has
an observed common ancestor.
Proof: Assume for the sake of contradiction that some pair in O

has an observed common ancestor.

Let K be a common ancestor of some pair of elements in O

such that no descendant of K is also

a common ancestor of some pair in O

.
Without loss of generality, assume K is a common ancestor of A and B. Let be the concate-
nation of edge labels in some directed path from K to A, and the concatenation of edge labels
in some directed path from K to B. That is,
A = K +R
A
B = K +R
B
155
where R
X
is the remainder of the polynomial expression that describes node X as a function of its
immediate latent ancestors and K.
By the given constraint
AB

CD
=
AC

BD
, we have (
2
K

CD

CK

DK
) + f(G) = 0,
where
f(G) = (
KR
B
+
KR
A
+
R
A
R
B
)
CD

CR
A

DR
B
However, no term in f(G) can contain the symbol : by Lemma 4.2 no element X in O

can
be an ancestor of any element in O

`X; also, by construction no descendant of K (with the possible

exception of K) can be an ancestor of C or D and therefore no sequence or can be generated
from the polynomial f that is a function of
KR
B
,
KR
A
,
R
A
R
B
,
CD
,
CR
A
or
DR
B
.
It follows that with probability 1 we have (
2
K

CD

CK

DK
) = 0, and since = 0 by
assumption, this implies
2
K

CD

CK

DK
= 0
CD.K
= 0. Contradiction.
Lemma 4.3 For any set O

= X
1
, X
2
, Y
1
, Y
2
O, if Factor
1
(X
1
, X
2
, G) = true, Factor
1
(Y
1
, Y
2
, G) =
true,
X
1
Y
1

X
2
Y
2
=
X
1
Y
2

X
2
Y
1
, and all elements of X
1
, X
2
, Y
1
, Y
2
are correlated, then no ele-
ment in X
1
, X
2
is an ancestor of any element in Y
1
, Y
2
in G and vice-versa.
Proof: Assume for the sake of contradiction that X
1
is an ancestor of Y
1
. Let P be an arbitrary
directed path from X
1
to Y
1
of K edges such that the edge coecients on this path are
1
. . .
K
.
One can write the covariance of X
1
and Y
1
as
X
1
Y
1
= c
1

2
X
1
+F(G), where F(G) is a polynomial
(in terms of edge coecients and error variances) that does not contain any term that includes the
symbol
1
, and c =
2
. . .
K
. Also, the polynomial corresponding to
2
X
1
cannot contain any term
that includes the symbol
1
.
Also analogously,
X
2
Y
1
can be written as c
1

X
1
X
2
+F

(G), where F

(G) does not contain

1
,
since X
1
cannot be an ancestor of X
2
by the given hypothesis and Lemma 4.2.
By Lemma 4.2 and the given conditions, Y
2
cannot be an ancestor of Y
1
and therefore, not an
ancestor of X
1
. X
1
cannot be an ancestor of Y
2
, by Lemma B.1 applied to pair Y
1
, Y
2
. This
implies that
X
1
Y
2
cannot contain any term that includes
1
. By the same reason, the polynomial
corresponding to
X
2
Y
2
cannot contain any term that includes
1
.
This means that the constraint
X
1
Y
1

X
2
Y
2
=
X
1
Y
2

X
2
Y
1
corresponds to the polynomial identity

1
(
2
X
1

X
2
Y
2

X
1
Y
2

X
1
X
2
) +F

(G) = 0, where the polynomial F

(G) does not contain any term

that includes
1
, and neither does any term in the factor (
2
X
1

X
2
Y
2

X
1
Y
2

X
1
X
2
). This will imply
with probability 1 that
2
X
1

X
2
Y
2

X
1
Y
2

X
1
X
2
= 0 (which is the same of saying that the partial
correlation of X
2
and Y
2
given X
1
is zero).
The expression
2
X
1

X
2
Y
2
contains a term that include
X
1
, the error variance for X
1
, while

X
1
Y
2

X
1
X
2
cannot contain such a term, since X
1
is not an ancestor of either X
2
or Y
2
. That will
then imply the term
X
1

X
2
Y
2
should vanish, which is a contradiction since
X
1
= 0 by assumption
and
X
2
Y
2
= 0 by hypothesis.
Let X =
x0
L+

k
i=1

i
and Y be random variables with zero mean, as well as L,
1
, ...,
k
.
Let
x0
,
x
1
, ...,
x
k
be real coecients. We dene
XY L
, the covariance of X and Y through
L, as
XY L

x0
E[LY ]. The following lemma will be useful to show Lemma 4.4:
Lemma B.2 Let A, B, C, D O such that A is not an ancestor of B, C or D in G and
A has a parent L in G, and no element of the covariance matrix of A, B, C and D is zero. If

BD
=
AD

BC
, then
ACL
=
ADL
= 0 or
ACL
/
ADL
=
AC
/
AD
=
BC
/
BD
.
156 Results from Chapter 4
Proof: Since G is a linear latent variable graph, we can express A, B, C and D as linear functions
of their parents as follows:
A = aL +

p
a
p
A
p
B =

i
b
i
B
i
C =

j
c
j
C
j
D =

k
d
k
D
k
where on the right-hand side of each equation the uppercase symbols denote the respective parents
of each variable on the left side, error terms included.
Given the assumptions, we have:

BD
=
AD

BC

E[a

j
c
j
LC
j
+

j
a
p
c
j
A
p
C
j
]
BD
= E[a

k
d
k
LD
k
+

k
a
p
d
k
A
p
D
k
]
BC

a(

j
c
j

LC
j
)
BD
+

j
a
p
c
j

ApC
j

BD
= a(

k
d
k

LD
k
)
BC
+

k
a
p
d
k

ApD
k

BC

a[(

j
c
j

LC
j
)
BD
(

k
d
k

LD
k
)
BC
)] + [

j
a
p
c
j

ApC
j

k
a
p
d
k

ApD
k

BC
] = 0
Since A is not an ancestor of B, C or D, there is no trek among elements of B, C, D con-
taining both L and A, and therefore the symbol a cannot appear in

j
a
p
c
j

ApC
j

k
a
p
d
k

ApD
k

BC
when we expand each covariance as a function of the parameters of G.
Therefore, since this polynomial is identically zero, we have to have the coecient for a equal to
zero, which implies:
a(

j
c
j

LC
j
)
BD
= a(

k
d
k

LD
k
)
BC

ACL

BD
=
ADL

BC
Since no element in
ABCD
is zero, then
ACL
= 0
ADL
= 0. If
ACL
= 0, then

ACL
/
ADL
=
AC
/
AD
=
BC
/
BD
.
Lemma 4.4 CS1 is sound.
Proof: Suppose X
1
and Y
1
have a common parent L in G. Let X
1
= aL +

p
a
p
A
p
and Y
1
=
bL +

i
b
i
B
i
, where each A
p
, B
i
are parents in G of X
1
and Y
1
, respectively.
By Lemma 4.2 and the given constraints, an element of X
1
, Y
1
cannot be an ancestor of
the other, and neither can be an ancestor in G of any element in X
2
, X
3
, Y
2
, Y
3
. By denition,

X
1
V L
= (a/b)
Y
1
V L
for some variable V , and therefore
X
1
V L
= 0
Y
1
V L
= 0. Assume

Y
1
X
2
L
=
X
1
X
2
L
= 0. Since it is given that
X
1
Y
1

X
2
X
3
=
X
1
X
2

Y
1
X
3
, by Lemma B.2 we have

X
1
Y
1
L
=
X
1
X
2
L
= 0. Since
X
1
Y
1
L
= ab
2
L
+ K, where no term in K contains the factor ab,
then if
X
1
Y
1
L
= 0, with probability 1 ab
2
L
= 0
2
L
= 0, which is a contradiction of the
assumptions. By repeating the argument, no element in
X
1
X
2
L
,
X
1
X
3
L
,
Y
1
X
2
L
,
Y
1
X
3
L
,
X
1
Y
2
L
,

X
1
Y
3
L
,
Y
1
Y
2
L
,
Y
1
Y
3
L
is zero. Therefore, since
X
1
Y
1

X
2
X
3
=
X
1
X
2

X
3
Y
1
=
X
1
X
3

X
2
Y
1
by
assumption, from Lemma B.2 we have

X
1
X
3

X
3
Y
1
=

X
1
X
3
L

X
3
Y
1
L
(B.7)
and from
X
1
Y
1

Y
2
Y
3
=
X
1
Y
2

Y
1
Y
3
=
X
1
Y
3

Y
1
Y
2

Y
1
Y
3

X
1
Y
3
=

Y
1
Y
3
L

X
1
Y
3
L
(B.8)
157
Since no covariance among the given variables is zero,

X
1
X
2

Y
1
X
3

X
1
Y
2

Y
1
Y
3
=

X
1
X
3

Y
1
X
2

X
1
Y
3

Y
1
Y
2

X
1
X
2

Y
1
Y
2
=
X
1
Y
2

Y
1
X
2

X
1
X
3

Y
1
Y
3

Y
1
X
3

X
1
Y
3
From (B.7), (B.8) it follows:

X
1
X
2

Y
1
Y
2
=
X
1
Y
2

Y
1
X
2

X
1
X
3
L

Y
1
Y
3
L

Y
1
X
3
L

X
1
Y
3
L
=
X
1
Y
2

Y
1
X
2
(a/b)
Y
1
X
3
L
(b/a)
X
1
Y
3
L

Y
1
X
3
L

X
1
Y
3
L
=
X
1
Y
2

Y
1
X
2
Contradiction.
Lemma 4.5 CS2 is sound.
Proof: Suppose X
1
and Y
1
have a common parent L in G. Let X
1
= aL +

p
a
p
A
p
and Y
1
=
bL+

p
b
i
B
i
. To simplify the presentation, we will represent

p
a
p
A
p
by random variable P
x
and

p
b
i
B
i
by P
y
, such that X
1
= aL + P
x
and Y
1
= bL + P
y
. We will assume that E[P
x
P] and
E[P
y
P] are not zero, for P X
1
, X
2
, Y
1
, Y
2
to simplify the proof, but the same results can be
obtained without this condition in an analogous (and simpler) way.
With probability 1 with respect to a Lebesgue measure over the linear coecients parameterizing
the graph, the constraint
X
1
Y
1

X
2
Y
2

X
1
Y
2

X
2
Y
1
= 0 corresponds to a polynomial identity where
some terms contain the product ab, some contain only a, some contain only b, and some contain
none of such symbols. Since this is a polynomial identity, all terms containing ab should sum to
zero. The same holds for terms containing only a, only b and not containing a or b. This constraint
can be rewritten as
ab(E[L
2
]
X
2
Y
2
E[LY
2
]E[LX
2
]) +
a(E[LP
y
]
X
2
Y
2
E[LY
2
]E[X
2
P
y
]) +
b(E[LP
x
]
X
2
Y
2
E[Y
2
P
x
]E[LX
2
]) +
(E[P
x
P
y
]
X
2
Y
2
E[P
x
Y
2
]E[P
y
X
2
])
From Lemmas 4.2 and 4.3 and the given hypothesis, X
1
cannot be an ancestor of any element of
X
2
, Y
1
, Y
2
and Y
1
cannot be an ancestor of any element in X
1
, X
2
, Y
2
. Therefore, the symbols
a and b cannot appear inside any of the polynomial expressions obtained when terms such as
X
2
Y
2
or E[Y
2
P
x
] are expressed as functions of the latent covariance matrix and the linear coecients and
error variances of the measurement model. All symbols a and b of
X
1
Y
1

X
2
Y
2

X
1
Y
2

X
2
Y
1
were
therefore factorized as above. Therefore, with probability 1 we have:
E[L
2
]
X
2
Y
2
= E[LX
2
]E[LY
2
] (B.9)
E[LP
y
]
X
2
Y
2
= E[LY
2
]E[X
2
P
y
] (B.10)
E[LP
x
]
X
2
Y
2
= E[Y
2
P
x
]E[LX
2
] (B.11)
158 Results from Chapter 4
E[P
x
P
y
]
X
2
Y
2
= E[Y
2
P
x
]E[X
2
P
Y
] (B.12)
Analogously, the constraint
X
2
Y
1

Y
2
Y
3

X
2
Y
3

Y
2
Y
1
= 0 will force other identities. Since Y
1
is also not an ancestor of Y
3
, we can split the polynomial expression derived from
X
2
Y
1

Y
2
Y
3

X
2
Y
3

Y
2
Y
1
= 0 into two parts
bE[LX
2
]
Y
2
Y
3
E[LY
2
]
X
2
Y
3
+E[X
2
P
Y
]
Y
2
Y
3
E[Y
2
P
Y
]
X
2
Y
3
= 0
where the second component, E[X
2
P
Y
]
Y
2
Y
3
E[Y
2
P
Y
]
X
2
Y
3
, cannot contain any term that includes
the symbol b, and neither can the second factor of the rst component, E[LX
2
]
Y
2
Y
3
E[LY
2
]
X
2
Y
3
.
With probability 1, it follows that:
E[LX
2
]
Y
2
Y
3
= E[LY
2
]
X
2
Y
3
E[X
2
P
Y
]
Y
2
Y
3
= E[Y
2
P
Y
]
X
2
Y
3
Since we have that
Y
2
Y
3
= 0 and
X
2
Y
3
= 0, from the two equations above, we get:
E[LX
2
]E[Y
2
P
Y
] = E[LY
2
]E[X
2
P
Y
] (B.13)
From the constraint
X
1
X
2

X
3
Y
2
=
X
1
Y
2

X
3
X
2
and a similar reasoning, we get
E[LX
2
]E[Y
2
P
X
] = E[LY
2
]E[X
2
P
X
] (B.14)
from which follows
E[X
2
P
X
]E[Y
2
P
Y
] = E[X
2
P
Y
]E[Y
2
P
X
] (B.15)
Combining (B.10) and (B.13), we have
aE[LP
y
]
X
2
Y
2
= aE[LX
2
]E[Y
2
P
Y
] (B.16)
Combining (B.11) and (B.14), we have
bE[LP
x
]
X
2
Y
2
= bE[X
2
P
X
]E[LY
2
] (B.17)
Combining (B.12) and (B.15), we have
E[P
x
P
y
]
X
2
Y
2
= E[X
2
P
X
]E[Y
2
P
Y
] (B.18)
From (B.9), (B.16), (B.17) and (B.18) and the given constraints:

X
1
X
2

Y
1
Y
2
= abE[LX
2
]E[LY
2
]+aE[LX
2
]E[Y
2
P
x
]+bE[X
2
P
x
]E[LY
2
]+E[X
2
P
X
]E[Y
2
P
Y
] = abE[L
2
]
X
2
Y
2
+
E[LP
y
]
X
2
Y
2
+E[LP
y
]
X
2
Y
2
+E[P
x
P
y
]
X
2
Y
2
=
X
1
Y
1

X
2
Y
2
=
X
1
Y
2

X
2
Y
1
Contradiction.
Theorem 4.6 There are sound identication rules that allow one to learn if two observed variables
share a common parent in a linear latent variable model that are not sound for non-linear latent
variable models.
159
Proof: Consider rst the following test: let G(O) be a linear latent variable model. Assume
X
1
, X
2
, X
3
, Y
1
, Y
2
, Y
3
Oand
X
1
Y
1

Y
2
Y
3
=
X
1
Y
2

Y
1
Y
3
=
X
1
Y
3

Y
1
Y
2
,
X
1
Y
2

X
2
X
3
=
X
1
X
2

Y
2
X
3
=

X
1
X
3

X
2
Y
2
,
X
1
Y
3

X
2
X
3
=
X
1
X
2

Y
3
X
3
=
X
1
X
3

X
2
Y
3
,
X
1
X
2

Y
2
Y
3
=
X
1
Y
2

X
2
Y
3
and that for all
triplets A, B, C, A, B X
1
, X
2
, X
3
, Y
1
, Y
2
, Y
3
, C O, we have
AB
= 0,
AB.C
= 0. Then
X
1
and Y
1
do not have a common parent in G.
Call this test CS3. Test CS3 is sound for linear models: if its conditions are true, then X
1
and
Y
1
do not have a common parent in G. The proof of this result is given by Silva et al. (2005).
However, this is not a sound rule for the non-linear case. To show this, it is enough to come up with
a latent variable model where X
1
and Y
1
have a common parent, and a latent covariance matrix
such that, for any choice of linear coecients and error variances, this test applies. Notice that
the denition of a sound identication rule in non-linear graphs allows us to choose specic latent
covariance matrices but the constraints should hold for any choice of linear coecients and error
variances (or, more precisely, with probability 1 with respect to the Lebesgue measure).
Consider the graph G with ve latent variables L
i
, 1 i 5, where L
1
has X
1
and Y
1
as its
only children, X
2
is the only child of L
2
, X
3
is the only child of L
3
, Y
2
is the only child of L
4
and
Y
3
is the only child of L
5
. Also, X
1
, X
2
, X
3
, Y
1
, Y
2
, Y
3
, as dened in CS3, are the only observed
variables, and each observed variable has only one parent besides its error term. Error variables
are independent.
The following simple randomized algorithm will choose a covariance matrix
L
for L
1
, L
2
,
L
3
, L
4
, L
5
that entails CS3. The symbol
ij
will denote the covariance of L
i
and L
j
.
1. Choose positive random values for all
ii
, 1 i 5
2. Choose random values for
12
and
13
3.
23

12

13
/
11
4. Choose random values for
45
,
25
and
24
5.
14

12

45
/
25
6.
15

12

45
/
24
7.
35

13

45
/
14
8.
34

12

45
/
15
9. Repeat from the beginning if
L
is not positive denite or if
14

23
=
12

34
Table B.1 provides an example of such matrix. Notice that the intuition behind this example
is to set the covariance matrix of the latent variables to have some vanishing partial correlations,
even though one does not necessarily have any conditional independence. For linear models, both
conditions are identical, and therefore this identication rule holds in such a case. .
Lemma B.3 For any set A, B, C, D = O

O, if
AB

CD
=
AC

BD
=
AD

BD
such that for
every set X, Y O

, Z O we have
XY.Z
= 0 and
XY
= 0, then A and B do not have more
than one common immediate latent ancestor in G.
160 Results from Chapter 4
L
1
L
2
L
3
L
4
L
5
1.0
0.4636804781967626 1.0
0.31177237495755117 0.1445627639088577 1.0
0.8241967922523632 0.6834605230188671 0.45954945371001815 1.0
0.5167659523766029 0.428525239857415 0.28813447630828753 0.7617079965565864 1.0
Table B.1: A counterexample that can be used to prove Theorem 4.6.
Proof: Assume for the sake of contradiction that L
1
and L
2
are two common immediate latent
ancestors of A and B in G. Let the structural equations for A, B, C and D be:
A =
1
L
1
+
2
L
2
+R
A
B =
1
L
1
+
2
L
2
+R
B
C =

j
c
j
C
j
D =

k
d
k
D
k
where
1
is a sequence of labels of edges corresponding to some directed path connecting L
1
and
A. Symbols
2
,
1
,
2
are dened analogously. R
X
is the remainder of the polynomial expression
that describes node X as a function of its parents and the immediate latent ancestors L
1
and L
2
.
Since the constraint
AB

CD
=
AC

BD
is observed, we have
AB

CD

AC

BD
= 0
(
1

2
L
1
+
1

L
1
L
2
+
2

2
L
2
+
1

L
1
R
B
+
2

L
2
R
B
+
1

L
1
R
A
+
2

L
2
R
A
+

R
A
R
B
)
CD
(
1

j
c
j

C
j
L
1
+
2

j
c
j

C
j
L
2
+

j
c
j

C
j
R
A
)
(
1

k
d
k

D
k
L
1
+
2

k
d
k

D
k
L
2
+

k
d
k

D
k
R
B
)) = 0
1

1
(
2
L
1

CD
(

j
c
j

C
j
L
1
)
(

k
d
k

D
k
L
1
)) +f(G) = 0, where
f(G) = (
1

L
1
L
2
+
2

2
L
2

L
1
R
B
+
2

L
2
R
B
+
1

L
1
R
A
+
2

L
2
R
A
+
R
A
R
B
)
CD

j
c
j

C
j
L
1
(
2

k
d
k

D
k
L
2
+

k
d
k

D
k
R
B
))

j
c
j

C
j
L
2
(
1

k
d
k

D
k
L
1
+
2

k
d
k

D
k
L
2
+

k
d
k

D
k
R
B
))

j
c
j

C
j
R
A
(
1

k
d
k

D
k
L
1
+
2

k
d
k

D
k
L
2
+

k
d
k

D
k
R
B
))
No element in O

is an ancestor of any other element in this set (Lemma 4.2) and no observed
node in any directed path from L
i
L
1
, L
2
to X A, B can be an ancestor of any node in
O

`X (Lemma B.1). That is, when fully expanding f(G) as a function of the linear parameters of
G, the product
1

1
cannot possibly appear.
Therefore, since with probability 1 the polynomial constraint is identically zero and nothing in
f(G) can cancel the term
1

1
, we have:

2
L
1

CD
=

j
c
j

C
j
L
1

k
d
k

D
k
L
1
(B.19)
Using a similar argument for the coecients of
1

2
,
2

1
and
2

2
, we get:

L
1
L
2

CD
=

j
c
j

C
j
L
1

k
d
k

D
k
L
2
(B.20)
161

L
1
L
2

CD
=

j
c
j

C
j
L
2

k
d
k

D
k
L
1
(B.21)

2
L
2

CD
=

j
c
j

C
j
L
2

k
d
k

D
k
L
2
(B.22)
From (B.19),(B.20), (B.21), (B.22), it follows:

AD
= [
1

j
c
j

C
j
L
1
+
2

j
c
j

C
j
L
2
][
1

k
d
k

D
k
L
1
+
2

k
d
k

D
k
L
2
]
=
2
1

j
c
g

C
j
L
1

k
d
k

D
k
L
1
+
1

j
c
j

C
j
L
1

k
d
k

D
k
L
2
+

j
c
j

C
j
L
2

k
d
k

D
k
L
1
+
2
2

j
c
j

C
j
L
2

k
d
k

D
k
L
2
= [
2
1

2
L
1
+ 2
1

L
1
L
2
+
2
2

2
L
2
]
CD
=
2
A

CD
which implies
CD

AC

AD
(
2
A
)
1
= 0
CD.A
= 0. Contradiction.
Lemma B.4 For any set A, B, C, D = O

O, if
AB

CD
=
AC

BD
=
AD

BD
such that for
every set X, Y O

, Z O we have
XY.Z
= 0 and
XY
= 0, then if A and B have a common
immediate latent ancestor L
1
in G, B and C have a common immediate latent ancestor L
2
in G,
we have L
1
= L
2
.
Proof: Assume A, B and C are parameterized as follows:
A = aL
1
+

p
a
p
A
p
B = b
1
L
1
+b
2
L
2
+

i
b
i
B
i
C = cL
2
+

j
c
j
C
j
where as before A
p
B
i
C
j
represents the possible other parents of A, B and C, respectively.
Assume L
1
= L
2
. We will show that
L
1
L
2
= 1, which is a contradiction. From the given constraint

CD
=
AD

BC
, and the fact that from Lemma 4.2 we have that for no pair X, Y O

X is
an ancestor of Y , if we factorize the constraint according to which terms include ab
1
c as a factor,
we obtain with probability 1:
ab
1
c[
2
L
1

L
2
D

L
1
D

L
1
L
2
] (B.23)
If we factorize such constraint according to ab
2
c, it follows:
ab
2
c[
L
1
L
2

L
2
D

L
1
D

2
L
2
] (B.24)
From (B.23) and (B.24), it follows that
2
L
1

2
L
2
= (
L
1
L
2
)
2

L
1
L
2
= 1. Contradiction.
Lemma B.5 For any set A, B, C, D = O

O, if
AB

CD
=
AC

BD
=
AD

BD
such that for
every set X, Y O

, Z O we have
XY.Z
= 0 and
XY
= 0, then if A and B have a common
immediate latent ancestor L
1
in G, C and D have a common immediate latent ancestor L
2
in G,
we have L
1
= L
2
.
162 Results from Chapter 4
Proof: Assume for the sake of contradiction that L
1
= L
2
. Let P
A
be a directed path from L
1
to
A, and
1
the sequence of edge labels in this path. Analogously, dene
2
as the sequence of edge
labels from L
1
to B by some arbitrary path P
B
,
1
a sequence from L
2
to C according to some
path P
C
and
2
a sequence from L
2
to D according to some path P
D
.
P
A
and P
B
cannot intersect, since it would imply the existance of an observed common cause
for P
A
and P
B
, which is ruled out by the given assumptions and Lemma B.1. Similarly, no pair of
paths in P
A
, P
B
, P
C
, P
D
can intersect. By Lemma B.4, L
1
cannot be an ancestor of either C or
D, or otherwise L
1
= L
2
. Analogously, L
2
cannot be an ancestor of either A or B.
By Lemma 4.2 and the given constraints, no element X in O

can be ancestor of an element in

`X.
It means that when expanding the given constraint
AB

BC
= 0, and keeping all and
only the terms that include the symbol
1

2
, we obtain
1

2
L
1

2
L
2

2
L
1
L
2
= 0,
which implies
L
1
L
2
= 1 with probability 1. Contradiction.
Lemma 4.7 Let S O be any set such that, for all A, B, C S, there is a fourth variable D O
where i.
AB

CD
=
AC

BD
=
AD

BD
and ii. for every set X, Y A, B, C, D, Z O we
have
XY.Z
= 0 and
XY
= 0. Then S can be partioned into two sets S
1
, S
2
where
1. all elements in S
1
share a common immediate latent ancestor, and no two elements in S
1
have any other common immediate latent ancestor;
2. no element S S
2
has any common immediate latent ancestor with any other element in
S`S
3. all elements in S are d-separated given the latents in G;
Proof: Follows immediately from the given constraints and Lemmas 4.2, B.4 and B.5.
Theorem 4.8 If a partition C
1
, . . . , C
k
of O

respects structural conditions SC1, SC2 and SC3,

then the following holds in the true latent variable graph G that generated the data:
1. for all X C
i
, Y C
j
, i = j, X and Y have no common parents, and X is d-separated from
the latent parents of Y given the latent parents of X;
2. for all X, Y O

, X is d-separated from Y given the latent parents of X;

3. every set C
i
can be partitioned into two groups according to Lemma 4.7;
Proof: Follows immediately from the given constraints and Lemmas 4.1, 4.4, 4.5 and Lemmas 4.7.

Before showing the proof of Theorem 4.9, the next two lemmas will be useful:
Lemma B.6 Let set A, B, C, D = O

O be such that
AB

CD
=
AC

BD
=
AD

BD
and for
every set X, Y O

, Z O we have
XY.Z
= 0 and
XY
= 0. If an immediate latent ancestor
L
X
of X O

is uncorrelated with some immediate latent ancestor L

Y
of Y O

, then L
X
is
uncorrelated with all immediate latent ancestors of all elements in O

`X or L
Y
is uncorrelated with
all immediate latent ancestors of all elements in O

`Y .
163
Proof: Since the immediate latent ancestors of O

are linked to O

in that set by directed paths

that do not intersect (Lemma B.1) other than at the sources, and the model is linear below the
latents, we can treat them as parents of O

without loss of generality. We will prove the theorem

in two steps.
Step 1: let X, Y O

. If a parent L
X
of X is uncorrelated with all parents of Y , then L
X
is
uncorrelated with all parents of all elements in O

`X. To see this, without loss of generality let

A = aL
A
+

p
a
p
A
p
, and let L
A
be uncorrelated with all parents of B. Let C = cL
C
+

j
c
j
C
j
.
This means that when expanding the polynomial
AB

BD
= 0, the only terms contain-
ing the symbol ac will be ac
L
A
L
C

BD
. Since ac = 0,
BD
= 0, this will force
L
A
L
C
= 0 with
probability 1. By symmetry, L
A
will be uncorrelated with all parents of C and D.
Step 2: now we show the result stated by the theorem. Without loss of generality let A =
aL
A
+

p
a
p
A
p
, B = bL
B
+

i
b
i
B
i
and let L
A
be uncorrelated with L
B
. Then no term in
the polynomial corresponding to
AB

CD
can contain a term with the symbol ab, since
L
A
L
B
= 0.
If L
B
is uncorrelated with all parents of D, then L
B
is uncorrelated will all parents of all elements
in O

`B, and we are done. Otherwise, assume L

B
is correlated with at least one parent of D.
Then at least one term in
AC

BD
will contain the symbol ab if there is some parent of C that
is correlated with L
A
(because
BD
will contain some term with b). It follows that L
A
has to be
uncorrelated with every parent of D, and by the result in Step 1, with all parents of all elements
in O

`A.
Lemma B.7 Let set A, B, C, D = O

O be such that
AB

CD
=
AC

BD
=
AD

BD
and
for every set X, Y O

, Z O we have
XY.Z
= 0 and
XY
= 0. Let A
p
be the set of
immediate latent ancestors of A, B
i
be the set of immediate latent ancestors of B, C
j
be the
set of immediate latent ancestors of C, D
k
be the set of immediate latent ancestors of D. Then

ApB
i

C
j
D
k
=
ApC
j

B
i
D
k
=
ApD
k

B
i
C
j
for all A
p
, B
i
, C
j
, D
k
A
p
B
i
C
j
D
k
.
Proof: Since the immediate latent ancestors of O

are linked to O

in that set by directed paths

that do not intersect (Lemma B.1) other than at the sources, and the model is linear below the
latents, we can treat them as parents of O

without loss of generality. Let a

p
be the coecient link-
ing A and A
p
. Dene b
i
, c
j
, d
k
analogously. The lemma follows immediately by the same measure
theoretical arguments of previous lemmas applied to the terms that include a
p
b
i
c
j
d
k
.
Theorem 4.9 Given a partition C of a subset O

of the observed variables of a latent variable

graph G such that C satises structural constraints SC1-SC4, there is a linear latent variable model
for the rst two moments of O

.
Proof: We will assume that all elements of all sets in C are correlated. Otherwise, C can be
partitioned into subsets with this property (because of the SC4 condition), and the parameterization
given below can be applied independently to each member of the partition without loss of generality.
Let An
i
be the set of immediate latent ancestors of the elements in C
i
C = C
1
, . . . , C
k
.
Split every An
i
into two disjoint sets An
0
i
and An
1
i
, such that An
0
i
contains all and only the those
elements of An
0
i
that are uncorrelated with all elements in An
1
An
k
. This implies that all
elements in An
1
1
An
1
k
are pairwise correlated by Lemma B.6.
164 Results from Chapter 4
Construct the graph G
L
linear
as follows. For each set An
i
, add a latent L
An
i
to G
L
linear
, as well
as all elements of An
1
i
. Add a directed edge from L
An
i
to each element in An
1
i
. Let G
L
linear
be
also a linear latent variable model. We will dene values for each parameter in this model.
Fully connected all elements in L
An
i
as an arbitrary directed acyclic graph (DAG). Instead
of dening the parameters for the edges and error variances in the subgraph of G
L
linear
induced
by L
An
i
, we will directly dene a covariance matrix
L
among these nodes. Standard results
in linear models can be used to translate this covariance matrix to the parameters of an arbitrary
fully connected DAG (Spirtes et al., 2000). Set the diagonal of
L
to be 1.
Dene the intercept parameters
x
of all elements in G
L
linear
to be zero. For each V in An
1
i
we
have a set of parameters for the local equations V =
V
L
An
i
+
V
, where
V
is a random variable
with zero mean and variance
V
.
Choose any three arbitrary elements X, Y, Z An
1
i
. Since the subgraph L
An
i
X, L
An
i

Y, L
An
i
Z has six parameters (
X
,
Y
,
Z
,
X
,
Y
,
Z
) and the population covariance matrix of
X, Y and Z has six entries, these parameters can be assigned an unique value (Bollen, 1989) such
that
XY
=
X

Y
and
X
=
2
X

2
X
. Let W be any other element of An
1
i
: set
W
=
WX
/
X
,

W
=
2
W

2
W
. From Lemma B.7, we have the constraint
WY

XZ

WX

Y Z
= 0, from which
one can verify that
WY
=
W

Y
. By symmetry and induction, for every pair P, Q in An
1
i
, we
have
PQ
=
P

Q
.
Let T be some element in An
1
j
, i = j: set the entry
ij
of
L
to be
TX
/(
T

X
). Let R and S be
another elements in An
1
j
. From Lemma B.7, we have the constraint
XT

RS

XR

ST
= 0, from
which one can verify that
XR
=
X

ij
. Let Y and Z be another elements in An
1
i
. From Lemma
B.7, we have the constraint
XT

Y Z

ZT
= 0 from which one can verify that
ZT
=
Z

ij
.
By symmetry and induction, for every pair P, Q in An
1
i
An
1
j
, we have
PQ
=
P

ij
.
Finally, let G
linear
be a graph constructed as follows:
1. start G
linear
with a node for each element in O

;
2. for each C
i
C, add a latent L
i
to G, and for each V C
i
, add an edge L
i
V
3. fully connect the latents in G
linear
to form an arbitrary directed acyclic graph
Parameterize a linear latent model based on G as follows: let V C
i
such that V has imme-
diate latent ancestors L
V
i
. In the true model, let V =
G
V
+
i

G
iV
L
V
i
+
G
V
, where every latent
is centered at its mean. Construct the equation V =
V
+
V
L
i
+
V
by instantiating
V
=
G
V
and
V
=
i

G
iV

L
V
i
, where
L
V
i
is the respective parameter for L
V
i
in G
L
linear
if L
V
i
An
1
i
, and 0
otherwise. The variance for
V
is dened as
2
V

2
V
. The L
i
variables have covariance matrix
L
as dened above. One can then verify that the covariance matrix generated by this model equals
the true covariance matrix of O

.
Lemma 4.10 Let G(O) be a latent variable graph where no pair in O is marginally uncorrelated,
and let X, Y O. If there is no pair P, Q O such that
XY

PQ
=
XP

Y Q
holds, then
there is at least one graph in the tetrad equivalence class of G where X and Y have a common
latent parent.
Proof: It will suce to show the result for linear latent variable models, since they are more con-
strained than non-linear ones. Moreover, we will be able to make use of the Tetrad Representation
165
Theorem and the equivalence of d-separations and vanishing partial correlations, facilitating the
proof.
If in all graphs in the tetrad equivalence graph of G we have that X and Y share some common
hidden parent, then we are done. Assume then that there is at least one graph G
0
in this class such
that X and Y have no common hidden parent. Construct graph G

0
by adding a new latent and
edges X L Y . We will show that G

0
is in the same tetrad equivalence class, i.e., the addition
of the substructure X L Y to G
0
does not destroy any entailed tetrad constraint (it might,
however, destroy some independence constraint).
Assume there is a tetrad constraint corresponding to some choke point X, P T, Q. If Y
is not an ancestor of T or Q, then this tetrad will not be destroyed by the introduction of subpath
X L Y , since no new treks connecting X or P to T or Q can be formed, and therefore no
choke point X, P T, Q will disappear.
Assume without loss of generality that Y is an ancestor of Q. Since there is a trek connecting X
to Q through Y (because no marginal correlations are zero) in G, the choke point X, P T, Q
should be in this trek. Let X be the starting node of this trek, and Q the ending node. If the
choke point is after Y on this trek, then this choke point will be preserved under the addition of
X L Y . If the choke point is Y or is before Y on this trek, then there will be a choke point
X, P Y, Q, a contradiction of the assumptions.
One can show that choke points Y, PT, Q are also preserved by an analogous argument.
Before proving Theorem 4.11, we will introduce several lemmas that will be used in the Theorem
proof.
Lemma B.8 Let G(O) be a linear latent variable graph, and let O

= A, B, C, D O. If all
elements in O

are marginally correlated, and a choke point CP = A, CB, D exists, and CP

is in all treks connecting elements in A, B, C, D, then no two elements X
1
, X
2
, X
1
A, C,
X
2
B, D, are both connected to CP in G by treks into CP.
Proof: By the Tetrad Representation Theorem, CP should be either on the A, C or the B, D
side of every trek connecting elements in these two sets. For the sake of contradiction, assume
without loss of generality that A and B are connected to CP by some treks into CP. Since

AB
= 0, CP has to be an ancestor of either A or B. Without loss of generality, let CP be an
ancestor of B. Then there is at least one trek connecting A and B such that CP is not on the
A, C side of it: the one connecting CP and A that is into CP and continues into B.
If CP is an ancestor of C, then there is at least one trek connecting C and B such that CP
is not in the B, D side of it: the one connecting CP and B that is into CP and continues into
C. But this cannot happen by the denition of choke point. If CP is not an ancestor of C, CP
has to be an ancestor of A, or otherwise there would be no treks connecting A and C (since CP
is in all treks connecting A and C by hypothesis, and at least one exists, because
AC
= 0). This
implies at least one trek connecting A and B such that CP is not on the B, D side of it: the one
connecting CP and B that is into CP and continues into A. Contradiction.
Lemma B.9 Let G(O) be a linear latent variable graph, and let O

= A, B, C, D, E O. If all
elements in O

are marginally correlated, and constraints

CD
=
AD

BC
,
AC

DE
=
AE

CD
and
BC

DE
=
BD

CE
hold, then all three tetrad constraints hold in the covariance matrix of
A, B, C, D.
166 Results from Chapter 4
Proof: By the Tetrad Representation Theorem, let CP
1
be a choke point A, C B, D, which
is known to exist in G by assumption. Let CP
2
be a choke point A, D C, E, which is also
assumed to exist. From the denition of choke point, all treks connecting C and D have to pass
through both CP
1
and CP
2
. We will assume without loss of generality that none of the choke
points we introduce in this proof are elements of A, B, C, D, E.
First, we will show by contradiction that all treks connecting A to C should include CP
1
.
Assume that A is connected to C through a trek T that includes CP
2
but not CP
1
. Let T
1
be the
subtrek A CP
2
, i.e., the subtrek of T connecting A and CP
2
. Let T
2
be the subtrek CP
2
C.
Neither T
1
or T
2
contain CP
1
, and they should not collide at CP
2
by denition. Notice that a trek
like T should exist, since CP
2
has to be in all treks connecting A and C, and at least one such trek
exists because
AC
= 0. Any subtrek connecting CP
2
to D that does not intersect T
2
elsewhere
but in CP
2
has to contain CP
1
. Let T
3
be the subtrek between CP
2
and CP
1
. Let T
4
be a subtrek
between CP
1
and B. Let T
5
be the subtrek between CP
1
and D. This is illustrated by Figure
B.2(a). (B and D might be connected by other treks, simbolized by the dashed edge.)
Now consider the choke point CP
3
= B, E C, D. Since CP
3
is in all treks connecting B
and C, CP
3
should be either on T
2
, T
3
or T
4
. If CP
3
is on T
4
(Figure B.2(b)), then there will be
a trek connecting D and E that does not include CP
2
, which contradicts the denition of choke
point A, DC, E, unless both BCP
1
and DCP
1
are into CP
1
. However, if both BCP
1
and D CP
1
(i.e., T
4
and T
5
) are into CP
1
, then CP
1
CP
2
is out of CP
1
and into CP
2
, since
T
2
T
3
T
5
is a trek by construction, and therefore cannot contain a collider. Since D is an
ancestor of CP
2
and CP
2
is in a trek connecting E and D, then CP
2
is an ancestor of E. All paths
CP
2
E should include CP
3
by denition, which implies that CP
2
is an ancestor of CP
3
.
B cannot be an ancestor of CP
3
, or otherwise CP
3
would have to be an ancestor of CP
1
, creating
the cycle CP
3
. . . CP
1
CP
2
CP
3
. CP
3
would have to be an ancestor of B,
since B CP
3
CP
1
is assumed to be a trek into CP
1
and CP
3
is not an ancestor of CP
1
(Figure
B.2(c)). If CP
3
is an ancestor of B, then there is a trek C CP
2
. . . CP
3
B, which
does not include CP
1
. Therefore, CP
3
is not in T
4
.
If CP
3
is in T
3
, B and D should both be ancestors of CP
1
, or otherwise there will be a trek
connecting them that does not include CP
3
. Again, this will imply that CP
1
is an ancestor of
CP
2
. If some trek E CP
3
is not into CP
3
, then this creates a trek D CP
1
CP
3
E that
does not contain CP
2
, contrary to our hypothesis. If every trek E CP
3
is into CP
3
, then some
other trek CP
3
D that is out of CP
3
but does not include CP
1
has to exist. But then this
creates a trek connecting C and D that does not include CP
1
, which contradicts the denition
CP
1
= A, C B, D. A similar reasoning forbids the placement of CP
3
in T
2
.
Therefore, all treks connecting A and C should include CP
1
. We will now show that all treks
connecting B and D should also include CP
1
. We know that all treks connecting elements in
A, C, D go through CP
1
. We also know that all treks between B, E and C, D go through
CP
3
. This is illustrated by Figure B.2(d). A possible trek from CP
3
to D that does not include
CP
1
(represented by the dashed edge connecting CP3 and D) would still have to include CP
2
,
since all treks in A, D C, E go through CP
2
. If CP
1
= CP
2
, then all treks between B and D
go through CP
1
. If CP
1
= CP
2
, then such CP
3
D trek without CP
1
but with CP
2
would exist,
implying that some trek CD without both CP
1
and CP
2
would exist, contrary to our hypothesis.
Therefore, we showed that all treks connecting elements in A, B, C, D go through the same
point CP
1
. By symmetry between B and E, it is also the case that CP
1
is in all treks connecting
elements in A, E, C, D. From this one can verify that CP
1
= CP
2
. We will show that CP
1
is
167
5
C
A
B
D
CP
CP
T
T
T
T
2
1
1
2
3
4
T
CP
C
A
B
D
CP
CP
T
T
T
T
2
1
1
2
3
4
T
5
E
3
(a) (b)
T
C
A
B
D
CP
CP
T
T
T
2
1
1
2
3
T
5
E
3
CP
4
E
1
CP
3
A C D B
CP
(c) (d)
Figure B.2: Several illustrations depicting cases used in the proof of Lemma B.9.
also a choke point for B, E C, D (althought it might be the case that CP1 = CP
3
). Because
CP
1
= CP
2
, one can verify that choke point CP
3
has to be in a trek connecting B and CP
1
. There
is a trek connecting B and CP
1
that is into CP
1
if and only if is a trek connecting B and CP
3
that is into CP
3
. The same holds for E. Therefore, there is a trek connecting B and CP
1
that is
into CP
1
if and only if there is a trek connecting E and CP
1
that is into CP
1
. However, if there is
a trek connecting B and CP
1
into CP
1
, then there is no trek connecting C and CP
1
that is into
CP
1
(because of choke point A, C B, D and Lemma B.8). This also implies there is no trek
E CP
1
into CP
1
, and because CP
1
is a A, D C, E choke point, Lemma B.8 will imply that
there is no DCP
1
into CP
1
. Therefore, all treks connecting pairs B, E C, D will be either
on the B, E side or C, D of CP
1
. CP
1
is a B, E C, D choke point.
Because CP
1
is a A, C B, D, A, D C, E and B, E C, D choke point, then
no pair in A, B, C, D can be connected to CP
1
by a trek into CP
1
. This implies that CP
1
d-
separates all elements in A, B, C, D and therefore CP
1
is a choke point for all tetrads in this set.
Lemma B.10 Let G(O) be a linear latent variable graph, and let O

= A, B, C, D, E O. If all
elements in O

are marginally correlated, and constraints

CD
=
AD

BC
,
AC

DE
=
AE

CD
and
BE

DC
=
BD

CE
hold, then all three tetrad constraints hold in the covariance submatrix
formed by any foursome in A, B, C, D, E.
Proof: As in Lemma B.9, let CP
1
be a choke point A, CB, D, and let CP
2
be a choke point
A, D C, E. Let CP
3
be choke point B, C D, E.
We rst show that all treks between C and A go through CP
1
. Assume there is a trek connecting
A and C through CP
2
but not CP
1
, analogous to Figure B.2(a). Let T
1
, . . . , T
5
be dened as in
168 Results from Chapter 4
Lemma B.9. Since all treks between C and D go through CP
3
, choke point CP
3
should be either
at T
2
, T
3
or T
4
.
If CP
3
is at T
2
or T
3
, then treks B and D should collide at CP
1
, or otherwise there will be a
trek connecting B and D that does not include CP
3
. This implies that CP
1
is an ancestor of CP
3
.
If there is a trek connecting D and CP
3
that intersects T
2
or T
3
not at CP
1
, then there will be a
trek connecting C and D that does not include CP
1
, which would be a contradiction. If there is
no such a trek connecting D and CP
3
, then CP
3
cannot be a B, C D, E choke point. If CP
3
is at T
4
, a similar case will follow.
Therefore, all treks connecting A and C include CP
1
. By symmetry between A, B, E and
C, D, CP
1
is in all treks connecting any pair in A, B, C, D, E. Using the same arguments of
Lemma B.9, one can show that CP
1
is a choke point for any foursome in this set.
Lemma B.11 Let G(O) be a linear latent variable graph, and let O

= A, B, C, D, E O. If all
elements in O

are marginally correlated, and constraints

CD
=
AD

BC
,
AC

DE
=
AE

CD
and
AB

CE
=
AC

BE
hold, then all three tetrad constraints hold in the covariance matrix of
A, C, D, E.
Proof: As in Lemmas B.9 and B.10, let CP
1
be a choke point A, C B, D, and let CP
2
be a
choke point A, D C, E. Let CP
3
be a choke point A, E B, C. We will rst show that
all treks connecting A and C either go through CP
1
or all treks connecting A and D go through
CP
2
.
As in Lemma B.9, all treks connecting C and D contains CP
1
and CP
2
. Let T be one of these
treks. Assuming that A and C are connected by some trek that does not contain CP
1
(but must
contain CP
2
) implies a family of graphs represented by Figure B.2(a).
Since there is a choke point CP
3
= A, E B, C, the only possible position for CP
3
in
Figure B.2(a) is in trek A CP
2
. If CP
2
= CP
3
, then no choke point A, D C, E can exist,
since CP
3
is not in T. Therefore, either all treks between A and C contain CP
1
, or CP
2
= CP
3
.
If the rst case holds, a similar argument will show that all treks between any element in
A, C, D and node E will have to go through CP
1
. If the second case holds, a similar argument
will show that all treks between any element in A, C, D and node E will have to go through CP
2
.
Therefore, there is a node CP such that all treks connecting elements in A, C, D, E go
throught some choke point. Similarly to the proof of Lemma B.9, using Lemma B.8, the given
tetrad constraints will imply that CP is a choke point for all tetrads in A, C, D, E for both cases
CP = CP
1
and CP = CP
2
.
Theorem 4.11 Let X O be a set of observed variables, [X[ < 6. Assume
X
1
X
2
= 0 for all
X
1
, X
2
X. There is no possible set of tetrad constraints within X for deciding if two nodes
A, B X do not have a common parent in a latent variable graph G(O).
Proof: It will suce to show the result for linear latent variable models, since they are more con-
strained than non-linear ones. Moreover, we will be able to make use of the Tetrad Representation
Theorem and the equivalence of d-separations and vanishing partial correlations, facilitating the
proof.
This is trivial for domains of size 2 and 3, where no tetrad constraint can hold. For domains
of size 4, let X = A, B, C, D be our four variables. We will show that it does not matter which
169
tetrad constraints hold among these four variables (excluding logically inconsistent constraints),
there exist two linear latent variable graphs with observable variables A, B, C, D, G

and G

,
where in the former A and B do not share a parent, while in latter they do have a parent in common.
This will be the main technique used during the entire proof. Another technique is showing that
some combinations of tetrad constraints will result in contradictory assumptions about existing
constraints, and therefore we do not need to create the G

and G

graphs corresponding to these

sets.
By Lemma 4.10, if we do not have any tetrad corresponding to a choke point A, V
1
B, V
2
,
then the result follows immediately. We therefore consider only the cases where the tetrad constraint
corresponding to choke point A, C B, D exists, without loss of generality. This assumption
will be used during the entire proof.
Bi-directed edges X Y will be used as a shorthand representation for the path X L Y ,
where L is some new latent independent of its non-children.
Suppose rst that all possible three tetrad constraints hold in the covariance matrix of
A, B, C, D, i.e.,
AB

CD
=
AC

BD
=
AD

BC
. Let G

have two latent nodes L

1
and L
2
, where
L
1
is a common parent of A and L
2
, and L
2
a parent of B, C and D. Let G

have a latent node

L
1
as the only parent of A, B, C and D, and no other edges, and the result will follow for this case.
Suppose now only one tetrad constraint holds instead of all three, i.e., the one entailed by a
choke point between pairs A, CB, D (the analogous case would be the pairs A, DB, C).
Create G

again by using two latents L

1
and L
2
, making L
2
a parent of B and D, and making L
1
a parent of L
2
, A and C. Create G

from G

, by adding the edge L

1
B.
Now suppose our domain X = A, B, C, D, E has ve variables, where will now denote
the covariance matrix of X. Again, we will show how to build graphs G

and G

in all possible
consistent combinations of vanishing and non-vanishing tetrad constraints. This case is more com-
plicated, and we will divide it in several major subcases. Each subcase will have an sub-index, and
each sub-index inherits the assumptions of higher-level indices. Some results about entailment of
tetrad constraints are stated without explicit detail: they can be derived directly by a couple of
algebraic manipulations of tetrad constraints or from Lemmas B.9, B.10 and B.11.
Case 1: There are choke points A, C B, D and A, B C, D. We know from the
assumption of existence of a choke point A, C B, D and results from Chapter 3 that this
is equivalent of having a latent variable d-separating all elements in A, B, C, D. Let G
0
be as
follows: let L
1
and L
2
be two latent variables, let L
1
be a parent of A, L
2
, and let L
2
be a parent
of B, C, D, E. We will construct G

and G

from G
0
, considering all possible combinations of
choke points of the form V
1
, V
2
V
3
, E.
Case 1.1: there is a choke point A, C D, E.
Case 1.1.1:there is a choke point A, D C, E. As before, this implies a choke point
A, EC, D. We only have to consider now choke points of the form X
1
, BX
2
, E
and X
1
, X
2
B, E. From the given constraints
BD

AC
=
BC

AD
(choke point A, B
C, D) and
DE

AC
=
CE

AD
(choke point A, E C, D), we have
BD

CE
=
BC

DE
, a
B, E C, D choke point. Choke points B, E A, C and B, E A, D will follow from
this conclusion. Finally, if we assume also the existence of some choke point X
1
, B X
2
, E,
then all choke points of this form will exist, and one can let G

= G
0
. Otherwise, if there is no
choke point X
1
, BX
2
, E, let G

be G
0
with the added edge B E. Construct G

by adding
edge L
2
A to G

.
170 Results from Chapter 4
Case 1.1.2:there is no choke point A, D C, E. Choke point A, E C, D cannot
exist, or this will imply A, D C, E. We only have to consider now choke points of
the form X
1
, B X
2
, E and X
1
, X
2
B, E. Choke point A, C B, E is entailed
to exist, since the single choke point that d-separates foursome A, B, C, D has to be the same
choke point for A, C D, E and therefore a choke point for A, C B, E. No choke
point X
1
, D X
2
, E can exist, for X
i
A, B, C, i = 1, 2: otherwise, from the given choke
points and X
1
, D X
2
, E, one can verify that A, D C, E would be generated using
combinations of tetrad constraints. We only have to consider now choke points of the
form X
1
, B X
2
, E. Choke points B, C A, E, B, C D, E, A, B C, E and
A, BD, E either all exist or none exists. If all exist, let G

= G
0
with the extra edge D E.
If none exists, let G

= G
0
and add both B E and D E to G

. Let G

be G

with the extra

edge L
2
A.
Case 1.2: there is no choke point A, C D, E.
Case 1.2.1:there is a choke point A, D C, E. This case is analogous to Case 1.1.2 by
symmetry within A, B, C, D.
Case 1.2.2:there is no choke point A, D C, E. Assume rst there is no choke point
A, EC, D. We only have to consider now choke points of the form X
1
, BX
2
, E
and X
1
, X
2
B, E. At most one of the choke points X
1
, B X
2
, E can exist. Otherwise,
any two of them will entail either A, DC, E, A, CD, E or A, EC, D by Lemmas
B.9, B.10 or B.11. Analogously, no choke point X
1
, X
2
B, E can exist.
Without loss of generality, let A, BD, E be the only possible extra choke point. Create
G

by adding edges C E and D E to G

0
. Create G

by adding edge L
2
A to G

. For the
case where no other choke point exists, create G

by adding edges A E, B E, C E and

D E to G
0
. Create G

by adding edge L
2
A to G

.
Assume now there is a choke point A, E C, D. We only have to consider now
choke points of the form X
1
, B X
2
, E and X
1
, X
2
B, E. No A, B X
1
, E
choke point can exist, or by Lemmas B.9, B.10 or B.11 and the given tetrad constraints, some
A, X
1
E, X
2
choke point will be entailed.
Choke point B, C D, E exists if and only if B, D C, E exists. can exist. If both
exist, create G

by adding edges A E to G
0
. Create G

by adding edge L
2
A to G

. If none
exists, create G

by adding edges A E and B E to G

0
. Create G

by adding edge L
2
A to
G

.
Case 2: There is a choke point A, C B, D, but no choke point A, B C, D.
Case 2.1: there is a choke point A, C D, E,.
Case 2.1.1: there is a choke point A, D C, E. As before, this implies a choke point
A, E C, D. We only have to consider now choke points of the form X
1
, B
X
2
, E and X
1
, X
2
B, E. The choke point A, C B, E is implied. No choke point
B, E X
1
, D can exist, or otherwise A, B C, D will be implied. For the same reason,
no choke point B, X
1
D, E can exist. We only have to consider now subsets of the
set of constraints A, BC, E, C, BA, E. The existence of A, BC, E implies
C, B A, E. We only need to consider either both or none.
Suppose none of these two constraints hold. Create G

with two latents L

1
, L
2
. Let L
1
be
a parent of B, L
2
, let L
2
be a parent of A, C, D, E. Add the bi-directed edge B E. Add
the bi-directed edge B D. Create G

out of G

by adding edge L
2
B. Now suppose both
171
constraints hold. Create G

with two latents L

1
, L
2
. Let L
1
be a parent of B, L
2
, let L
2
be a
parent of A, C, D, E. Add the bi-directed edge B D. Create G

out of G

by adding edge
L
2
B.
Case 2.1.2: there is no choke point A, D C, E. Since there is a choke point A, C
D, E by assumption 2.1, there is no choke point A, E C, D or otherwise we get a con-
tradiction. Analogously, because there is a A, C B, D choke point but no A, B C, D
(assumption 2), we cannot have a A, DB, C choke point. This covers all choke points within
sets A, B, C, D and A, C, D, E. We only have to consider now choke points of the form
X
1
, B X
2
, E and X
1
, X
2
B, E.
From
AB

CD
=
AD

BC
(choke point A, C B, D) and
AE

CD
=
AD

CE
(choke
point A, C D, E) one gets
AB

CE
=
AE

BC
, i.e., a B, E A, C choke point. Choke
point B, E A, D exists if and only if B, E C, D exists: to see how the former implies
the latter, use the tetrad constraint from B, E A, C. Therefore, we have two subcases.
Case 2.1.2.1: there are choke points B, E A, D and B, E C, D. We only
have to consider now choke points of the form X
1
, BX
2
, E. No choke point B, A
C, E and B, CA, E can exist (one implies the other, since we have B, EA, C, and all
three together with the given choke points will generate A, BC, D, excluded by assumption).
Choke points B, C D, E and B, D C, E either both exist or both do not exist. The
same holds for pair B, AD, E, B, DA, E. Let G

be a graph with two latents, L

1
, L
2
,
where L
1
is a parent of L
2
, A, C and L
2
is a parent of B, D, E. Add bi-directed edge B D
for cases where B, C D, E, B, D C, E do not exist. Add bi-directed edge B E for
cases where B, A D, E, B, D A, E do not exist. Let G

be formed from G

with the
addition of L
1
B.
Case 2.1.2.2: there are no choke points B, E A, D and B, E C, D. We only
have to consider now choke points of the form X
1
, B X
2
, E. Using the tetrad con-
straint implied by choke point A, C D, E, one can verify that A, B D, E holds if and
only if B, C D, E holds (call pair A, B D, E, B, C D, E Pair 1). From the
given B, EA, C, we have that A, BC, E holds if and only if B, CA, E holds (call
it Pair 2). Using the given tetrad constraint corresponding to A, C B, D, one can show that
B, D A, E holds if and only if B, D C, E (call it Pair 3). We can therefore partition
all six possible X
1
, B X
2
, E into these three pairs. Moreover, if Pair 1 holds, none of the
other two can hold, because Pair 1 and Pair 2 together imply B, E A, D. Pair 1 and Pair 3
together imply B, E C, D.
If neither Pair holds, construct G

as follows. Let G
0
be the latent variable graph con-
taining three latents L
1
, L
2
, L
3
where L
1
is a parent of A, C, L
2
, L
2
is a parent of B, L
3
and
L
3
is a parent of D, E. Let G

be G
0
with the added edges B D and B E. If Pair 1 alone
holds, let G

be as G
0
. In both cases, let G

be G

with the added edge L

1
B.
If Pair 2 holds, but not Pair 3 (nor Pair 1), construct G

as follows. Let G
0
be a latent
variable graph with two latents L
1
and L
2
, where L
1
is a parent of L
2
and A, and L
2
is a parent
of B, C, D, E. Let G

be G
0
augment with edges B D and B E. If Pairs 2 and 3 hold (but
nor Pair 1), let G

be G
0
with the extra edge B D. In both cases, let G

be G

with the extra

edge L
2
A. If Pair 3 holds but not Pair 2 (nor Pair 1), let G

have three latents L

1
, L
2
, L
3
, where
L
1
is a parent of L
2
and A, L
2
is a parent of L
3
and C, and L
3
is a parent of B, D and E. Let G

be as G

but with the extra edge L

3
L
1
.
Case 2.2: there no a choke point A, C D, E.
172 Results from Chapter 4
Case 2.2.1: there is a choke point A, D C, E. Because of the choke points that are
assumed not to exist, it follows immediately that choke points A, D B, C, A, E C, D
cannot exist. We only have to consider now choke points of the form X
1
, B X
2
, E
and X
1
, X
2
B, E. The choke point A, D B, E cannot exist, or otherwise when it is
combined with choke point A, DC, E, it will generate a contraint corresponding to choke point
A, D B, C, which is assumed not to exist. Similarly, A, C B, E cannot exist because
the existence of A, C B, D will imply A, C D, E. No choke point B, E C, D
can exist either. This follows from choke points A, C B, D, A, D C, E, which with
B, E C, D entail choke point A, B C, D (Lemma B.9), which is assumed not to exist.
We only have to consider now choke points of the form X
1
, B X
2
, E. Choke
points B, CD, E and B, DC, E are automatically excluded because of A, CB, D,
A, D C, E and Lemma B.10. Combining choke point A, B C, E with choke point
A, D C, E will generate a choke point B, D C, E, which we just discarded. Therefore,
there is no choke point A, B C, E. Combining choke point B, D A, E with choke
point A, C B, D will generate a choke point B, D C, E, which we just discarded.
Therefore, there is no choke point B, D A, E. Combining choke point B, C A, E with
A, CB, D and A, DC, E using Lemma B.11 will result in a choke point A, EC, D,
which is discarded by hypothesis. Therefore, there is no choke point B, C A, E. Combining
choke point A, B D, E with A, C B, D and A, D C, E using Lemma B.11 will
result in a choke point A, B C, D, which is discarded by hypothesis. Therefore, there is no
choke point A, B D, E.
This means our model can entail only tetrad constraints generated by A, C B, D and
A, DC, E. Let G

have two latent variables L

1
and L
2
. Make L
1
the parent of A, C, E, L
2
.
Let L
2
be the parent of B and D. Add bi-directed edges B E. Let G

be G

with the added

edge L
2
A.
Case 2.2.2: there is no choke point A, D C, E. As before, both A, B C, D and
A, DB, C are forbidden. We consider two possible scenarios for choke point A, EC, D.
Case 2.2.2.1: there is a choke point A, E C, D. We only have to consider now
choke points of the form X
1
, B X
2
, E and X
1
, X
2
B, E. Choke point B, E
C, D does not exist, because this combined with A, E C, D will result in A, B C, D,
excluded by assumption. B, EA, D cannot exist either: to see this, start from the constraint
set A, CB, D, A, EC, D, B, EA, D. Exchanging the labels of D and E, followed
by the exchange of E and C, this is equivalent to A, EB, C, A, DC, E, B, DA, C.
From Lemma B.11, the constraint B, D E, C is generated. Reverting the substitutions of E
and C, and E and D, this is equal to B, EC, D in the original labeling, which was ruled out
at the beginning of this paragraph. A similar reasoning rules out B, EA, C. We only have
to consider now choke points of the form X
1
, B X
2
, E. Choke point A, B D, E
cannot exist. Given the assumed choke point set A, CB, D, A, EC, D, A, BD, E,
by exchanging labels A and C, one obtains A, C B, D, A, D C, E, B, C D, E,
which by Lemma B.10 implies choke points among all elements in A, B, C, D, E. A similar
reasoning rules out all other choke points of the type X
1
, B X
2
, E. Construct G

as follows:
two latents, L
1
and L
2
, where L
1
is a parent of A, C, E and L
2
, and L
2
is a parent of B and D.
Add the bi-directed edge B E. Construct G

by adding edge L
1
B to G

.
Case 2.2.2.2: there is no choke point A, E C, D. We only have to consider now
choke points of the form X
1
, B X
2
, E and X
1
, X
2
B, E. Choke point A, C
173
B, E does not exist, because this combined with A, C B, D generates A, C D, E.
Choke points A, D B, E and C, D B, E cannot both exist, since they jointly imply
choke point A, C B, E.
Assume for now that choke point A, D B, E exists (but not C, D
B, E). We only have to consider now choke points of the form X
1
, B X
2
, E.
Choke point A, BC, E cannot exist, since by exchanging A and D, B and C in set A, C
B, D, A, D B, E, A, B C, E we get A, C B, D, A, D B, E, B, E
C, D, which by Lemma B.9 will imply all tetrad constraints with A, B, C, D.
The same reasoning applies to B, C A, E (exchanging A and D, B and C in the
given tetrad constraints) by using Lemma B.10. The same reasoning applies to B, C D, E
(exchanging A and D, B and C in the given tetrad constraints) by using Lemma B.11.
Because of the assumed A, C B, D, either both choke points A, E B, D,
C, E B, D exist or none exists. Because of the assumed A, D B, E, either both choke
points A, E B, D, A, D B, E exist or none exists. That is, either all choke points
A, E B, D, A, D B, E, C, E B, D exist or none exist. If all exist, create G

as
follows: use two latents L
1
, L
2
, where L
1
is a parent of A, C and L
2
, L
2
is a parent of B, D and E,
and there is a bi-directed edge C E. Construct G

by adding edge L
2
A to G

. If none of the
three mentioned choke points exist, do the same but with an extra bi-directed edge B E.
Assume now that choke point C, D B, E exists (but not A, D B, E).
This is analogous to the previous case by symmetry of A and C.
Assume now that no choke point C, D B, E or A, D B, E exists. We
only have to consider now choke points of the form X
1
, BX
2
, E. Let Pair 1 be the set
of choke points A, BC, E, A, BD, E. Let Pair 2 be the set of choke points B, C
A, E, B, CD, E. Let Pair 3 be the set of choke points B, DA, E, B, DC, E.
At most one element of Pair 1 can exist (or otherwise it will entail A, B C, D). For the same
reason, at most one element of Pair 2 can exist. Either both elements of Pair 3 exist or none exist.
If both elements of Pair 3 exist, then no element of Pair 1 or Pair 2 can exist. For example,
B, D A, E from Pair 3 and B, C A, E from Pair 2 together entail C, D A, E,
discarded by hypothesis. In the case where both elements of Pair 3 exist, construct G

as follows:
let L
1
and L
2
be two latents, where L
1
is a parent of A, C and L
2
, and L
2
is a parent of B, D and
E. Add bi-directed edges A E and C E. Construct G

by adding L
2
A to G

.
Choke point B, C D, E (from Pair 2) cannot co-exist with A, B D, E (from
Pair 1) since this entails A, CD, E. Moreover, B, CD, E cannot co-exist with A, B
C, E (also from Pair 1), since A, C B, D, A, B C, E, B, C D, E, which by
exchanging B with D generates A, CB, D, A, DC, E, B, EC, D. From Lemma
B.9, this implies all three tetrads in the covariance of A, B, C, D, a contradiction.
By symmetry between A and C, it follows that no two elements of the union of Pair 1
and Pair 2 can simultaneously exist. Let X
1
, B X
2
, E be a choke point in the union of Pair
1 and Pair 2 that is assumed to exist. Construct G

as follows: let L
1
and L
2
be two latents, where
L
1
is a parent of A, C and L
2
, and L
2
is a parent of B, D. If X
1
= A and X
2
= C, or if X
1
= C
and X
2
= A, let L
1
be the parent of E. Otherwise, let L
2
be the parent of E. Add bi-directed
edges between E and every element in X`B, X
1
. Construct G

by adding L
2
A to G

.
Finally, if no element in Pairs 1, 2 or 3 is assumed to exist, create G

and G

as above,
but connect E to all other elements of X by bi-directed edges.
174 Results from Chapter 4
Appendix C
Results from Chapter 6
C.1 Update equations for variational approximation
Following the notation in Chapter 6, the equations below provide the update steps on the opti-
mization of the variational lower bound:
1. Optimizing q() and a

:
q() = Dirichlet([am) (C.1)
where for each element am
s
in am,
am
s
= a

s
+
n

i=1
q(s
i
) (C.2)
To optimize a

, one has to appeal for a gradient-based technique, such as Newton-Raphson

(Beal and Ghahramani, 2003). The gradient is given by:
T(G, D)
a

= (a

1
S
S

s=1
[(a) (am
s
)] (C.3)
where (x) here is the digamma function, the derivative of the logarithm of the gamma function.
2. Optimizing q(B) and

L
:
Let 'g(V)`
q(V)
denote the expected value of g(V) according to the distribution q(V). Since
the prior probability of elements in B
s
is a product of marginals for each element in this set, its
posterior distribution for will also factorize over each L
(k)
i
L
i
, where 1 i n is an index over
data points, n being the size of the data set. Let L
(k1)
, . . . , L
(km
k
)
L be the parents of L
(k)
in
G. Let
kjs
be the parameter associated with edge L
(kj)
L
(k)
in mixture component s. Then
the variational posterior distribution q(B) is given by
q(B) =
|L|

k=1
q(B
s
L
k
)
|L|

k=1
N(V
L
k
M
L
k
, V
1
L
k
), (C.4)
176 Results from Chapter 6
B
s

L
k
= [
k1s
. . .
km
k
s
], (C.5)
M
L
kj
=
n

i=1
q(s
i
)
ks

L
(k)
i
L
(kj)
i

q(L
i
|s
i
)
(C.6)
V
L
kjl
=
n

i=1
q(s
i
)
ks

L
(kj)
i
L
(kl)
i

q(L
i
|s
i
)
+1(j = l)

L
(C.7)
where 1 j m
k
, 1 l m
k
and 1(T) = 1 if and only if expression T is true, and 0 otherwise.
Moreover,
(

L
)
(1)
=

S
s=1

|L|
k=1

B
s

L
k
B
s
L
k

q(B
s
)
[B[
(C.8)
where [B[ is the number of elements in B.
3. Optimizing
ks
, 1 k [L[, 1 s S:

ks
=
n

i=1
q(s
i
)/
n

i=1
q(s
i
)

(L
(k)

m
k

j=1

kjs
L
(kj)
)
2

q(L
i
|s
i
)q(B
s
)
(C.9)
4. Optimizing q(L
i
[s
i
):
Let
s
be the diagonal matrix such that
s
kk
is the corresponding inverse variance
ks
. Let B
s
i
be a matrix of coecients such that entry b
kj
= 0 if there is no edge L
(j)
L
(k)
in G. Otherwise,
let b
kj
correspond to the parameter associated with edge L
(j)
L
(k)
in mixture component s
i
.
Let Ch
X
(L
(k)
) and Ch
L
(L
(k)
) be the children of L
(k)
in Xand L, respectively. Let Ch
X
(L
(j)
, L
(k)
) =
Ch
X
(L
(k)
) Ch
X
(L
(j)
). Let
tks
i
be the parameter associated with edge L
(k)
X
(t)
in mixture
component s
i
. Let Pa
X
(X
(t)
) be the parents of X
(t)
in X, and let
tvs
i
be the parameter associated
with edge X
(v)
X
(t)
in mixture component s
i
. Finally, let I be the identity matrix of size [L[.
We optimize the variational posterior q(L
i
[s
i
) by:
q(L
i
[s
i
) = N((V
1
+V
2
)M, (V
1
+V
2
)
1
), (C.10)
M
k
=

X
(t)
Ch
X
(L
(k)
)

X
(t)
i
'
tks
i
`
q(
s
i )

vPa
X
(X
(t)
)
'
tvs
i

tks
i
`
q(
s
i )

, (C.11)
V
1
=

(I B
s
i
)(I B
s
i
)

q(B
s
i )
(C.12)
V
2
jk
=

X
(t)
Ch(L
(j)
,L
(k)
)

t
'
tjs
i

tks
i
`
q(
s
i )
(C.13)
C.1 Update equations for variational approximation 177
5. Optimizing q() and

X
:
Let Z
(k1)
, . . . , Z
(km
k
)
L X 1 be the parents of X
(k)
in G. Let
kjs
be the parameter
associated with edge Z
(kj)
X
(k)
. By convention, let Z
(k1)
be the intercept term among the
parents of X
k)
(i.e., Z
(k1)
is constant and set to 1). Then the variational posterior distribution
q() is given by
q() =
S

s=1
|X|

k=1
q(
s
X
k
)
S

s=1
|X|

k=1
N(V
X
k
s
M
X
k
s
, V
1
X
k
s
), (C.14)

X
k
= [
k1s
. . .
km
k
s
], (C.15)
M
X
kj
s
=
n

i=1
q(s
i
= s)
k

Z
(k)
i
Z
(kj)
i

q(L
i
|s)
(C.16)
V
X
kjl
s
=

n
i=1
q(s
i
= s)
k

Z
(kj)
i
Z
(kl)
i

q(L
i
|s)
+1(j = l j > 1)

X
(k)
+1(j = l j = 1)
t
X
(k)
where 1 j m
k
, 1 l m
k
, and 1(T) = 1 if and only if expression T is true, and 0 otherwise.
Moreover,
(

X
k
)
(1)
=

S
s=1

j>1

2
kjs

q(
s
)
[
s

X
k
[ S
(C.17)
(
t
X
k
)
(1)
=

S
s=1

2
k1s

q(
s
)
S
(C.18)
6. Optimizing
k
, 1 k [L[:

k
=
S

s=1
n

i=1
q(s)/
S

s=1
n

i=1
q(s)

(X
(k)

m
k

j=1

kjs
Z
(kj)
)
2

q(L
i
|s)q(
k
)
(C.19)
where for each X
(k)
, Z
(k1)
, . . . , Z
(km
k
)
are the parents of X
(k)
in G.
7. Optimizing q(s
i
):
q(s
i
) =
1
Z
exp[(m
s
i
) () +'ln p(L
i
[s
i
)`
q(L
i
|s
i
)q(B
s
i )
+
1
2
ln [
s
i
[

1
2
tr

(X
i

s
i
Z
i
) (X
i

s
i
Z
i
)

q(L
i
|s
i
)q(
s
i )

]
where
s
i
is the covariance of L given s = s
i
, and Z is a normalizing constant to ensure

S
s
i
=1
q(s
i
) =
1.
178 Results from Chapter 6
...
X
2
X
3
X
5
X
4
X
X X
n1 n
X X
7 6
1
Figure C.1: Let F be a pure one-factor model consisting of indicators X
1
, . . . , X
n
, and let the true
graph among such variables be given as above, where all latent variables are connected and X
3
is
connected by bi-directed edges to all other variables not in X
1
, X
2
, X
4
. Variable X
3
will be the
one present in the highest number of tetrad constraints that are entailed by F that do not hold in
the population.
C.2 Problems with Washdown
The intuition behind Washdown is that nodes that participate in the highest number of invalid
tetrad constraints will be the rst ones to be eliminated. This should be seen as a heuristic since
typical score functions, such as the one suggested in Chapter 6, also take into account quantitative
characteristics of the invalid tetrad constraints (i.e., how much they deviate from zero in reality,
and not only if the constraint is entailed or not).
However, even if the given score function perfectly ranks models according to the number of
invalid tetrad constraints (i.e., where the models with the least number of false constraints will be
the ones that will achieve the highest score), this is still not enough to guarantee that Washdown
will nd a pure measurement model if one exists, as formalized by the next theorem.
Theorem C.1 Let O be the set of variables in the dataset given as input to Washdown. Assume
the score function is the negative number of invalid tetrad constraints that are entailed by the model
(so that the best ranked models will be the ones with the least number of invalid entailed tetrad
constraints). Then, even if there is some sequence of node deletions (from the one-factor model
given at the start of Washdown) that creates a pure model with at least three nodes per latent,
Washdown might not follow any of such sequences.
Proof: A counter-example can be easily constructed by having an unique set of four indicators
that can form a pure one-factor model, and making one of such variables belong to many entailed
tetrad constraints that are violated in the population. Figure C.1 illustrates such a case. The only
possible one-factor model that can be formed contains variables X
1
X
4
. Variable X
3
is present
in the highest number of invalid tetrad constraints (by having the number of latents much higher
than 4), and will be removed from F in the next step if the score function satises the assumptions
of the theorem. No other subset of four variables can form a one-factor model, and in the end the
empty graph G
0
will have a higher score (assuming consistency of the score function) over whatever
set is selected by Washdown.
C.3 Implementation details 179
C.3 Implementation details
In our implementation of Washdown, we used Structural EM (Friedman, 1998) to speed-up
the choice of node to be removed. Given a model of n indicators, we calculate the n possible
submodels by xing the distribution of the latents given the data, and then estimating the other
parameters of the model before scoring it. Once a node is chosen to be removed, we estimate the
full model again and compare it to the current score. This way, the number of full score evaluations
is never higher than the number of observable variables for each new cluster that is introduced.
For larger sample sizes, one might want to re-estimate the full model only when a local maxima is
achieved in order to achieve a much higher speed-up. We did not perform any empirical study on
how this might hurt the accuracy of the algorithm.
We actually did not apply the variational score in most of the implementation of Washdown
used in the experiments on causal discovery. The reason was the sensitivity of the score function
to the initial choice of parameter values: many dierent local maxima could be generated. Doing
multiple re-starts from a large number of initial parameter values slows down the method consid-
erably. Therefore, instead of using the variational score for choosing the node to be removed, we
used BIC. The likelihood function is not as sensitive (since there are no hyperparameters to be
t). We still needed to do multiple re-starts (ve, in our implementation), but the variance on the
score per trial was not as high, and therefore we do not need as many as we would need with the
variational function.
However, one can verify in synthetic experiments that the BIC score is considerably less precise
than the variational one, undertting the data much more easily. This is partially due to the
diculty of the problem: in Gaussian models, for instance, a
2
test would frequently accept with
high signicance (> 0.20) a false one-factor model that in reality would contain several nodes from
dierent clusters.
To minimize this problem, we added an extra step to our implementation: suppose X
i
is the
best choice of node to be removed, but the model where all other indicators are parents of X
i
(as in
Figure 6.4) still scores less than the pure model with no extra edges. Instead of stopping removing
nodes, we do a greedy search that tries to add some edge X
j
X
i
to the current pure model if that
increases the score. If after this search we have some edge X
j
X
i
, we remove X
i
and proceed
to the next iteration of node removal. This modication is essential for making Washdown work
reasonably well with the BIC score function.
A less elegant modication was added on top of that at the end of each cycle of Washdown,
before we perform a GraphComparison. We again do a greedy search to add edges between
indicators but now without restring which nodes can be at the endpoints, unlike the procedure
given in the previous paragraph. If some edge A B is added, we remove node B. A new search
for the next edge is done, and we stop when no edge can increase the score of the model.
The variational score function was still used in GraphComparison and MIMBuild. In our
experiments with density estimation, we did not use the BIC score at all, and consequently none
of the modications above, since they slow down the procedure (we did not increase the number of
score function evaluations per trial). It would be interesting to compare in a future work if these
modications would result in a better probabilistic model for the given datasets.
Another heuristic that we adopted was requiring a minimum number of indicators per latent.
In the case of the regular Washdown, we forced the algorithm to keep at least three indicators
per latent all the time (or four indicators, if there is only one latent). If the absence of some node
would imply a model without three indicators per latent, then this node would not be considered
180 Results from Chapter 6
for removal. In simulations, this seems to help to increase the accuracy of the model, avoiding
unnecessary fragmentation of clusters. The number 3 was chosen since this is the minimum number
of indicators to make a single latent factor identiable (if there is more than one latent, a fourth
descendant is avaliable as the child of another latent - otherwise, we require 4 indicators for an
one-factor model to be testable). Notice that the original Washdown algorithm of Silva (2002)
does not impose this restriction.
In the case of K-LatentClustering, which allows multiple latent parents per indicator, we
applied a generalized version of this heuristic. Instead of requiring at least 3 indicators per cluster
as in Washdown (where each cluster has only one latent parent), we require at least p indicators
for a cluster of k latents, where p is the minimum integer such that p(p + 1)/2 kp + p. That
is, the minimum number of indicators such that the number of unique entries in the observed
covariance matrix (p(p + 1)/2) is at least as large as the number of covariance parameters (per
mixture component) in the measurement model of the cluster (kp +p).
Concerning the bi-directed that are used in the description of FullLatentClustering, we
chose not to parameterize them as covariances among the residuals as it is done, e.g., in the Gaussian
mixed ancestral graph representation of Richardson and Spirtes (2002), mostly due to the diculty
of dening priors over such graphs and performing parameter tting under the Structural EM
framework, as explained in the next paragraphs. Instead, each bi-directed edge is just a shorthand
representation of a new independent hidden common cause of two children.
That is, each bi-directed edge X
1
X
2
represents a new independent latent X
1
L X
2
.
The goal is to free the covariance
X
1
X
2
across every component of the mixture model, increasing
the rank of the covariance matrix only on subsets of the observed variables that include X
1
and
X
2
, while leaving all other covariances untouched
1
.
Concerning bi-directed edges and Structural EM, we introduce yet another approximation.
Let L
new
be the new hidden variable associated with the bi-directed edge X
1
X
2
, and let L be
the current set of latents. We introduce the variational approximation q(LL
new
) q(L)q(L
new
),
xing q(L) and updating only q(L
new
). This still requires tting a latent variable model for each
evaluation, however a model with only one latent and only the edges into X
1
and X
2
, which is
relatively ecient. Notice this is still a lower bound on the true function. After deciding which
bi-directed edge increases the score most (if any), we introduce it into the graph and evaluate the
full log-posterior score function.
1
Those are still not completely free to vary, since the full covariance matrix is constrained to be positive denite.
Bibliography
R. Agrawal and R. Srikant. Fast algorithms for mining association rules. Proc. 20th Int. Conf.
Very Large Data Bases, VLDB, 1994.
H. Attias. Independent factor analysis. Graphical Models: foundations of neural computation, pages
207257, 1999.
F. Bach and M. Jordan. Learning graphical models with Mercer kernels. Neural Information
Processing Systems, 2002.
F. Bach and M. Jordan. Beyond independent components: trees and clusters. Journal of Machine
Learning Research, 4:12051233, 2003.
D. Bartholomew. Measuring Intelligence: Facts and Falacies. Cambridge University Press, 2004.
D. Bartholomew and M. Knott. Latent Variable Models and Factor Analysis. Arnold Publishers,
1999.
D. Bartholomew, F. Steele, I. Moustaki, and J. Galbraith. The Analysis and Interpretation of
Multivariate Data for Social Scientists. Arnold Publishers, 2002.
A. Basilevsky. Statistical Factor Analysis and Related Methods. Wiley, 1994.
M. Beal and Z. Ghahramani. The variational bayesian em algorithm for incomplete data: with
application to scoring graphical model structures. Bayesian Statistics, 7, 2003.
M. Beal, Z. Ghahramani, and C. Rasmussen. The innite hidden markov model. Advances in
Neural Information Processing Systems, 14, 2001.
J. Binder, D. Koller, S. Russell, and K. Kanazawa. Adaptive probabilistic networks with hidden
variables. Machine Learning, 29:213244, 1997.
C. Bishop. Latent variable models. Learning in Graphical Models, 1998.
C. L. Blake and C. J. Merz. UCI repository of machine learning databases,
http://www.ics.uci.edu/mlearn/mlrepository.html, 1998.
D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research,
pages 9931022, 2003.
K. Bollen. Structural Equation Models with Latent Variables. John Wiley & Sons, 1989.
182 BIBLIOGRAPHY
K. Bollen. Outlier screening and a distribution-free test for vanishing tetrads. Sociological Methods
and Research, 19:8092, 1990.
K. Bollen. Modeling strategies: in search of the holy grail. Structural Equation Modeling, 7:7481,
2000.
K. Bollen and P. Paxton. Interactions of latent variables in structural equation models. Structural
Equation Modeling, 5:267293, 1998.
C. Borgelt and R. Kruse. Induction of association rules: Apriori implementation. 15th Conference
on Computational Statistics (Compstat 2002, Berlin, Germany), 2002.
W. Buntine and A. Jakulin. Applying discrete PCA in data analysis. Proceedings of 20th Conference
on Uncertainty in Articial Intelligence, 2004.
E. Carmines and R. Zeller. Reliability and Validity Assessment. Quantitative Applications in the
Social Sciences 17. Sage Publications., 1979.
M. Carreira-Perpinan. Continuous Latent Variable Models for Dimensionality Reduction and Se-
quential Data Reconstruction. PhD Thesis, University of Sheeld, UK, 2001.
R. Carroll, D. Ruppert, C. Crainiceanu, T. Tosteson, and M. Karagas. Nonlinear and nonparametric
regression and instrumental variables. Journal of the American Statistical Association, 99:736
750, 2004.
D. Chakrabarti, S. Papadimitriou, D. Modha, and C. Faloutsos. Fully automatic cross-associations.
Proceedings of Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data
Mining, pages 7988, 2004.
D. Chickering. Optimal structure identication with greedy search. Journal of Machine Learning
Research, 3:507554, 2002.
W. Chu and Z. Ghahramani. Gaussian processes for ordinal regression. Technical Report, University
College London, 2004.
G. Cooper. A simple constraint-based algorithm for eciently mining observational databases for
causal relationships. Data Mining and Knowledge Discovery, 2, 1997.
G. Cooper. An overview of the representation and discovery of causal relationships using Bayesian
networks. Computation, Causation and Discovery, pages 362, 1999.
G. Cooper and E. Herskovits. A Bayesian method for the induction of probabilistic networks from
data. Machine Learning, 9:309347, 1992.
G. Elidan, N. Lotner, N. Friedman, and D. Koller. Discovering hidden variables: a structure-based
approach. Neural Information Processing Systems, 13:479485, 2000.
C. Fornell and Y. Yi. Assumptions of the two-step approach to latent variable modeling. Sociological
Methods & Research, 20:291320, 1992.
N. Friedman. The Bayesian structural EM algorithm. Proceedings of 14th Conference on Uncer-
tainty in Articial Intelligence, 1998.
BIBLIOGRAPHY 183
D. Geiger and C. Meek. Quantier elimination for statistical problems. Proceedings of 15th Con-
ference on Uncertainty in Articial Intelligence, 1999.
Z. Ghahramani and M. Beal. Variational inference for Bayesian mixture of factor analysers. Ad-
vances in Neural Information Processing Systems, 12, 1999.
Z. Ghahramani and G. Hinton. The EM algorithm for the mixture of factor analyzers. Technical
Report CRG-TR-96-1. Department of Computer Science, University of Toronto., 1996.
J. Gibson. Freedom and Tolerance in the United States. Chicago, IL: University of Chicago,
National Opinion Research Center [producer], 1987. Ann Arbor, MI: Inter-university Consortium
for Political and Social Research [distributor], 1991.
C. Glymour. The Minds Arrow: Bayes Nets and Graphical Causal Models in Psychology. MIT
Press, 2002.
C. Glymour and G. Cooper. Computation, Causation and Discovery. MIT Press, 1999.
C. Glymour, Richard Scheines, Peter Spirtes, and Kevin Kelly. Discovering Causal Structure:
Articial Intelligence, Philosophy of Science, and Statistical Modeling. Academic Press, 1987.
M. Grzebyk, P. Wild, and D. Chouaniere. On identication of multi-factor models with correlated
residuals. Biometrika, 91:141151, 2004.
B. Habing. Nonparametric regression and the parametric bootstrap for local dependence assess-
ment. Applied Psychological Measurement, 25:221233, 2001.
H. Harman. Modern Factor Analysis. University of Chicago Press, 1967.
L. Hayduk and D. Glaser. Jiving the four-step, waltzing around factor analysis, and other serious
fun. Structural Equation Modeling, 7:135, 2000.
D. Heckerman. A bayesian approach to learning causal networks. Proceedings of 11th Conference
on Uncertainty in Articial Intelligence, pages 285295, 1995.
D. Heckerman. A tutorial on learning with Bayesian networks. Learning in Graphical Models, pages
301354, 1998.
A. Hyvarinen. Survey on independent component analysis. Neural Computing Surveys, 2:94128,
1999.
A. Jackson and R. Scheines. Single mothers self-ecacy, parenting in the home environment and
childrens development in a two-wave study. Submitted to Social Work Research, 2005.
R. Johnson and D. Wichern. Applied Multivariate Statistical Analysis. Prentice Hall, 2002.
M. Jordan. Learning in Graphical Models. MIT Press, 1998.
K. Joreskog. Structural Equation Modeling with Ordinal Variables using LISREL. Technical
Report, Scientic Software International Inc., 2004.
B. Junker and K. Sijtsma. Nonparametric item response theory in action: An overview of the
special issue. Applied Psychological Measurement, 25:211220, 2001.
184 BIBLIOGRAPHY
Y. Kano and A. Harada. Stepwise variable selection in factor analysis. Psychometrika, 65:722,
2000.
Y. Kano and S. Shimizu. Causal inference using nonnormality. Proceedings of the International
Symposium of Science of Modeling - The 30th anniversary of the Information Criterion (AIC),
pages 261270, 2003.
R. Klee. Introduction to the Philosophy of Science: Cutting Nature at its Seams. Oxford University
Press, 1996.
J. Loehlin. Latent Variable Models: An Introduction to Factor, Path and Structural Equation
Analysis. Lawrence Erlbaum, 2004.
E. Malinowski. Factor Analysis in Chemistry. John Wiley & Sons, 2002.
C. Meek. Graphical Models: Selecting Causal and Statistical Models. PhD Thesis, Carnegie Mellon
University, 1997.
T. Minka. Automatic choice of dimensionality for pca. Advances in Neural Information Processing
Systems, 13:598604, 2000.
T. Mitchell. Machine Learning. McGraw-Hill, 1997.
J. Pan, C. Faloutsos, M. Hamamoto, and H. Kitagawa. Autosplit: Fast and scalable discovery of
hidden variables in stream and multimedia databases. PAKDD, 2004.
J. Pearl. Probabilistic Reasoning in Expert Systems: Networks of Plausible Inference. Morgan
Kaufmann, 1988.
J. Pearl. Causality: Models, Reasoning and Inference. Cambridge University Press, 2000.
T. Richardson and P. Spirtes. Ancestral graph Markov models. Annals of Statistics, 30:9621030,
2002.
K. Roeder and L. Wasserman. Practical bayesian density estimation using mixtures of normals.
Journal of the American Statistical Association, pages 894902, 1997.
P. Rosebaum. Observational studies. Springer-Verlag, 2002.
B. Scholkopf and A. Smola. Learning with Kernels. MIT Press, 2002.
G. Shafer, A. Kogan, and P.Spirtes. Generalization of the tetrad representation theorem. DIMACS
Technical Report, 1993.
R. Silva. The structure of the unobserved. MSc. Thesis, Center for Automated Learning and
Discovery. Technical Report CMU-CALD-02-102, School of Computer Science, Carnegie Mellon
University, 2002.
R. Silva and R. Scheines. Generalized measurement models. Technical Report CMU-CALD-04-101,
Carnegie Mellon University, 2004.
BIBLIOGRAPHY 185
R. Silva, R. Scheines, C. Glymour, and P. Spirtes. Learning measurement models for unobserved
variables. Proceedings of 19th Conference on Uncertainty in Articial Intelligence, pages 543550,
2003.
C. Silverstein, S. Brin, R. Motwani, and J. Ullman. Scalable techniques for mining causal structures.
Data Mining and Knowledge Discovery, 2000.
C. Spearman. general intelligence, objectively determined and measured. American Journal of
Psychology, 15:210293, 1904.
P. Spirtes, C. Glymour, and R. Scheines. Causation, Prediction and Search. Cambridge University
Press, 2000.
E. Stanghellini and N. Wermuth. On the identication of path analysis models with one hidden
variable. Biometrika, 92:To appear, 2005.
W. Stout. A new item response theory modeling approach with applications to unidimensionality
assessment and ability estimation. Psychometrika, 55:293325, 1990.
M. Wall and Y. Amemiya. Estimation of polynomial structural equation models. Journal of the
American Statistical Association, 95:929940, 2000.
M. Wedel and W. Kamakura. Factor analysis with (mixed) observed and latent variables in the
exponential family. Psychometrika, 66:515530, 2001.
J. Wegelin, A. Packer, and T. Richardson. Latent models for cross-covariance. Journal of Multi-
variate Analysis, page in press, 2005.
J. Wishart. Sampling errors in the theory of two factors. British Journal of Psychology, 19:180187,
1928.
I. Yalcin and Y. Amemiya. Nonlinear factor analysis as a statistical method. Statistical Science,
16:275294, 2001.
M. Zaki. Mining non-redundant association rules. Data Mining and Knowledge Discovery, 19:
223248, 2004.
N. Zhang. Hierarchical latent class models for cluster analysis. Journal of Machine Learning
Research, 5:697723, 2004.

Data Analysis and Related Applications, Volume 2: Multivariate, Health and Demographic Data Analysis 1st Edition Konstantinos N. Zafeiris Available All Format
100% (1)
Data Analysis and Related Applications, Volume 2: Multivariate, Health and Demographic Data Analysis 1st Edition Konstantinos N. Zafeiris Available All Format
147 pages
Statistical Modelling For Sports Scientists: Practical Introduction Using R (Part 1)
No ratings yet
Statistical Modelling For Sports Scientists: Practical Introduction Using R (Part 1)
104 pages
Longitudinal PDF
No ratings yet
Longitudinal PDF
664 pages
Longitudinalf
No ratings yet
Longitudinalf
664 pages
STA1501 Study Guide
No ratings yet
STA1501 Study Guide
223 pages
Lecture Notes
No ratings yet
Lecture Notes
138 pages
Emilio J. Castilla 2007 Dynamic Analysis in Social Sciences PDF
No ratings yet
Emilio J. Castilla 2007 Dynamic Analysis in Social Sciences PDF
403 pages
Dynamic Analysis in The Social Sciences 1st Edition Emilio J. Castilla Online Version
100% (8)
Dynamic Analysis in The Social Sciences 1st Edition Emilio J. Castilla Online Version
164 pages
Proefschrift Robert Zwitser PDF
No ratings yet
Proefschrift Robert Zwitser PDF
127 pages
Introduction To Statistics
No ratings yet
Introduction To Statistics
46 pages
Design Expert
No ratings yet
Design Expert
74 pages
Techniques of Event History Modeling New Approaches To Casual Analysis 2nd Edition High-Quality Ebook
100% (9)
Techniques of Event History Modeling New Approaches To Casual Analysis 2nd Edition High-Quality Ebook
17 pages
Basic Stats
No ratings yet
Basic Stats
49 pages
Dynamic Analysis in The Social Sciences 1st Edition Emilio J. Castilla Available All Format
100% (2)
Dynamic Analysis in The Social Sciences 1st Edition Emilio J. Castilla Available All Format
156 pages
Advanced High School Statistics 2nd Edition David Diez PDF Download
100% (1)
Advanced High School Statistics 2nd Edition David Diez PDF Download
155 pages
Scott Cunningham - Causal Inference - The Mixtape-Yale University Press (2021)
No ratings yet
Scott Cunningham - Causal Inference - The Mixtape-Yale University Press (2021)
615 pages
Advanced High School Statistics 2nd Edition David Diez PDF Download
100% (3)
Advanced High School Statistics 2nd Edition David Diez PDF Download
59 pages
Sem
No ratings yet
Sem
583 pages
RealStats Book
No ratings yet
RealStats Book
897 pages
Non Linear Time History Analysis of Tall
No ratings yet
Non Linear Time History Analysis of Tall
366 pages
Prediction in Projection: A New Paradigm in Delay-Coordinate Reconstruction
No ratings yet
Prediction in Projection: A New Paradigm in Delay-Coordinate Reconstruction
115 pages
Basic Prob and Stats
No ratings yet
Basic Prob and Stats
92 pages
BMB Stats (Jovanovic)
No ratings yet
BMB Stats (Jovanovic)
442 pages
Causality 2
100% (1)
Causality 2
68 pages
(Cambridge Series in Statistical and Probabilistic Mathematics) Gerhard Tutz, Ludwig-Maximilians-Universität Munchen - Regression For Categorical Data-Cambridge University Press (2012)
100% (3)
(Cambridge Series in Statistical and Probabilistic Mathematics) Gerhard Tutz, Ludwig-Maximilians-Universität Munchen - Regression For Categorical Data-Cambridge University Press (2012)
574 pages
Short Paper SIS Palermo 2018
No ratings yet
Short Paper SIS Palermo 2018
1,668 pages
Causal Inference Book Part I-Ifqdve
No ratings yet
Causal Inference Book Part I-Ifqdve
158 pages
Reg Book Stat
No ratings yet
Reg Book Stat
79 pages
Descriptive Statistics Lecture Notes
No ratings yet
Descriptive Statistics Lecture Notes
156 pages
Learning Statistics With Jamovi
No ratings yet
Learning Statistics With Jamovi
529 pages
Study Guide For STA3701
No ratings yet
Study Guide For STA3701
325 pages
Statistics 231 Course Notes
No ratings yet
Statistics 231 Course Notes
204 pages
Handbook of Missing Data Methodology
90% (10)
Handbook of Missing Data Methodology
590 pages
Applied Robust Statistics-David Olive
No ratings yet
Applied Robust Statistics-David Olive
588 pages
CausalML Book
50% (2)
CausalML Book
496 pages
Lecture Note 2019 PDF
100% (1)
Lecture Note 2019 PDF
235 pages
Stats For Science Course Reader
No ratings yet
Stats For Science Course Reader
77 pages
Cito Proefschrift Maarten Marsman PDF
No ratings yet
Cito Proefschrift Maarten Marsman PDF
114 pages
Process Improvement Using Data: Release B72e39
100% (1)
Process Improvement Using Data: Release B72e39
421 pages
LectureNotes22 WI4455
No ratings yet
LectureNotes22 WI4455
154 pages
2021 - Creel - Econometrics (Githuib Book)
No ratings yet
2021 - Creel - Econometrics (Githuib Book)
1,060 pages
Testes de Qualidade de Ajuste
No ratings yet
Testes de Qualidade de Ajuste
113 pages
Imstat
No ratings yet
Imstat
549 pages
Econometrics - Applied Robust Statistic To Regression Analysis
No ratings yet
Econometrics - Applied Robust Statistic To Regression Analysis
534 pages
Introduction To Bayesian Networks - Koski - Noble
No ratings yet
Introduction To Bayesian Networks - Koski - Noble
471 pages
Foundations of Descriptive and Inferential Statistics (Version 4)
No ratings yet
Foundations of Descriptive and Inferential Statistics (Version 4)
177 pages
Better Business Decisions From Data Statistical Analysis For Professional Success 1st Ed. Edition Kenny Download
100% (1)
Better Business Decisions From Data Statistical Analysis For Professional Success 1st Ed. Edition Kenny Download
56 pages
Analyzing and Modeling Rank Data
No ratings yet
Analyzing and Modeling Rank Data
28 pages
Econometric S
No ratings yet
Econometric S
1,341 pages
Interactive and Dynamic Graphics For Data Analysis
No ratings yet
Interactive and Dynamic Graphics For Data Analysis
169 pages
Bayesian Arima
No ratings yet
Bayesian Arima
64 pages
Imstat
No ratings yet
Imstat
510 pages
OpenIntro Statistics 3rd Edition David M Diez Newest Edition 2025
100% (2)
OpenIntro Statistics 3rd Edition David M Diez Newest Edition 2025
103 pages
Multivariate Statistical Modelling Based On Generalized Linear Models 2nd Edition ISBN 0387951873, 9780387951874 PDF
No ratings yet
Multivariate Statistical Modelling Based On Generalized Linear Models 2nd Edition ISBN 0387951873, 9780387951874 PDF
17 pages
Bayesian Statistical Methods
100% (11)
Bayesian Statistical Methods
288 pages
Crime Analysis and Prediction Using Data
100% (1)
Crime Analysis and Prediction Using Data
8 pages
Proposed Research Direction For Sustainable Smes in Bangladesh
No ratings yet
Proposed Research Direction For Sustainable Smes in Bangladesh
13 pages
DA 2016 Problem Set 2
No ratings yet
DA 2016 Problem Set 2
4 pages
TSA Using Matlab
No ratings yet
TSA Using Matlab
30 pages
Data Science & ML Expert Profile
No ratings yet
Data Science & ML Expert Profile
6 pages
Software Reliability Prediction Using Machine Learning and Deep Learning
No ratings yet
Software Reliability Prediction Using Machine Learning and Deep Learning
6 pages
Fairness Lectures-21
No ratings yet
Fairness Lectures-21
63 pages
Ship Speed Trials: New Guidelines
100% (1)
Ship Speed Trials: New Guidelines
11 pages
Kothari 2009
No ratings yet
Kothari 2009
33 pages
Cost Estimation-Case Study PDF
100% (1)
Cost Estimation-Case Study PDF
11 pages
Nikhil - Kumar Resume Checking
No ratings yet
Nikhil - Kumar Resume Checking
3 pages
Psychology Research Methods Guide
No ratings yet
Psychology Research Methods Guide
50 pages
Chapter 3545
No ratings yet
Chapter 3545
28 pages
Predictingthe Outcomeofa Tennis Tournament Basedon Both Dataand Judgments
No ratings yet
Predictingthe Outcomeofa Tennis Tournament Basedon Both Dataand Judgments
27 pages
9780784412350s PDF
No ratings yet
9780784412350s PDF
66 pages
Intro to Statistical Learning
No ratings yet
Intro to Statistical Learning
9 pages
The Skills
No ratings yet
The Skills
124 pages
Final Farmer's Assistance
No ratings yet
Final Farmer's Assistance
63 pages
Worksheet How Does Light Travel Answers
No ratings yet
Worksheet How Does Light Travel Answers
3 pages
Smart Energy Management System For A Solar
No ratings yet
Smart Energy Management System For A Solar
4 pages
Financial Forecasting
No ratings yet
Financial Forecasting
14 pages
How Does A Business Forecast
No ratings yet
How Does A Business Forecast
10 pages
Indian Astrology Vimshottari Dasha Analysis
0% (2)
Indian Astrology Vimshottari Dasha Analysis
5 pages
Supply Chain & Forecasting Solutions
No ratings yet
Supply Chain & Forecasting Solutions
7 pages
Twelve Houses
0% (1)
Twelve Houses
196 pages
Crime Analysis and Prediction Using Datamining: A Review
No ratings yet
Crime Analysis and Prediction Using Datamining: A Review
20 pages
ADVISOR Battery Models Overview
No ratings yet
ADVISOR Battery Models Overview
9 pages
PISA 2018 - Scientific Competencies Framework
No ratings yet
PISA 2018 - Scientific Competencies Framework
4 pages
August 5 Leo: Personality & Traits
No ratings yet
August 5 Leo: Personality & Traits
1 page
The Lorax
No ratings yet
The Lorax
5 pages

Latent

Uploaded by

Latent

Uploaded by

Automatic Discovery of Latent Variable Models

Lead Cognitive Skills

all hold, then X

is correlated. At most one element in O

as indicators of a more psychological type of anxiety labeled worry by Bartholomew et al. No

is an ancestor of any element in O

O, and let C be a partition of O

respects structural conditions SC1-SC4, then

, X is d-separated from Y given the latent parents of X;

, as well as from their respective latent

can still be represented. Given a graph G, a linear parameterization

of the observed variables of a latent variable

. This has an important heuristic implication:

to the respective observed dis-

. Several estimators and algorithms exist for estimating polychoric correlations

= [1/S, . . . , 1/S] is a xed vector and a

is a hyperparameter. The hyperparameter is a

is by construction a matrix of rank 1

scores better than G

is correlated. At most one element in O

, X, Y, Z, then there is an unique latent L

(V ) obtained from G by removing V . Update G by choosing

is an ancestor of any element in O

will be a linear function of its immediate latent ancestors, i.e., latent

symbol is a polynomial function of all (possible) directed paths from C to X

will appear again in some

summarizes all possible directed paths from C to X

. This remark will be very important

where the index j runs over the same latent

`B. In a completely analogous way, one

has an observed common ancestor.

such that no descendant of K is also

`X; also, by construction no descendant of K (with the possible

(G) does not contain

(G) = 0, where the polynomial F

(G) does not contain any term

can be ancestor of an element in

respects structural conditions SC1, SC2 and SC3,

, X is d-separated from Y given the latent parents of X;

is uncorrelated with some immediate latent ancestor L

in that set by directed paths

without loss of generality. We will prove the theorem

`X. To see this, without loss of generality let

`B, and we are done. Otherwise, assume L

in that set by directed paths

without loss of generality. Let a

of the observed variables of a latent variable

are marginally correlated, and a choke point CP = A, CB, D exists, and CP

are marginally correlated, and constraints

are marginally correlated, and constraints

are marginally correlated, and constraints

graphs corresponding to these

have two latent nodes L

have a latent node

again by using two latents L

, by adding the edge L

with the extra

by adding edges C E and D E to G

by adding edges A E, B E, C E and

by adding edges A E and B E to G

with two latents L

with two latents L

be a graph with two latents, L

with the added edge L

with the extra

have three latents L

but with the extra edge L

have two latent variables L

with the added

, one has to appeal for a gradient-based technique, such as Newton-Raphson

You might also like