Deep SEM
Deep SEM
net/publication/353401568
CITATIONS READS
111 4,722
7 authors, including:
Dan Zhao
Tsinghua University
80 PUBLICATIONS 2,332 CITATIONS
SEE PROFILE
All content following this page was uploaded by Qiuyu Lian on 06 May 2022.
Gene regulatory networks (GRNs) encode the complex molecular interactions that govern cell identity. Here we propose
DeepSEM, a deep generative model that can jointly infer GRNs and biologically meaningful representation of single-cell RNA
sequencing (scRNA-seq) data. In particular, we developed a neural network version of the structural equation model (SEM)
to explicitly model the regulatory relationships among genes. Benchmark results show that DeepSEM achieves comparable or
better performance on a variety of single-cell computational tasks, such as GRN inference, scRNA-seq data visualization, clus-
tering and simulation, compared with the state-of-the-art methods. In addition, the gene regulations predicted by DeepSEM on
cell-type marker genes in the mouse cortex can be validated by epigenetic data, which further demonstrates the accuracy and
efficiency of our method. DeepSEM can provide a useful and powerful tool to analyze scRNA-seq data and infer a GRN.
T
he rapid development of single-cell sequencing technolo- nonlinear frameworks and benefit from the computational power
gies1–3 provides unprecedented opportunities for biologists that the deep learning model brought to us.
to investigate cellular states. However, it also poses new To address the above problems, we present DeepSEM, a deep
challenges in the form of experimental noise not found in bulk generative model that can jointly embed the gene expression data
sequencing data4,5, which might significantly decrease the accu- and simultaneously construct a GRN that reflects the inner struc-
racy of downstream bioinformatics analysis by introducing biases ture of gene interactions in single cells without relying on any addi-
in the gene expression. To address these problems, recently, there tional information such as TF binding motifs or single-cell ATAC
has been great interest in applying deep learning models to filter sequencing (scATAC-seq) data. To implement such an idea, we took
the noise in single-cell transcriptome data by modeling the compli- inspiration from the work of Yu et al.21, which generalized a popular
cated interaction patterns among the genes6,7. Inherently, the tran- approach, called the structural equation model (SEM), that infers
scriptome of a cell is governed by its gene regulation process in a the causality using a linear model. By adding proper mathematical
cell-specific manner. Hence, we expect that the deep learning-based constraints, part of the neural network architecture could be used
methods are able to model such gene interactions to reveal a more to predict the GRN of the scRNA-seq data. A previous study by
clear landscape of cell heterogeneity, capturing both transcriptomic Lin et al.22 showed that more accurate cell representations could be
similarities between cells of the same cell type and differences across achieved by guiding the neural network architecture with a GRN
different cell types8–10. However, so far, deep learning-based single- structure derived from the literature and databases. In this Article,
cell analysis framework6,7,11–13 are usually black boxes and it is hard we show that the neural network architecture can reflect GRN
to evaluate to what extent gene regulatory network (GRN) structure structure by properly designing the neural network layer without
or any other internal structure of the data is learned. relying on any prior knowledge. The neural network architecture
A number of computational models14–19 have attempted to incor- can be inferred jointly with the training of the weights of the neu-
porate GRN inference into their single-cell data analysis models. ral network in an end-to-end manner. The overall framework of
One class of these methods relies on side measurements such as sin- DeepSEM is a beta-variational autoencoder (beta-VAE)23 in which
gle-cell chromatin accessibility or transcription factor (TF) binding the weights of both the encoder and decoder functions represent
motifs19. However, these measurements often require more com- the adjacency matrix of the GRN. Our model does not require
plicated experimental designs and could also introduce additional any extra experimental data such as open chromatin information,
noise as these data could come from different experiments. Current ChIP sequencing (ChIP-seq) data or TF binding motifs to infer the
methods solely based on single-cell RNA sequencing (scRNA-seq) GRN structure. The nonlinear neural networks in DeepSEM are
data also have explicit limitations. For example, it is common for employed to address the challenges in single-cell data analysis, such
GRN inference algorithms to use statistics algorithms that focus as experimental noise, high dimensionality and scalability. In addi-
on the co-expression networks instead of decoding the casual rela- tion, by explicitly modeling the GRN, DeepSEM is more ‘transpar-
tionships among TFs and their corresponding target genes15,20. In ent’ than the conventional neural network models and can reduce
addition, most algorithms that incorporate gene interactions are the overfitting problem of deep learning models by greatly restrict-
linear models16,17 or tree-based models14,18 and it is generally hard ing the parameter space. By inspecting the architecture of the model
to directly generalize these approaches to more comprehensive that represents the inner workings of a cell, we can observe how
1
Institute for Interdisciplinary Information Sciences, Tsinghua University, Beijing, China. 2Genomic Analysis Laboratory, The Salk Institute for Biological
Studies, La Jolla, CA, USA. 3Bioinformatics and Systems Biology Program, University of California, San Diego, La Jolla, CA, USA. 4UM-SJTU Joint Institute,
Shanghai Jiao Tong University, Shanghai, China. 5Department of Automation, Shanghai Jiao Tong University, Shanghai, China. 6Institute for Artificial
Intelligence, Peking University, Beijing, China. ✉e-mail: zengjy321@mail.tsinghua.edu.cn; majianzhu@pku.edu.cn
multiple genes interact with each other to determine the expression Reconstructed
levels of individual genes. gene expression Simulating new
Cell type 1
Cell type 2
UMAP 2
single cells Simulated cells
UMAP 2
strate the accuracy and efficiency of our algorithm. Moreover, we
Shared weights
also evaluate the quality of the single-cell representation regularized
by the GRN structure. We find that DeepSEM can achieve compa-
Visualization of
rable or better performance compared with current state-of-the-art UMAP projections UMAP 1
methods on the tasks of visualization and cell-type identification on
Encoder
various benchmark datasets.
Another important functional component of DeepSEM is to GRN layer
simulate scRNA-seq data by perturbing the values of its hidden neu- Reconstruction of
rons. In silico data simulations have already achieved tremendous GRN
c1
N(µ, σ 2)
c2
c3
c4
g1 g2 g3 g4 g5 g6
Zµ Zσ 2
g6 g5 g4 g4 g2 g1
g6 g5 g4 g3 g2 g1
Inverse
GRN layer
GRN layer
(I – W T)–1
(I – W T)
g1 g2 g3 g 4 g 5 g 6 g1 g2 g3 g4 g5 g6
Mean Variance
NN NN
Encoder
(shared weights) Decoder
(shared weights)
NN NN NN NN
c1 c1
c2 c2
NN NN NN NN
c3 c3
c4 c4
g1 g2 g3 g4 g5 g6 g1 g2 g3 g4 g5 g6
Gene expression Reconstructed gene expression
Fig. 2 | The neural network architecture of DeepSEM. The VAE of DeepSEM contains four modules: encoder, GRN layer, inverse GRN layer and decoder.
The encoder and decoder are both MLPs taking one gene as input, and the weights of encoder and decoder are shared between different genes. The GRN
layer and inverse GRN layer are both gene interaction matrices, which explicitly model the GRN network and guide the information flow of the neural
networks. g1–g6, the name of genes in gene expression data; c1–c4, the name of the cells in gene expression data; NN, neural network.
prediction performance on 81.82% (36/44) of the benchmarks and performance drops only 3.39% when using cell-type specific ChIP-
at least 10% improvement compared with the second-best approach seq as the ground truth. In the end, we found that the ensemble strat-
(GENIE3) on 54.54% (24/44) of the benchmarks when evaluated egy provided greater performance improvement when the training
using EPR. When considering the AUPRC ratio metric, DeepSEM cell was extremely limited, as shown in Supplementary Fig. 2c. The
achieves the best prediction performance on 86.36% (38/44) of ensemble strategy provides additional performance improvement
the benchmarks and at least 10% improvement compared with the in the range of 3.80% to 12.80% (on average) when the size of the
second-best approach (PIDC) on at least 54.54% (24/44) of bench- data is extremely small. In addition to performance comparison,
marks. In addition, DeepSEM substantially outperforms SCODE, we also studied the scalability of DeepSEM for large-scale datasets
ppcor and SINCERITIES on most benchmarks. In particular, among (Supplementary Section 3 and Supplementary Figs. 3 and 4).
all the benchmarks, DeepSEM achieves an average of improvement
of 12.46%, 14.38% and 20.70% compared with PIDC, GENIE3 and Validation of GRN using epigenetic data. DNA methylation
GRNBoost2, respectively, and performs equally or better than these and chromatin accessibility can affect the binding of transcrip-
approaches on 90.91% (40/44), 93.18% (41/44) and 95.45% (42/44) tion factors to cis-regulatory elements and therefore influence the
on all the benchmarks in both evaluation metrics. expression of downstream target genes29,30. To further explore the
Next, we investigate how the performance of DeepSEM is influ- biological significance of the GRN identified by DeepSEM, we also
enced by the number of cells and whether DeepSEM could work examined the concordance between the gene regulations predicted
with limited training data (Supplementary Fig. 2a,b). We first by DeepSEM with the ones inferred from cell-type specific epigen-
constructed five datasets by subsampling 400, 300, 200, 100 and etic data. Previous studies have reported that integrating TF bind-
50 single cells from the BEELINE benchmark28 and evaluated the ing motif information with epigenetic data can accurately predict
accuracy of the GRN prediction of DeepSEM on these five datasets. TF binding sites in a cell-type-specific manner31,32. Therefore, we
First, we found that the performance was relatively stable between hypothesize that if one TF is regulating a gene in a given cell type,
200 and 400 cells. The performance drops about 20% when only it should be more likely to associate the TF motifs with hypo CG
50 single cells are provided to our model (Supplementary Fig. methylation and open chromatins at the flanking regions of the tar-
2a). Second, we analyzed the performance with respect to differ- get gene in the corresponding cell type.
ent kinds of ground truth. As shown in Supplementary Fig. 2b, the To test this hypothesis, we applied the DeepSEM framework
performance drops significantly on cell-type non-specific ground to an scRNA-seq dataset from the mouse cortex33, and compared
truth (STRING and cell-type non-specific ChIP-seq) with 31.47% the results with the single-nucleus methyl-cytosine sequencing
(STRING) and 28.35% (cell-type non-specific ChIP-seq), respec- (snmC-seq)34 and scATAC-seq data35. To search for epigenetic
tively, when only 50 cells are used as training data. In contrast, the evidence supporting regulations of the marker genes, we used the
S
IE
IE
t2
t2
*
*
IT
IT
EM
EM
M
os
os
ER
ER
E
E
3
3
E
E
Bo
Bo
IE
IE
pS
pS
pS
pS
D
D
Ground truth Dataset
r
C
C
N
N
EN
co
EN
co
ee
ee
ee
ee
D
N
R
R
SC
SC
pp
pp
PI
SI
PI
SI
G
G
D
D
hESC 4.35 4.13 3.75 1.11 4.65 4.49 4.12 1.09
Non-specific mESC 2.96 2.79 3.25 1.66 3.51 3.25 3.51 1.19
ChIP-seq mHSC-E 6.55 6.59 5.41 1.90 6.39 6.08 5.74 1.22
mHSC-GM 7.49 7.22 6.68 1.25 6.44 6.03 5.85 5.85 1.17
mHSC-L 3.23 3.12 2.85 2.85 1.05 3.41 3.30 3.08 1.38 1.38
LOF/GOF mESC 1.43 1.37 1.36 1.27 1.43 1.42 1.35 1.03
Fig. 3 | Summary of the GRN prediction performance in terms of EPR. The performance of GRN inference on seven different datasets with four distinct
ground-truth sources using 500 (left) and 1,000 (right) most-varying genes and all varying TFs evaluated by the median EPR value over ten repeats. For
each dataset, the color is scaled between 0 and 1 using the minimum–maximum scale on EPR value. The black squares denote performance worse than
random predictors. DeepSEM* denotes the DeepSEM without using ensemble learning. EPR is defined as the odds ratio of the true positives among the
top K predicted edges between the model and the random predictions where K denotes the number of edges in ground-truth GRN.
scATAC-seq peaks and differentially methylated regions (DMRs)34 Notably, all these six upstream TFs were consistently expressed in
as potential regulatory elements for each cell-type-specific gene both the target and non-target cell types (Fig. 4d). Together, these
(±200 kb of transcription start sites), and then for each gene cal- studies provide orthogonal evidence to support our predicted GRN
culated the proportion of its predicted regulators whose motifs are and indicate the potential utility of DeepSEM to study the cell-
located in the regulatory elements near the gene. Consistent with type-specific gene interaction networks, especially on the wide
our hypothesis, we found substantial enrichment of different types range of housekeeping TFs that have large functional impact on
of regulatory region containing the motifs of predicted TFs (Fig. 4a the cell dynamics.
and Supplementary Fig. 5), suggesting high accuracy of the GRN
identified by DeepSEM from the epigenetic aspect. We provide the Cell representation. Previous studies have indicated that more
summary statistics for the predicted GRN in Supplementary Section biologically meaningful representations for scRNA-seq could
4 and Supplementary Table 3. be generated by considering the interactions among different
More concretely, we examined the predictions related to Rorb, genes, such as protein–protein interaction networks36, GRNs37,
which is a gene encoding a nuclear hormone receptor highly co-expression networks from bulk RNA-seq data and annotated
expressed in L4 cells. We applied DeepSEM to predict the regula- pathways38. In particular, linking regulatory relationships to
tors of Rorb in L4, and our model identified 12 putative upstream gene expression has been proved to be able to effectively over-
TFs regulating Rorb, including genes Mef2c, Nr2f1 and Pknox2 come dropout and other technical variations in both single-cell
(Supplementary Table 4). We found that the binding motifs of these and bulk sequencing experiments39,40. Since the cell representa-
TFs are located in the first intron of Rorb where the cytosines within tion of DeepSEM is a nonlinear mapping from the expression to
the motifs were specifically hypomethylated in L4, indicating the GRN activities, we hypothesized that the hidden representation
cell-type-specific regulation relationship between these TFs and can also effectively define cell states and cell types by explicitly
Rorb (Fig. 4b). As another example, Syt6 encodes a transmembrane modeling the GRN structure. In the encoder function (Fig. 2 and
protein that involves synaptic vesicle exocytosis, whose expression Methods), the nonlinear function can be viewed as a denoising
is restricted to L6 CT cells. We predicted that Syt6 could be reg- function to amplify or suppress the values of the expression of
ulated by Nf1a, Stat1 and Sp3 in L6 CT (Supplementary Table 4). certain genes; whereas the GRN layer can be viewed as a scoring
Accordingly, by comparing with scATAC-seq data, we observed that function of calculating the activity of each regulon. To evaluate
the regions associated with the binding motifs of these TFs along the quality of these representations, we applied DeepSEM to iden-
the Syt6 gene body were specifically open in L6 CT cells (Fig. 4c). tify different cell types on nine scRNA-seq datasets, including
L4 L4
L5-1 L5 IT
DL-1 L6 IT
L6 CT
DL-2
Motif Nfia Sp3 Stat1
Motif Mef2c Nr2f1 Pknox2
103592500 103593500 103619500 103620500
19042083 19059025 19095896
d
Rorb Mef2c Nr2f1 Pknox2
L6 IT
8
L5 IT
UMAP2
UMAP2
UMAP2
UMAP2
L6 CT 7
L4 6
L2/3 IT
L5 PT 5
UMAP1 UMAP1 UMAP1 UMAP1 4
2
UMAP2
UMAP2
UMAP2
UMAP2
log(1 + CPM)
UMAP1 UMAP1 UMAP1 UMAP1
Fig. 4 | Validating GRN prediction using epigenetic data. a, For each target gene (n = 1,822 genes), the proportion of the DeepSEM predicted TFs (top
5% among all TFs) whose motifs were in the ATAC peak or DMR region at ±200 kb of its transcription start sites compared with background (random
selections). Genes whose motifs were in the ATAC peak or DMR region at ±200 kb of its transcription start sites but without any regulators were removed.
We ran the experiments for five times and reported the average P values calculated by the one-sided Wilcoxon test. Left: The violin plot of non-zero data.
Right: the proportion of non-zero data. b, The genome browser view of the L4 marker gene Rorb. The green bar shows the methylation level of each CpG
site, and bars on different sides of the x axis represent different strands. Black arrows signify the differentially methylated CpG sites (DMS) within/next to
each motif, and the coordinates of these DMSs are shown on the bottom. The cluster labels from Luo et al.34 were used, where L5-1 represents L5 IT, and
DL-1 and DL-2 are subtypes of L6 IT. c, The genome browser view of L6 CT maker gene Syt6. Orange bars show the ATAC-seq signals. Only exemplified cell
types related to the comparison are shown. d, UMAP plots of the six cell types used in our study (n = 6,456 cells) colored by expression levels of Rorb, Syt6
or their predicted regulators in the corresponding cell types.
a widely used mouse brain dataset (the Zeisel dataset41), a mouse in Supplementary Table 5. To benchmark DeepSEM, we also
embryo dataset42 and a mouse peripheral blood mononuclear cell compared its low-dimensional embeddings to four other meth-
(PBMC) dataset43. A complete list of these datasets can be found ods: scVI6, DCA7, ZIFA44 and factor analysis (FA)45, following the
a
1.0 1.0
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
ARI
NMI
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
DeepSEM scVI ZIFA FA DCA DeepSEM scVI ZIFA FA DCA
b
Raw DeepSEM scVI
UMAP2
UMAP2
UMAP2
UMAP1 UMAP1 UMAP1
DCA ZIFA FA
UMAP2
UMAP2
UMAP2
Fig. 5 | Single-cell clustering and embedding. a, The clustering performance of DeepSEM on nine single-cell datasets compared with four baseline
methods. The box plots show the median (line), the interquartile range (box) and the whiskers (extending 1.5 times the interquartile range). b, UMAP
visualization of hidden representations on the Zeisel dataset for different methods.
Louvain algorithm46 to cluster all the single cells into the same Next, we evaluated to what extent the latent space generated by
number of clusters. We provide a brief introduction and running DeepSEM could reflect the biological variability among different
details of these baseline methods in Supplementary Section 5. To cells on the Zeisel dataset41. Visualizing using the uniform manifold
quantify the clustering accuracy based on the reference labels, approximation and projection (UMAP)47, we found that DeepSEM
we used the adjusted Rand index (ARI) and normalized mutual was able to provide a more biologically meaningful representa-
information (NMI), both of which range from 0 (random clusters) tion. As shown in Fig. 5b, DeepSEM organizes oligodendrocytes
to 1 (identical clusters). In general, DeepSEM performs better and pyramidal CA1 cells into two clear clusters, while methods
than all the four baseline methods on five of the nine benchmark including scVI, ZIFA and FA failed to do so (highlight in the black
datasets including the Zeisel dataset41, which are used in scVI6, circles). Quantitative comparison further highlights the advantage
and also achieved comparable performance on the other datasets of DeepSEM. Membership weights (Supplementary Table 6) show
(within a range of 5% on average) (Fig. 5a and Supplementary Fig. that DeepSEM has an average closer distance between cell and cell-
6). We also showed that DeepSEM can maintain the hierarchical type center. We further compare DeepSEM with the baseline meth-
structure among different cell types by evaluating the clustering ods on R, earth-mover’s distance (EMD) and K-nearest neighbor
performance on the Zeisel dataset41 with the ground-truth hier- (KNN) preservation metrics48, which all measure the distribution
archical structure provided by the origin study41 (Supplementary difference between embedding data and origin data48. The results
Fig. 6a). In particular, DeepSEM outperforms another probabilis- show that DeepSEM can maintain better KNN consistency com-
tic scRNA-seq model scVI6, which is also based on the VAE, sug- pared with scVI (Supplementary Table 7). Besides preserving good
gesting the necessity of explicitly modeling the GRN structure. clustering, it had been expected that single cells typically follow a
UMAP2
UMAP2
UMAP2
UMAP1 UMAP1 UMAP1 UMAP1
UMAP2
UMAP2
UMAP2
UMAP1 UMAP1 UMAP1 UMAP1
UMAP2
UMAP2
UMAP2
UMAP1 UMAP1 UMAP1 UMAP1
b
Louvain cluster 1 Louvain cluster 2 CD14+ monocyte CD4+/CD45RO+ memory
1.0 1.0 1.0 1.0
True positive rate
GRN consistency
0.8
GRN consistency
GRN consistency
Fig. 6 | Simulation performance of DeepSEM compared with cscGAN and scGAN. a, The embedding plot layout was determined by UMAP visualization.
b, Receiver operator characteristic (ROC) curves and area under the ROC Curve (AUC) scores (means ± s.d., lower is better) using RF to classify real and
simulation cells. c, GRN consistency (means ± 95% CI, higher is better) between the simulated and real data.
temporal progression process, and are more apt to be represented cell types, which was consistent with the finding in Zeisel et al.41.
with a continuous differentiation trajectory. We then embedded We also found that the embedding produced by DeepSEM can rep-
only the oligodendrocyte cells in a separate low-dimensional space resent the same differentiation trajectory as defined in Zeisel et al.41,
and checked the consistency between the embeddings of cell sub- whereas methods including scVI and DCA failed to capture this
types and the differentiation trajectory defined in Zeisel et al.41 information (Supplementary Fig. 7).
(Supplementary Fig. 7). DeepSEM, ZIFA and FA were able to pro-
duce tight cell clusters for subtypes Oligo 3. This finding indicated scRNA-seq simulation. In this study, we compared the simulation
that Oligo 3 presents a distinct cellular state compared with other performance of DeepSEM with two other GAN-based approaches,
cscGAN and scGAN27, on the PBMC dataset43. A brief introduction of The structure of the GRN is explicitly modeled as special layers of
cscGAN and scGAN including their running details could be found in the neural network, which act as biological constraints to restrict the
Supplementary Section 6. The original study27 classified cells through parameter space. One of the limitations of our study is that the run-
the Louvain algorithm46, but we found that the Louvain clusters were ning time increases with the number of genes involved due to the
inconsistent with cell types identified by Zheng et al.43. For instance, ‘inverse’ operation in the inverse GRN layer. Empirically, DeepSEM
47.0% of the CD56+ natural killer (NK) cells were misclustered to a is relatively slower than other VAE models such as scVI6. To address
CD8+ cytotoxic T cluster (Supplementary Fig. 8). Therefore, we used the potential limitation during the training process, we recommend
two different annotation strategies as introduced in Methods. We users to select highly variable genes instead of using the whole tran-
found that all three methods could generate almost indistinguishable scriptome as input features.
low-dimensional embeddings as in the original data, when clustered Models like DeepSEM may have other potential applications in
using both computational approaches and cell types labeled by the single-cell biology. For instance, since all cells share the identical
experts (Fig. 6a and Supplementary Figs. 9 and 10). We trained a ran- genome, GRN can be shared among different modalities, such as
dom forest (RF) classifier to distinguish the simulated data from the transcriptomic and epigenomic data. Therefore, DeepSEM can be
real data for each Louvain cluster and each cell type that contains more adopted to integrate different single-cell modalities by leveraging
than 2,500 cells. Our hypothesis is that the prediction performance of GRN as a ‘bridge’ to construct a common latent space. A second
the RF model should be close to random if the simulated data genera- potential application is to use the DeepSEM framework to integrate
tion was ‘realistic’. We observed that the classification performance of other molecular interaction networks, such as a protein–protein
RF dropped with an average of 1.21% and 19.77% to distinguish simu- interaction network, open chromatin data, DNA binding motifs and
lated data generated by DeepSEM from the real test data compared a genetic interaction network to further infer a GRN and achieve
with cscGAN and scGAN, respectively (Fig. 6b and Supplementary higher accuracy.
Figs. 11 and 12). Note that the objective functions of cscGAN and
scGAN are to train a generator function that can fool another dis- Methods
criminator function, so it was expected that they perform well on this The DeepSEM framework. Structural equation modeling is a multivariate
task. On the other side, DeepSEM achieved realistic simulation, sug- statistical model to analyze structural relationships among different random
variables. The basic SEM was first developed to model the covariance matrix for
gesting that integrating GRN may serve as a crucial step for modeling random variables50. Later, the SEM was found to be very powerful in modeling the
scRNA-seq data. relationship between observed features and hidden latent variables and was widely
We also proposed another concept, called GRN consistency, used in econometrics and sociology for causal inference51,52. More importantly,
to measure the quality of the simulated single cells. GRN con- the SEM can be adopted to detect the conditional dependency among random
variables and therefore also used to predict the graph structure of Bayesian
sistency measures the difference of the predicted GRN between
networks and Markov random fields53–56. DeepSEM generalizes the SEM, which
the real and simulated scRNA-seq data, which quantifies how models the conditional dependencies among random variables and is formulated
much the single-cell model captures both the marginal and con- as a self-regression problem
ditional independence in the original distribution. That is, if a
X = WT X + Z, (1)
scRNA-seq simulation is realistic, the GRN predictions obtained
using the simulated dataset should match the predictions from
the real dataset. Different from recent work49 that used marginal we can modify equation (1) to the following form
independence to measure the consistency, our GRN consistency X = I − WT
( ) −1
Z
considers the conditional independence between TF and target (2)
genes, which is harder to maintain for simulators. We found that Z = I − WT X,
( )
cells reveals dynamic genetic effects on gene expression. Nat. Commun. 11,
where xij denotes the UMAP embedding of cell j in cell type i. 810 (2020).
9. Olsson, A. et al. Single-cell analysis of mixed-lineage states leading to a
The GRN consistency between the simulated and real data. In this study, we binary cell fate choice. Nature 537, 698–702 (2016).
investigated whether the generated scRNA-seq data have the same GRN with 10. Sharma, A. et al. Onco-fetal reprogramming of endothelial cells drives
scRNA-seq from real cells. For each cluster or cell type that contains more than immunosuppressive macrophages in hepatocellular carcinoma. Cell 183,
2,500 cells, we used DeepSEM, scGAN and cscGAN to generate n gene expression 377–394.e21 (2020).
profiles, where n was equal to the number of cells. We used GRNBoost218 to 11. Arisdakessian, C., Poirion, O., Yunits, B., Zhu, X. & Garmire, L. X.
infer the GRNs for both real and simulated single-cell data. Similar to the GRN DeepImpute: an accurate, fast, and scalable deep neural network method to
inference benchmark BEELINE28, only GRNs outgoing from TFs were considered. impute single-cell RNA-seq data. Genome Biol. 20, 211 (2019).
For each cluster, we selected the top K = {0.1%, 0.2%, 0.5%, 1.0%, 1.5%, 2.0%, 2.5%} 12. Wang, T. et al. BERMUDA: a novel deep transfer learning method for
GRNs and used them to evaluate the consistency between the real and simulated single-cell RNA sequencing batch correction reveals hidden high-resolution
cells. The GRN consistency is calculated as follows: cellular subtypes. Genome Biol. 20, 165 (2019).
GRN consistency = 13. Li, X. et al. Deep learning enables accurate clustering with batch effect
(9) removal in single-cell RNA-seq analysis. Nat. Commun. 11, 2338 (2020).
Number of overlap edges in top N predicted edges between real and simulated cells 14. Huynh-Thu, V. A., Irrthum, A., Wehenkel, L. & Geurts, P. Inferring
N
regulatory networks from expression data using tree-based methods. PLoS
where N = K% × number of predicted GRN in real cells. We reported the average ONE 5, e12776 (2010).
GRN consistency within a ±95% confidence interval (CI). 15. Chan, T. E., Stumpf, M. P. H. & Babtie, A. C. Gene regulatory network
inference from single-cell data using multivariate information measures. Cell
Effects of data downsampling and augmentation on cell-type classification. In Syst. 5, 251–267.e3 (2017).
this study, we evaluated the data augmentation performance by discriminating cells 16. Matsumoto, H. et al. SCODE: an efficient regulatory network inference
annotated by the selected cluster or cell type from the other cells with limited training algorithm from single-cell RNA-seq during differentiation. Bioinformatics 33,
data. More specifically, we selected cells annotated as cluster 2 obtained from the 2314–2321 (2017).
Louvain algorithm46, also as in Marouf et al.27, and CD56+ NK cells annotated by 17. Papili Gao, N., Ud-Dean, S. M. M., Gandrillon, O. & Gunawan, R.
Zheng et al.43. First, we randomly sampled 80% of the data as the training dataset and SINCERITIES: inferring gene regulatory networks from time-stamped single
used the remaining 20% of the data as the test dataset. Then, the cells were randomly cell transcriptional expression profiles. Bioinformatics 34, 258–266 (2018).
downsampled with eight different percentages {50%, 25%, 10%, 5%, 3%, 2%, 1%, 0.5%} 18. Moerman, T. et al. GRNBoost2 and Arboreto: efficient and scalable inference
on the training dataset. For each downsampling rate, we simulated 2,000 cells from of gene regulatory networks. Bioinformatics 35, 2159–2161 (2019).
the selected Louvain cluster or cell type by DeepSEM, cscGAN and scGAN. We also 19. Kamimoto, K., Hoffmann, C. M. & Morris, S. A. CellOracle: dissecting cell
randomly sampled 2,000 cells with replacement from the selected Louvain cluster or identity via network inference and in silico gene perturbation. Preprint at
cell type and annotated the above procedure as upsampling. An RF model with default bioRxiv https://doi.org/10.1101/2020.02.17.947416 (2020).
hyperparameters was trained and the top 50 PCs of each cell were selected as features 20. Kim, S. ppcor: an R package for a fast calculation to semi-partial correlation
to discriminate cells from the others. We trained the RF model five times and reported coefficients. Commun. Stat. Appl. Methods 22, 665–674 (2015).
the mean AUPR ± 95% CI. 21. Yu, Y., Jie, C., Tian, G. & Mo, Y. DAG-GNN: DAG structure learning with
graph neural networks. In Proceedings of the 36th International Conference on
Data availability Machine Learning 7154–7163 (ICML, 2019).
We provide all datasets generated or analyzed during this study. The gene 22. Lin, C., Jain, S., Kim, H. & Bar-Joseph, Z. Using neural networks for reducing
experimental scRNA-seq datasets were downloaded from Gene Expression the dimensions of single-cell RNA-seq data. Nucleic Acids Res. 45, e156 (2017).
Omnibus with the accession numbers GSE81252 (hHEP dataset65), GSE75748 (hESC 23. Higgins, I. et al. beta-VAE: learning basic visual concepts with a constrained
dataset66), GSE98664 (mESC dataset62), GSE48968 (mDC dataset63), GSE81682 variational framework. In Proceedings of the 5th International Conference on
(mHSC dataset64), GSE115746 (mouse cortex dataset33), GSE60361 (Zeisel dataset41), Learning Representations (ICML, 2017).
GSE85241 (Muraro dataset78), GSE81861 (Li dataset79), and GSE45719 (Deng 24. Zhao, A., Balakrishnan, G., Durand, F., Guttag, J. V. & Dalca, A. V. Data
dataset80). The other experimental scRNA-seq dataset were downloaded from augmentation using learned transformations for one-shot medical image
ArrayExpress with the accession number E-MTAB-5061 (Segerstolpe dataset81), NCBI segmentation. In Proceedings of the IEEE Conference on Computer Vision and
Sequence Read Archive (SRA) with accession number SRP041736 (Pollen dataset42), Pattern Recognition 8543–8553 (IEEE, 2019).
GitHub repositories (https://github.com/LuyiTian/sc_mixology) (CellBench dataset82) 25. Lotfollahi, M., Wolf, F. A. & Theis, F. J. scGen predicts single-cell
and the website for x10genomics (https://support.10xgenomics.com/single-cell-gene- perturbation responses. Nat. Methods 16, 715–721 (2019).
expression/datasets/) (PBMC dataset43). The scATAC-seq and snmC-seq for mouse 26. Wang, X., Ghasedi Dizaji, K. & Huang, H. Conditional generative adversarial
cortex were downloaded from Gene Expression Omnibus with the accession numbers network for gene expression inference. Bioinformatics 34, i603–i611 (2018).
GSE126724 (scATAC-seq35) and GSE97179 (snmC-seq34). More information for these 27. Marouf, M. et al. Realistic in silico generation and augmentation of single-cell
datasets could be found in Methods. We also summarize the accession and download RNA-seq data using generative adversarial networks. Nat. Commun. 11, 166
links in Supplementary Tables 1, 2, 5 and 9. Source Data for Figs. 3–6 are available (2020).
with this manuscript. 28. Pratapa, A., Jalihal, A. P., Law, J. N., Bharadwaj, A. & Murali, T. M.
Benchmarking algorithms for gene regulatory network inference from
single-cell transcriptomic data. Nat. Methods 17, 147–154 (2020).
Code availability 29. Moore, L. D., Le, T. & Fan, G. DNA methylation and its basic function.
The codes generated during this study are available on GitHub (https://github.com/
Neuropsychopharmacology (2013).
HantaoShu/DeepSEM) and in Zenodo83.
30. Thurman, R. E. et al. The accessible chromatin landscape of the human
genome. Nature 489, 75–82 (2012).
Received: 25 March 2021; Accepted: 15 June 2021; 31. Keilwagen, J., Posch, S. & Grau, J. Accurate prediction of cell type-specific
Published online: 22 July 2021 transcription factor binding. Genome Biol. 20, 9 (2019).