0% found this document useful (0 votes)
26 views12 pages

Deep SEM

The article presents DeepSEM, a deep generative model designed to infer gene regulatory networks (GRNs) and analyze single-cell RNA sequencing (scRNA-seq) data. DeepSEM outperforms existing methods in various computational tasks, including GRN inference and data visualization, while effectively addressing challenges such as experimental noise and high dimensionality. The model's architecture allows for a more transparent understanding of gene interactions and provides a powerful tool for single-cell data analysis.

Uploaded by

lisiyuan199511
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views12 pages

Deep SEM

The article presents DeepSEM, a deep generative model designed to infer gene regulatory networks (GRNs) and analyze single-cell RNA sequencing (scRNA-seq) data. DeepSEM outperforms existing methods in various computational tasks, including GRN inference and data visualization, while effectively addressing challenges such as experimental noise and high dimensionality. The model's architecture allows for a more transparent understanding of gene interactions and provides a powerful tool for single-cell data analysis.

Uploaded by

lisiyuan199511
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/353401568

Modeling gene regulatory networks using neural network architectures

Article in Nature Computational Science · July 2021


DOI: 10.1038/s43588-021-00099-8

CITATIONS READS

111 4,722

7 authors, including:

Qiuyu Lian Han Li


Tsinghua University Nankai University
23 PUBLICATIONS 1,147 CITATIONS 14 PUBLICATIONS 198 CITATIONS

SEE PROFILE SEE PROFILE

Dan Zhao
Tsinghua University
80 PUBLICATIONS 2,332 CITATIONS

SEE PROFILE

All content following this page was uploaded by Qiuyu Lian on 06 May 2022.

The user has requested enhancement of the downloaded file.


Articles
https://doi.org/10.1038/s43588-021-00099-8

Modeling gene regulatory networks using neural


network architectures
Hantao Shu 1
, Jingtian Zhou2,3, Qiuyu Lian4,5, Han Li 1
, Dan Zhao 1
, Jianyang Zeng 1 ✉ and
6✉
Jianzhu Ma

Gene regulatory networks (GRNs) encode the complex molecular interactions that govern cell identity. Here we propose
DeepSEM, a deep generative model that can jointly infer GRNs and biologically meaningful representation of single-cell RNA
sequencing (scRNA-seq) data. In particular, we developed a neural network version of the structural equation model (SEM)
to explicitly model the regulatory relationships among genes. Benchmark results show that DeepSEM achieves comparable or
better performance on a variety of single-cell computational tasks, such as GRN inference, scRNA-seq data visualization, clus-
tering and simulation, compared with the state-of-the-art methods. In addition, the gene regulations predicted by DeepSEM on
cell-type marker genes in the mouse cortex can be validated by epigenetic data, which further demonstrates the accuracy and
efficiency of our method. DeepSEM can provide a useful and powerful tool to analyze scRNA-seq data and infer a GRN.

T
he rapid development of single-cell sequencing technolo- nonlinear frameworks and benefit from the computational power
gies1–3 provides unprecedented opportunities for biologists that the deep learning model brought to us.
to investigate cellular states. However, it also poses new To address the above problems, we present DeepSEM, a deep
challenges in the form of experimental noise not found in bulk generative model that can jointly embed the gene expression data
sequencing data4,5, which might significantly decrease the accu- and simultaneously construct a GRN that reflects the inner struc-
racy of downstream bioinformatics analysis by introducing biases ture of gene interactions in single cells without relying on any addi-
in the gene expression. To address these problems, recently, there tional information such as TF binding motifs or single-cell ATAC
has been great interest in applying deep learning models to filter sequencing (scATAC-seq) data. To implement such an idea, we took
the noise in single-cell transcriptome data by modeling the compli- inspiration from the work of Yu et al.21, which generalized a popular
cated interaction patterns among the genes6,7. Inherently, the tran- approach, called the structural equation model (SEM), that infers
scriptome of a cell is governed by its gene regulation process in a the causality using a linear model. By adding proper mathematical
cell-specific manner. Hence, we expect that the deep learning-based constraints, part of the neural network architecture could be used
methods are able to model such gene interactions to reveal a more to predict the GRN of the scRNA-seq data. A previous study by
clear landscape of cell heterogeneity, capturing both transcriptomic Lin et al.22 showed that more accurate cell representations could be
similarities between cells of the same cell type and differences across achieved by guiding the neural network architecture with a GRN
different cell types8–10. However, so far, deep learning-based single- structure derived from the literature and databases. In this Article,
cell analysis framework6,7,11–13 are usually black boxes and it is hard we show that the neural network architecture can reflect GRN
to evaluate to what extent gene regulatory network (GRN) structure structure by properly designing the neural network layer without
or any other internal structure of the data is learned. relying on any prior knowledge. The neural network architecture
A number of computational models14–19 have attempted to incor- can be inferred jointly with the training of the weights of the neu-
porate GRN inference into their single-cell data analysis models. ral network in an end-to-end manner. The overall framework of
One class of these methods relies on side measurements such as sin- DeepSEM is a beta-variational autoencoder (beta-VAE)23 in which
gle-cell chromatin accessibility or transcription factor (TF) binding the weights of both the encoder and decoder functions represent
motifs19. However, these measurements often require more com- the adjacency matrix of the GRN. Our model does not require
plicated experimental designs and could also introduce additional any extra experimental data such as open chromatin information,
noise as these data could come from different experiments. Current ChIP sequencing (ChIP-seq) data or TF binding motifs to infer the
methods solely based on single-cell RNA sequencing (scRNA-seq) GRN structure. The nonlinear neural networks in DeepSEM are
data also have explicit limitations. For example, it is common for employed to address the challenges in single-cell data analysis, such
GRN inference algorithms to use statistics algorithms that focus as experimental noise, high dimensionality and scalability. In addi-
on the co-expression networks instead of decoding the casual rela- tion, by explicitly modeling the GRN, DeepSEM is more ‘transpar-
tionships among TFs and their corresponding target genes15,20. In ent’ than the conventional neural network models and can reduce
addition, most algorithms that incorporate gene interactions are the overfitting problem of deep learning models by greatly restrict-
linear models16,17 or tree-based models14,18 and it is generally hard ing the parameter space. By inspecting the architecture of the model
to directly generalize these approaches to more comprehensive that represents the inner workings of a cell, we can observe how

1
Institute for Interdisciplinary Information Sciences, Tsinghua University, Beijing, China. 2Genomic Analysis Laboratory, The Salk Institute for Biological
Studies, La Jolla, CA, USA. 3Bioinformatics and Systems Biology Program, University of California, San Diego, La Jolla, CA, USA. 4UM-SJTU Joint Institute,
Shanghai Jiao Tong University, Shanghai, China. 5Department of Automation, Shanghai Jiao Tong University, Shanghai, China. 6Institute for Artificial
Intelligence, Peking University, Beijing, China. ✉e-mail: zengjy321@mail.tsinghua.edu.cn; majianzhu@pku.edu.cn

Nature Computational Science | VOL 1 | July 2021 | 491–501 | www.nature.com/natcomputsci 491


Articles NAtuRE CoMputAtIonAl ScIEncE

multiple genes interact with each other to determine the expression Reconstructed
levels of individual genes. gene expression Simulating new
Cell type 1
Cell type 2

We evaluate the performance of DeepSEM for various single-cell Cell type 3

UMAP 2
single cells Simulated cells

tasks such as GRN inference, scRNA-seq data visualization, cell-type


Decoder
identification and cell simulations on several benchmark datasets.
We first show that DeepSEM is able to achieve better performance MLP
on the GRN inference task compared with the state-of-the-art UMAP 1
algorithms on several popular benchmark datasets. We also apply
DeepSEM to another single-cell dataset without the ground-truth
Inverse
GRN measured, and provide extensive evidence extracted from the GRN layer
Cell type 1
Cell type 2
single-cell DNA methylation and open chromatin data to demon- Cell type 3

UMAP 2
strate the accuracy and efficiency of our algorithm. Moreover, we
Shared weights
also evaluate the quality of the single-cell representation regularized
by the GRN structure. We find that DeepSEM can achieve compa-
Visualization of
rable or better performance compared with current state-of-the-art UMAP projections UMAP 1
methods on the tasks of visualization and cell-type identification on
Encoder
various benchmark datasets.
Another important functional component of DeepSEM is to GRN layer
simulate scRNA-seq data by perturbing the values of its hidden neu- Reconstruction of
rons. In silico data simulations have already achieved tremendous GRN

success in computer vision for data augmentation, especially when MLP


the number of training samples is limited24. In single-cell biology,
the same types of simulation algorithm have also been applied to GRN
scRNA-seq data to predict the single-cell perturbation response out Gene expression
Transcription factor
of sample19,25, identify marker genes26 and augment the sparse cell Target gene

populations to improve the accuracy of cell-type classification27.


The state-of-the-art simulation algorithms27 aim to generate ‘real- Fig. 1 | Overview of DeepSEM. Left: DeepSEM is a generative model
istic’ scRNA-seq data based on the generative adversarial networks including two main modules: an encoder (bottom left) and a decoder
(GANs) to make the low-dimensional projection of the simulated (top left). Right: DeepSEM performs three major functions by leveraging
data indistinguishable from the data distributions of the real cells. different modules: (1) GRN prediction (bottom right), (2) scRNA-seq data
In this Article, we show that DeepSEM is able to achieve more real- embedding and visualization (middle right), and (3) scRNA-seq simulation
istic simulations compared with other GAN-based models through (top right).
guiding the information flow using the GRN layer and mirroring
the in vivo generation process of RNA governed by multiple TFs. In
addition, we propose another concept, called the GRN consistency, expression values, which makes the entire framework an autoen-
to measure the quality of the simulated single-cell data. In particular, coder (Fig. 2 and Methods). By jointly modeling GRN and single-
the GRN consistency measures the difference of the predicted GRN cell transcriptome data, DeepSEM acts as a multipurpose tool that
between the simulated and real scRNA-seq data, which accesses could serve for various tasks in single-cell data analysis by analyzing
how much the computational model captures both the marginal and different modules.
conditional independence from the original distribution, and, more
importantly, quantifies how likely it is that one method can gen- Performance of GRN inference. To evaluate the performance of
erate realistic scRNA-seq data satisfying the biological constraints. DeepSEM on GRN inference, we followed the BEELINE frame-
Tests on several benchmark datasets show that DeepSEM is able to work28, which collected four different kinds of ground-truth net-
achieve realistic scRNA-seq profiles and higher GRN consistency work (Supplementary Table 1) and seven scRNA-seq datasets
compared with the state-of-the-art single-cell simulators. including five cell lines from mouse and two cell lines from human
(Supplementary Table 2). For each dataset, as recommended by
Results Pratapa et al.28, we considered only highly variable TFs and top N
Overview of the DeepSEM framework. Given an scRNA-seq (N = 500 and 1,000) most-varying genes (details in Supplementary
dataset as input, DeepSEM jointly models the GRN and the tran- Section 1). The performance was evaluated by the early precision
scriptome by generating the SEM with a beta-VAE framework (Fig. ratio (EPR) (Fig. 3) and area under the precision–recall curve ratio
1). We designed two neural network layers, named the GRN layer (AUPRC ratio) (Supplementary Fig. 1) as used in the BEELINE
and the inverse GRN layer, to explicitly model the GRN structure framework28, which are defined as the odds ratio of the true posi-
(Methods). Different from conventional deep learning models that tives among the top K predicted edges and the AUPRC between the
embed the expressions of all the genes together into a latent space6,7, model and the random predictions. Here K denotes the number
the encoder function of DeepSEM takes the expression of only one of edges in ground-truth GRN. We compared DeepSEM with six
gene as the input feature of the neural network. The neural net- baseline algorithms, including GENIE314, PIDC15, GRNBoost218,
works for different genes share their weights or it could be viewed as SCODE16, ppcor20 and SINCERITIES17, which had been proved to
using one neural network to scan all the genes. At this step, there are achieve state-of-the-art performance on the benchmark datasets
no interactions among different genes in the model. Later, another based on the evaluation of BEELINE28. To achieve stable predictions
two fully connected neural networks transform the output of these from deep learning models, we use the ensemble strategy to gener-
small neural networks to the posterior mean and standard devia- ate the final predictions (Methods). We provide a brief introduction
tion of a multivariate Gaussian distribution. Decoupling the non- of these baseline methods for their functionalities and their running
linear operation and the gene interaction is the key for DeepSEM details can be found in Supplementary Section 2.
to achieve more robust and interpretable hidden representations at Overall, DeepSEM outperforms all the other baseline methods
the same time. Next, a decoder function equipped with the inverse on scRNA-seq datasets in terms of both EPR and AUPRC ratio met-
GRN layer transforms the hidden representations back to the gene rics (Fig. 3 and Supplementary Fig. 1). DeepSEM achieves the best

492 Nature Computational Science | VOL 1 | July 2021 | 491–501 | www.nature.com/natcomputsci


NAtuRE CoMputAtIonAl ScIEncE Articles
Sampling value Latent representation

c1
N(µ, σ 2)
c2
c3
c4
g1 g2 g3 g4 g5 g6
Zµ Zσ 2

g6 g5 g4 g4 g2 g1

g6 g5 g4 g3 g2 g1
Inverse
GRN layer
GRN layer
(I – W T)–1
(I – W T)

g1 g2 g3 g 4 g 5 g 6 g1 g2 g3 g4 g5 g6

Mean Variance
NN NN

Encoder
(shared weights) Decoder
(shared weights)

NN NN NN NN

c1 c1
c2 c2
NN NN NN NN
c3 c3
c4 c4
g1 g2 g3 g4 g5 g6 g1 g2 g3 g4 g5 g6
Gene expression Reconstructed gene expression

Fig. 2 | The neural network architecture of DeepSEM. The VAE of DeepSEM contains four modules: encoder, GRN layer, inverse GRN layer and decoder.
The encoder and decoder are both MLPs taking one gene as input, and the weights of encoder and decoder are shared between different genes. The GRN
layer and inverse GRN layer are both gene interaction matrices, which explicitly model the GRN network and guide the information flow of the neural
networks. g1–g6, the name of genes in gene expression data; c1–c4, the name of the cells in gene expression data; NN, neural network.

prediction performance on 81.82% (36/44) of the benchmarks and performance drops only 3.39% when using cell-type specific ChIP-
at least 10% improvement compared with the second-best approach seq as the ground truth. In the end, we found that the ensemble strat-
(GENIE3) on 54.54% (24/44) of the benchmarks when evaluated egy provided greater performance improvement when the training
using EPR. When considering the AUPRC ratio metric, DeepSEM cell was extremely limited, as shown in Supplementary Fig. 2c. The
achieves the best prediction performance on 86.36% (38/44) of ensemble strategy provides additional performance improvement
the benchmarks and at least 10% improvement compared with the in the range of 3.80% to 12.80% (on average) when the size of the
second-best approach (PIDC) on at least 54.54% (24/44) of bench- data is extremely small. In addition to performance comparison,
marks. In addition, DeepSEM substantially outperforms SCODE, we also studied the scalability of DeepSEM for large-scale datasets
ppcor and SINCERITIES on most benchmarks. In particular, among (Supplementary Section 3 and Supplementary Figs. 3 and 4).
all the benchmarks, DeepSEM achieves an average of improvement
of 12.46%, 14.38% and 20.70% compared with PIDC, GENIE3 and Validation of GRN using epigenetic data. DNA methylation
GRNBoost2, respectively, and performs equally or better than these and chromatin accessibility can affect the binding of transcrip-
approaches on 90.91% (40/44), 93.18% (41/44) and 95.45% (42/44) tion factors to cis-regulatory elements and therefore influence the
on all the benchmarks in both evaluation metrics. expression of downstream target genes29,30. To further explore the
Next, we investigate how the performance of DeepSEM is influ- biological significance of the GRN identified by DeepSEM, we also
enced by the number of cells and whether DeepSEM could work examined the concordance between the gene regulations predicted
with limited training data (Supplementary Fig. 2a,b). We first by DeepSEM with the ones inferred from cell-type specific epigen-
constructed five datasets by subsampling 400, 300, 200, 100 and etic data. Previous studies have reported that integrating TF bind-
50 single cells from the BEELINE benchmark28 and evaluated the ing motif information with epigenetic data can accurately predict
accuracy of the GRN prediction of DeepSEM on these five datasets. TF binding sites in a cell-type-specific manner31,32. Therefore, we
First, we found that the performance was relatively stable between hypothesize that if one TF is regulating a gene in a given cell type,
200 and 400 cells. The performance drops about 20% when only it should be more likely to associate the TF motifs with hypo CG
50 single cells are provided to our model (Supplementary Fig. methylation and open chromatins at the flanking regions of the tar-
2a). Second, we analyzed the performance with respect to differ- get gene in the corresponding cell type.
ent kinds of ground truth. As shown in Supplementary Fig. 2b, the To test this hypothesis, we applied the DeepSEM framework
performance drops significantly on cell-type non-specific ground to an scRNA-seq dataset from the mouse cortex33, and compared
truth (STRING and cell-type non-specific ChIP-seq) with 31.47% the results with the single-nucleus methyl-cytosine sequencing
(STRING) and 28.35% (cell-type non-specific ChIP-seq), respec- (snmC-seq)34 and scATAC-seq data35. To search for epigenetic
tively, when only 50 cells are used as training data. In contrast, the evidence supporting regulations of the marker genes, we used the

Nature Computational Science | VOL 1 | July 2021 | 491–501 | www.nature.com/natcomputsci 493


Articles NAtuRE CoMputAtIonAl ScIEncE

TFs + 500 genes TFs + 1,000 genes

S
IE

IE
t2

t2
*

*
IT

IT
EM

EM

M
os

os
ER

ER
E

E
3

3
E

E
Bo

Bo
IE

IE
pS

pS

pS

pS
D

D
Ground truth Dataset

r
C

C
N

N
EN

co

EN

co
ee

ee

ee

ee
D

N
R

R
SC

SC
pp

pp
PI

SI

PI

SI
G

G
D

D
hESC 4.35 4.13 3.75 1.11 4.65 4.49 4.12 1.09

hHep 3.51 2.92 3.51 1.04 3.51 2.96 3.62 1.09

mDC 2.11 1.96 2.03 1.03 2.09 1.94 2.16 1.09

mESC 3.63 3.38 3.35 1.54 4.15 3.89 3.71 1.24


STRING
mHSC-E 7.89 7.43 7.49 1.96 7.99 7.57 8.08 3.94

mHSC-GM 9.37 9.03 8.65 1.58 8.64 8.26 8.49 1.43

mHSC-L 6.83 6.98 6.83 1.97 7.31 7.31 7.17 1.43

hESC 2.27 2.20 1.52 1.98 2.38 2.32 1.28 2.13


hHep 2.64 2.50 2.48 1.10 2.96 2.71 2.44 1.25
mDC 3.34 2.92 2.73 1.01 3.82 3.39 2.95 1.15

Non-specific mESC 2.96 2.79 3.25 1.66 3.51 3.25 3.51 1.19
ChIP-seq mHSC-E 6.55 6.59 5.41 1.90 6.39 6.08 5.74 1.22
mHSC-GM 7.49 7.22 6.68 1.25 6.44 6.03 5.85 5.85 1.17
mHSC-L 3.23 3.12 2.85 2.85 1.05 3.41 3.30 3.08 1.38 1.38

hESC 1.16 1.19 1.02 1.19 1.18 1.05 1.69


hHep 1.20 1.19 1.03 1.12 1.26 1.22 1.01 1.12 1.01 1.01
mDC 1.08 1.17 1.04 1.31 1.13 1.05 1.01 1.30
Cell-type-
specific mESC 1.05 1.04 1.06 1.01 1.07 1.07 1.01 1.06
ChIP-seq mHSC-E 1.06 1.05 1.03 1.03 1.03 1.02 1.07 1.06 1.01
mHSC-GM 1.13 1.12 1.06 1.01 1.10 1.10 1.03
mHSC-L 1.07 1.06 1.02 1.07 1.12 1.12 1.04 1.08

LOF/GOF mESC 1.43 1.37 1.36 1.27 1.43 1.42 1.35 1.03

Low/poor High/good Random predictor

Fig. 3 | Summary of the GRN prediction performance in terms of EPR. The performance of GRN inference on seven different datasets with four distinct
ground-truth sources using 500 (left) and 1,000 (right) most-varying genes and all varying TFs evaluated by the median EPR value over ten repeats. For
each dataset, the color is scaled between 0 and 1 using the minimum–maximum scale on EPR value. The black squares denote performance worse than
random predictors. DeepSEM* denotes the DeepSEM without using ensemble learning. EPR is defined as the odds ratio of the true positives among the
top K predicted edges between the model and the random predictions where K denotes the number of edges in ground-truth GRN.

scATAC-seq peaks and differentially methylated regions (DMRs)34 Notably, all these six upstream TFs were consistently expressed in
as potential regulatory elements for each cell-type-specific gene both the target and non-target cell types (Fig. 4d). Together, these
(±200 kb of transcription start sites), and then for each gene cal- studies provide orthogonal evidence to support our predicted GRN
culated the proportion of its predicted regulators whose motifs are and indicate the potential utility of DeepSEM to study the cell-
located in the regulatory elements near the gene. Consistent with type-specific gene interaction networks, especially on the wide
our hypothesis, we found substantial enrichment of different types range of housekeeping TFs that have large functional impact on
of regulatory region containing the motifs of predicted TFs (Fig. 4a the cell dynamics.
and Supplementary Fig. 5), suggesting high accuracy of the GRN
identified by DeepSEM from the epigenetic aspect. We provide the Cell representation. Previous studies have indicated that more
summary statistics for the predicted GRN in Supplementary Section biologically meaningful representations for scRNA-seq could
4 and Supplementary Table 3. be generated by considering the interactions among different
More concretely, we examined the predictions related to Rorb, genes, such as protein–protein interaction networks36, GRNs37,
which is a gene encoding a nuclear hormone receptor highly co-expression networks from bulk RNA-seq data and annotated
expressed in L4 cells. We applied DeepSEM to predict the regula- pathways38. In particular, linking regulatory relationships to
tors of Rorb in L4, and our model identified 12 putative upstream gene expression has been proved to be able to effectively over-
TFs regulating Rorb, including genes Mef2c, Nr2f1 and Pknox2 come dropout and other technical variations in both single-cell
(Supplementary Table 4). We found that the binding motifs of these and bulk sequencing experiments39,40. Since the cell representa-
TFs are located in the first intron of Rorb where the cytosines within tion of DeepSEM is a nonlinear mapping from the expression to
the motifs were specifically hypomethylated in L4, indicating the GRN activities, we hypothesized that the hidden representation
cell-type-specific regulation relationship between these TFs and can also effectively define cell states and cell types by explicitly
Rorb (Fig. 4b). As another example, Syt6 encodes a transmembrane modeling the GRN structure. In the encoder function (Fig. 2 and
protein that involves synaptic vesicle exocytosis, whose expression Methods), the nonlinear function can be viewed as a denoising
is restricted to L6 CT cells. We predicted that Syt6 could be reg- function to amplify or suppress the values of the expression of
ulated by Nf1a, Stat1 and Sp3 in L6 CT (Supplementary Table 4). certain genes; whereas the GRN layer can be viewed as a scoring
Accordingly, by comparing with scATAC-seq data, we observed that function of calculating the activity of each regulon. To evaluate
the regions associated with the binding motifs of these TFs along the quality of these representations, we applied DeepSEM to iden-
the Syt6 gene body were specifically open in L6 CT cells (Fig. 4c). tify different cell types on nine scRNA-seq datasets, including

494 Nature Computational Science | VOL 1 | July 2021 | 491–501 | www.nature.com/natcomputsci


NAtuRE CoMputAtIonAl ScIEncE Articles
a
Rank-sum, P = 8.81 × 10–31

validated by methylation data or ATAC data


0.8 Proportion of non-zero data

Proportion of predicted TFs


0.7
DeepSEM Background
0.6
Zero Zero
0.5
28.8% 35.2%
0.4
0.3 71.2% 64.8%
0.2
Non-zero Non-zero
0.1
0
–0.1
Background DeepSEM

b c 103590000 Syt6 Chr3 103625000


19033600 Chr19 19125800
(1) (2)
Rorb
DMR L2/3
ATAC peak
(1) (2) (3)
L2/3 L4
L4 L5 IT
L5-1
DL-1 L6 IT
DL-2 L6 CT

(1) (2) (3)


(1) (2)
L2/3 L2/3

L4 L4

L5-1 L5 IT

DL-1 L6 IT

L6 CT
DL-2
Motif Nfia Sp3 Stat1
Motif Mef2c Nr2f1 Pknox2
103592500 103593500 103619500 103620500
19042083 19059025 19095896
d
Rorb Mef2c Nr2f1 Pknox2
L6 IT
8
L5 IT
UMAP2

UMAP2

UMAP2

UMAP2

L6 CT 7

L4 6
L2/3 IT
L5 PT 5
UMAP1 UMAP1 UMAP1 UMAP1 4

Syt6 Nfia Sp3 Stat1 3

2
UMAP2

UMAP2

UMAP2

UMAP2

log(1 + CPM)
UMAP1 UMAP1 UMAP1 UMAP1

Fig. 4 | Validating GRN prediction using epigenetic data. a, For each target gene (n = 1,822 genes), the proportion of the DeepSEM predicted TFs (top
5% among all TFs) whose motifs were in the ATAC peak or DMR region at ±200 kb of its transcription start sites compared with background (random
selections). Genes whose motifs were in the ATAC peak or DMR region at ±200 kb of its transcription start sites but without any regulators were removed.
We ran the experiments for five times and reported the average P values calculated by the one-sided Wilcoxon test. Left: The violin plot of non-zero data.
Right: the proportion of non-zero data. b, The genome browser view of the L4 marker gene Rorb. The green bar shows the methylation level of each CpG
site, and bars on different sides of the x axis represent different strands. Black arrows signify the differentially methylated CpG sites (DMS) within/next to
each motif, and the coordinates of these DMSs are shown on the bottom. The cluster labels from Luo et al.34 were used, where L5-1 represents L5 IT, and
DL-1 and DL-2 are subtypes of L6 IT. c, The genome browser view of L6 CT maker gene Syt6. Orange bars show the ATAC-seq signals. Only exemplified cell
types related to the comparison are shown. d, UMAP plots of the six cell types used in our study (n = 6,456 cells) colored by expression levels of Rorb, Syt6
or their predicted regulators in the corresponding cell types.

a widely used mouse brain dataset (the Zeisel dataset41), a mouse in Supplementary Table 5. To benchmark DeepSEM, we also
embryo dataset42 and a mouse peripheral blood mononuclear cell compared its low-dimensional embeddings to four other meth-
(PBMC) dataset43. A complete list of these datasets can be found ods: scVI6, DCA7, ZIFA44 and factor analysis (FA)45, following the

Nature Computational Science | VOL 1 | July 2021 | 491–501 | www.nature.com/natcomputsci 495


Articles NAtuRE CoMputAtIonAl ScIEncE

a
1.0 1.0

0.9 0.9

0.8 0.8

0.7 0.7

0.6 0.6
ARI

NMI
0.5 0.5

0.4 0.4

0.3 0.3

0.2 0.2

0.1 0.1
DeepSEM scVI ZIFA FA DCA DeepSEM scVI ZIFA FA DCA

b
Raw DeepSEM scVI
UMAP2

UMAP2

UMAP2
UMAP1 UMAP1 UMAP1

DCA ZIFA FA
UMAP2

UMAP2

UMAP2

UMAP1 UMAP1 UMAP1

Astrocytes/ependymal Microglia Interneurons Pyramidal SS


Oligodendrocytes Endothelial/mural Pyramidal CA1

Fig. 5 | Single-cell clustering and embedding. a, The clustering performance of DeepSEM on nine single-cell datasets compared with four baseline
methods. The box plots show the median (line), the interquartile range (box) and the whiskers (extending 1.5 times the interquartile range). b, UMAP
visualization of hidden representations on the Zeisel dataset for different methods.

Louvain algorithm46 to cluster all the single cells into the same Next, we evaluated to what extent the latent space generated by
number of clusters. We provide a brief introduction and running DeepSEM could reflect the biological variability among different
details of these baseline methods in Supplementary Section 5. To cells on the Zeisel dataset41. Visualizing using the uniform manifold
quantify the clustering accuracy based on the reference labels, approximation and projection (UMAP)47, we found that DeepSEM
we used the adjusted Rand index (ARI) and normalized mutual was able to provide a more biologically meaningful representa-
information (NMI), both of which range from 0 (random clusters) tion. As shown in Fig. 5b, DeepSEM organizes oligodendrocytes
to 1 (identical clusters). In general, DeepSEM performs better and pyramidal CA1 cells into two clear clusters, while methods
than all the four baseline methods on five of the nine benchmark including scVI, ZIFA and FA failed to do so (highlight in the black
datasets including the Zeisel dataset41, which are used in scVI6, circles). Quantitative comparison further highlights the advantage
and also achieved comparable performance on the other datasets of DeepSEM. Membership weights (Supplementary Table 6) show
(within a range of 5% on average) (Fig. 5a and Supplementary Fig. that DeepSEM has an average closer distance between cell and cell-
6). We also showed that DeepSEM can maintain the hierarchical type center. We further compare DeepSEM with the baseline meth-
structure among different cell types by evaluating the clustering ods on R, earth-mover’s distance (EMD) and K-nearest neighbor
performance on the Zeisel dataset41 with the ground-truth hier- (KNN) preservation metrics48, which all measure the distribution
archical structure provided by the origin study41 (Supplementary difference between embedding data and origin data48. The results
Fig. 6a). In particular, DeepSEM outperforms another probabilis- show that DeepSEM can maintain better KNN consistency com-
tic scRNA-seq model scVI6, which is also based on the VAE, sug- pared with scVI (Supplementary Table 7). Besides preserving good
gesting the necessity of explicitly modeling the GRN structure. clustering, it had been expected that single cells typically follow a

496 Nature Computational Science | VOL 1 | July 2021 | 491–501 | www.nature.com/natcomputsci


NAtuRE CoMputAtIonAl ScIEncE Articles
a
Louvain cluster 1 Louvain cluster 2 CD14+ monocyte CD4+/CD45RO+ memory
DeepSEM
Raw data
UMAP2

UMAP2

UMAP2

UMAP2
UMAP1 UMAP1 UMAP1 UMAP1

Louvain cluster 1 Louvain cluster 2 CD14+ monocyte CD4+/CD45RO+ memory


cscGAN
Raw data
UMAP2

UMAP2

UMAP2

UMAP2
UMAP1 UMAP1 UMAP1 UMAP1

Louvain cluster 1 Louvain cluster 2 CD14+ monocyte CD4+/CD45RO+ memory


scGAN
Raw data
UMAP2

UMAP2

UMAP2

UMAP2
UMAP1 UMAP1 UMAP1 UMAP1

b
Louvain cluster 1 Louvain cluster 2 CD14+ monocyte CD4+/CD45RO+ memory
1.0 1.0 1.0 1.0
True positive rate

True positive rate

True positive rate

True positive rate


0.8 0.8 0.8 0.8
0.6 0.6 0.6 0.6
0.4 0.4 0.4 0.4
DeepSEM (AUC = 0.58 ± 0.01) DeepSEM (AUC = 0.57 ± 0.01) DeepSEM (AUC = 0.54 ± 0.02) DeepSEM (AUC = 0.57 ± 0.01)
0.2 cscGAN (AUC = 0.58 ± 0.01) 0.2 cscGAN (AUC = 0.56 ± 0.01) 0.2 cscGAN (AUC = 0.60 ± 0.03) 0.2 cscGAN (AUC = 0.58 ± 0.04)
scGAN (AUC = 0.77 ± 0.01) scGAN (AUC = 0.71 ± 0.01) scGAN (AUC = 0.74 ± 0.02) scGAN (AUC = 0.74 ± 0.03)
Random (AUC = 0.54 ± 0.02) Random (AUC = 0.50 ± 0.04)
0 Random (AUC = 0.50 ± 0.01) 0 Random (AUC = 0.50 ± 0.02) 0 0
0 0.2 0.4 0.6 0.8 1.0 0 0.2 0.4 0.6 0.8 1.0 0 0.2 0.4 0.6 0.8 1.0 0 0.2 0.4 0.6 0.8 1.0
False positive rate False positive rate False positive rate False positive rate

c Louvain cluster 1 Louvain cluster 2 CD14+ monocyte CD4+/CD45RO+ memory


1.0 1.0 0.9 0.7
0.8 0.6
0.8
GRN consistency

GRN consistency

0.8
GRN consistency

GRN consistency

DeepSEM 0.7 0.5


scGAN 0.6
0.6 cscGAN 0.6 0.4
0.5
0.4 0.4 0.3
0.4
0.3 0.2
0.2 0.2
0.2 0.1
0 0 0.1 0
0.1 0.2 0.5 1.0 1.5 2.0 5.0 0.1 0.2 0.5 1.0 1.5 2.0 5.0 0.1 0.2 0.5 1.0 1.5 2.0 5.0 0.1 0.2 0.5 1.0 1.5 2.0 5.0
Top K% of GRN prediction Top K% of GRN prediction Top K% of GRN prediction Top K% of GRN prediction

Fig. 6 | Simulation performance of DeepSEM compared with cscGAN and scGAN. a, The embedding plot layout was determined by UMAP visualization.
b, Receiver operator characteristic (ROC) curves and area under the ROC Curve (AUC) scores (means ± s.d., lower is better) using RF to classify real and
simulation cells. c, GRN consistency (means ± 95% CI, higher is better) between the simulated and real data.

temporal progression process, and are more apt to be represented cell types, which was consistent with the finding in Zeisel et al.41.
with a continuous differentiation trajectory. We then embedded We also found that the embedding produced by DeepSEM can rep-
only the oligodendrocyte cells in a separate low-dimensional space resent the same differentiation trajectory as defined in Zeisel et al.41,
and checked the consistency between the embeddings of cell sub- whereas methods including scVI and DCA failed to capture this
types and the differentiation trajectory defined in Zeisel et al.41 information (Supplementary Fig. 7).
(Supplementary Fig. 7). DeepSEM, ZIFA and FA were able to pro-
duce tight cell clusters for subtypes Oligo 3. This finding indicated scRNA-seq simulation. In this study, we compared the simulation
that Oligo 3 presents a distinct cellular state compared with other performance of DeepSEM with two other GAN-based approaches,

Nature Computational Science | VOL 1 | July 2021 | 491–501 | www.nature.com/natcomputsci 497


Articles NAtuRE CoMputAtIonAl ScIEncE

cscGAN and scGAN27, on the PBMC dataset43. A brief introduction of The structure of the GRN is explicitly modeled as special layers of
cscGAN and scGAN including their running details could be found in the neural network, which act as biological constraints to restrict the
Supplementary Section 6. The original study27 classified cells through parameter space. One of the limitations of our study is that the run-
the Louvain algorithm46, but we found that the Louvain clusters were ning time increases with the number of genes involved due to the
inconsistent with cell types identified by Zheng et al.43. For instance, ‘inverse’ operation in the inverse GRN layer. Empirically, DeepSEM
47.0% of the CD56+ natural killer (NK) cells were misclustered to a is relatively slower than other VAE models such as scVI6. To address
CD8+ cytotoxic T cluster (Supplementary Fig. 8). Therefore, we used the potential limitation during the training process, we recommend
two different annotation strategies as introduced in Methods. We users to select highly variable genes instead of using the whole tran-
found that all three methods could generate almost indistinguishable scriptome as input features.
low-dimensional embeddings as in the original data, when clustered Models like DeepSEM may have other potential applications in
using both computational approaches and cell types labeled by the single-cell biology. For instance, since all cells share the identical
experts (Fig. 6a and Supplementary Figs. 9 and 10). We trained a ran- genome, GRN can be shared among different modalities, such as
dom forest (RF) classifier to distinguish the simulated data from the transcriptomic and epigenomic data. Therefore, DeepSEM can be
real data for each Louvain cluster and each cell type that contains more adopted to integrate different single-cell modalities by leveraging
than 2,500 cells. Our hypothesis is that the prediction performance of GRN as a ‘bridge’ to construct a common latent space. A second
the RF model should be close to random if the simulated data genera- potential application is to use the DeepSEM framework to integrate
tion was ‘realistic’. We observed that the classification performance of other molecular interaction networks, such as a protein–protein
RF dropped with an average of 1.21% and 19.77% to distinguish simu- interaction network, open chromatin data, DNA binding motifs and
lated data generated by DeepSEM from the real test data compared a genetic interaction network to further infer a GRN and achieve
with cscGAN and scGAN, respectively (Fig. 6b and Supplementary higher accuracy.
Figs. 11 and 12). Note that the objective functions of cscGAN and
scGAN are to train a generator function that can fool another dis- Methods
criminator function, so it was expected that they perform well on this The DeepSEM framework. Structural equation modeling is a multivariate
task. On the other side, DeepSEM achieved realistic simulation, sug- statistical model to analyze structural relationships among different random
variables. The basic SEM was first developed to model the covariance matrix for
gesting that integrating GRN may serve as a crucial step for modeling random variables50. Later, the SEM was found to be very powerful in modeling the
scRNA-seq data. relationship between observed features and hidden latent variables and was widely
We also proposed another concept, called GRN consistency, used in econometrics and sociology for causal inference51,52. More importantly,
to measure the quality of the simulated single cells. GRN con- the SEM can be adopted to detect the conditional dependency among random
variables and therefore also used to predict the graph structure of Bayesian
sistency measures the difference of the predicted GRN between
networks and Markov random fields53–56. DeepSEM generalizes the SEM, which
the real and simulated scRNA-seq data, which quantifies how models the conditional dependencies among random variables and is formulated
much the single-cell model captures both the marginal and con- as a self-regression problem
ditional independence in the original distribution. That is, if a
X = WT X + Z, (1)
scRNA-seq simulation is realistic, the GRN predictions obtained
using the simulated dataset should match the predictions from
the real dataset. Different from recent work49 that used marginal we can modify equation (1) to the following form
independence to measure the consistency, our GRN consistency X = I − WT
( ) −1
Z
considers the conditional independence between TF and target (2)
genes, which is harder to maintain for simulators. We found that Z = I − WT X,
( )

DeepSEM was able to achieve much higher GRN consistency


compared with cscGAN and scGAN (Fig. 6c and Supplementary where I ∈ Rm×m denotes the identity matrix, X ∈ Rn×m denotes the gene
Figs. 13 and 14). For instance, for the top 0.1% predicted GRN expression matrix with n cells and m genes, W ∈ Rm×m represents the adjacency
edges, there is only 61.68% and 32.04% (on average) predicted matrix of the GRN that captures the conditional dependencies among different
genes, and Z ∈ Rn×m stands for a noise matrix following a Gaussian distribution.
GRN edge overlap between the real data and the simulated one by Here we modify equation (2) to a nonlinear version of the SEM, which was
cscGAN and scGAN, respectively. On the other hand, DeepSEM originally proposed by Yu et al.21, as follows
can improve the GRN consistency by 33.07% and 156.14%
compared with cscGAN and scGAN, respectively (Fig. 6c and X = f1 ((I − WT )−1 Z), (3)
Supplementary Figs. 13 and 14). This result demonstrates that
DeepSEM is able to generate more realistic scRNA-seq data satis-
fying biological constraints. Z = I − WT f2 (X),
( )
(4)
We further investigated how the simulated single-cell data could
be used to improve the quality of the downstream cell-type classi- where f1 and f2 stand for multilayer neural networks. In particular, equation (3) can
fication task, when only a limited number of single cells were pro- be decomposed into the following formulas:
vided. To investigate this, we downsampled single-cell data from
two clusters (Louvain cluster 2 and CD56+ NK), and trained an RF HZ = (I − W)−1 Z,
model to distinguish these cells from others. We observed that for
where the HZ can be further rewritten in column-wise vectors:
the approaches to be compared, the performance could be improved
by simply upsampling the downsampled clusters, which was equiva- HZ = [h0 ;h1 ;…;hm ] , hi ∈ Rn×1 ,
lent to putting higher weights on some samples to address the class
f1 (hi ) = tanh tanh tanh hi WT1 WT2 WT3 ,
( ( ( ) ) )
imbalance problem for the RF model (Supplementary Fig. 15). (5)
The performances of DeepSEM and scGAN are about the same X = [f1 (h0 ) ;f1 (h1 ) ;…;f1 (hm )] ,
and only slightly lower than the performance of cscGAN by 1.02% W1 ∈ Rd×1 , W2 ∈ Rd×d , W3 ∈ R1×d ,
(Supplementary Fig. 15).
where d denotes the number of hidden neurons for each layer and hi denotes to
Discussion the hidden latent for gene i. Equation (5) can be viewed as a nonlinear decoder
In this Article, we introduce a general computational framework function from random variable Z. W1, W2 and W3 are linear weights of different
that can jointly model the GRN and single-cell transcriptomic data. layers of the neural network. Different from the conventional neural network

498 Nature Computational Science | VOL 1 | July 2021 | 491–501 | www.nature.com/natcomputsci


NAtuRE CoMputAtIonAl ScIEncE Articles
used in scRNA-seq modeling, f1 takes only one feature as input. We then define a Implementation of DeepSEM. In DeepSEM, the log-transformed scRNA-seq
corresponding encoder function with a similar form, that is expression data after Z-normalizing is fed into the neural network. We initialized
MLPs by using the ‘kaiming_uniform’61 and initialized W by setting the matrix
X = [x0 ;x1 ;…;xm ] , xi ∈ Rn×1 , diagonal as zeros and the others following a Gaussian distribution N(1/(m − 1), ε2),
in which m stands for number of genes and ε denotes a small value to avoid being
HX = [f2 (x0 ) ;f2 (x1 ) ;…;f2 (xm )],
trapped in the local optimal. The values on the diagonal are fixed as zeroes in the
f2 (xi ) = tanh tanh tanh xi WT4 WT5 WT6 , (6) whole training process to guarantee that W is able to learn the regulatory network
( ( ( ) ) )
between genes. We determined the key hyperparameters (α, β and number of
Z = Reparameter I − WT HX , epochs) using a grid search strategy and used common default values for others.
(( ) )

We summarize all the hyperparameters used in this study in Supplementary Table


W4 ∈ Rd×1 , W5 ∈ Rd×d , W6 ∈ R2×d , 8 and discuss the influence of the key hyperparameters in Supplementary Section 7
and Supplementary Fig. 17.
where xi denotes the expression for gene i and f2 stands for another multilayer
neural network to model the noise in X, which takes the expression of each gene Datasets and data processing. Datasets used to evaluate GRN inference. We
as input feature and ‘Reparameter’ stands for the reparameterization trick57. W4, evaluated the performance of GRN inference on seven datasets (Supplementary
W5 and W6 are linear weights of different layers of the neural network. We name Table 2) as in the BEELINE framework28, where ground-truth GRNs are all
I − WT the GRN layer and (I − WT)−1 the inverse GRN layer. To integrate both available from: (1) mouse embryonic stem cells (mESC)62, (2) mouse dendritic
equation (5) and equation (6), we use a beta-VAE model23 with an additional L1 cells (mDC)63, (3) three lineages of mouse hematopoietic stem cells64, including
norm to regularize the adjacent matrix W. More specifically, the loss function of erythroid lineage (mHSC-E), granulocyte-macrophage lineage (mHSC-GM) and
the DeepSEM is defined as follows lymphoid lineage (mHSC-L), (4) human mature hepatocytes (hHep)65 and (5)
L = −Eq(X) [log p (X|Z)] + βKL(q(Z|X)||p(Z)) + α||W||1 , (7) human embryonic stem cells (hESC)66. To preprocess the raw gene expression data,
we adopted the same strategy as in the BEELINE framework28 (Supplementary
where E denotes the expected value function, p and q denote the distribution of Section 1).
X and Z, KL denotes to KL-divergence function and α and β denote to hyper- For each dataset, there are three kinds of ground-truth GRN according to their
parameters.That is, instead of directly transforming X to Z, the beta-VAE information sources: cell-type-specific ChIP-seq67–70, non-specific ChIP-seq71–73
framework models Z as a Gaussian distribution in which the mean and variance are and functional interaction networks collected from the STRING database74. For the
the outputs of neural networks taking X as the input features. mouse embryonic stem cells (mESC), the loss-/gain-of-function data (LOF/GOF)70
Similar to other autoencoder models, beta-VAE also recovers the input X in were also collected as a ground-truth GRN. Following Pratapa et al.28, we excluded
a probabilistic model. For cell-type-specific tasks, such as GRN prediction, single- all edges that were not outgoing from TFs in the prediction during the evaluation.
cell embedding and simulation, we only considered the non-zero elements of X In addition, we also included another mouse cortex dataset33 for which the GRN
because it is typically hard to denoise (impute) all the non-zero values using only at the single-cell level was not measured. We focused on six major neuronal cell
a limited number of samples for each cell type. For the prior distribution p(Z), we types from the primary visual cortex, including layer 2/3 (L2/3), L4, L5 and L6
chose Gaussian distribution N(μ, σ2) in which the mean (μ) and variance (σ2) were intra-telencephalic (IT) neurons, L5 pyramidal tract (PT) neurons and L6 cortical-
also estimated by minimizing equation (7). A Gaussian distribution-based VAE thalamic (CT) neurons. Following the same data-processing procedures as in
model has been widely used in other domains such as images, videos and natural the original data paper, we excluded those cells that annotated as low quality and
languages. Empirically, we also tried the Gaussian mixture model from one to five those genes that expressed in less than 10% of cells. Then we normalized the count
Gaussian components as the prior distribution but found that the performance library sizes using count per million mapped reads (CPM). For each cell type,
did not significantly outperform the original Gaussian distribution with a single we selected 500 cell-type-specific genes using the ‘rank_genes_group’ function
component (Supplementary Fig. 16). Note that f1 and f2 are multilayer perceptrons provided by the scanpy package75 and selected TF genes from the JASPAR 2018
(MLPs) with unique architectures. For each cell, we used f2 to scan all the m genes CORE vertebrates non-redundant motif database76. The motifs were scanned again
and output m hidden variables. Similarly, f1 scans m hidden variables and outputs with the mm10 genome using FIMO77 with P value thresholds of 1 × 10−5; after
m recovered gene expression values. In this way, we made sure that all the genes that, 487 TFs were under consideration in the following analysis. After removing
can only interact at the GRN layer and Inverse GRN layer. duplicated genes, 2,762 genes including 487 TFs were considered in this study.
We included the downloading addresses of the GRN data and the snmC-seq and
Optimization of DeepSEM. We adopted the gradient descent and the scATAC-seq data for the mouse cortex dataset in Supplementary Table 1.
reparameterization trick57 to train our DeepSEM model. Different from
conventional beta-VAE, there is an ‘inverse’ operation in equation (3), which Datasets used to evaluate cell embeddings. We evaluated the single-cell embedding
makes the optimization problem harder because the inverse operation is generally on the Zeisel dataset41, which was also used to assess the performance of scVI6.
sensitive to noise. We found that directly using the Adam algorithm58 sometimes We also collected eight other datasets including the mouse embryo stem dataset42,
yielded unstable results. Inspired by the coordinate descent in graphical lasso59 in human pancreas dataset78 and the PBMC dataset43, all with known cell types
which the matrix inverse is required, we optimized the weight matrix W in the annotated. For each dataset, we discarded all cells without annotations and whose
GRN layer and the weights in the normal linear layers in an alternative way. cell type contained less than 10 cells. We also discarded those genes that expressed
More specifically, after using the RMSprop algorithm60 to optimize the weights in less than 1% of cells. For each dataset, only the top 1,000 highly variable genes
of MLPs for one epoch, we then optimized W using the same optimization were considered in evaluation. We present the summarized statistics for each
algorithm for another two epochs with different learning rates (details in dataset in Supplementary Table 5.
Supplementary Table 8).
Datasets used to evaluate the performance of simulation. We evaluated the
GRN inference. The core element of DeepSEM is its ability to infer the GRN performance of scRNA-seq simulation following the experimental setting in
structure through a probabilistic modeling of scRNA-seq data. In this study, similar Marouf et al.27. In particular, we trained and evaluated our model on a published
to the causal inference in SEM, we take the adjacent matrix W defined in equation human PBMC (healthy donor A) dataset43. Then we used two different strategies
(3) to indicate the GRN learned by DeepSEM. The absolute value of each element to annotate the cell types: (1) annotations from the original authors43, including
of W is used to rank the possibility of the regulatory relationships between genes. 11 different cell types (Supplementary Table 9) and (2) annotations derived from
To obtain stable prediction, we run the training process with ten different random the Louvain algorithm with ‘resolution’ parameter 0.15 (Supplementary Table 10)
initializations of the models. The final GRN prediction is the average of the following the same procedure as in Marouf et al.27. After discarding those genes
absolute adjacent matrices W derived over the ten different models. that expressed in less than 1% of cells and cells that expressed less than 10 genes,
we normalized the read counts to 20,000. We selected the top 1,000 variable genes
Simulation of scRNA-seq data. Another important component of DeepSEM is based on the log-transformed normalized data and further normalized the gene
that it can simulate scRNA-seq data by perturbing the values of hidden neurons. expression using the ‘normalize_total’ function provided in the scanpy package75,
DeepSEM simulates ‘realistic’ scRNA-seq in a unique manner by guiding the with the input formats of cscGAN and scGAN27.
information flow through the GRN layer, which mirrors the in vivo generation
process of messenger RNA dynamics governed by multiple TFs. In particular, Latent representation visualization and clustering. For both DeepSEM and
DeepSEM first perturbs the hidden vector Z as defined in equation (3). Note that all other approaches to be evaluated, we first extracted the top 50 principal
Z follows a Gaussian distribution in which mean and variance are calculated from components (PCs) using principal component analysis if the size of hidden
two separate neural networks. Let n denote the white noise variable following embeddings was larger than 50 and calculated the cell neighborhood graph by
N(0,I ), where I stands for the identity matrix. Then a perturbation Ẑ is defined setting the ‘n_neighbors’ parameter as 30. Then we visualized the results of these
as μ + nσ, where μ and σ stand for the mean and standard values of the posterior datasets in two dimensions using the UMAP algorithm with default parameters.
probability of Z. The simulated gene expression can be generated by decoding Ẑ We used the Louvain algorithm46 to cluster cells and selected parameter
by multiplying with the GRN layer. In this study, we only focus on simulating the ‘resolution’ by binary search to generate the same number of clusters with cell-
expression of those non-zero input expression values. type annotation.

Nature Computational Science | VOL 1 | July 2021 | 491–501 | www.nature.com/natcomputsci 499


Articles NAtuRE CoMputAtIonAl ScIEncE
Visualization of simulated and real data. In this study, for each Louvain cluster References
and cell type containing more than 2,500 cells, we randomly selected 80% of cells 1. Picelli, S. et al. Smart-seq2 for sensitive full-length transcriptome profiling in
as the training data and the remaining 20% of cells as the test data. DeepSEM and single cells. Nat. Methods 10, 1096–1098 (2013).
other baseline methods were trained on the training data, and for each method 2. Macosko, E. Z. et al. Highly parallel genome-wide expression profiling of
we simulated the same amount of data as the number of cells in the test data and individual cells using nanoliter droplets. Cell 161, 1202–1214 (2015).
used UMAP with default hyperparameters to visualize simulated and real test 3. Hashimshony, T. et al. CEL-Seq2: sensitive highly-multiplexed single-cell
data. Following Marouf et al.27, we further trained an RF model with 1,000 trees RNA-seq. Genome Biol. 17, 77 (2016).
to classify the real test data and the simulated data on the first 50 PCs. Fivefold 4. Wagner, A., Regev, A. & Yosef, N. Revealing the vectors of cellular identity
cross-validation was performed to evaluate the prediction performance in terms of with single-cell genomics. Nat. Biotechnol. 34, 1145–1160 (2016).
the averaged area under the receiver operator characteristic (ROC) curve. We also 5. Kharchenko, P. V., Silberstein, L. & Scadden, D. T. Bayesian approach to
conducted a positive control by discriminating training data from test data. We also single-cell differential expression analysis. Nat. Methods 11, 740–742 (2014).
consider the membership weight of the clustering to quantitatively compare the 6. Lopez, R., Regier, J., Cole, M. B., Jordan, M. I. & Yosef, N. Deep generative
visualization performance. Membership weight is calculated by the distribution of modeling for single-cell transcriptomics. Nat. Methods 15, 1053–1058 (2018).
the inverse Euclidean distance between each cell and the centers of all the clusters 7. Eraslan, G., Simon, L. M., Mircea, M., Mueller, N. S. & Theis, F. J. Single-cell
( ) RNA-seq denoising using a deep count autoencoder. Nat. Commun. 10,
1 390 (2019).
Membership weights = softmax ) , (8)
dist xij , center(xi,: ) 8. Cuomo, A. S. E. et al. Single-cell RNA-sequencing of differentiating iPS
(

cells reveals dynamic genetic effects on gene expression. Nat. Commun. 11,
where xij denotes the UMAP embedding of cell j in cell type i. 810 (2020).
9. Olsson, A. et al. Single-cell analysis of mixed-lineage states leading to a
The GRN consistency between the simulated and real data. In this study, we binary cell fate choice. Nature 537, 698–702 (2016).
investigated whether the generated scRNA-seq data have the same GRN with 10. Sharma, A. et al. Onco-fetal reprogramming of endothelial cells drives
scRNA-seq from real cells. For each cluster or cell type that contains more than immunosuppressive macrophages in hepatocellular carcinoma. Cell 183,
2,500 cells, we used DeepSEM, scGAN and cscGAN to generate n gene expression 377–394.e21 (2020).
profiles, where n was equal to the number of cells. We used GRNBoost218 to 11. Arisdakessian, C., Poirion, O., Yunits, B., Zhu, X. & Garmire, L. X.
infer the GRNs for both real and simulated single-cell data. Similar to the GRN DeepImpute: an accurate, fast, and scalable deep neural network method to
inference benchmark BEELINE28, only GRNs outgoing from TFs were considered. impute single-cell RNA-seq data. Genome Biol. 20, 211 (2019).
For each cluster, we selected the top K = {0.1%, 0.2%, 0.5%, 1.0%, 1.5%, 2.0%, 2.5%} 12. Wang, T. et al. BERMUDA: a novel deep transfer learning method for
GRNs and used them to evaluate the consistency between the real and simulated single-cell RNA sequencing batch correction reveals hidden high-resolution
cells. The GRN consistency is calculated as follows: cellular subtypes. Genome Biol. 20, 165 (2019).
GRN consistency = 13. Li, X. et al. Deep learning enables accurate clustering with batch effect
(9) removal in single-cell RNA-seq analysis. Nat. Commun. 11, 2338 (2020).
Number of overlap edges in top N predicted edges between real and simulated cells 14. Huynh-Thu, V. A., Irrthum, A., Wehenkel, L. & Geurts, P. Inferring
N
regulatory networks from expression data using tree-based methods. PLoS
where N = K% × number of predicted GRN in real cells. We reported the average ONE 5, e12776 (2010).
GRN consistency within a ±95% confidence interval (CI). 15. Chan, T. E., Stumpf, M. P. H. & Babtie, A. C. Gene regulatory network
inference from single-cell data using multivariate information measures. Cell
Effects of data downsampling and augmentation on cell-type classification. In Syst. 5, 251–267.e3 (2017).
this study, we evaluated the data augmentation performance by discriminating cells 16. Matsumoto, H. et al. SCODE: an efficient regulatory network inference
annotated by the selected cluster or cell type from the other cells with limited training algorithm from single-cell RNA-seq during differentiation. Bioinformatics 33,
data. More specifically, we selected cells annotated as cluster 2 obtained from the 2314–2321 (2017).
Louvain algorithm46, also as in Marouf et al.27, and CD56+ NK cells annotated by 17. Papili Gao, N., Ud-Dean, S. M. M., Gandrillon, O. & Gunawan, R.
Zheng et al.43. First, we randomly sampled 80% of the data as the training dataset and SINCERITIES: inferring gene regulatory networks from time-stamped single
used the remaining 20% of the data as the test dataset. Then, the cells were randomly cell transcriptional expression profiles. Bioinformatics 34, 258–266 (2018).
downsampled with eight different percentages {50%, 25%, 10%, 5%, 3%, 2%, 1%, 0.5%} 18. Moerman, T. et al. GRNBoost2 and Arboreto: efficient and scalable inference
on the training dataset. For each downsampling rate, we simulated 2,000 cells from of gene regulatory networks. Bioinformatics 35, 2159–2161 (2019).
the selected Louvain cluster or cell type by DeepSEM, cscGAN and scGAN. We also 19. Kamimoto, K., Hoffmann, C. M. & Morris, S. A. CellOracle: dissecting cell
randomly sampled 2,000 cells with replacement from the selected Louvain cluster or identity via network inference and in silico gene perturbation. Preprint at
cell type and annotated the above procedure as upsampling. An RF model with default bioRxiv https://doi.org/10.1101/2020.02.17.947416 (2020).
hyperparameters was trained and the top 50 PCs of each cell were selected as features 20. Kim, S. ppcor: an R package for a fast calculation to semi-partial correlation
to discriminate cells from the others. We trained the RF model five times and reported coefficients. Commun. Stat. Appl. Methods 22, 665–674 (2015).
the mean AUPR ± 95% CI. 21. Yu, Y., Jie, C., Tian, G. & Mo, Y. DAG-GNN: DAG structure learning with
graph neural networks. In Proceedings of the 36th International Conference on
Data availability Machine Learning 7154–7163 (ICML, 2019).
We provide all datasets generated or analyzed during this study. The gene 22. Lin, C., Jain, S., Kim, H. & Bar-Joseph, Z. Using neural networks for reducing
experimental scRNA-seq datasets were downloaded from Gene Expression the dimensions of single-cell RNA-seq data. Nucleic Acids Res. 45, e156 (2017).
Omnibus with the accession numbers GSE81252 (hHEP dataset65), GSE75748 (hESC 23. Higgins, I. et al. beta-VAE: learning basic visual concepts with a constrained
dataset66), GSE98664 (mESC dataset62), GSE48968 (mDC dataset63), GSE81682 variational framework. In Proceedings of the 5th International Conference on
(mHSC dataset64), GSE115746 (mouse cortex dataset33), GSE60361 (Zeisel dataset41), Learning Representations (ICML, 2017).
GSE85241 (Muraro dataset78), GSE81861 (Li dataset79), and GSE45719 (Deng 24. Zhao, A., Balakrishnan, G., Durand, F., Guttag, J. V. & Dalca, A. V. Data
dataset80). The other experimental scRNA-seq dataset were downloaded from augmentation using learned transformations for one-shot medical image
ArrayExpress with the accession number E-MTAB-5061 (Segerstolpe dataset81), NCBI segmentation. In Proceedings of the IEEE Conference on Computer Vision and
Sequence Read Archive (SRA) with accession number SRP041736 (Pollen dataset42), Pattern Recognition 8543–8553 (IEEE, 2019).
GitHub repositories (https://github.com/LuyiTian/sc_mixology) (CellBench dataset82) 25. Lotfollahi, M., Wolf, F. A. & Theis, F. J. scGen predicts single-cell
and the website for x10genomics (https://support.10xgenomics.com/single-cell-gene- perturbation responses. Nat. Methods 16, 715–721 (2019).
expression/datasets/) (PBMC dataset43). The scATAC-seq and snmC-seq for mouse 26. Wang, X., Ghasedi Dizaji, K. & Huang, H. Conditional generative adversarial
cortex were downloaded from Gene Expression Omnibus with the accession numbers network for gene expression inference. Bioinformatics 34, i603–i611 (2018).
GSE126724 (scATAC-seq35) and GSE97179 (snmC-seq34). More information for these 27. Marouf, M. et al. Realistic in silico generation and augmentation of single-cell
datasets could be found in Methods. We also summarize the accession and download RNA-seq data using generative adversarial networks. Nat. Commun. 11, 166
links in Supplementary Tables 1, 2, 5 and 9. Source Data for Figs. 3–6 are available (2020).
with this manuscript. 28. Pratapa, A., Jalihal, A. P., Law, J. N., Bharadwaj, A. & Murali, T. M.
Benchmarking algorithms for gene regulatory network inference from
single-cell transcriptomic data. Nat. Methods 17, 147–154 (2020).
Code availability 29. Moore, L. D., Le, T. & Fan, G. DNA methylation and its basic function.
The codes generated during this study are available on GitHub (https://github.com/
Neuropsychopharmacology (2013).
HantaoShu/DeepSEM) and in Zenodo83.
30. Thurman, R. E. et al. The accessible chromatin landscape of the human
genome. Nature 489, 75–82 (2012).
Received: 25 March 2021; Accepted: 15 June 2021; 31. Keilwagen, J., Posch, S. & Grau, J. Accurate prediction of cell type-specific
Published online: 22 July 2021 transcription factor binding. Genome Biol. 20, 9 (2019).

500 Nature Computational Science | VOL 1 | July 2021 | 491–501 | www.nature.com/natcomputsci


NAtuRE CoMputAtIonAl ScIEncE Articles
32. Funk, C. C. et al. Atlas of transcription factor binding sites from 65. Camp, J. G. et al. Multilineage communication regulates human liver bud
ENCODE DNase hypersensitivity data across 27 tissue types. Cell Rep. 32, development from pluripotency. Nature 546, 533–538 (2017).
108029 (2020). 66. Chu, L.-F. et al. Single-cell RNA-seq reveals novel regulators of human
33. Tasic, B. et al. Shared and distinct transcriptomic cell types across neocortical embryonic stem cell differentiation to definitive endoderm. Genome Biol. 17,
areas. Nature 563, 72–78 (2018). 173 (2016).
34. Luo, C. et al. Single-cell methylomes identify neuronal subtypes and 67. ENCODE Project Consortium An integrated encyclopedia of DNA elements
regulatory elements in mammalian cortex. Science 357, 600–604 (2017). in the human genome. Nature 489, 57–74 (2012).
35. Fang, R. et al. Comprehensive analysis of single cell ATAC-seq data with 68. Davis, C. A. et al. The encyclopedia of DNA elements (ENCODE): data portal
SnapATAC. Nat. Commun. 12, 1337 (2021). update. Nucleic Acids Res. 46, D794–D801 (2018).
36. Dong, J. et al. Enhancing single-cell cellular state inference by incorporating 69. Oki, S. et al. ChIP-Atlas: a data-mining suite powered by full integration of
molecular network features. Preprint at bioRxiv (2019). public ChIP-seq data. EMBO Rep. 19, e46255 (2018).
37. Aibar, S. et al. SCENIC: single-cell regulatory network inference and 70. Xu, H. et al. ESCAPE: database for integrating high-content published data
clustering. Nat. Methods 14, 1083–1086 (2017). collected from human and mouse embryonic stem cells. Database 2013,
38. Li, X. et al. Network embedding-based representation learning for single cell bat045 (2013).
RNA-seq data. Nucleic Acids Res. 45, e166–e166 (2017). 71. Garcia-Alonso, L., Holland, C. H., Ibrahim, M. M., Turei, D. & Saez-
39. Cahan, P. et al. CellNet: network biology applied to stem cell engineering. Rodriguez, J. Benchmark and integration of resources for the estimation of
Cell 158, 903–915 (2014). human transcription factor activities. Genome Res. 29, 1363–1375 (2019).
40. Morris, S. A. et al. Dissecting engineered cell types and enhancing cell fate 72. Liu, Z.-P., Wu, C., Miao, H. & Wu, H. RegNetwork: an integrated database of
conversion via CellNet. Cell 158, 889–902 (2014). transcriptional and post-transcriptional regulatory networks in human and
41. Zeisel, A. et al. Brain structure. Cell types in the mouse cortex and mouse. Database 2015, bav095 (2015).
hippocampus revealed by single-cell RNA-seq. Science 347, 1138–1142 (2015). 73. Han, H. et al. TRRUST: a reference database of human transcriptional
42. Pollen, A. A. et al. Low-coverage single-cell mRNA sequencing reveals regulatory interactions. Sci. Rep. 5, 11432 (2015).
cellular heterogeneity and activated signaling pathways in developing cerebral 74. Szklarczyk, D. et al. STRING v11: protein–protein association networks with
cortex. Nat. Biotechnol. 32, 1053–1058 (2014). increased coverage, supporting functional discovery in genome-wide
43. Zheng, G. X. Y. et al. Massively parallel digital transcriptional profiling of experimental datasets. Nucleic Acids Res. 47, D607–D613 (2019).
single cells. Nat. Commun. 8, 14049 (2017). 75. Wolf, F. A., Angerer, P. & Theis, F. J. SCANPY: large-scale single-cell gene
44. Pierson, E. & Yau, C. ZIFA: dimensionality reduction for zero-inflated
expression data analysis. Genome Biol. 19, 15 (2018).
single-cell gene expression analysis. Genome Biol. 16, 241 (2015).
76. Fornes, O. et al. JASPAR 2020: update of the open-access database of
45. Jolliffe, I. T. in Principal Component Analysis (ed. Jolliffe, I. T.) 115–128
transcription factor binding profiles. Nucleic Acids Res. 48, D87–D92 (2020).
(Springer, 1986).
77. Grant, C. E., Bailey, T. L. & Noble, W. S. FIMO: scanning for occurrences of a
46. Blondel, V. D., Guillaume, J.-L., Lambiotte, R. & Lefebvre, E. Fast unfolding of
given motif. Bioinformatics 27, 1017–1018 (2011).
communities in large networks. J. Stat. Mech. 2008, P10008 (2008).
78. Muraro, M. J. et al. A Single-cell transcriptome atlas of the human pancreas.
47. Becht, E. et al. Dimensionality reduction for visualizing single-cell data using
Cell Syst. 3, 385–394.e3 (2016).
UMAP. Nat. Biotechnol. https://doi.org/10.1038/nbt.4314 (2019).
48. Heiser, C. N. & Lau, K. S. A quantitative framework for evaluating single-cell 79. Li, H. et al. Reference component analysis of single-cell transcriptomes
data structure preservation by dimensionality reduction techniques. Cell Rep. elucidates cellular heterogeneity in human colorectal tumors. Nat. Genet. 49,
31, 107576 (2020). 708–718 (2017).
49. Viñas, R., Andrés-Terré, H., Liò, P. & Bryson, K. Adversarial generation of 80. Deng, Q., Ramsköld, D., Reinius, B. & Sandberg, R. Single-cell RNA-seq
gene expression data. Bioinformatics https://doi.org/10.1093/bioinformatics/ reveals dynamic, random monoallelic gene expression in mammalian cells.
btab035 (2021). Science 343, 193–196 (2014).
50. Bollen, K. A. Structural Equations with Latent Variables (John Wiley & Sons, 81. Segerstolpe, Å. et al. Single-cell transcriptome profiling of human pancreatic
1989). islets in health and type 2 diabetes. Cell Metab. 24, 593–607 (2016).
51. Haavelmo, T. The statistical implications of a system of simultaneous 82. Tian, L. et al. Benchmarking single cell RNA-sequencing analysis pipelines
equations. Econometrica 11, 1–12 (1943). using mixture control experiments. Nat. Methods 16, 479–487 (2019).
52. King, M., Goldberger, A. S. & Duncan, O. D. Structural equation models in 83. Shu, H. et al. Code for paper ‘Modeling gene regulatory networks using
the social sciences. Econ. J. 84, 212–214 (1974). neural network architectures’. Zenodo https://doi.org/10.5281/zenodo.4915754
53. Duarte, C. W., Klimentidis, Y. C., Harris, J. J., Cardel, M. & Fernández, J. R. A (2021).
hybrid Bayesian network/structural equation (BN/SEM) modeling approach
for detecting physiological networks for obesity-related genetic variants. In Acknowledgements
Proceedings of the IEEE International Conference on Bioinformatics and This work was supported in part by the National Natural Science Foundation of
Biomedicine 696–702 (IEEE, 2012). China (61872216, 81630103), the Turing AI Institute of Nanjing to J. Zeng. We also
54. Yoo, C. & Oh, S. Combining structure equation model with Bayesian acknowledge the National Natural Science Foundation of China (31900862) for funding
networks for predicting with high accuracy of recommending surgery for support to D.Z..
better survival in Benign prostatic hyperplasia patients. In 20th International
Congress on Modelling and Simulation-Adapting to Change 2029–2033 Author contributions
(Modelling and Simulation Society of Australia and New Zealand, 2013). J. Zeng and J.M. designed the study and developed the conceptual ideas. H.S. and Q.L.
55. Zheng, X., Aragam, B., Ravikumar, P. & Xing, E. P. DAGs with NO TEARS: implemented the main algorithms. H.S. performed the model training and experimental
continuous optimization for structure learning. In Proceedings of the 32nd validation task. H.S. and J. Zhou collected all the input data sources and interpreted the
International Conference on Neural Information Processing Systems 9492–9503 results. H.S., J. Zhou, H.L., D.Z., J. Zeng and J.M. wrote the manuscript with support
(IEEE, 2018). from all authors.
56. Luo, Y., Peng, J. & Ma, J. When causal inference meets deep learning. Nat.
Mach. Intell. 2, 426–427 (2020).
57. Kingma, D. P. & Welling, M. An introduction to variational autoencoders. Competing interests
Found Trends Mach. Learn. 12, 307–392 (2019). J. Zeng is founder and CTO of Silexon AI Technology Co. Ltd. and has an equity interest.
58. Kingma, D. P. & Ba, J. Adam: a method for stochastic optimization. In The remaining authors declare no competing interests.
Proceedings of the 3th International Conference on Learning Representations
(ICLR, 2015).
59. Friedman, J., Hastie, T. & Tibshirani, R. Sparse inverse covariance estimation Additional information
with the graphical lasso. Biostatistics 9, 432–441 (2008). Supplementary information The online version contains supplementary material
60. Tieleman, T. & Hinton, G. Lecture 6.5-rmsprop, Coursera: Neural Networks for available at https://doi.org/10.1038/s43588-021-00099-8.
Machine Learning Technical Report (Univ. Toronto, 2012).
Correspondence and requests for materials should be addressed to J.Z. or J.M.
61. He, K., Zhang, X., Ren, S. & Sun, J. Delving deep into rectifiers: surpassing
human-level performance on ImageNet classification. In Proceedings of the Peer review information Nature Computational Science thanks Jun Ding, Yafei Lyu and
IEEE International Conference on Computer Vision 1026–1034 (IEEE, 2015). the other, anonymous, reviewer(s) for their contribution to the peer review of this work.
62. Hayashi, T. et al. Single-cell full-length total RNA sequencing uncovers Handling editor: Fernando Chirigati, in collaboration with the Nature Computational
dynamics of recursive splicing and enhancer RNAs. Nat. Commun. 9, Science team.
619 (2018). Reprints and permissions information is available at www.nature.com/reprints.
63. Shalek, A. K. et al. Single-cell RNA-seq reveals dynamic paracrine control of
cellular variation. Nature 510, 363–369 (2014). Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in
64. Nestorowa, S. et al. A single-cell resolution map of mouse hematopoietic stem published maps and institutional affiliations.
and progenitor cell differentiation. Blood 128, e20–e31 (2016). © The Author(s), under exclusive licence to Springer Nature America, Inc. 2021

Nature Computational Science | VOL 1 | July 2021 | 491–501 | www.nature.com/natcomputsci 501

View publication stats

You might also like