Bayesian Inference and Distance Calculation for Single-Cell RNA-seq Data
SanityR provides an R interface to the Sanity model, described in Breda et al. (2021), Nature Biotechnology for single-cell gene expression analysis. It offers tools for:
- Bayesian estimation of log normalized counts and their uncertainty.
- Computing statistically sound distances between cells while accounting for uncertainty.
- Integrates with
SingleCellExperimentto be used as part of the Bioconductor Single Cell Workflow
SanityR is available on Bioconductor.
To install this package, start R (version “4.5”) and enter:
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("SanityR")To install the latest version from GitHub, you can use the remotes package:
remotes::install_github("TeoSakel/SanityR")library(SanityR)
# Simulate data
sce <- simulate_branched_random_walk(N_path = 10, length_path = 10, N_gene = 200)
# Run Sanity estimation
sce <- Sanity(sce)
# Compute distances
dist <- calculateSanityDistance(sce)
# Perform clustering or visualization
plot(hclust(dist))sce <- Sanity(sce)
logcounts(sce)Log-normalizes the UMI counts and estimates error bars for each value using a hierarchical Bayesian Model:
where:
-
$n_{gc}$ is the observed UMI count for gene$g$ in cell$c$ . -
$\lambda_c$ is the cell-specific transcription rate. -
$\alpha_g$ is the mean activity quotient of the gene$g$ . -
$a$ and$b$ are prior hyperparameters for the Gamma distribution. -
$\delta_{gc}$ is the log fold-change of activity for gene$g$ in cell$c$ versus the mean. -
$v_g$ is the prior variance of the log fold-change for gene$g$ .
Log-normalized counts in this model are calculated as:
dist <- calculateSanityDistance(sce)Computes the expected Euclidean distance between cells, accounting for measurement uncertainty:
where:
-
$d$ is the distance between cells$c$ and$c'$ -
$\delta_{gc}$ is the log fold-change of activity for gene$g$ in cell$c$ computed bySanity. -
$\eta_g = \epsilon_{gc} + \epsilon_{gc'}$ is the sum of posterior variances of$\delta_{gc}$ . -
$\Delta_g$ is the “true” distance along the dimension of gene$g$ . -
$\alpha$ is a hyperparameter that controls the correlation between cells (0 = fully correlated, 2 = fully independent).
The function requires Sanity() to have been run before to estimate
dist object suitable for clustering or
embedding.
Provides two functions to generate synthetic datasets for benchmarking using the generative process described in the original paper:
simulate_independent_cells(): Simulates cells with independent gene expression profiles.simulate_branched_random_walk(): Simulates cells with pseudo-temporal trajectories forming a tree.
sce_indep <- simulate_independent_cells(N_cell = 100, N_gene = 50)
sce_branch <- simulate_branched_random_walk(N_path = 20, length_path = 5, N_gene = 50)Both functions return a SingleCellExperiment object.
Breda, J., Zavolan, M., & van Nimwegen, E. Bayesian inference of gene expression states from single-cell RNA-seq data. Nature Biotechnology, 39, 1008–1016 (2021). doi:10.1038/s41587-021-00875-x
Amezquita, R.A., Lun, A.T.L., Becht, E. et al. Orchestrating single-cell analysis with Bioconductor. Nature Methods 17, 137–145 (2020). doi:10.1038/s41592-019-0654-x