recall

Calibrated Clustering with Artificial Variables for Single-Cell RNA-Sequencing

📚 Documentation: https://zaoqu-liu.github.io/recall/

Overview

recall (Calibrated Clustering with Artificial Variables) is a statistical framework designed to protect against over-clustering in single-cell RNA-sequencing (scRNA-seq) data analysis by controlling for the impact of double-dipping.

In standard scRNA-seq pipelines, unsupervised clustering is used to identify biologically distinct cell types, followed by differential expression testing between clusters. When clustering algorithms over-partition the data, downstream analyses produce inflated P-values and increased false discovery rates. recall addresses this fundamental statistical challenge through a knockoff-inspired calibration procedure.

Key Features

FDR-controlled clustering: Integrates knockoff filter methodology to control false discovery rate
Algorithm agnostic: Compatible with Louvain, Leiden, and other clustering algorithms
Seurat integration: Seamless integration with Seurat V4 and V5 workflows
Cross-platform: Full support for Linux, macOS, and Windows
Scalable: Efficiently handles large-scale scRNA-seq datasets

Installation

From R-Universe (Recommended)

install.packages("recall", repos = "https://zaoqu-liu.r-universe.dev")

From GitHub

# Install devtools if not already installed
if (!requireNamespace("devtools", quietly = TRUE))
    install.packages("devtools")

devtools::install_github("Zaoqu-Liu/recall")

Optional: Install presto for faster differential expression

devtools::install_github("immunogenomics/presto")

Quick Start

library(Seurat)
library(recall)

# Load your single-cell data
# seurat_obj <- CreateSeuratObject(counts = your_counts_matrix)

# Standard Seurat preprocessing
seurat_obj <- NormalizeData(seurat_obj)
seurat_obj <- FindVariableFeatures(seurat_obj)
seurat_obj <- ScaleData(seurat_obj)
seurat_obj <- RunPCA(seurat_obj)
seurat_obj <- FindNeighbors(seurat_obj)
seurat_obj <- RunUMAP(seurat_obj, dims = 1:10)

# recall clustering (drop-in replacement for FindClusters)
seurat_obj <- FindClustersRecall(seurat_obj, resolution_start = 0.8)

# Visualize results
DimPlot(seurat_obj, group.by = "recall_clusters")

Methodology

The recall algorithm implements a three-stage calibration procedure:

Stage 1: Synthetic Null Variable Generation

Inspired by knockoff variables (Barber & Candès, 2015), we augment the expression matrix with synthetic "knockoff" genes that preserve the marginal distribution of real genes but are known a priori not to contribute to any biological signal. Supported generative models include:

ZIP: Zero-Inflated Poisson (default, fast)
NB: Negative Binomial
ZIP-copula: ZIP with Gaussian copula for gene-gene correlations
NB-copula: NB with Gaussian copula

Stage 2: Joint Clustering

Both original and knockoff features undergo identical preprocessing (normalization, scaling, PCA) and clustering, ensuring knockoffs experience the same double-dipping as real genes.

Stage 3: Knockoff-Calibrated Selection

For each cluster pair, we compute the knockoff filter statistic:

$$W_j = -\log_{10}(p_j^{\text{original}}) - (-\log_{10}(p_j^{\text{knockoff}}))$$

Clusters are merged if no genes pass the knockoff filter at a target FDR (default: 0.05). The algorithm iteratively reduces resolution until all cluster pairs exhibit statistically significant differential expression.

Main Functions

Function	Description
`FindClustersRecall()`	Main clustering function using knockoff calibration
`FindClustersCountsplit()`	Alternative method using count splitting
`seurat_workflow()`	Complete Seurat preprocessing pipeline

Advanced Usage

Customizing the Null Distribution

# Use Negative Binomial with copula for better correlation modeling
seurat_obj <- FindClustersRecall(
  seurat_obj,
  null_method = "NB-copula",
  resolution_start = 1.0,
  reduction_percentage = 0.1,
  cores = 4
)

Using Count Splitting Alternative

# Count splitting approach (Neufeld et al., 2022)
seurat_obj <- FindClustersCountsplit(
  seurat_obj,
  resolution_start = 0.8,
  algorithm = "leiden"
)

Citation

If you use recall in your research, please cite:

DenAdel, A., Ramseier, M., Navia, A., Shalek, A., Raghavan, S., Winter, P., Amini, A., & Crawford, L. (2025). A knockoff calibration method to avoid over-clustering in single-cell RNA-sequencing. American Journal of Human Genetics. https://doi.org/10.1016/j.ajhg.2025.01.001

@article{denadel2025knockoff,
  title={A knockoff calibration method to avoid over-clustering in single-cell RNA-sequencing},
  author={DenAdel, Alan and Ramseier, Megan and Navia, Andrew and Shalek, Alex and Raghavan, Srivatsan and Winter, Peter and Amini, Arash and Crawford, Lorin},
  journal={American Journal of Human Genetics},
  year={2025},
  publisher={Elsevier}
}

References

Barber, R. F., & Candès, E. J. (2015). Controlling the false discovery rate via knockoffs. The Annals of Statistics, 43(5), 2055-2085.
Neufeld, A., Gao, L. L., Pober, J., & Witten, D. (2022). Inference after latent variable estimation for single-cell RNA sequencing data. Biostatistics, 24(1), 33-51.

Issues and Support

For questions, bug reports, or feature requests, please open an issue on GitHub or contact Zaoqu Liu.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.github		.github
R		R
docs		docs
man		man
tests		tests
vignettes		vignettes
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
DESCRIPTION		DESCRIPTION
Dockerfile		Dockerfile
LICENSE		LICENSE
LICENSE.md		LICENSE.md
NAMESPACE		NAMESPACE
README.md		README.md
_pkgdown.yml		_pkgdown.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

recall

Calibrated Clustering with Artificial Variables for Single-Cell RNA-Sequencing

Overview

Key Features

Installation

From R-Universe (Recommended)

From GitHub

Optional: Install presto for faster differential expression

Quick Start

Methodology

Stage 1: Synthetic Null Variable Generation

Stage 2: Joint Clustering

Stage 3: Knockoff-Calibrated Selection

Main Functions

Advanced Usage

Customizing the Null Distribution

Using Count Splitting Alternative

Citation

References

Issues and Support

License

About

Licenses found

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

recall

Calibrated Clustering with Artificial Variables for Single-Cell RNA-Sequencing

Overview

Key Features

Installation

From R-Universe (Recommended)

From GitHub

Optional: Install presto for faster differential expression

Quick Start

Methodology

Stage 1: Synthetic Null Variable Generation

Stage 2: Joint Clustering

Stage 3: Knockoff-Calibrated Selection

Main Functions

Advanced Usage

Customizing the Null Distribution

Using Count Splitting Alternative

Citation

References

Issues and Support

License

About

Resources

License

Licenses found

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages