Skip to content

Zaoqu-Liu/recall

 
 

Repository files navigation

recall

Calibrated Clustering with Artificial Variables for Single-Cell RNA-Sequencing

R-Universe R-CMD-check License: MIT Platform Seurat

📚 Documentation: https://zaoqu-liu.github.io/recall/

Overview

recall (Calibrated Clustering with Artificial Variables) is a statistical framework designed to protect against over-clustering in single-cell RNA-sequencing (scRNA-seq) data analysis by controlling for the impact of double-dipping.

In standard scRNA-seq pipelines, unsupervised clustering is used to identify biologically distinct cell types, followed by differential expression testing between clusters. When clustering algorithms over-partition the data, downstream analyses produce inflated P-values and increased false discovery rates. recall addresses this fundamental statistical challenge through a knockoff-inspired calibration procedure.

Key Features

  • FDR-controlled clustering: Integrates knockoff filter methodology to control false discovery rate
  • Algorithm agnostic: Compatible with Louvain, Leiden, and other clustering algorithms
  • Seurat integration: Seamless integration with Seurat V4 and V5 workflows
  • Cross-platform: Full support for Linux, macOS, and Windows
  • Scalable: Efficiently handles large-scale scRNA-seq datasets

Installation

From R-Universe (Recommended)

install.packages("recall", repos = "https://zaoqu-liu.r-universe.dev")

From GitHub

# Install devtools if not already installed
if (!requireNamespace("devtools", quietly = TRUE))
    install.packages("devtools")

devtools::install_github("Zaoqu-Liu/recall")

Optional: Install presto for faster differential expression

devtools::install_github("immunogenomics/presto")

Quick Start

library(Seurat)
library(recall)

# Load your single-cell data
# seurat_obj <- CreateSeuratObject(counts = your_counts_matrix)

# Standard Seurat preprocessing
seurat_obj <- NormalizeData(seurat_obj)
seurat_obj <- FindVariableFeatures(seurat_obj)
seurat_obj <- ScaleData(seurat_obj)
seurat_obj <- RunPCA(seurat_obj)
seurat_obj <- FindNeighbors(seurat_obj)
seurat_obj <- RunUMAP(seurat_obj, dims = 1:10)

# recall clustering (drop-in replacement for FindClusters)
seurat_obj <- FindClustersRecall(seurat_obj, resolution_start = 0.8)

# Visualize results
DimPlot(seurat_obj, group.by = "recall_clusters")

Methodology

The recall algorithm implements a three-stage calibration procedure:

Stage 1: Synthetic Null Variable Generation

Inspired by knockoff variables (Barber & Candès, 2015), we augment the expression matrix with synthetic "knockoff" genes that preserve the marginal distribution of real genes but are known a priori not to contribute to any biological signal. Supported generative models include:

  • ZIP: Zero-Inflated Poisson (default, fast)
  • NB: Negative Binomial
  • ZIP-copula: ZIP with Gaussian copula for gene-gene correlations
  • NB-copula: NB with Gaussian copula

Stage 2: Joint Clustering

Both original and knockoff features undergo identical preprocessing (normalization, scaling, PCA) and clustering, ensuring knockoffs experience the same double-dipping as real genes.

Stage 3: Knockoff-Calibrated Selection

For each cluster pair, we compute the knockoff filter statistic:

$$W_j = -\log_{10}(p_j^{\text{original}}) - (-\log_{10}(p_j^{\text{knockoff}}))$$

Clusters are merged if no genes pass the knockoff filter at a target FDR (default: 0.05). The algorithm iteratively reduces resolution until all cluster pairs exhibit statistically significant differential expression.

Main Functions

Function Description
FindClustersRecall() Main clustering function using knockoff calibration
FindClustersCountsplit() Alternative method using count splitting
seurat_workflow() Complete Seurat preprocessing pipeline

Advanced Usage

Customizing the Null Distribution

# Use Negative Binomial with copula for better correlation modeling
seurat_obj <- FindClustersRecall(
  seurat_obj,
  null_method = "NB-copula",
  resolution_start = 1.0,
  reduction_percentage = 0.1,
  cores = 4
)

Using Count Splitting Alternative

# Count splitting approach (Neufeld et al., 2022)
seurat_obj <- FindClustersCountsplit(
  seurat_obj,
  resolution_start = 0.8,
  algorithm = "leiden"
)

Citation

If you use recall in your research, please cite:

DenAdel, A., Ramseier, M., Navia, A., Shalek, A., Raghavan, S., Winter, P., Amini, A., & Crawford, L. (2025). A knockoff calibration method to avoid over-clustering in single-cell RNA-sequencing. American Journal of Human Genetics. https://doi.org/10.1016/j.ajhg.2025.01.001

@article{denadel2025knockoff,
  title={A knockoff calibration method to avoid over-clustering in single-cell RNA-sequencing},
  author={DenAdel, Alan and Ramseier, Megan and Navia, Andrew and Shalek, Alex and Raghavan, Srivatsan and Winter, Peter and Amini, Arash and Crawford, Lorin},
  journal={American Journal of Human Genetics},
  year={2025},
  publisher={Elsevier}
}

References

  • Barber, R. F., & Candès, E. J. (2015). Controlling the false discovery rate via knockoffs. The Annals of Statistics, 43(5), 2055-2085.
  • Neufeld, A., Gao, L. L., Pober, J., & Witten, D. (2022). Inference after latent variable estimation for single-cell RNA sequencing data. Biostatistics, 24(1), 33-51.

Issues and Support

For questions, bug reports, or feature requests, please open an issue on GitHub or contact Zaoqu Liu.

License

This project is licensed under the MIT License - see the LICENSE file for details.

About

Calibrated clustering with artificial variables to avoid over-clustering in single-cell RNA-sequencing

Resources

License

Unknown, MIT licenses found

Licenses found

Unknown
LICENSE
MIT
LICENSE.md

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • R 99.9%
  • Dockerfile 0.1%