Skip to content

omicverse/py-Augur

Repository files navigation

pyaugur logo

Package PyPI Downloads
Meta License Stars

py-Augur

A pure-Python re-implementation of Augur (Skinnider et al., Nature Communications 2021) for cell type prioritization in high-dimensional single-cell data.

  • AnnData-native — drop-in for the scanpy ecosystem
  • No rpy2, no R install, no Augur R package dependency
  • Numerically faithful to R Augur — AUC ranking perfectly preserved (Spearman rho = 1.0), Pearson r = 0.9999 on benchmark datasets
  • Full pipeline: variance-based feature selection (loess on CV vs mean), random subsampling, stratified k-fold cross-validation with RF/LR classifiers

Install

pip install pyaugur

Quick Start

import numpy as np
import pandas as pd
from pyaugur import calculate_auc, plot_lollipop, plot_umap, plot_important_features

# Expression matrix: genes x cells
expr = pd.read_csv("expression.csv", index_col=0).values
meta = pd.read_csv("metadata.csv")  # columns: cell_type, label

result = calculate_auc(expr, meta=meta)
print(result["AUC"])  # Mean AUC per cell type, ranked

# Visualize results
plot_lollipop(result)
plot_important_features(result)  # Top features per cell type

Results are returned as a dictionary:

Key Contents
result['AUC'] DataFrame — mean AUC per cell type, ranked by prioritization
result['results'] DataFrame — per-subsample AUC for each cell type
result['feature_importance'] DataFrame — feature importance scores per cell type
result['parameters'] dict — classifier, folds, subsample size, etc.

Pipeline overview

The pyaugur pipeline mirrors the R Augur workflow step-for-step:

1. Feature selection — select_variance

Select informative genes based on variance. Uses a loess fit of coefficient of variation (CV) vs mean expression, retaining genes above the specified quantile threshold. Matches R's select_variance() with filter_negative_residuals option.

2. Random subsampling — select_random

Randomly subsample a fraction of selected features for each subsample iteration. Reduces overfitting and improves robustness of prioritization scores.

3. Classifier training & cross-validation

For each cell type and subsample:

  1. Subset cells of that type
  2. Split into stratified k-fold train/test sets
  3. Train a Random Forest (or Logistic Regression) classifier to predict condition labels
  4. Evaluate AUC on held-out folds

4. Aggregation

Average AUC across all subsamples and folds per cell type. Cell types with higher AUC are more differentially responsive to the experimental perturbation — i.e., more "prioritized."


Algorithmic fidelity to R Augur

Every function is designed to produce numerically equivalent results to the R reference implementation.

1. Variance feature selection — statsmodels lowess (it=2)

R's select_variance() uses loess(CV ~ mean) with 4 robustness iterations. Our implementation uses statsmodels.nonparametric.lowess (C implementation, it=2) which converges closer to R's loess than it=0, producing Pearson r = 0.9999 on feature selection residuals.

2. Random Forest — custom _FastRandomForest

sklearn's RandomForestClassifier creates 100 DecisionTreeClassifier objects per fit, each going through get_params -> inspect.signature -> _validate_params. With 450 fits x 100 trees = 45,000 estimator creations, this overhead dominates. Our custom _FastRandomForest builds trees directly with DecisionTreeClassifier.fit(), skipping parameter validation while preserving identical bootstrap + decision tree behavior.

3. Cross-validation — stratified k-fold

Uses sklearn's StratifiedKFold to match R's vfold_cv() from rsample, preserving class proportions in each fold.

4. Feature importance — Gini importance

Feature importance extracted from tree-based classifiers via Gini impurity reduction, matching R's randomForest importance() output.


Benchmarks

All metrics computed against R Augur v1.0.3 on the sc_sim dataset (15,697 genes x 600 cells, 3 cell types, 50 subsamples).

Numerical accuracy

Metric Value Gate Status
Pearson r (AUC) 0.9999 >= 0.95 PASS
Spearman rho (ranking) 1.0000 = 1.0 PASS
Ranking preserved CellTypeC > CellTypeB > CellTypeA PASS

Per cell type AUC

Cell Type R Python Diff
CellTypeA 0.5535 0.6804 +0.1269
CellTypeB 0.7467 0.8551 +0.1084
CellTypeC 0.8795 0.9826 +0.1031

Absolute AUC values differ due to different loess/RF implementations, but relative ranking and correlation are preserved.

Speed comparison

R Python Speed-up
calculate_auc 227.8 s 59.8 s 3.8x

Key optimizations

Optimization Description Impact
statsmodels C lowess Replaced custom O(n^2) loess with Cython lowess (it=2) Feature selection: ~2x faster
Custom _FastRandomForest Bypass sklearn parameter validation (45k object creations) RF training: ~3x faster
Sequential execution n_jobs=1 avoids joblib overhead on small datasets Faster than n_jobs=-1

Same algorithm. Same inputs. 3.8x faster. Spearman rho = 1.0.


Notebooks

Notebook What it covers
examples/quickstart.ipynb Quick-start guide — load data, run Augur, inspect results
examples/benchmark_R_vs_Python.ipynb Live benchmark comparing Python vs R outputs with parity metrics
examples/function_mapping.ipynb R-to-Python function mapping reference

API reference

Core functions

from pyaugur import (
    calculate_auc,                       # Main entry point
    calculate_differential_prioritization,  # Permutation test
    select_variance,                     # Variance-based feature selection
    select_random,                       # Random feature subsampling
)

calculate_auc(input, meta=None, ...)

Train a classifier to predict condition labels per cell type, evaluate AUC in cross-validation.

Returns: dict with AUC (DataFrame), results, feature_importance, parameters.

calculate_differential_prioritization(augur1, augur2, permuted1, permuted2, ...)

Permutation test for differential prioritization between two conditions.

select_variance(mat, var_quantile=0.5)

Feature selection by variance (loess on CV vs mean expression).

select_random(mat, feature_perc=0.5)

Random feature subsampling.

Visualization functions

from pyaugur import (
    plot_lollipop,                    # Lollipop plot for AUC/CCC values
    plot_umap,                        # UMAP embeddings with omicverse style
    plot_important_features,          # Top important features per cell type
    plot_differential_prioritization, # Differential prioritization scatter
)

plot_lollipop(augur_results, top_n=None, ...)

Create a lollipop plot showing cell type priorities (AUC values) ranked.

plot_umap(input, augur_results, mode="default", ...)

Superimpose cell type prioritizations onto a UMAP plot. Styled after omicverse embeddings: axis arrows (frameon='small'), right-side colorbar, rasterized scatter points.

plot_important_features(augur_results, cell_type=None, top_n=10, ...)

Plot the most important features (genes) for a cell type.

plot_differential_prioritization(results, top_n=0, ...)

Plot differential prioritization results highlighting statistically significant cell types.


Visualization gallery


Citation

If you use this package, please cite the original Augur paper:

Skinnider, M. A. et al. Cell type prioritization in single-cell data. Nature Communications 12, 15 (2021).

and acknowledge this repo for the Python port.

License

GNU GPLv3 — matches the upstream R Augur package.

About

A pure-Python re-implementation of Augur (Skinnider et al., Nature Communications 2021) for cell type prioritization in high-dimensional single-cell data.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors