A pure-Python re-implementation of Augur (Skinnider et al., Nature Communications 2021) for cell type prioritization in high-dimensional single-cell data.
- AnnData-native — drop-in for the scanpy ecosystem
- No
rpy2, no R install, no Augur R package dependency - Numerically faithful to R Augur — AUC ranking perfectly preserved (Spearman rho = 1.0), Pearson r = 0.9999 on benchmark datasets
- Full pipeline: variance-based feature selection (loess on CV vs mean), random subsampling, stratified k-fold cross-validation with RF/LR classifiers
pip install pyaugurimport numpy as np
import pandas as pd
from pyaugur import calculate_auc, plot_lollipop, plot_umap, plot_important_features
# Expression matrix: genes x cells
expr = pd.read_csv("expression.csv", index_col=0).values
meta = pd.read_csv("metadata.csv") # columns: cell_type, label
result = calculate_auc(expr, meta=meta)
print(result["AUC"]) # Mean AUC per cell type, ranked
# Visualize results
plot_lollipop(result)
plot_important_features(result) # Top features per cell typeResults are returned as a dictionary:
| Key | Contents |
|---|---|
result['AUC'] |
DataFrame — mean AUC per cell type, ranked by prioritization |
result['results'] |
DataFrame — per-subsample AUC for each cell type |
result['feature_importance'] |
DataFrame — feature importance scores per cell type |
result['parameters'] |
dict — classifier, folds, subsample size, etc. |
The pyaugur pipeline mirrors the R Augur workflow step-for-step:
Select informative genes based on variance. Uses a loess fit of coefficient of variation (CV) vs mean expression, retaining genes above the specified quantile threshold. Matches R's select_variance() with filter_negative_residuals option.
Randomly subsample a fraction of selected features for each subsample iteration. Reduces overfitting and improves robustness of prioritization scores.
For each cell type and subsample:
- Subset cells of that type
- Split into stratified k-fold train/test sets
- Train a Random Forest (or Logistic Regression) classifier to predict condition labels
- Evaluate AUC on held-out folds
Average AUC across all subsamples and folds per cell type. Cell types with higher AUC are more differentially responsive to the experimental perturbation — i.e., more "prioritized."
Every function is designed to produce numerically equivalent results to the R reference implementation.
R's select_variance() uses loess(CV ~ mean) with 4 robustness iterations. Our implementation uses statsmodels.nonparametric.lowess (C implementation, it=2) which converges closer to R's loess than it=0, producing Pearson r = 0.9999 on feature selection residuals.
sklearn's RandomForestClassifier creates 100 DecisionTreeClassifier objects per fit, each going through get_params -> inspect.signature -> _validate_params. With 450 fits x 100 trees = 45,000 estimator creations, this overhead dominates. Our custom _FastRandomForest builds trees directly with DecisionTreeClassifier.fit(), skipping parameter validation while preserving identical bootstrap + decision tree behavior.
Uses sklearn's StratifiedKFold to match R's vfold_cv() from rsample, preserving class proportions in each fold.
Feature importance extracted from tree-based classifiers via Gini impurity reduction, matching R's randomForest importance() output.
All metrics computed against R Augur v1.0.3 on the sc_sim dataset (15,697 genes x 600 cells, 3 cell types, 50 subsamples).
| Metric | Value | Gate | Status |
|---|---|---|---|
| Pearson r (AUC) | 0.9999 | >= 0.95 | PASS |
| Spearman rho (ranking) | 1.0000 | = 1.0 | PASS |
| Ranking preserved | CellTypeC > CellTypeB > CellTypeA | — | PASS |
| Cell Type | R | Python | Diff |
|---|---|---|---|
| CellTypeA | 0.5535 | 0.6804 | +0.1269 |
| CellTypeB | 0.7467 | 0.8551 | +0.1084 |
| CellTypeC | 0.8795 | 0.9826 | +0.1031 |
Absolute AUC values differ due to different loess/RF implementations, but relative ranking and correlation are preserved.
| R | Python | Speed-up | |
|---|---|---|---|
calculate_auc |
227.8 s | 59.8 s | 3.8x |
| Optimization | Description | Impact |
|---|---|---|
| statsmodels C lowess | Replaced custom O(n^2) loess with Cython lowess (it=2) | Feature selection: ~2x faster |
| Custom _FastRandomForest | Bypass sklearn parameter validation (45k object creations) | RF training: ~3x faster |
| Sequential execution | n_jobs=1 avoids joblib overhead on small datasets | Faster than n_jobs=-1 |
Same algorithm. Same inputs. 3.8x faster. Spearman rho = 1.0.
| Notebook | What it covers |
|---|---|
examples/quickstart.ipynb |
Quick-start guide — load data, run Augur, inspect results |
examples/benchmark_R_vs_Python.ipynb |
Live benchmark comparing Python vs R outputs with parity metrics |
examples/function_mapping.ipynb |
R-to-Python function mapping reference |
from pyaugur import (
calculate_auc, # Main entry point
calculate_differential_prioritization, # Permutation test
select_variance, # Variance-based feature selection
select_random, # Random feature subsampling
)Train a classifier to predict condition labels per cell type, evaluate AUC in cross-validation.
Returns: dict with AUC (DataFrame), results, feature_importance, parameters.
Permutation test for differential prioritization between two conditions.
Feature selection by variance (loess on CV vs mean expression).
Random feature subsampling.
from pyaugur import (
plot_lollipop, # Lollipop plot for AUC/CCC values
plot_umap, # UMAP embeddings with omicverse style
plot_important_features, # Top important features per cell type
plot_differential_prioritization, # Differential prioritization scatter
)Create a lollipop plot showing cell type priorities (AUC values) ranked.
Superimpose cell type prioritizations onto a UMAP plot. Styled after omicverse embeddings: axis arrows (frameon='small'), right-side colorbar, rasterized scatter points.
Plot the most important features (genes) for a cell type.
Plot differential prioritization results highlighting statistically significant cell types.
If you use this package, please cite the original Augur paper:
Skinnider, M. A. et al. Cell type prioritization in single-cell data. Nature Communications 12, 15 (2021).
and acknowledge this repo for the Python port.
GNU GPLv3 — matches the upstream R Augur package.