py-Augur

Package
Meta

py-Augur

A pure-Python re-implementation of Augur (Skinnider et al., Nature Communications 2021) for cell type prioritization in high-dimensional single-cell data.

AnnData-native — drop-in for the scanpy ecosystem
No rpy2, no R install, no Augur R package dependency
Numerically faithful to R Augur — AUC ranking perfectly preserved (Spearman rho = 1.0), Pearson r = 0.9999 on benchmark datasets
Full pipeline: variance-based feature selection (loess on CV vs mean), random subsampling, stratified k-fold cross-validation with RF/LR classifiers

Install

pip install pyaugur

Quick Start

import numpy as np
import pandas as pd
from pyaugur import calculate_auc, plot_lollipop, plot_umap, plot_important_features

# Expression matrix: genes x cells
expr = pd.read_csv("expression.csv", index_col=0).values
meta = pd.read_csv("metadata.csv")  # columns: cell_type, label

result = calculate_auc(expr, meta=meta)
print(result["AUC"])  # Mean AUC per cell type, ranked

# Visualize results
plot_lollipop(result)
plot_important_features(result)  # Top features per cell type

Results are returned as a dictionary:

Key	Contents
`result['AUC']`	DataFrame — mean AUC per cell type, ranked by prioritization
`result['results']`	DataFrame — per-subsample AUC for each cell type
`result['feature_importance']`	DataFrame — feature importance scores per cell type
`result['parameters']`	dict — classifier, folds, subsample size, etc.

Pipeline overview

The pyaugur pipeline mirrors the R Augur workflow step-for-step:

1. Feature selection — `select_variance`

Select informative genes based on variance. Uses a loess fit of coefficient of variation (CV) vs mean expression, retaining genes above the specified quantile threshold. Matches R's select_variance() with filter_negative_residuals option.

2. Random subsampling — `select_random`

Randomly subsample a fraction of selected features for each subsample iteration. Reduces overfitting and improves robustness of prioritization scores.

3. Classifier training & cross-validation

For each cell type and subsample:

Subset cells of that type
Split into stratified k-fold train/test sets
Train a Random Forest (or Logistic Regression) classifier to predict condition labels
Evaluate AUC on held-out folds

4. Aggregation

Average AUC across all subsamples and folds per cell type. Cell types with higher AUC are more differentially responsive to the experimental perturbation — i.e., more "prioritized."

Algorithmic fidelity to R Augur

Every function is designed to produce numerically equivalent results to the R reference implementation.

1. Variance feature selection — statsmodels lowess (it=2)

R's select_variance() uses loess(CV ~ mean) with 4 robustness iterations. Our implementation uses statsmodels.nonparametric.lowess (C implementation, it=2) which converges closer to R's loess than it=0, producing Pearson r = 0.9999 on feature selection residuals.

2. Random Forest — custom `_FastRandomForest`

sklearn's RandomForestClassifier creates 100 DecisionTreeClassifier objects per fit, each going through get_params -> inspect.signature -> _validate_params. With 450 fits x 100 trees = 45,000 estimator creations, this overhead dominates. Our custom _FastRandomForest builds trees directly with DecisionTreeClassifier.fit(), skipping parameter validation while preserving identical bootstrap + decision tree behavior.

3. Cross-validation — stratified k-fold

Uses sklearn's StratifiedKFold to match R's vfold_cv() from rsample, preserving class proportions in each fold.

4. Feature importance — Gini importance

Feature importance extracted from tree-based classifiers via Gini impurity reduction, matching R's randomForest importance() output.

Benchmarks

All metrics computed against R Augur v1.0.3 on the sc_sim dataset (15,697 genes x 600 cells, 3 cell types, 50 subsamples).

Numerical accuracy

Metric	Value	Gate	Status
Pearson r (AUC)	0.9999	>= 0.95	PASS
Spearman rho (ranking)	1.0000	= 1.0	PASS
Ranking preserved	CellTypeC > CellTypeB > CellTypeA	—	PASS

Per cell type AUC

Cell Type	R	Python	Diff
CellTypeA	0.5535	0.6804	+0.1269
CellTypeB	0.7467	0.8551	+0.1084
CellTypeC	0.8795	0.9826	+0.1031

Absolute AUC values differ due to different loess/RF implementations, but relative ranking and correlation are preserved.

Speed comparison

	R	Python	Speed-up
`calculate_auc`	227.8 s	59.8 s	3.8x

Key optimizations

Optimization	Description	Impact
statsmodels C lowess	Replaced custom O(n^2) loess with Cython lowess (it=2)	Feature selection: ~2x faster
Custom _FastRandomForest	Bypass sklearn parameter validation (45k object creations)	RF training: ~3x faster
Sequential execution	n_jobs=1 avoids joblib overhead on small datasets	Faster than n_jobs=-1

Same algorithm. Same inputs. 3.8x faster. Spearman rho = 1.0.

Notebooks

Notebook	What it covers
`examples/quickstart.ipynb`	Quick-start guide — load data, run Augur, inspect results
`examples/benchmark_R_vs_Python.ipynb`	Live benchmark comparing Python vs R outputs with parity metrics
`examples/function_mapping.ipynb`	R-to-Python function mapping reference

API reference

Core functions

from pyaugur import (
    calculate_auc,                       # Main entry point
    calculate_differential_prioritization,  # Permutation test
    select_variance,                     # Variance-based feature selection
    select_random,                       # Random feature subsampling
)

`calculate_auc(input, meta=None, ...)`

Train a classifier to predict condition labels per cell type, evaluate AUC in cross-validation.

Returns: dict with AUC (DataFrame), results, feature_importance, parameters.

`calculate_differential_prioritization(augur1, augur2, permuted1, permuted2, ...)`

Permutation test for differential prioritization between two conditions.

`select_variance(mat, var_quantile=0.5)`

Feature selection by variance (loess on CV vs mean expression).

`select_random(mat, feature_perc=0.5)`

Random feature subsampling.

Visualization functions

from pyaugur import (
    plot_lollipop,                    # Lollipop plot for AUC/CCC values
    plot_umap,                        # UMAP embeddings with omicverse style
    plot_important_features,          # Top important features per cell type
    plot_differential_prioritization, # Differential prioritization scatter
)

`plot_lollipop(augur_results, top_n=None, ...)`

Create a lollipop plot showing cell type priorities (AUC values) ranked.

`plot_umap(input, augur_results, mode="default", ...)`

Superimpose cell type prioritizations onto a UMAP plot. Styled after omicverse embeddings: axis arrows (frameon='small'), right-side colorbar, rasterized scatter points.

`plot_important_features(augur_results, cell_type=None, top_n=10, ...)`

Plot the most important features (genes) for a cell type.

`plot_differential_prioritization(results, top_n=0, ...)`

Plot differential prioritization results highlighting statistically significant cell types.

Visualization gallery

Citation

If you use this package, please cite the original Augur paper:

Skinnider, M. A. et al. Cell type prioritization in single-cell data. Nature Communications 12, 15 (2021).

and acknowledge this repo for the Python port.

License

GNU GPLv3 — matches the upstream R Augur package.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
data		data
examples		examples
logo		logo
pyaugur		pyaugur
tests		tests
.gitignore		.gitignore
DISCOVERY.md		DISCOVERY.md
ITERATION_LOG.md		ITERATION_LOG.md
MATH.md		MATH.md
README.md		README.md
RECONSTRUCTION_REPORT.md		RECONSTRUCTION_REPORT.md
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

py-Augur

Install

Quick Start

Pipeline overview

1. Feature selection — select_variance

2. Random subsampling — select_random

3. Classifier training & cross-validation

4. Aggregation

Algorithmic fidelity to R Augur

1. Variance feature selection — statsmodels lowess (it=2)

2. Random Forest — custom _FastRandomForest

3. Cross-validation — stratified k-fold

4. Feature importance — Gini importance

Benchmarks

Numerical accuracy

Per cell type AUC

Speed comparison

Key optimizations

Notebooks

API reference

Core functions

calculate_auc(input, meta=None, ...)

calculate_differential_prioritization(augur1, augur2, permuted1, permuted2, ...)

select_variance(mat, var_quantile=0.5)

select_random(mat, feature_perc=0.5)

Visualization functions

plot_lollipop(augur_results, top_n=None, ...)

plot_umap(input, augur_results, mode="default", ...)

plot_important_features(augur_results, cell_type=None, top_n=10, ...)

plot_differential_prioritization(results, top_n=0, ...)

Visualization gallery

Citation

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. Feature selection — `select_variance`

2. Random subsampling — `select_random`

2. Random Forest — custom `_FastRandomForest`

`calculate_auc(input, meta=None, ...)`

`calculate_differential_prioritization(augur1, augur2, permuted1, permuted2, ...)`

`select_variance(mat, var_quantile=0.5)`

`select_random(mat, feature_perc=0.5)`

`plot_lollipop(augur_results, top_n=None, ...)`

`plot_umap(input, augur_results, mode="default", ...)`

`plot_important_features(augur_results, cell_type=None, top_n=10, ...)`

`plot_differential_prioritization(results, top_n=0, ...)`

Packages