yallHap

Modern, pipeline-friendly Y-chromosome haplogroup inference.

Features

YFull tree: Uses the most comprehensive Y-chromosome phylogeny (185,780+ SNPs)
Probabilistic scoring: Likelihood-based confidence scores, not just SNP counting
Ancient DNA support: Built-in damage filtering, transversions-only mode, quality rescaling
Multiple references: Supports GRCh37, GRCh38, and T2T-CHM13v2.0 with automatic liftover
Multi-threaded: Parallel sample processing with --threads N for population-scale studies
Batch processing: Classify thousands of samples efficiently with classify_batch()
Pipeline-friendly: Proper exit codes, JSON/TSV output, Nextflow/Snakemake examples
Bioconda/Docker: Easy installation and containerized execution

Accuracy

Validated against established datasets:

Dataset	Samples	Same Major Lineage	Reference	Notes
1000 Genomes Phase 3	1,233	99.8% (95% CI: 99.3-100%)	GRCh37	Modern WGS, heuristic mode
AADR Ancient DNA	7,333	90.7% Bayesian / 88.3% Heuristic	GRCh37	Full dataset, stratified by variant density
gnomAD HGDP/1KG	1,231	99.9% (95% CI: 99.5-100%)	GRCh38	High-coverage WGS

1000 Genomes details:

Only 3 misclassified samples (2 rare A0 haplogroups, 1 NO/K confusion)
Mean confidence: 0.994
Mean derived SNPs: 15.4

AADR Ancient DNA details (7,333 samples):

Overall: 90.7% accuracy with Bayesian ancient mode vs 88.3% with heuristic transversions-only
Stratified by variant density: <1% (33.7%), 1-4% (37.9%), 4-10% (71.7%), 10-50% (97.8%), ≥50% (99.0%)
At ≥10% variant density, both modes achieve 97-99% accuracy, comparable to modern WGS
Bayesian mode recommended for 4-10% variant density (+12-24 pp improvement)
Variant density = (called variants / total variants in chrY VCF) × 100%

gnomAD High-Coverage details:

200 samples randomly selected from 1,231 overlapping with 1000 Genomes
30× high-coverage whole-genome sequencing
Mean derived SNPs: 26.7
95% confidence interval: 98.17-100%

See VALIDATION_TESTING.md for reproducible validation protocols.

Installation

pip (recommended)

pip install yallhap

Conda

conda install -c bioconda yallhap

Docker

docker pull trianglegrrl/yallhap

From source

git clone https://github.com/trianglegrrl/yallhap.git
cd yallhap
pip install -e ".[dev]"

Quick Start

1. Download reference data

yallhap download --output-dir data/

This downloads:

YFull tree JSON (~14 MB)
YBrowse SNP database for GRCh38 (~430 MB)
YBrowse SNP database for GRCh37 (~50 MB)

2. Classify a sample

Use the SNP database matching your VCF's reference genome:

# For GRCh38/hg38 VCFs
yallhap classify sample.vcf.gz \
    --tree data/yfull_tree.json \
    --snp-db data/ybrowse_snps_grch38.csv \
    --reference grch38 \
    --output result.json

# For GRCh37/hg19 VCFs
yallhap classify sample.vcf.gz \
    --tree data/yfull_tree.json \
    --snp-db data/ybrowse_snps_grch37.csv \
    --reference grch37 \
    --output result.json

3. View results

cat result.json | jq '.haplogroup, .confidence'
# "R-L21"
# 0.97

Usage

Single Sample Classification

yallhap classify sample.vcf.gz \
    --tree data/yfull_tree.json \
    --snp-db data/ybrowse_snps_grch38.csv \
    --reference grch38 \
    --output result.json

Multi-Sample VCF

For VCFs containing multiple samples, specify which sample to classify:

yallhap classify multi_sample.vcf.gz \
    --tree data/yfull_tree.json \
    --snp-db data/ybrowse_snps_grch38.csv \
    --sample NA12878 \
    --output result.json

Batch Processing

Process multiple VCF files into a single TSV:

yallhap batch sample1.vcf.gz sample2.vcf.gz sample3.vcf.gz \
    --tree data/yfull_tree.json \
    --snp-db data/ybrowse_snps_grch38.csv \
    --output results.tsv

Parallel Processing

Use multiple threads for faster batch processing:

yallhap batch samples/*.vcf.gz \
    --tree data/yfull_tree.json \
    --snp-db data/ybrowse_snps_grch38.csv \
    --threads 16 \
    --output results.tsv

With 16 threads, processing 1,000+ samples takes approximately 10 minutes.

TSV Output Format

Use --format tsv for tab-separated output (useful for pipelines):

yallhap classify sample.vcf.gz \
    --tree data/yfull_tree.json \
    --snp-db data/ybrowse_snps_grch38.csv \
    --format tsv \
    --output result.tsv

Reference Genomes

yallHap supports three reference genomes. Use the SNP database matching your VCF's reference:

VCF Reference	SNP Database	`-r` flag
GRCh37/hg19	`ybrowse_snps_grch37.csv`	`grch37`
GRCh38/hg38	`ybrowse_snps_grch38.csv`	`grch38`
T2T-CHM13v2.0	`ybrowse_snps_grch38.csv`	`t2t`

# GRCh37 (hg19) - 1000 Genomes Phase 3, many ancient DNA datasets
yallhap classify sample.vcf.gz \
    -s data/ybrowse_snps_grch37.csv -r grch37 ...

# GRCh38 (hg38) - current standard, gnomAD, most modern studies
yallhap classify sample.vcf.gz \
    -s data/ybrowse_snps_grch38.csv -r grch38 ...

# T2T-CHM13v2.0 - complete Y chromosome (62 Mb)
yallhap classify sample.vcf.gz \
    -s data/ybrowse_snps_grch38.csv -r t2t ...

T2T Note: T2T coordinates are computed automatically via liftover from GRCh38 positions. Ensure liftover chain files are available (run python scripts/download_liftover_chains.py).

Ancient DNA Mode

yallHap includes specialized handling for ancient DNA samples with post-mortem damage.

Recommended: Bayesian Ancient Mode

For ancient DNA samples with moderate variant density (4–10%), Bayesian ancient mode is recommended, achieving +12–24 percentage point improvement over heuristic mode in this range:

yallhap classify ancient.vcf.gz \
    --tree data/yfull_tree.json \
    --snp-db data/ybrowse_snps_grch38.csv \
    --ancient \
    --bayesian \
    --output result.json

Variant density is calculated as (called variants / total variants in chrY VCF) × 100%. You can estimate this from your VCF or calculate it directly. At ≥10% variant density, both modes achieve comparable accuracy (97–99%); below 4%, classification is unreliable regardless of mode.

Basic Ancient Mode

Filters C>T and G>A transitions at read termini:

yallhap classify ancient.vcf.gz \
    --tree data/yfull_tree.json \
    --snp-db data/ybrowse_snps_grch38.csv \
    --ancient \
    --min-depth 1 \
    --output result.json

Transversions-Only Mode

Strictest mode for heavily damaged samples (ignores all transitions):

yallhap classify ancient.vcf.gz \
    --tree data/yfull_tree.json \
    --snp-db data/ybrowse_snps_grch38.csv \
    --transversions-only \
    --output result.json

Damage Rescaling

Downweight potentially damaged variants without excluding them:

yallhap classify ancient.vcf.gz \
    --tree data/yfull_tree.json \
    --snp-db data/ybrowse_snps_grch38.csv \
    --ancient \
    --damage-rescale moderate \
    --output result.json

Options for --damage-rescale:

none (default): No rescaling
moderate: 50% weight reduction for damage-like transitions
aggressive: 80% weight reduction

Python API

Single Sample

from yallhap.tree import Tree
from yallhap.snps import SNPDatabase
from yallhap.classifier import HaplogroupClassifier

# Load resources
tree = Tree.from_json("data/yfull_tree.json")
snp_db = SNPDatabase.from_csv("data/ybrowse_snps_grch38.csv")

# Create classifier
classifier = HaplogroupClassifier(
    tree=tree,
    snp_db=snp_db,
    reference="grch38",
)

# Classify
result = classifier.classify("sample.vcf.gz")
print(f"{result.sample}: {result.haplogroup} (confidence: {result.confidence:.2f})")

Batch Classification (Multi-Sample VCF)

For multi-sample VCFs, classify_batch() is 10x faster than calling classify() repeatedly:

# Get list of sample names to classify
samples = ["NA12878", "NA12891", "NA12892"]

# Classify all samples in one pass
results = classifier.classify_batch("multi_sample.vcf.gz", samples)

for result in results:
    print(f"{result.sample}: {result.haplogroup}")

Ancient DNA Mode

# Recommended: Bayesian ancient mode for moderate variant density (4-10%)
classifier = HaplogroupClassifier(
    tree=tree,
    snp_db=snp_db,
    reference="grch37",
    ancient_mode=True,
    bayesian=True,  # Recommended for 4-10% variant density
)

# Alternative: Transversions-only mode (strictest filtering)
classifier = HaplogroupClassifier(
    tree=tree,
    snp_db=snp_db,
    reference="grch37",
    ancient_mode=True,
    transversions_only=True,
    damage_rescale="moderate",
)

Output Format

JSON (default)

{
  "sample": "SAMPLE1",
  "haplogroup": "R-L21",
  "confidence": 0.97,
  "reference": "grch38",
  "tree_version": "YFull (185780 SNPs, hash: a1b2c3d4)",
  "snp_stats": {
    "informative_tested": 1247,
    "derived": 145,
    "ancestral": 1089,
    "missing": 13,
    "filtered_damage": 0
  },
  "quality_scores": {
    "qc1_backbone": 0.98,
    "qc2_terminal": 1.0,
    "qc3_path": 0.95,
    "qc4_posterior": 0.97
  },
  "path": ["ROOT", "A0-T", "A1", "...", "R-L21"],
  "defining_snps": ["L21"]
}

Reproducibility

The tree_version field includes a hash of the tree file content, enabling exact reproducibility. When citing yallHap results, include the tree_version value to document the exact phylogeny version used. The format is:

YFull (<snp_count> SNPs, hash: <8-char SHA256>)

Example: "YFull (185780 SNPs, hash: a1b2c3d4)"

TSV (for batch processing)

sample	haplogroup	confidence	qc1	qc2	qc3	qc4	derived	ancestral	missing
SAMPLE1	R-L21	0.9700	0.9800	1.0000	0.9500	0.9700	145	1089	13

Exit Codes

Code	Meaning
0	Success (high confidence, ≥0.95)
1	Classification failed (no haplogroup)
2	Low confidence (<0.95)
10	File not found
11	Invalid input
99	Unexpected error

Quality Scores

Score	Name	Description
QC1	Backbone	Intermediate markers on path to haplogroup match expected states
QC2	Terminal	Defining markers for called haplogroup are present
QC3	Path	Consistency within the called haplogroup branch
QC4	Posterior	Overall posterior probability from likelihood calculation

CLI Reference

`yallhap classify`

Classify a single VCF file.

Usage: yallhap classify [OPTIONS] VCF

Options:
  -t, --tree PATH          Path to YFull tree JSON [required]
  -s, --snp-db PATH        Path to SNP database CSV [required]
  -r, --reference TEXT     Reference genome: grch37, grch38, t2t [default: grch38]
  --sample TEXT            Sample name (for multi-sample VCFs)
  --ancient                Enable ancient DNA mode
  --transversions-only     Only use transversions (strictest aDNA mode)
  --damage-rescale TEXT    Rescale quality: none, moderate, aggressive
  --min-depth INTEGER      Minimum read depth [default: 1]
  --min-quality INTEGER    Minimum base quality [default: 20]
  -o, --output PATH        Output file (stdout if omitted)
  --format TEXT            Output format: json, tsv [default: json]

`yallhap batch`

Batch process multiple VCF files.

Usage: yallhap batch [OPTIONS] VCF_FILES...

Options:
  -t, --tree PATH          Path to YFull tree JSON [required]
  -s, --snp-db PATH        Path to SNP database CSV [required]
  -r, --reference TEXT     Reference genome: grch37, grch38, t2t [default: grch38]
  --ancient                Enable ancient DNA mode
  --transversions-only     Only use transversions
  --damage-rescale TEXT    Rescale quality: none, moderate, aggressive
  -o, --output PATH        Output TSV file [required]
  --threads INTEGER        Parallel threads [default: 1]

`yallhap download`

Download reference data (YFull tree + SNP databases for all reference genomes).

Usage: yallhap download [OPTIONS]

Options:
  -o, --output-dir PATH    Output directory [default: data/]
  -f, --force              Overwrite existing files

Downloads:

yfull_tree.json - YFull phylogenetic tree (~14 MB)
ybrowse_snps_grch38.csv - SNP positions for GRCh38/hg38 (~430 MB)
ybrowse_snps_grch37.csv - SNP positions for GRCh37/hg19 (~50 MB)

Pipeline Integration

Nextflow

See pipelines/nextflow/ for a complete example.

process YALLHAP {
    input:
    path vcf

    output:
    path "*.json"

    script:
    """
    yallhap classify ${vcf} \
        --tree ${params.tree} \
        --snp-db ${params.snp_db} \
        --reference ${params.reference} \
        --output ${vcf.baseName}.json
    """
}

Snakemake

See pipelines/snakemake/ for a complete example.

rule yallhap:
    input:
        vcf="{sample}.vcf.gz"
    output:
        json="{sample}.haplogroup.json"
    params:
        tree=config["yallhap_tree"],
        snp_db=config["yallhap_snps"]
    shell:
        """
        yallhap classify {input.vcf} \
            --tree {params.tree} \
            --snp-db {params.snp_db} \
            --output {output.json}
        """

Experimental Features

Bayesian Mode

A Bayesian classification mode is available that computes posterior probabilities over tree paths using log-likelihood ratios:

# For modern samples
yallhap classify sample.vcf.gz \
    --tree data/yfull_tree.json \
    --snp-db data/ybrowse_snps_grch38.csv \
    --bayesian \
    --output result.json

# For ancient DNA (recommended for 4-10% variant density)
yallhap classify ancient.vcf.gz \
    --tree data/yfull_tree.json \
    --snp-db data/ybrowse_snps_grch38.csv \
    --ancient \
    --bayesian \
    --output result.json

Performance: On modern high-coverage samples (1000 Genomes, gnomAD), Bayesian mode produces identical results to heuristic mode—no accuracy improvement. However, for ancient DNA with moderate variant density (4–10%), Bayesian ancient mode achieves +12–24 percentage point improvement over heuristic mode (71.7% vs 52.4% accuracy). On the full AADR ancient DNA dataset (7,333 samples), Bayesian ancient mode achieves 90.7% accuracy vs 88.3% for heuristic transversions-only mode.

This mode incorporates allelic depth (AD) information when available and uses adjusted error rates for ancient DNA damage modeling. For modern samples, heuristic mode is recommended for speed; for ancient DNA at 4–10% variant density, Bayesian mode is recommended for improved accuracy.

Development

# Clone repository
git clone https://github.com/trianglegrrl/yallhap.git
cd yallhap

# Install with dev dependencies
pip install -e ".[dev]"

# Run tests
pytest tests/ -v

# Run linters
black src/ tests/
ruff check src/ tests/
mypy src/

Citation

If you use yallHap in your research, please cite:

@software{yallhap,
  title = {yallHap: Modern Y-chromosome haplogroup inference},
  year = {2025},
  url = {https://github.com/trianglegrrl/yallhap}
}

License

PolyForm Noncommercial License 1.0.0 - see LICENSE for details.

This license allows use for noncommercial purposes, including research, education, and personal projects. For commercial use, please contact the maintainers.

Acknowledgments

YFull for maintaining the comprehensive Y-chromosome phylogeny
YBrowse for the SNP database
Yleaf and pathPhynder for algorithmic inspiration
1000 Genomes Project and AADR for validation data

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
.github/workflows		.github/workflows
conda		conda
data		data
docker		docker
pipelines		pipelines
reference/yleaf		reference/yleaf
scripts		scripts
src/yallhap		src/yallhap
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
REFERENCE_DATA.md		REFERENCE_DATA.md
VALIDATION_TESTING.md		VALIDATION_TESTING.md
ZENODO_MANIFEST.md		ZENODO_MANIFEST.md
pyproject.toml		pyproject.toml
validation_1kg_full.json		validation_1kg_full.json
validation_aadr_2000_filtered.json		validation_aadr_2000_filtered.json

Folders and files

Latest commit

History

Repository files navigation

yallHap

Features

Accuracy

Installation

pip (recommended)

Conda

Docker

From source

Quick Start

1. Download reference data

2. Classify a sample

3. View results

Usage

Single Sample Classification

Multi-Sample VCF

Batch Processing

Parallel Processing

TSV Output Format

Reference Genomes

Ancient DNA Mode

Recommended: Bayesian Ancient Mode

Basic Ancient Mode

Transversions-Only Mode

Damage Rescaling

Python API

Single Sample

Batch Classification (Multi-Sample VCF)

Ancient DNA Mode

Output Format

JSON (default)

Reproducibility

TSV (for batch processing)

Exit Codes

Quality Scores

CLI Reference

yallhap classify

yallhap batch

yallhap download

Pipeline Integration

Nextflow

Snakemake

Experimental Features

Bayesian Mode

Development

Citation

License

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 4

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`yallhap classify`

`yallhap batch`

`yallhap download`

Packages