Modern, pipeline-friendly Y-chromosome haplogroup inference.
- YFull tree: Uses the most comprehensive Y-chromosome phylogeny (185,780+ SNPs)
- Probabilistic scoring: Likelihood-based confidence scores, not just SNP counting
- Ancient DNA support: Built-in damage filtering, transversions-only mode, quality rescaling
- Multiple references: Supports GRCh37, GRCh38, and T2T-CHM13v2.0 with automatic liftover
- Multi-threaded: Parallel sample processing with
--threads Nfor population-scale studies - Batch processing: Classify thousands of samples efficiently with
classify_batch() - Pipeline-friendly: Proper exit codes, JSON/TSV output, Nextflow/Snakemake examples
- Bioconda/Docker: Easy installation and containerized execution
Validated against established datasets:
| Dataset | Samples | Same Major Lineage | Reference | Notes |
|---|---|---|---|---|
| 1000 Genomes Phase 3 | 1,233 | 99.8% (95% CI: 99.3-100%) | GRCh37 | Modern WGS, heuristic mode |
| AADR Ancient DNA | 7,333 | 90.7% Bayesian / 88.3% Heuristic | GRCh37 | Full dataset, stratified by variant density |
| gnomAD HGDP/1KG | 1,231 | 99.9% (95% CI: 99.5-100%) | GRCh38 | High-coverage WGS |
1000 Genomes details:
- Only 3 misclassified samples (2 rare A0 haplogroups, 1 NO/K confusion)
- Mean confidence: 0.994
- Mean derived SNPs: 15.4
AADR Ancient DNA details (7,333 samples):
- Overall: 90.7% accuracy with Bayesian ancient mode vs 88.3% with heuristic transversions-only
- Stratified by variant density: <1% (33.7%), 1-4% (37.9%), 4-10% (71.7%), 10-50% (97.8%), ≥50% (99.0%)
- At ≥10% variant density, both modes achieve 97-99% accuracy, comparable to modern WGS
- Bayesian mode recommended for 4-10% variant density (+12-24 pp improvement)
- Variant density = (called variants / total variants in chrY VCF) × 100%
gnomAD High-Coverage details:
- 200 samples randomly selected from 1,231 overlapping with 1000 Genomes
- 30× high-coverage whole-genome sequencing
- Mean derived SNPs: 26.7
- 95% confidence interval: 98.17-100%
See VALIDATION_TESTING.md for reproducible validation protocols.
pip install yallhapconda install -c bioconda yallhapdocker pull trianglegrrl/yallhapgit clone https://github.com/trianglegrrl/yallhap.git
cd yallhap
pip install -e ".[dev]"yallhap download --output-dir data/This downloads:
- YFull tree JSON (~14 MB)
- YBrowse SNP database for GRCh38 (~430 MB)
- YBrowse SNP database for GRCh37 (~50 MB)
Use the SNP database matching your VCF's reference genome:
# For GRCh38/hg38 VCFs
yallhap classify sample.vcf.gz \
--tree data/yfull_tree.json \
--snp-db data/ybrowse_snps_grch38.csv \
--reference grch38 \
--output result.json
# For GRCh37/hg19 VCFs
yallhap classify sample.vcf.gz \
--tree data/yfull_tree.json \
--snp-db data/ybrowse_snps_grch37.csv \
--reference grch37 \
--output result.jsoncat result.json | jq '.haplogroup, .confidence'
# "R-L21"
# 0.97yallhap classify sample.vcf.gz \
--tree data/yfull_tree.json \
--snp-db data/ybrowse_snps_grch38.csv \
--reference grch38 \
--output result.jsonFor VCFs containing multiple samples, specify which sample to classify:
yallhap classify multi_sample.vcf.gz \
--tree data/yfull_tree.json \
--snp-db data/ybrowse_snps_grch38.csv \
--sample NA12878 \
--output result.jsonProcess multiple VCF files into a single TSV:
yallhap batch sample1.vcf.gz sample2.vcf.gz sample3.vcf.gz \
--tree data/yfull_tree.json \
--snp-db data/ybrowse_snps_grch38.csv \
--output results.tsvUse multiple threads for faster batch processing:
yallhap batch samples/*.vcf.gz \
--tree data/yfull_tree.json \
--snp-db data/ybrowse_snps_grch38.csv \
--threads 16 \
--output results.tsvWith 16 threads, processing 1,000+ samples takes approximately 10 minutes.
Use --format tsv for tab-separated output (useful for pipelines):
yallhap classify sample.vcf.gz \
--tree data/yfull_tree.json \
--snp-db data/ybrowse_snps_grch38.csv \
--format tsv \
--output result.tsvyallHap supports three reference genomes. Use the SNP database matching your VCF's reference:
| VCF Reference | SNP Database | -r flag |
|---|---|---|
| GRCh37/hg19 | ybrowse_snps_grch37.csv |
grch37 |
| GRCh38/hg38 | ybrowse_snps_grch38.csv |
grch38 |
| T2T-CHM13v2.0 | ybrowse_snps_grch38.csv |
t2t |
# GRCh37 (hg19) - 1000 Genomes Phase 3, many ancient DNA datasets
yallhap classify sample.vcf.gz \
-s data/ybrowse_snps_grch37.csv -r grch37 ...
# GRCh38 (hg38) - current standard, gnomAD, most modern studies
yallhap classify sample.vcf.gz \
-s data/ybrowse_snps_grch38.csv -r grch38 ...
# T2T-CHM13v2.0 - complete Y chromosome (62 Mb)
yallhap classify sample.vcf.gz \
-s data/ybrowse_snps_grch38.csv -r t2t ...T2T Note: T2T coordinates are computed automatically via liftover from GRCh38 positions. Ensure liftover chain files are available (run python scripts/download_liftover_chains.py).
yallHap includes specialized handling for ancient DNA samples with post-mortem damage.
For ancient DNA samples with moderate variant density (4–10%), Bayesian ancient mode is recommended, achieving +12–24 percentage point improvement over heuristic mode in this range:
yallhap classify ancient.vcf.gz \
--tree data/yfull_tree.json \
--snp-db data/ybrowse_snps_grch38.csv \
--ancient \
--bayesian \
--output result.jsonVariant density is calculated as (called variants / total variants in chrY VCF) × 100%. You can estimate this from your VCF or calculate it directly. At ≥10% variant density, both modes achieve comparable accuracy (97–99%); below 4%, classification is unreliable regardless of mode.
Filters C>T and G>A transitions at read termini:
yallhap classify ancient.vcf.gz \
--tree data/yfull_tree.json \
--snp-db data/ybrowse_snps_grch38.csv \
--ancient \
--min-depth 1 \
--output result.jsonStrictest mode for heavily damaged samples (ignores all transitions):
yallhap classify ancient.vcf.gz \
--tree data/yfull_tree.json \
--snp-db data/ybrowse_snps_grch38.csv \
--transversions-only \
--output result.jsonDownweight potentially damaged variants without excluding them:
yallhap classify ancient.vcf.gz \
--tree data/yfull_tree.json \
--snp-db data/ybrowse_snps_grch38.csv \
--ancient \
--damage-rescale moderate \
--output result.jsonOptions for --damage-rescale:
none(default): No rescalingmoderate: 50% weight reduction for damage-like transitionsaggressive: 80% weight reduction
from yallhap.tree import Tree
from yallhap.snps import SNPDatabase
from yallhap.classifier import HaplogroupClassifier
# Load resources
tree = Tree.from_json("data/yfull_tree.json")
snp_db = SNPDatabase.from_csv("data/ybrowse_snps_grch38.csv")
# Create classifier
classifier = HaplogroupClassifier(
tree=tree,
snp_db=snp_db,
reference="grch38",
)
# Classify
result = classifier.classify("sample.vcf.gz")
print(f"{result.sample}: {result.haplogroup} (confidence: {result.confidence:.2f})")For multi-sample VCFs, classify_batch() is 10x faster than calling classify() repeatedly:
# Get list of sample names to classify
samples = ["NA12878", "NA12891", "NA12892"]
# Classify all samples in one pass
results = classifier.classify_batch("multi_sample.vcf.gz", samples)
for result in results:
print(f"{result.sample}: {result.haplogroup}")# Recommended: Bayesian ancient mode for moderate variant density (4-10%)
classifier = HaplogroupClassifier(
tree=tree,
snp_db=snp_db,
reference="grch37",
ancient_mode=True,
bayesian=True, # Recommended for 4-10% variant density
)
# Alternative: Transversions-only mode (strictest filtering)
classifier = HaplogroupClassifier(
tree=tree,
snp_db=snp_db,
reference="grch37",
ancient_mode=True,
transversions_only=True,
damage_rescale="moderate",
){
"sample": "SAMPLE1",
"haplogroup": "R-L21",
"confidence": 0.97,
"reference": "grch38",
"tree_version": "YFull (185780 SNPs, hash: a1b2c3d4)",
"snp_stats": {
"informative_tested": 1247,
"derived": 145,
"ancestral": 1089,
"missing": 13,
"filtered_damage": 0
},
"quality_scores": {
"qc1_backbone": 0.98,
"qc2_terminal": 1.0,
"qc3_path": 0.95,
"qc4_posterior": 0.97
},
"path": ["ROOT", "A0-T", "A1", "...", "R-L21"],
"defining_snps": ["L21"]
}The tree_version field includes a hash of the tree file content, enabling exact reproducibility. When citing yallHap results, include the tree_version value to document the exact phylogeny version used. The format is:
YFull (<snp_count> SNPs, hash: <8-char SHA256>)
Example: "YFull (185780 SNPs, hash: a1b2c3d4)"
sample haplogroup confidence qc1 qc2 qc3 qc4 derived ancestral missing
SAMPLE1 R-L21 0.9700 0.9800 1.0000 0.9500 0.9700 145 1089 13
| Code | Meaning |
|---|---|
| 0 | Success (high confidence, ≥0.95) |
| 1 | Classification failed (no haplogroup) |
| 2 | Low confidence (<0.95) |
| 10 | File not found |
| 11 | Invalid input |
| 99 | Unexpected error |
| Score | Name | Description |
|---|---|---|
| QC1 | Backbone | Intermediate markers on path to haplogroup match expected states |
| QC2 | Terminal | Defining markers for called haplogroup are present |
| QC3 | Path | Consistency within the called haplogroup branch |
| QC4 | Posterior | Overall posterior probability from likelihood calculation |
Classify a single VCF file.
Usage: yallhap classify [OPTIONS] VCF
Options:
-t, --tree PATH Path to YFull tree JSON [required]
-s, --snp-db PATH Path to SNP database CSV [required]
-r, --reference TEXT Reference genome: grch37, grch38, t2t [default: grch38]
--sample TEXT Sample name (for multi-sample VCFs)
--ancient Enable ancient DNA mode
--transversions-only Only use transversions (strictest aDNA mode)
--damage-rescale TEXT Rescale quality: none, moderate, aggressive
--min-depth INTEGER Minimum read depth [default: 1]
--min-quality INTEGER Minimum base quality [default: 20]
-o, --output PATH Output file (stdout if omitted)
--format TEXT Output format: json, tsv [default: json]
Batch process multiple VCF files.
Usage: yallhap batch [OPTIONS] VCF_FILES...
Options:
-t, --tree PATH Path to YFull tree JSON [required]
-s, --snp-db PATH Path to SNP database CSV [required]
-r, --reference TEXT Reference genome: grch37, grch38, t2t [default: grch38]
--ancient Enable ancient DNA mode
--transversions-only Only use transversions
--damage-rescale TEXT Rescale quality: none, moderate, aggressive
-o, --output PATH Output TSV file [required]
--threads INTEGER Parallel threads [default: 1]
Download reference data (YFull tree + SNP databases for all reference genomes).
Usage: yallhap download [OPTIONS]
Options:
-o, --output-dir PATH Output directory [default: data/]
-f, --force Overwrite existing files
Downloads:
yfull_tree.json- YFull phylogenetic tree (~14 MB)ybrowse_snps_grch38.csv- SNP positions for GRCh38/hg38 (~430 MB)ybrowse_snps_grch37.csv- SNP positions for GRCh37/hg19 (~50 MB)
See pipelines/nextflow/ for a complete example.
process YALLHAP {
input:
path vcf
output:
path "*.json"
script:
"""
yallhap classify ${vcf} \
--tree ${params.tree} \
--snp-db ${params.snp_db} \
--reference ${params.reference} \
--output ${vcf.baseName}.json
"""
}See pipelines/snakemake/ for a complete example.
rule yallhap:
input:
vcf="{sample}.vcf.gz"
output:
json="{sample}.haplogroup.json"
params:
tree=config["yallhap_tree"],
snp_db=config["yallhap_snps"]
shell:
"""
yallhap classify {input.vcf} \
--tree {params.tree} \
--snp-db {params.snp_db} \
--output {output.json}
"""A Bayesian classification mode is available that computes posterior probabilities over tree paths using log-likelihood ratios:
# For modern samples
yallhap classify sample.vcf.gz \
--tree data/yfull_tree.json \
--snp-db data/ybrowse_snps_grch38.csv \
--bayesian \
--output result.json
# For ancient DNA (recommended for 4-10% variant density)
yallhap classify ancient.vcf.gz \
--tree data/yfull_tree.json \
--snp-db data/ybrowse_snps_grch38.csv \
--ancient \
--bayesian \
--output result.jsonPerformance: On modern high-coverage samples (1000 Genomes, gnomAD), Bayesian mode produces identical results to heuristic mode—no accuracy improvement. However, for ancient DNA with moderate variant density (4–10%), Bayesian ancient mode achieves +12–24 percentage point improvement over heuristic mode (71.7% vs 52.4% accuracy). On the full AADR ancient DNA dataset (7,333 samples), Bayesian ancient mode achieves 90.7% accuracy vs 88.3% for heuristic transversions-only mode.
This mode incorporates allelic depth (AD) information when available and uses adjusted error rates for ancient DNA damage modeling. For modern samples, heuristic mode is recommended for speed; for ancient DNA at 4–10% variant density, Bayesian mode is recommended for improved accuracy.
# Clone repository
git clone https://github.com/trianglegrrl/yallhap.git
cd yallhap
# Install with dev dependencies
pip install -e ".[dev]"
# Run tests
pytest tests/ -v
# Run linters
black src/ tests/
ruff check src/ tests/
mypy src/If you use yallHap in your research, please cite:
@software{yallhap,
title = {yallHap: Modern Y-chromosome haplogroup inference},
year = {2025},
url = {https://github.com/trianglegrrl/yallhap}
}PolyForm Noncommercial License 1.0.0 - see LICENSE for details.
This license allows use for noncommercial purposes, including research, education, and personal projects. For commercial use, please contact the maintainers.