Skip to content

grgong/pie

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

137 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

pie -- piN/piS Estimator

pie logo

Estimate nonsynonymous (piN) and synonymous (piS) nucleotide diversity from pool-seq or individual-sequencing data on large eukaryotic genomes.

Two analysis modes:

  • Pool mode (pie pool): Derives allele frequencies from AD (allele depth) fields in pool-seq VCFs.
  • Individual mode (pie ind): Derives allele frequencies from GT (genotype) fields across multiple diploid samples — a site-frequency based approximation consistent with pairwise diversity under the model assumptions.

Comparison with SNPGenie

pie reimplements the Nei-Gojobori pool-seq method from SNPGenie (Nelson et al., 2015) in a numpy-vectorized Python stack with gene-level multiprocessing, and extends it with an individual-sequencing mode.

Stop codon site counting. SNPGenie excludes stop-producing mutations from site counting: for a codon position where one of three possible changes introduces a stop codon, only the two remaining sense→sense changes are counted (e.g., AGA pos 1: N = 1/2, S = 1/2). pie follows this same convention by default, consistent with the classical NG86 method. Use --include-stop-codons to instead count stop-gained mutations as nonsynonymous (AGA pos 1: N = 2/3, S = 1/3), following the convention used by tools such as PAML and libsequence.

Multiallelic sites and frequency overlay. When a VCF is decomposed with bcftools norm -m-, a multiallelic site (e.g., ALT=C,G) becomes two separate records at the same position. SNPGenie processes each record independently with an incremental subtract-from-ref / add-to-alt strategy, which can cause the reference allele frequency to be double-subtracted. pie skips multiallelic sites by default. With --keep-multiallelic, it groups records by genomic position and recomputes frequencies from raw allele depths (freq(alt_i) = AD_i / (AD_ref + ΣAD_alt)). For all sites, alt frequencies are assigned directly and the reference is set to 1 − Σalt.

Simplified annotation input. SNPGenie requires a specific GTF format with a single isoform per gene, which typically involves manual format conversion and isoform filtering. pie accepts GFF3 or GTF directly (auto-detected) and automatically selects the longest isoform per gene — no preprocessing needed.

Auto-indexing. Plain VCF files are automatically bgzipped and tabix-indexed; FASTA files are indexed via pysam if the .fai is missing. No manual preprocessing required.

Performance. Core computations are numpy-vectorized and genes are processed in parallel via multiprocessing.Pool.

Validation. A regression test suite pins piN/piS values for 400 real Acyrthosiphon pisum genes, covering both strand orientations and multiallelic sites. With unified inputs, synonymous/nonsynonymous site counts match SNPGenie exactly (r = 1.0) and piN/piS values agree within ~1%, fully explained by the Nei (1987) sample-size correction that SNPGenie applies (π = n/(n−1)·(1−Σfᵢ²)) vs. the population parameter pie uses (π = 2p(1−p)). See benchmark/ for details.

Installation

git clone https://github.com/grgong/pie.git
cd pie

# Create conda environment (includes editable install of pie)
mamba env create -f environment.yml
mamba activate pie

# For development (adds pytest)
pip install -e ".[dev]"

Requires Python >= 3.12. Key dependencies: cyvcf2, gffutils, pysam, numpy, pandas, plotnine, click.

Quick start

# Pool-seq mode
pie pool -v variants.vcf.gz -g genes.gff3 -f reference.fa -o results/ -t 8

# Pool-seq with stricter filters
pie pool -v variants.vcf.gz -g genes.gff3 -f reference.fa -o results/ \
  -d 20 -q 30 --pass-only -t 8

# Individual-sequencing mode (all samples)
pie ind -v multi_sample.vcf.gz -g genes.gff3 -f reference.fa -o results/ -t 8

# Individual mode with selected samples
pie ind -v multi_sample.vcf.gz -g genes.gff3 -f reference.fa -o results/ \
  -S S1,S2,S3

# Individual mode with sample list file
pie ind -v multi_sample.vcf.gz -g genes.gff3 -f reference.fa -o results/ \
  --samples-file samples.txt --min-call-rate 0.9

# Per-variant annotation table
pie pool -v variants.vcf.gz -g genes.gff3 -f reference.fa -o results/ --variant-table

# View summary
pie summary results/summary.tsv

# Generate plots
pie plot manhattan -i results/gene_results.tsv -o manhattan.png
pie plot scatter -i results/gene_results.tsv -o scatter.png --color-by-chrom
pie plot histogram -i results/gene_results.tsv -o histogram.png
pie plot boxplot -i results/gene_results.tsv -o boxplot.png
pie plot sliding-window -i results/window_results.tsv -o sw.png

Input requirements

Input Format Notes
VCF .vcf or .vcf.gz Single- or multi-sample. Multiallelic sites are skipped by default; use --keep-multiallelic to merge them. Accepts both decomposed (bcftools norm -m-) and non-decomposed VCFs. Plain VCF is auto-bgzipped; missing .tbi index is auto-created. See Multi-sample VCFs for sample selection.
Annotation GFF3 or GTF Must contain gene, mRNA/transcript, and CDS features. Format is auto-detected.
Reference FASTA Must match the VCF reference. Auto-indexed by pysam if .fai is missing.

In pool mode, allele frequencies are extracted from the AD (allele depth) format field, falling back to INFO AF/DP if AD is unavailable. In individual mode, allele frequencies are derived from GT (genotype) fields: freq = alt_allele_count / (2 × called_samples).

CLI reference

All commands support -h/--help. Use pie -V to print the version.

pie pool

Pool-seq mode (allele-frequency based).

pie pool -v FILE -g FILE -f FILE -o DIR [OPTIONS]

Shared options (available in both pool and ind):

Option Short Default Description
--vcf -v required Input VCF file (bgzipped or plain)
--gff -g required GFF3 or GTF annotation file
--fasta -f required Reference FASTA file (indexed)
--outdir -o required Output directory (created if absent)
--min-freq 0.01 Minimum alt allele frequency
--min-qual -q 20.0 Minimum variant QUAL score
--pass-only off Only use PASS-filtered variants
--keep-multiallelic off Keep and merge multiallelic sites instead of skipping them
--include-stop-codons off Count stop-gained mutations as nonsynonymous (by default they are excluded, matching NG86/SNPGenie)
--window-size -w 1000 Sliding window size in bp
--window-step -W 100 Sliding window step in bp
--threads -t 1 Number of parallel worker processes
--variant-table off Write per-variant annotation table (variant_results.tsv)

Pool-only options:

Option Short Default Description
--min-depth -d 10 Minimum read depth at variant site
--sample -s Sample name to analyse (required for multi-sample VCFs)

pie ind

Individual-sequencing mode (genotype based).

pie ind -v FILE -g FILE -f FILE -o DIR [OPTIONS]

Individual-only options (plus all shared options above):

Option Short Default Description
--samples -S all Comma-separated sample names (defaults to all VCF samples)
--samples-file File with one sample name per line (mutually exclusive with --samples)
--min-call-rate 0.8 Minimum genotype call rate per site (range 0–1)
--min-an 2 Minimum allele number (AN) per site

pie plot

Create publication-ready plots from piN/piS results. Five subcommands are available; all produce 300 DPI output in PNG, PDF, or SVG (auto-detected from the file extension). Default figure sizes follow NPG (Nature) journal specs.

Shared options (all subcommands except sliding-window):

Option Short Default Description
--input -i required Input TSV file
--output -o required Output plot path
--width -W (per subcommand) Figure width in inches
--height -H (per subcommand) Figure height in inches
--dpi 300 Resolution in dots per inch
--min-codons Exclude genes with fewer than N codons
--min-variants Exclude genes with fewer than N variants

pie plot manhattan

Genome-wide Manhattan plot of per-gene piN/piS. Input: gene_results.tsv. Default size: 7.2 × 3.5 in (double column).

Option Default Description
--log-scale off Use log₂ scale for piN/piS y-axis
--label-top Label top N outlier genes
--highlight-genes Comma-separated gene IDs to label
--max-ratio 2.0 Exclude genes with piN/piS above this value
--exclude-zero-ratio off Exclude genes with piN/piS = 0

pie plot scatter

piN vs piS scatter plot with point size reflecting gene length. Input: gene_results.tsv. Default size: 4.7 × 4.7 in (1.5 column).

Option Default Description
--color-by-chrom off Color points by chromosome
--max-piN 2.0 Exclude genes with piN above this value
--max-piS 2.0 Exclude genes with piS above this value

pie plot histogram

Distribution of piN/piS ratios with density overlay. Input: gene_results.tsv. Default size: 3.5 × 3.0 in (single column).

Option Default Description
--max-ratio 2.0 Exclude genes with piN/piS above this value
--exclude-zero-ratio off Exclude genes with piN/piS = 0

pie plot boxplot

Per-chromosome boxplots of piN, piS, and piN/piS (three faceted panels). Each metric is filtered independently. Input: gene_results.tsv. Default size: 3.5 × 7.0 in (single column).

Option Default Description
--max-piN 2.0 Exclude genes with piN above this value
--max-piS 2.0 Exclude genes with piS above this value
--max-ratio 2.0 Exclude genes with piN/piS above this value
--exclude-zero-ratio off Exclude genes with piN/piS = 0

pie plot sliding-window

Sliding window piN/piS line plot, one row per chromosome. Input: window_results.tsv. Default size: 7.2 in wide × 1.5 in per chromosome.

Option Default Description
--max-ratio 2.0 Exclude windows with piN/piS above this value

Note: --min-codons and --min-variants are not available for sliding-window because it operates on window-level data, not per-gene data.

pie summary

Print genome-wide summary statistics to the terminal.

pie summary SUMMARY_FILE

Takes the path to summary.tsv as a positional argument.

Multi-sample VCFs

Pool mode

By default, pie expects a single-sample VCF. If the VCF contains two or more samples, pie will abort and list the available sample names. Use --sample to select one:

pie pool -v multi.vcf.gz -g genes.gff3 -f ref.fa -o results/ -s pool_A

When --sample is given, output files are prefixed with the sample name (e.g., pool_A.gene_results.tsv, pool_A.window_results.tsv, pool_A.summary.tsv). Single-sample VCFs do not require --sample and produce unprefixed filenames as before.

Individual mode

In individual mode (pie ind), pie uses all samples in the VCF by default. Use --samples (comma-separated) or --samples-file (one name per line) to select a subset. See the Quick start for examples.

Allele frequencies are computed per site as AC / AN, where AC is the alternate allele count and AN = 2 × called_samples (diploid). Sites with call rate below --min-call-rate or AN below --min-an are skipped.

Output files

All outputs are written to --outdir. In pool mode, when --sample is specified, each filename is prefixed with {sample}. (e.g., pool_A.gene_results.tsv).

gene_results.tsv

One row per gene. Columns:

Column Description
chrom Chromosome or scaffold
gene_id Gene identifier from annotation
transcript_id Selected longest isoform
start, end Gene genomic coordinates
strand + or -
n_codons Codons analyzed (excluding stop codons)
n_poly_codons Codons containing polymorphic sites
N_sites Total nonsynonymous sites
S_sites Total synonymous sites
N_diffs Frequency-weighted nonsynonymous differences
S_diffs Frequency-weighted synonymous differences
piN N_diffs / N_sites
piS S_diffs / S_sites
piN_piS piN / piS (NA when piS = 0)
mean_variant_depth Pool mode: mean read depth; individual mode: mean AN across variant sites (0 when no variants)
n_variants SNP count within CDS
n_samples Number of selected samples (individual mode only)
mean_call_rate Mean genotype call rate across variant sites (individual mode only)

window_results.tsv

Sliding windows within each gene's CDS. Columns: chrom, win_start, win_end, gene_id, n_codons, N_sites, S_sites, N_diffs, S_diffs, piN, piS, piN_piS.

summary.tsv

Single-row genome-wide summary: total_genes, total_codons, cds_snp_variants, genome_piN, genome_piS, genome_piN_piS, mean_gene_piN, mean_gene_piS, median_gene_piN, median_gene_piS. In individual mode, two additional columns are included: n_samples_selected and mean_call_rate (variant-site-weighted average).

Genome-wide piN and piS are computed by summing N/S diffs and sites across all genes before dividing (not averaging per-gene ratios).

variant_results.tsv

Per-variant annotation table, produced only when --variant-table is given. One row per CDS SNP that passes quality filters.

Column Description Example
chrom Chromosome/scaffold X
pos 1-based genomic position 458032
ref Reference allele A
alt Alternate allele G
gene_id Gene the variant falls in gene_007
codon_pos Position within codon (1, 2, 3) 2
ref_codon Reference codon GAT
alt_codon Alternate codon GGT
ref_aa Reference amino acid D
alt_aa Alternate amino acid G
variant_class synonymous, nonsynonymous, stop_gained, or stop_lost nonsynonymous
ao Alternate allele observation count 22
ro Reference allele observation count 107
dp Total depth at site 129
af Allele frequency (ao / dp) 0.1705
strand Coding strand (+/−) +
cds_position 1-based nucleotide position within CDS 152
n_sites Fractional nonsynonymous site count at this codon position 0.6667
s_sites Fractional synonymous site count at this codon position 0.3333

Notes:

  • af is computed from read counts (ao / dp), not from the VCF INFO/AF field (which FreeBayes reports as 0.5 for all heterozygous calls in pool-seq mode).
  • In individual mode, ao/ro are allele counts from genotypes (not read counts) and dp is the allele number (AN = 2 × called samples).
  • Only variants within CDS regions that pass quality filters are included. If a variant affects multiple genes/transcripts, the longest-isoform selection applies.

Algorithm

pie uses the Nei-Gojobori (1986) method adapted for allele frequency data from pool-seq or individual-sequencing experiments.

Site counting

For each of the 64 codons, a precomputed lookup table stores fractional nonsynonymous (N) and synonymous (S) site counts at each of the three codon positions. Each position is evaluated by considering all three possible single-nucleotide changes and classifying them as synonymous or nonsynonymous under the standard genetic code. Mutations that introduce a premature stop codon (stop-gained) are excluded from site counting by default, matching the classical NG86/SNPGenie convention. For example, if one of 3 changes produces a stop codon and the other is synonymous, that position contributes N = 1/2 and S = 1/2 (counting only the two sense→sense changes). Use --include-stop-codons to instead count stop-gained mutations as nonsynonymous.

Allele frequency handling

In pool mode, per-site allele frequencies are extracted from the AD (allele depth) format field. In individual mode, frequencies are derived from diploid genotypes across selected samples: freq = AC / AN, where AC is the alternate allele count and AN = 2 × called_samples. Missing genotypes (./.) are excluded; only called samples contribute to AC and AN.

Codon haplotype enumeration

Regardless of mode, per-site allele frequencies are used to enumerate all possible codon haplotypes via the product of per-position frequencies (linkage equilibrium assumption). For a codon with variants at positions 1 and 3:

p1: {A: 0.95, C: 0.05}   p2: {G: 1.0}   p3: {T: 0.80, A: 0.20}
  -> AGT (0.76), AGA (0.19), CGT (0.04), CGA (0.01)

Site counts are the frequency-weighted sum across codon haplotypes. Pairwise differences are computed for all haplotype pairs, weighted by 2 * freq_i * freq_j.

Diversity estimator and sample-size correction

pie computes per-site diversity as the population parameter:

π = 2p(1−p)          (biallelic)
π = Σ_{i<j} 2·f_i·f_j  (multiallelic / codon-level)

SNPGenie instead applies the Nei (1987) unbiased estimator, which includes a finite-sample correction:

π = n/(n−1) · (1 − Σf_i²)

where n is the read depth at the site. The ratio between the two estimators is exactly (n−1)/n, producing a systematic difference of ~1% at 93× depth.

Why pie does not apply this correction. In pool-seq, two layers of sampling exist:

  1. Pool composition — a finite number of individuals (2N chromosomes) is drawn from the population.
  2. Sequencing — reads are drawn from the pooled DNA at depth n.

The Nei (1987) correction was designed for n = number of sampled gene copies (chromosomes), not sequencing reads. SNPGenie substitutes read depth for chromosome count, which corrects only the sequencing-sampling layer and conflates it with the biological-sampling layer. A complete unbiased estimator would require correcting for both levels (Futschik & Schlötterer 2010):

π̂ = 2·p̂·(1−p̂) · n/(n−1) · 2N/(2N−1)

Since the pool size is rarely known with precision and the correction is negligible at typical pool-seq depths (>50×), pie reports the plug-in estimator directly. This is transparent — no implicit assumption about sample size — and has no effect on piN/piS ratios or between-group comparisons, where the systematic bias cancels. For a quantitative concordance analysis, see benchmark/.

Multi-step substitutions

For codon pairs differing at 2 or 3 positions, all shortest mutational pathways are enumerated. N and S differences are averaged across valid pathways (those not passing through a stop codon intermediate).

Edge cases

  • Stop-gained mutations: By default, SNPs that introduce a premature stop codon are excluded from site counting and pairwise difference calculations, matching the classical NG86 method and SNPGenie. With --include-stop-codons, stop-gained mutations are instead counted as nonsynonymous changes, consistent with the convention used by PAML and libsequence.
  • Exon-spanning codons: CDS exons are concatenated in genomic order before codon extraction. Codons split across exon boundaries are handled by tracking per-base genomic positions through the concatenated sequence.
  • Isoform selection: The longest isoform (by total CDS length) is selected per gene; all others are ignored.

Aggregation

gene piN  = sum(codon N_diffs) / sum(codon N_sites)
gene piS  = sum(codon S_diffs) / sum(codon S_sites)
gene piN/piS = piN / piS   (NA if piS = 0)

Parallelization

Genes are processed independently via multiprocessing.Pool. Each worker opens its own VCF and FASTA file handles to avoid shared state.

Testing

# Unit and integration tests (~1.4s)
pytest tests/ -m "not slow"

# Regression tests against real Acyrthosiphon pisum dataset (~12s)
# Requires: tar xzf data/Acyrthosiphon_pisum.tar.gz -C data/
pytest tests/ -m slow

# All tests
pytest tests/

The regression tests pin piN/piS values for 400 genes, covering site counting, multiallelic VCF handling, and both strand orientations.

Citation

This tool implements the averaging method for counting synonymous and nonsynonymous substitutions described in:

Nei M, Gojobori T. Simple methods for estimating the numbers of synonymous and nonsynonymous nucleotide substitutions. Mol Biol Evol. 1986;3(5):418-426.

The pool-seq adaptation (frequency-weighted site and difference counting) follows the approach used in SNPGenie:

Nelson CW, Moncla LH, Hughes AL. SNPGenie: estimating evolutionary parameters to detect natural selection using pooled next-generation sequencing data. Bioinformatics. 2015;31(22):3709-3711.

About

pie -- piN/piS Estimator

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors