vfam_trees is a bioinformatics pipeline for building maximum-likelihood phylogenetic trees for viral families. For each family it discovers species via NCBI Taxonomy, downloads sequences from GenBank, applies quality filtering, clusters per species, aligns with MAFFT, infers trees with FastTree and IQ-TREE, annotates internal nodes with LCA-based taxonomy, roots taxonomy-aware, and writes Newick and PhyloXML output along with PDF/PNG figures and per-run TSV summaries.
Two trees are produced per family:
- tree_500 — broad diversity tree (up to 500 sequences, FastTree / GTR+G or LG+G, SH-like support)
- tree_100 — collapsed representative tree (up to 100 sequences, IQ-TREE / GTR+G or
TEST-selected best-fit AA model; SH-aLRT support for nucleotide trees, UFBoot for protein trees)
The per-family pipeline runs in seven stages. Each stage is configurable per family (see Configuration); RefSeqs are protected at every step where sequences can be dropped.
The family TaxID is resolved from NCBI Taxonomy and every descendant species is enumerated. The species list defines the universe of organisms eligible for download. An optional manual.restrict_to_lineages list in the per-family config restricts this universe to one or more taxonomic lineages — each entry is a taxon at any rank (species, genus, subfamily, ...) given as a scientific name or a taxid; only species whose taxid is a descendant of at least one listed lineage proceed to download; others are skipped entirely.
GenBank is queried per species with RefSeq priority. The query targets the configured molecule (whole_genome or a marker name) and, for segmented viruses, the configured segment keyword (records lacking it in their title are dropped). RefSeqs are uncapped; non-RefSeq records are limited by download.max_per_species.
An optional shared cache (configurable TTL, per-entry locking, negative-result caching) avoids re-downloading the same species across runs and across parallel family jobs.
Each species' sequences are filtered in this order:
- Manual overrides (per-family
manual.include/manual.include_seq/manual.include_fasta_files/manual.exclude/manual.restrict_to_lineages) — accessions inmanual.excludeare dropped immediately; accessions inmanual.includebypass every QC step below and are protected from removal during clustering, proportional merge, and length-outlier filtering. Match is exact, version included (e.g.NC_002617.1).manual.include_seqaccepts pasted sequences (records not yet in GenBank) as a list of{id, organism, sequence}mappings;manual.include_fasta_filesaccepts paths to one or more FASTA files whose sequences are injected the same way. Both are injected after fetch, receive the same bypass + downstream protection asmanual.include, and their ids must not collide with any fetched accession (not supported in concat mode).manual.restrict_to_lineagesrestricts the set of species that proceed to download to those under one or more named taxonomic lineages (see Species discovery). - Organism exclusion — case-insensitive substring match against
ORGANISM,SOURCE, andDEFINITION(joined with newlines so terms cannot straddle field boundaries). Defaults:synthetic construct,metagenome,MAG:,uncultured,unverified,vector,recombinant,patent. - Ambiguity —
max_ambiguouscap on the fraction ofN/X/ IUPAC degenerate characters. - Minimum length —
min_length: nullauto-sets the threshold to 50% of the per-species median, with relaxation fallback to 40% then 30% if too few sequences pass; in all cases a hard floor of 200 bp / 100 aa applies. - Length-outlier filter (post-merge, pre-MSA) — two-sided keep window that is the union of (a) a MAD-on-log-lengths window and (b) a hard floor
[min_lo_mult, max_hi_mult] × median. See Length-outlier behaviour for the exact form. RefSeqs andmanual.includerecords flagged by the filter are kept with a warning.
- RefSeq absorption — non-RefSeq sequences ≥
refseq_absorption.threshold(default 0.99) identical to a RefSeq within the same species are absorbed into the RefSeq, suppressing redundant near-zero-branch cherries. RefSeqs themselves are never removed. - Adaptive per-species clustering — MMseqs2 with binary search for an identity threshold within
[clustering.threshold_min, threshold_max]that yields ≤max_reps_500(ormax_reps_100) representatives. - Proportional cross-species merge — when the species count exceeds the target tree size, species with at least one RefSeq are kept first; remaining slots fill by rep count. Species dropped at the cap (and how many of them carried a RefSeq) are recorded in
summary.tsvand a WARNING is logged.
- MAFFT with separate options for nucleotide vs. protein, and for tree_500 vs. tree_100 (the latter typically slower / more accurate).
- trimAl column trimming (
-automated1by default) between MAFFT and tree inference; pre-trim length, tool, and options are recorded per tree. - FastTree for tree_500 (configurable
model_nuc/model_aa;+G/+GAMMAenables discrete-gamma rate variation). - IQ-TREE for tree_100, with per-sequence-type options:
--fastfor nucleotides (SH-aLRT support, auto-added by the wrapper) and-B 1000for protein (UFBoot ultrafast bootstrap, more robust on divergent protein families). Settingmodel_aa: TESTtriggers ModelFinder; the chosen model (e.g.LG+I+G4) is parsed from the IQ-TREE log and surfaces insummary.tsv, the per-family PDF, and the PhyloXML provenance. - Branch-support measure is picked per tree (
SH_like/SH_aLRT/UFBoot) and recorded uniformly:tree{500,100}_support_typeplus genericsupport_{min,q1,median,q3,max,iqr}columns; the PhyloXML<confidence type="…">attribute mirrors the same label. - Iterative branch-length outlier removal — between iterations, leaves with terminal branch length exceeding
median + factor × MADare removed and MSA + tree are re-run (up tomax_iterations; only when ≥min_seqsremain). RefSeqs are protected: a flagged RefSeq stays with a warning. Detailed per-outlier log lines include length, ratio to median, and threshold. - Multi-marker protein concatenation — large DNA virus families with curated marker presets concatenate per-marker MAFFT + trimAl alignments into a single matrix; tree_100 uses partitioned IQ-TREE (
-p partitions.nex -m MFP) so each marker gets its own ModelFinder pick. Per-marker fetches are bucketed by source nucleotide accession (Policy A); a length filter (concatenation.source_nuc_min_length_frac, default 0.3) drops source-nuc accessions shorter than that fraction of the longest parent so partial single-gene submissions don't crowd out genome-scale proteins underdownload.max_per_species(default 3000 for concat families). See CONCAT_DESIGN.md for design details.
- LCA-based internal-node annotation using NCBI ranked lineages, with a configurable lineage-depth filter (
taxonomy.lca_min_rank, defaultspecies) so leaves whose lineage is too shallow are excluded from the LCA vote. - Genus inference for taxa lacking a formal
genusrank in NCBI lineage:none,suffix(single-word taxa ending invirustreated as genus, per ICTV convention), ordeepest(default — suffix first, then deepest rank above species). - Taxonomy-guided rooting via LCA specificity scoring, with MAD and midpoint fallbacks. The PhyloXML
<phylogeny>is emitted withrooted="true" rerootable="false"so downstream viewers cannot re-root the carefully rooted output. - Genus / subfamily coloring in HLS color space — one hue band per subfamily, lightness varies across genera within a subfamily; colors are applied to PDF/PNG figures and to PhyloXML (
style:font_color).
Per-family: Newick, PhyloXML (with <taxonomy>, <sequence>, vipr: metadata, and rank properties), rooted-rectangular and unrooted-radial PDF + PNG tree images, a topology-only icon PNG, sequence and metadata FASTAs/TSVs, a per-family PDF report, and a per-family log.
Cross-family: a row-per-family summary.tsv with full statistics, a lightweight status.tsv (success/skip), and an overview_tree_100.png thumbnail grid shaded by viral realm. See Output for the full file inventory.
Per-family directories are named <Family>_<taxid> (e.g. Asfarviridae_137992). Failures and skips at any stage produce a row in status.tsv with the skip reason; the per-family work directory is preserved for inspection.
- Pre-configured family presets — 28 segmented RNA virus families (segment keywords) and 26 DNA virus families (sequence type, region/marker, and concat marker sets for the 16 families that use concatenation) ship with curated defaults.
- Checkpointing — MSA and tree steps store a content-hashed sidecar of inputs + tool / model / options, so any change auto-invalidates the cache. Each iterative outlier-removal pass gets its own hash; partially-completed runs resume in the right place.
- Family-annotation TSV (optional) — joins extra per-family columns (e.g.
baltimore_class) intosummary.tsvandstatus.tsv. Missing file or family is silent. - Stage tracking —
vfam_trees statusreports the current processing stage for in-progress families (downloading/QC, MSA, tree inference, annotating). - Dry-run mode —
vfam_trees run --dry-runpreviews per-family parameters without executing. - Parallelism —
-j Nruns N families concurrently;-t Tsets threads per family job. The download cache and external tools are concurrency-safe. - Validation and warnings — partial NCBI batches, unrecognized per-family YAML keys, configs that override a recommended DNA-family setting, and MAFFT / tree output sanity checks all surface as log warnings.
biopython >= 1.81
click >= 8.1
pyyaml >= 6.0
matplotlib >= 3.9 # PDF report and tree images; requires NumPy 2.x compatible build
| Tool | Purpose |
|---|---|
mafft |
Multiple sequence alignment |
trimal |
Alignment column trimming |
FastTree |
Rapid ML tree inference (tree_500) |
iqtree2 |
ML tree inference (tree_100) |
mmseqs |
Sequence clustering |
All tools must be available on $PATH. Installation via conda is recommended:
conda install -c bioconda mafft fasttree iqtree mmseqs2 trimalgit clone https://github.com/cmzmasek/vfam_trees.git
cd vfam_trees
pip install -e .# 1. Create and edit global config (sets NCBI email and API key)
vfam_trees init
# edit config/global.yaml
# 2. Check all dependencies
vfam_trees test
# 3. Generate per-family configs (review before running)
vfam_trees init-configs -f families.txt
# 4. Preview what will run without executing anything
vfam_trees run -f families.txt --dry-run
# 5. Run the pipeline
vfam_trees run -f families.txt -j 4 -t 4
# 6. Check progress
vfam_trees status -f families.txtGenerate a template with vfam_trees init. Then edit it to set your NCBI credentials:
ncbi:
email: your.email@example.com # REQUIRED
api_key: your_ncbi_api_key # optional but recommended (10 req/s vs 3 req/s)An NCBI API key can be obtained for free at https://www.ncbi.nlm.nih.gov/account/.
The defaults: section of global.yaml overrides the built-in defaults for all families. Per-family configs further override individual parameters.
cache:
dir: ~/.vfam_cache # shared across all runs on this machine (or a lab filesystem)
ttl_days: 90 # re-download after 90 days; null = never expireCache entries are keyed by (taxid, db, region, segment, max_per_species) so changing any query parameter triggers a fresh download. Parallel family jobs (-j N) coordinate via per-entry lock files. Negative results (species with zero hits) are cached too, so they are not re-queried on every run.
Management:
vfam_trees cache clear Asfarviridae
vfam_trees cache clear --all # wipe entire cache
vfam_trees cache stats # show entry count and sizeannotation_tsv: virus_families_annotation.tsvRelative paths are resolved against the global.yaml directory. The file must have a family column (case-insensitive match) plus any extra columns to be picked up — currently baltimore_class (Roman numeral I–VII per Baltimore 1971) is the only one read by the pipeline. Missing file, missing key, or missing family → column simply left empty.
family baltimore_class host_range segmented genome_size
Flaviviridae IV Vertebrates No ~11 kb
Poxviridae I Vertebrates No ~130–375 kbPer-family configs are auto-generated if missing. Generate them in advance to review and tune parameters before running:
vfam_trees init-configs -f families.txt
# Regenerate with current defaults (overwrites any manual edits):
vfam_trees init-configs -f families.txt --forceKey parameters:
download:
max_per_species: 300 # cap on non-RefSeq sequences per species
sequence:
type: nucleotide # nucleotide or protein
# auto-set to protein for large DNA virus families
region: whole_genome # whole_genome, or a marker name (e.g. "DNA polymerase", "hexon")
# auto-set for known large DNA virus families
segment: null # segment keyword for segmented viruses (e.g. "segment L")
# auto-set for known segmented RNA families
quality:
min_length: null # null = auto (50% of per-species median, floor 200 bp/100 aa)
max_ambiguous: 0.01 # maximum fraction of ambiguous bases/residues
exclude_organisms: # case-insensitive substring match against
- synthetic construct # ORGANISM + SOURCE + DEFINITION (joined with newline
- metagenome # so terms cannot straddle field boundaries)
- uncultured
- "MAG:" # metagenome-assembled genomes (NCBI DEFINITION prefix)
- recombinant
- patent
clustering:
tool: mmseqs2
threshold_min: 0.70 # minimum clustering identity
threshold_max: 0.99 # maximum clustering identity
max_reps_500: 20 # max representatives per species for tree_500
max_reps_100: 5 # max representatives per species for tree_100
targets:
max_500: 500 # target sequences for tree_500
max_100: 100 # target sequences for tree_100
msa_500:
tool: mafft
options_nuc: "--6merpair --retree 2" # used for nucleotide sequences
options_aa: "--auto" # used for amino acid sequences (MAFFT auto-selects strategy)
msa_100:
tool: mafft
options_nuc: "--retree 3" # used for nucleotide sequences
options_aa: "--maxiterate 1000 --localpair" # used for amino acid sequences (MAFFT L-INS-i; high accuracy, slower)
msa_trim:
enabled: true # drop poorly-aligned columns before tree inference
tool: trimal
options: "-automated1" # adaptive; works for both nucleotide and protein
tree_500:
tool: fasttree
options_nuc: "" # used for nucleotide sequences
options_aa: "" # used for amino acid sequences (e.g. marker-gene families)
model_nuc: GTR+G
model_aa: LG+G # LG+G used for amino acid sequences
tree_100:
tool: iqtree
options_nuc: "--fast" # nucleotide: SH-aLRT support (auto-added by wrapper)
options_aa: "-B 1000" # protein: UFBoot ultrafast bootstrap (stronger support
# for divergent protein families; --fast is incompatible with -B)
model_nuc: GTR+G
model_aa: TEST # TEST = IQ-TREE ModelFinder; the chosen best-fit model
# is recorded in summary.tsv / PDF / PhyloXML instead of "TEST"
length_outlier:
enabled: true # pre-MSA length-based outlier removal
k: 5.0 # MAD-on-log keep window: exp(median(log L) ± k · σ_log)
# with σ_log = 1.4826 × MAD(log L); 0 disables MAD
min_lo_mult: 0.20 # never drop seqs ≥ this × median (0 disables lower floor)
max_hi_mult: 5.0 # never drop seqs ≤ this × median (0 disables upper floor)
outlier_removal:
enabled: true # iterative post-tree branch-length outlier removal
factor: 20.0 # threshold = median + factor × MAD (Median Absolute Deviation)
max_iterations: 3 # maximum MSA+tree iterations
min_seqs: 40 # only remove outliers when ≥ min_seqs sequences remain after removal
labeling:
format: "{species}|{id}|{host}" # format string for leaf labels (PhyloXML <name>,
# PDF/PNG tree images, FASTA display names).
# Placeholders: {species}, {id} (accession), {host},
# {strain}, {location}, {year}, {genus}.
# Literal text (separators etc.) is kept verbatim.
replace_whitespace: true # replace spaces in field values with underscores
keep_separator_on_empty: false # false: drop empty fields and their preceding
# separator so no leading/consecutive separators appear
# true: keep separators regardless of field content
coloring:
genus_inference: deepest # none: only formal NCBI genus-rank entries are colored
# suffix: single-word taxa ending in "virus" treated as genus
# (recovers ICTV genus names that NCBI still has at "no rank")
# deepest (default): suffix first, then deepest lineage entry
# above species rank
taxonomy:
lca_min_rank: species # none: every leaf contributes to internal-node LCA voting
# subfamily / genus / species (default): exclude leaves whose
# lineage does not reach this rank, so shallow lineages cannot
# drag ancestor labels back toward the root
manual: # curator overrides on per-family record selection
include: [] # exact accessions (with version, e.g. "NC_002617.1") that
# bypass all QC (length, ambiguity, organism exclusion) and
# are protected through clustering, proportional merge, and
# length-outlier filtering — stronger than RefSeq at QC,
# equal to RefSeq downstream
include_seq: [] # pasted sequences not yet in GenBank — each entry is a
# {id, organism, sequence} mapping; injected after fetch,
# bypass QC, and receive the same downstream protection as
# manual.include records (not supported in concat mode)
include_fasta_files: [] # paths to FASTA files whose sequences are injected identically
# to include_seq entries; FASTA id field → sequence id,
# remainder of header → organism (not supported in concat mode)
exclude: [] # exact accessions to drop immediately after fetch, before QC
restrict_to_lineages: [] # restrict the pipeline to species under one or more named
# lineages — entries are taxon names at any rank (species,
# genus, subfamily, ...) or NCBI taxids (integer or digit
# string); only species descendant from at least one listed
# lineage proceed to download; empty or absent means use the
# full discovered species list
name: "" # override the display name in PDF/PNG titles and the PhyloXML
# <name> element; e.g. "Hantaviridae 2026"; does not affect
# output file names; empty (default) uses the family nameThe coloring and taxonomy keys can also be set globally in the defaults: section of global.yaml — per-family configs inherit them automatically.
Example — adding a pasted reference and forcing a curator-picked accession in:
manual:
include:
- NC_002617.1 # force-keep this RefSeq even if QC would drop it
include_seq:
- id: MyLab_Isolate_2026
organism: Foo virus
sequence: |
ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT
ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT
- id: PreprintSeq_42
organism: Bar virus
sequence: ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT
include_fasta_files:
- /data/unpublished/batch_2026_isolates.fasta
- /data/unpublished/outgroup.fa
exclude:
- KF234567.1 # drop a known-bad isolate before QCNotes on include_seq and include_fasta_files:
id(forinclude_seq) or the first whitespace-delimited token of the FASTA header (forinclude_fasta_files) is used throughout the pipeline (FASTA headers, tree leaves, PhyloXML) and must not collide with any accession returned by NCBI for the family, with anything ininclude/exclude, or across the two options — collisions raise a hard error.organism(forinclude_seq) or the remainder of the FASTA header after the id (forinclude_fasta_files) controls the species bucket the entry joins (matching a fetched species name appends to that bucket; a novel name creates a new one) and shows up in leaf labels. If a FASTA header is a single token (no remainder), the organism is left empty and the leaf label is just the header itself.- For
include_seq, the sequence may be wrapped across lines — whitespace is stripped and the sequence is uppercased. - Injected leaves render gray (no genus) and do not participate in LCA voting, since only id and organism are required.
- Not supported when
sequence.region == "concatenated"— injected sequences cannot be split into per-marker proteins.
Example — restricting a family run to specific lineages (mixing ranks, names, and taxids):
manual:
restrict_to_lineages:
- Orthohantavirus # genus — all species under it are kept
- Mobatvirus # another genus in the same family
- Hantaan orthohantavirus # individual species also accepted
- 11595 # NCBI taxid (integer) — equivalent to the species name above
- "1980519" # taxid as a quoted digit string — also acceptedFor each entry, NCBI Taxonomy is queried for the species-rank descendants under that taxon; the family's discovered species list is then restricted to those that fall under at least one of the listed lineages. Entries whose subtree does not overlap the discovered species produce a WARNING in the log (typo, or a taxon outside this family). If restrict_to_lineages matches nothing in the discovered species list the family is skipped with an explanatory status entry.
The pre-MSA length-outlier filter takes the union of two windows around the median sequence length:
- MAD-on-log-lengths:
exp(median(log L) ± k · σ_log)withσ_log = 1.4826 · MAD(log L). Adapts to each family's natural spread;k=0disables. - Hard floor:
[min_lo_mult, max_hi_mult] × median. Guarantees a minimum keep window even when MAD is degenerate or the family is very tight; either knob set to0disables that side.
Behaviour across distribution shapes (median = 1000, defaults k=5.0, floor [0.20×, 5.0×]):
| Distribution | Effective keep window | Notes |
|---|---|---|
| Tight family + 0.40× truncation | [0.20×, 5.0×] |
Floor kicks in; moderate truncations kept |
| Variable family (~2× spread) | [0.19×, 5.23×] |
MAD widens beyond the floor |
| Bimodal (≥50% tied at median) | [0.20×, 5.0×] |
MAD = 0; floor alone applies |
| Tight bulk + 30× extreme outlier | [0.20×, 5.0×] |
Floor catches the outlier |
| Tight bulk + 0.03× extreme outlier | [0.20×, 5.0×] |
Floor catches the outlier |
Filter outcome (kept / dropped, the median, and the resolved keep window) is logged at INFO once per tree and written to summary.tsv (tree{500,100}_n_length_outliers_{long,short}, length_filter_median, length_filter_lo_cutoff, length_filter_hi_cutoff).
Known DNA virus families are automatically configured with curated ICTV-aligned markers. Small ssDNA / dsDNA families use a single diagnostic protein (when one is well established), medium–large dsDNA families use a family-specific structural or replication protein, and nucleocytoplasmic large DNA viruses (NCLDVs) share DNA polymerase (family B) as a universal marker:
| Family group | Marker gene | Sequence type |
|---|---|---|
| Circoviridae, Smacoviridae | Rep | protein |
| Anelloviridae | ORF1 | protein |
| Parvoviridae | NS1 | protein |
| Polyomaviridae | large T antigen | protein |
| Papillomaviridae | L1 | protein |
| Hepadnaviridae | whole genome | nucleotide |
| Adenoviridae | hexon | protein |
| Orthoherpesviridae, Alloherpesviridae, Herpesviridae | DNA polymerase | protein |
| Malacoherpesviridae | major capsid protein | protein |
| Iridoviridae | major capsid protein | protein |
| Asfarviridae | B646L (p72) | protein |
| Baculoviridae, Nudiviridae | lef-8 | protein |
| Ascoviridae | DNA polymerase | protein |
| Poxviridae | rpo147 | protein |
| Nimaviridae, Hytrosaviridae, Phycodnaviridae, Mimiviridae, Marseilleviridae, Pandoraviridae, Pithoviridae, Medusaviridae | DNA polymerase | protein |
Families using multi-marker protein concatenation (region: concatenated) are documented in CONCAT_DESIGN.md; when a family appears there, the concat preset takes precedence over the single-marker fallback in the table above. The presets were re-tuned against actual NCBI annotation coverage in the 2026-05 cache audit (v1.2.30 also added [Title] to the protein fetch query — [Protein Name] is sparsely populated for several viral families, so the actual product name had to be matched against the FASTA defline instead):
- Ascoviridae — split off from the 7-gene baculovirus core (LEF / PIF genes 0 % coverage in Ascoviridae) to its own 3-marker preset (DNA polymerase, major capsid protein, DNA helicase P143).
- Iridoviridae — trimmed from 7 to 6 markers (VLTF-3 dropped — genuinely absent from NCBI Iridoviridae annotations even after the
[Title]query fix in v1.2.30). - Poxviridae — trimmed from 9 to 8 markers (single-stranded DNA-binding protein / I3L at 1 %).
- Alloherpesviridae — split off from the Orthoherpesviridae 7-gene set to a 6-marker subset (ssDNA-binding / ICP8 at 0 % in fish herpesviruses).
- Malacoherpesviridae — removed from concatenation entirely (5 species, only MCP and DNA pol annotated); falls back to single-marker MCP.
- Phycodnaviridae — split off from the 8-marker NCLDV hallmark fallback to a custom 4-marker preset (DNA polymerase, major capsid protein, packaging ATPase, mRNA capping enzyme); the other four hallmarks remain at 0–1 % coverage even with the
[Title]query fix. - Mimiviridae, Marseilleviridae, Pithoviridae — split off from the NCLDV fallback to 6-marker presets (per-family drop list).
The 8-marker NCLDV-hallmark fallback now applies only to Pandoraviridae and Medusaviridae (no cache audit data yet). If a stale auto-generated config file exists with incorrect settings for any of these families, the program logs a warning and suggests deleting the file to regenerate it.
# Generate global config template
vfam_trees init
# Overwrite an existing global config
vfam_trees init --force
# Check all dependencies and NCBI connectivity
vfam_trees test
# Generate per-family configs without running
vfam_trees init-configs -f families.txt
# Regenerate configs with current defaults (overwrites manual edits)
vfam_trees init-configs -f families.txt --force
# Preview per-family parameters without running the pipeline
vfam_trees run -f families.txt --dry-run
# Run pipeline (1 family at a time)
vfam_trees run -f families.txt
# Run with 4 parallel families, 4 threads each
vfam_trees run -f families.txt -j 4 -t 4
# Force re-run families already marked as done
vfam_trees run -f families.txt --force
# Check progress (shows current stage for in-progress families)
vfam_trees status -f families.txt
# (Re-)generate the overview PNG of all tree_100 trees
vfam_trees overview
# Cache management
vfam_trees cache clear Asfarviridae
vfam_trees cache clear --all --yes
vfam_trees cache statsThe families.txt file should contain one ICTV family name per line. Lines beginning with # are treated as comments and ignored:
# Positive-sense RNA viruses
Flaviviridae
Coronaviridae
# Negative-sense RNA viruses
Filoviridae
Trees
| File | Description |
|---|---|
<Family>_tree_{500,100}.nwk |
Newick tree |
<Family>_tree_{500,100}.xml |
PhyloXML tree (rooted, non-rerootable; <taxonomy>, <sequence>, vipr: metadata, rank properties, style:font_color) |
Tree images — each tree is exported in two layouts at PDF and 150 dpi PNG:
| File | Layout |
|---|---|
<Family>_tree_{500,100}.{pdf,png} |
Rooted rectangular: taxonomy-annotated internal labels (genus / subgenus / subfamily / family); support values < 50% suppressed; genus/subfamily color legend; two-line figure caption (method · alignment · stats) |
<Family>_tree_{500,100}_ur.{pdf,png} |
Unrooted radial: equal-angle layout with leaf labels rotated outward; no internal labels, no support values; same coloring as rooted |
<Family>_tree_icon.png |
Square topology-only thumbnail of tree_100 (no labels, uniform branch color; size and colors configurable) |
Sequences and metadata
| File | Description |
|---|---|
<Family>_sequences_raw_{500,100}.fasta |
Sequences entering the MSA (post-QC, post-clustering, post-merge, post-length-outlier-filter; before post-tree branch-length outlier removal) |
<Family>_alignment_{500,100}.fasta |
Final alignment fed to the tree (MAFFT + trimAl when msa_trim.enabled: true; reflects sequences after iterative branch-length outlier removal) |
<Family>_metadata_{500,100}.tsv |
Per-leaf metadata |
<Family>_id_map.tsv |
Short ID → display name mapping (single-protein / whole-genome runs) |
Concat-mode only (large DNA virus families; see CONCAT_DESIGN.md)
| File | Description |
|---|---|
<Family>_id_map_{500,100}.tsv |
Per-tree id-map (concat keeps source-nuc accessions as leaf IDs) |
<Family>_partitions_{500,100}.nex |
NEXUS charset coordinates of each marker block in the concatenated alignment |
<Family>_markers_{500,100}/ |
Per-marker FASTAs: <safe_marker>_raw.fasta (unaligned) and <safe_marker>_alignment.fasta (post-trim when enabled) |
Reports and logs
| File | Description |
|---|---|
<Family>_report.pdf |
Stats table (taxid, lineage, molecule/region, species/QC counts, sequence-length stats, per-tree MSA / tree / clustering / support stats), post-QC and per-tree length histograms, support histograms, tree_100 visualization with color legend |
<Family>.log |
Per-family log |
| File | Description |
|---|---|
summary.tsv |
One row per family analyzed (success or skip): species counts, QC breakdown, sequence-length stats, clustering thresholds, MSA / trim / tree program-model-options, leaf counts, support stats, length- and branch-outlier counts and cutoffs, genus/subfamily counts, plus joined columns from annotation_tsv (e.g. baltimore_class) |
status.tsv |
Lightweight one-row-per-family table: family, ncbi_taxid, molecule_region, status (OK or skip reason), lineage, baltimore_class |
overview_tree_100.png |
Thumbnail grid of all tree_100 trees, shaded by viral realm (ssDNA, dsDNA, –ssRNA, +ssRNA/dsRNA, RT viruses) |
GNU General Public License v3.0 (GPLv3). See LICENSE for details.