#phylogenetic-tree #bioinformatics #tree-distance #robinson-foulds

bin+lib rapidtrees

Fast pairwise tree distance calculations (Robinson-Foulds, Weighted RF, Kuhner-Felsenstein) for phylogenetic trees

9 releases (5 breaking)

Uses new Rust 2024

new 0.7.1 Jun 5, 2026
0.7.0 Jun 4, 2026
0.6.0 May 27, 2026
0.5.1 May 12, 2026
0.2.3 Feb 27, 2026

#80 in Biology

MIT and GPL-3.0 licenses

205KB
3K SLoC

🌲⚑ rapidtrees

Blazingly fast pairwise phylogenetic tree distance calculations β€” Robinson–Foulds, Weighted RF, and Kuhner–Felsenstein β€” powered by Rust

Crates.io PyPI CI Coverage License: MIT

Overview β€’ Installing β€’ Usage β€’ Python API β€’ Snap Format β€’ Benchmarks


πŸ—ΊοΈ Overview

rapidtrees computes pairwise tree distances from BEAST/NEXUS .trees files or from precomputed .snap files and writes a labeled distance matrix. Three metrics are supported:

Metric Flag Output Description
Robinson–Foulds --metric rf integer Symmetric difference of bipartitions
Weighted RF --metric weighted float Branch-length-weighted bipartition difference
Kuhner–Felsenstein --metric kf float Euclidean distance on branch lengths

✨ Why rapidtrees?

  • πŸ¦€ Rust core β€” zero-overhead bitset operations with a cache-friendly memory layout
  • πŸ”€ Parallel by default β€” powered by rayon, automatically scales across all cores
  • 🐍 Python bindings β€” drop into any Python/NumPy workflow via PyO3
  • πŸ“¦ No Rust toolchain required β€” pre-built wheels on PyPI for Linux, macOS, and Windows
  • πŸ—œοΈ Gzip output β€” stream directly to .tsv.gz without a separate compression step

πŸš€ Performance

Benchmarked on a ZIKA dataset (283 taxa Β· 4 000 trees Β· ~8 M comparisons) (non-gzipped output):

Metric Total time Throughput
Robinson-Foulds ~2.7 s ~3.0 M comparisons/sec
Weighted RF ~3.4 s ~2.3 M comparisons/sec
Kuhner-Felsenstein ~3.3 s ~2.3 M comparisons/sec

πŸ”§ Installing

Pre-built wheels for Linux, macOS, and Windows. No Rust toolchain needed.

pip install rapidtrees

πŸ¦€ CLI (crates.io)

Install the standalone command-line binary. Requires the Rust toolchain.

cargo install rapidtrees

πŸ› οΈ From source

Prerequisites

  • Rust toolchain β€” for building the Rust core
  • pixi β€” for managing Python and R dependencies

Setup

git clone https://github.com/Joon-Klaps/rapidtrees.git
cd rapidtrees
# Set up environment β€” installs Python, R (with phangorn), and builds the package
pixi install

Development tasks

# Run Python API tests (includes R/phangorn cross-validation)
pixi run test-python

# Run Rust unit tests
pixi run test-rust

πŸ’» Usage

rapidtrees \
  (--input <path/to/file.trees> | --snap-input <path/to/file.snap>) \
  --output <path/to/output.tsv[.gz]> \
  [--burnin-trees <N>] \
  [--burnin-states <STATE>] \
  [--use-real-taxa] \
  [--metric rf|weighted|kf] \
  [-q|--quiet]
Flag Description
-i, --input <INPUT> Path to BEAST .trees (NEXUS) file
--snap-input <SNAP_INPUT> Path to .snap file (currently supports only --metric rf)
-o, --output <OUTPUT> Output path. Use .gz suffix for gzip compression; - for stdout
-t, --burnin-trees <N> Drop the first N trees (default: 0)
-s, --burnin-states <STATE> Keep only trees with STATE > STATE (default: 0)
--use-real-taxa Map numeric taxon IDs via the TRANSLATE block
--metric <rf|weighted|kf> Distance metric (default: rf)
-q, --quiet Suppress progress messages (errors still go to stderr)

The output is a square TSV matrix where both the header row and first column contain tree names formatted as <file_basename>_tree_STATE<state>. Use -o - to write to stdout for easy piping.

πŸ’‘ Examples

Compute RF matrix β†’ gzipped file
rapidtrees \
  -i tests/data/hiv1.trees \
  -o out/hiv1_rf.tsv.gz \
  --metric rf

# Reading in beast 0.003s
# Read in 162 taxons for 21 trees
# Creating tree bit snapshots 0.002s
# Determining distances using RF for 210 combinations
# Determining distances using RF 0.000s
# Writing to output 0.000s
Apply burn-in by tree count
rapidtrees \
  -i tests/data/hiv1.trees \
  -o out/hiv1_rf.tsv \
  -t 2

# Reading in beast 0.003s
# Read in 162 taxons for 19 trees
# Creating tree bit snapshots 0.001s
# Determining distances using RF for 171 combinations
# Determining distances using RF 0.000s
# Writing to output 0.000s
Compute RF matrix directly from a snapshot file
rapidtrees \
  --snap-input out/hiv1.snap \
  -o out/hiv1_rf_from_snap.tsv.gz \
  --metric rf

# Read snap with 21 trees and 2193 bipartitions in 0.001s
# Determined distances using RF in 0.000s
# Writing to out/hiv1_rf_from_snap.tsv.gz in 0.000s

🐍 Python API

rapidtrees exposes four functions from its Rust core. All accept a Python iterator of newick strings, keeping memory constant regardless of tree count.

Function Returns
pairwise_rf_from_newick_iter (names, bytes) β€” RF matrix as flat uint32 bytes, row-major
pairwise_rf_with_snapshots_from_newick_iter (names, bytes, leaf_names, n_bip, bytes, bytes) β€” RF matrix + presence matrix + clade bitmasks
pairwise_wrf_from_newick_iter (names, list[float]) β€” Weighted RF, flat row-major
pairwise_wrf_with_snapshots_from_newick_iter (names, bytes, leaf_names, n_bip, bytes, bytes) β€” wRF matrix + branch-length matrix + clade bitmasks
pairwise_kf_from_newick_iter (names, list[float]) β€” Kuhner-Felsenstein, flat row-major
pairwise_kf_with_snapshots_from_newick_iter (names, bytes, leaf_names, n_bip, bytes, bytes) β€” KF matrix + branch-length matrix + clade bitmasks
import rapidtrees as rtd
import numpy as np

trees = [
    "(A:0.1,(B:0.1,C:0.1):0.1);",
    "(A:0.1,(C:0.1,B:0.1):0.1);",
    "((A:0.1,B:0.1):0.1,C:0.1);",
]
names = ["t1", "t2", "t3"]

tree_names, rf_bytes = rtd.pairwise_rf_from_newick_iter(
    names, iter(trees), [{}], [0, 0, 0]
)
rf = np.frombuffer(rf_bytes, dtype=np.uint32).reshape(len(tree_names), -1)

For BEAST .trees files, translate maps, the snapshot API, and multi-file usage see docs/python-api.md.


πŸ“¦ Snap Format

rapidtrees can export tree snapshots to a compressed binary .snap file for downstream analyses (ESS computation, ASDSF, convergence diagnostics) on HPC clusters without re-parsing the original .trees files. The CLI can also compute RF distance matrices directly from .snap files via --snap-input.

What is a snapshot?

A tree snapshot is a compact bitset representation of a phylogenetic tree. Each bipartition (split) is encoded as a bitset over leaf indices. The full set of snapshots for a tree collection captures everything needed for RF-family distance computations and convergence diagnostics β€” without storing the original Newick strings.

Note: Snapshots are not human-readable and are not intended for general interchange. They are an internal format optimized for fast distance calculations and cannot be convert to it's original newick-style format.

File layout

A .snap file is a gzip-compressed binary stream with the following sections in order:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  HEADER                                              β”‚
β”‚  4 bytes  magic        "SNAP" (0x534E4150)          β”‚
β”‚  1 byte   version      format version (currently 2) β”‚
β”‚  8 bytes  n_trees      u64 LE β€” number of trees     β”‚
β”‚  8 bytes  n_taxa       u64 LE β€” number of leaf taxa β”‚
β”‚  8 bytes  n_bip        u64 LE β€” number of unique    β”‚
β”‚                         bipartitions across all treesβ”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  TAXA NAMES                                          β”‚
β”‚  For each of n_taxa:                                 β”‚
β”‚    4 bytes  length     u32 LE β€” byte length of name β”‚
β”‚    N bytes  name       UTF-8 string                  β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  TREE NAMES                                          β”‚
β”‚  For each of n_trees:                                β”‚
β”‚    4 bytes  length     u32 LE β€” byte length of name β”‚
β”‚    N bytes  name       UTF-8 string                  β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  PRESENCE MATRIX                                     β”‚
β”‚  n_trees Γ— n_bip bytes, row-major uint8             β”‚
β”‚  presence[i][j] = 1 if bipartition j is in tree i  β”‚
β”‚                   0 otherwise                        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Bipartition column order is deterministic: columns are sorted in ascending Bitset order (lexicographic over u64 words, i.e. by leaf-index bit pattern), so the same tree set always produces the same column indices regardless of parse order.

What the presence matrix gives you

The presence matrix is sufficient for all major convergence diagnostics:

Diagnostic What you need
Pseudo ESS presence.mean(axis=0) per chain
ASDSF Per-chain split frequencies
FrΓ©chet ESS RF distances via XOR of rows
WRF / KF distances Branch lengths β€” use pairwise_wrf_with_snapshots_from_newick_iter or pairwise_kf_with_snapshots_from_newick_iter

Note: sum(presence[i] XOR presence[j]) == RF(tree_i, tree_j) exactly.

Presence matrix from Python

Snap files are written and read by the CLI only. From Python, use pairwise_rf_with_snapshots_from_newick_iter to obtain the same presence matrix in memory without writing a file β€” see the Python API section above.

import rapidtrees as rtd
import numpy as np

import math

tree_names, rf_bytes, leaf_names, n_bip, pres_bytes, bip_clade_bytes = (
    rtd.pairwise_rf_with_snapshots_from_newick_iter(
        names, iter(newicks), translate_maps, map_indices
    )
)
n = len(tree_names)
presence = np.frombuffer(pres_bytes, dtype=np.uint8).reshape(n, n_bip).copy()

# Global split frequencies (for Pseudo ESS / ASDSF)
global_freq = presence.mean(axis=0)

# RF distance between any two trees β€” no recomputation needed
rf_01 = int((presence[0].astype(int) ^ presence[1].astype(int)).sum())

# Named presence matrix β€” decode clade bitmasks for column labels
bytes_per_bip = math.ceil(len(leaf_names) / 8)
bip_arr  = np.frombuffer(bip_clade_bytes, dtype=np.uint8).reshape(n_bip, bytes_per_bip)
bip_bool = np.unpackbits(bip_arr, axis=1, bitorder="little")[:, :len(leaf_names)]
col_labels = ["|".join(leaf_names[i] for i in np.where(bip_bool[j])[0]) for j in range(n_bip)]
import pandas as pd
df = pd.DataFrame(presence, index=tree_names, columns=col_labels)

⏱️ Benchmarks

Benchmarks were run on a MacBook Pro M1. Trees are parsed once and bitset snapshots are reused across all pairwise comparisons. Parallelism is provided by rayon β€” no manual thread management needed.

Show full benchmark table

Output of cargo bench --bench memory_time_benchmark

Taxa (N) Trees (T) Combinations Est. Memory Actual Memory Wall Time CPU Time
10 100 5.0K 13.96 KB 17.87 KB 172.29 Β΅s 733.00 Β΅s
10 1000 500.0K 139.65 KB 168.95 KB 5.94 ms 12.99 ms
10 10000 50.0M 1.36 MB 1.64 MB 717.43 ms 1.36 s
10 100000 5.0B 13.64 MB 16.40 MB 3.53 min 4.04 min
100 100 5.0K 133.59 KB 136.09 KB 244.42 Β΅s 1.75 ms
100 1000 500.0K 1.30 MB 1.21 MB 17.49 ms 50.99 ms
100 10000 50.0M 13.05 MB 11.95 MB 1.08 s 4.81 s
100 100000 5.0B 130.46 MB 119.41 MB 4.29 min 9.86 min
500 100 5.0K 706.84 KB 713.93 KB 2.02 ms 3.08 ms
500 1000 500.0K 6.90 MB 5.89 MB 27.77 ms 216.87 ms
500 10000 50.0M 69.03 MB 57.84 MB 2.82 s 21.30 s
500 100000 5.0B 690.27 MB 577.28 MB 7.44 min 37.13 min
1000 100 5.0K 1.50 MB 1.51 MB 873.75 Β΅s 5.63 ms
1000 1000 500.0K 15.03 MB 11.86 MB 51.22 ms 420.94 ms
1000 10000 50.0M 150.32 MB 115.30 MB 5.12 s 42.92 s
1000 100000 5.0B 1.47 GB 1.12 GB 11.26 min 74.74 min
2000 100 5.0K 3.50 MB 3.51 MB 1.14 ms 9.92 ms
2000 1000 500.0K 35.01 MB 24.15 MB 93.73 ms 838.01 ms
2000 10000 50.0M 350.06 MB 230.59 MB 9.42 s 1.38 min
2000 100000 5.0B 3.42 GB 2.24 GB 18.57 min 141.11 min
5000 100 5.0K 14.30 MB 12.43 MB 61.36 ms 83.85 ms
5000 1000 500.0K 143.02 MB 63.97 MB 224.40 ms 2.09 s
5000 10000 50.0M 1.40 GB 579.40 MB 22.45 s 3.44 min

Note: Weighted RF and KF produce floating-point matrices; RF produces integer matrices.


πŸ” Troubleshooting

  • No trees parsed? Verify the input is a valid NEXUS .trees file and adjust --burnin-* settings.
  • Piping to other tools? Use -q to suppress timing messages on stdout.
  • Gzipped output not working? Ensure the output filename ends with .gz.

βš–οΈ License

rapidtrees is provided under the MIT License.

Dependencies

~13–25MB
~327K SLoC