#bioinformatics #variant-calling #umi #sequencing

kam-core

Core types and traits for alignment-free variant detection

4 releases (2 breaking)

0.3.1 May 18, 2026
0.3.0 Apr 16, 2026
0.2.0 Apr 14, 2026
0.1.0 Mar 25, 2026

#811 in Biology


Used in 6 crates (5 directly)

MIT license

50KB
629 lines

kam

CI

Alignment-free variant detection for Twist duplex UMI sequencing.

kam replaces the four-tool chain HUMID + Jellyfish + km + kmtools with a single Rust binary that preserves molecule-level evidence from raw FASTQ reads through to variant calls. It supports two operating modes: somatic discovery and tumour-informed monitoring.


Features

  • Alignment-free: no reference alignment required at call time.
  • Molecule-aware: every k-mer carries a MoleculeEvidence record with distinct molecule count, duplex support, and strand breakdown.
  • Two modes: discovery (all alt paths) and tumour-informed monitoring (--target-variants).
  • Tumour-informed monitoring: precision 1.0 at all VAF levels. Background cfDNA variants are eliminated because they do not match the expected somatic alleles.
  • TI rescue (--ti-rescue, --alt-walk): probes the k-mer index directly for TI targets that produce no PASS call, recovering sub-threshold evidence.
  • SV and fusion detection: large deletions, tandem duplications, inversions, and gene fusions via junction allowlists (--sv-junctions, --fusion-targets).
  • ML rescoring (--ml-model): optional gradient-boosted re-scoring of variant calls using a bundled ONNX model.
  • Fast: 16--22 seconds per sample on a single core at 2 M read pairs.

Benchmark results

Evaluated on the Twist cfDNA Pan-Cancer Reference Standard v2 (24 samples, 375 truth variants, 3 concentrations, 8 VAF levels). Configuration: --max-vaf 0.35 --min-family-size-override 2 --target-variants.

Concentration VAF Sensitivity SNV sens Indel sens Precision
15 ng 2% 61.3% 80.0% 38.8% 1.0
30 ng 2% 59.2% 77.1% 37.6% 1.0
5 ng 2% 51.7% 68.8% 31.2% 1.0
15 ng 0.5% 40.0% 52.7% 24.7% 1.0
30 ng 0.5% 46.1% 61.0% 28.2% 1.0
0% VAF (all) 0 FPs

Runtime: 16--22 s per sample. Peak RSS: 1.8--2.0 GB.


Installation

From crates.io

cargo install kam-bio

This installs the kam binary.

From source

Requires Rust 1.75 or later.

cargo build --release

The binary is at target/release/kam.


Usage

For detailed documentation, guides, and examples, see the User Manual.

kam run \
  --r1 sample_R1.fastq.gz \
  --r2 sample_R2.fastq.gz \
  --targets targets_100bp.fa \
  --output-dir results/

Tumour-informed monitoring mode

Supply a VCF of expected somatic variants. Only calls matching an entry in the target set are reported; all other calls are labelled NotTargeted.

kam run \
  --r1 sample_R1.fastq.gz \
  --r2 sample_R2.fastq.gz \
  --targets targets_100bp.fa \
  --output-dir results/ \
  --max-vaf 0.35 \
  --min-family-size-override 2 \
  --target-variants known_variants.vcf

Individual pipeline stages

The pipeline can also be run stage by stage:

# 1. Assemble molecules from raw reads.
kam assemble --r1 R1.fastq.gz --r2 R2.fastq.gz --output molecules.bin

# 2. Build a k-mer index against target sequences.
kam index --input molecules.bin --targets targets.fa --output index.bin

# 3. Walk de Bruijn graph paths.
kam pathfind --index index.bin --targets targets.fa --output paths.bin

# 4. Call variants from scored paths.
kam call --paths paths.bin --output calls.vcf

Key options

Flag Default Description
--min-family-size-override N 1 Minimum reads per UMI family. Set to 2 to remove singletons.
--max-vaf F Discard calls above this VAF (removes germline heterozygotes).
--min-alt-molecules N 2 Minimum alt molecules to emit a call.
--min-confidence F 0.99 Minimum posterior confidence.
--target-variants VCF Enable tumour-informed monitoring mode.
--ti-rescue false Probe k-mer index directly for TI targets with no PASS call.
--alt-walk false Rescue using pre-built alt sequences from --alt-as-ref.
--sv-junctions FASTA Add SV junction k-mers to the allowlist for SV detection.
--fusion-targets FASTA Add fusion junction k-mers for fusion/translocation detection.
--ml-model NAME Re-score calls with a built-in ML model (e.g. single-strand-v1).

ML Models

kam includes a built-in ML scorer that re-ranks variant calls using a gradient-boosted classifier. Enable it with --ml-model:

kam run --r1 sample_R1.fastq.gz --r2 sample_R2.fastq.gz \
  --targets panel_targets.fa --output-dir results/ \
  --ml-model single-strand-v1
Model Chemistry Training data AUPRC AUROC When to use
single-strand-v1 Twist UMI duplex ML3 single-strand consensus, 9,990 samples 0.9998 0.9949 Standard Twist UMI panels with single-strand consensus calling

To list all available built-in models: kam models list

See docs/manual/07_ml_models.md for full model documentation, feature descriptions, performance metrics, and custom model instructions.


Chemistry

kam supports configurable UMI chemistries via config.toml. Presets are available for common protocols:

Preset UMI Skip Duplex
twist-umi-duplex 5 bp 2 bp Yes
simplex-12bp 12 bp 0 bp No
simplex-9bp 9 bp 0 bp No
simplex-8bp 8 bp 0 bp No

See examples/ for config files covering each chemistry, and the Configuration Reference for all options.


Architecture

kam-core      — shared types: Molecule, ConsensusRead, MoleculeEvidence
kam-assemble  — molecule assembly from raw FASTQ (replaces HUMID)
kam-index     — k-mer indexing with molecule provenance (replaces Jellyfish)
kam-pathfind  — de Bruijn graph construction and path walking (replaces km)
kam-call      — statistical variant calling and tumour-informed filtering
kam-ml        — optional gradient-boosted variant re-scoring (ONNX)
kam           — CLI binary wiring all stages together

Development

Run tests and quality checks before committing:

cargo test
cargo clippy -- -D warnings
cargo fmt -- --check

Large Files

Benchmark data, simulation outputs, and compiled PDFs are stored in Nextcloud and mirrored locally in bigdata/. See NEXTCLOUD.md for a quick reference, or docs/project/devmanual/nextcloud.md for full documentation.

Public read-only share: https://nextcloudlocal.trentz.me/s/pTizAiSAJQsPcDo

Uploads require a separate edit share token. See .env.example and NEXTCLOUD.md.


Licence

MIT

Dependencies

~0.5–1.4MB
~29K SLoC