#bioinformatics #protein #ai

app biors

Command-line tools for bio-rs biological AI model input workflows

62 releases (37 breaking)

new 0.37.3 May 13, 2026
0.36.0 May 7, 2026

#16 in Biology

MIT/Apache

355KB
9K SLoC

bio-rs

CI Release Benchmark Contracts License: MIT/Apache-2.0

bio-rs turns protein FASTA into validated, tokenized, model-ready inputs for bio-AI workflows.

FASTA -> validated protein sequence -> token IDs -> model-ready JSON

DNA and RNA FASTA validation is also supported; tokenization is currently protein-only.

Status: pre-1.0 CLI and JSON contract stabilization.

Why bio-rs?

Most bio-AI models are born in Python, but the tooling around them often needs to run somewhere else:

  • local CLIs
  • CI pipelines
  • servers
  • browsers
  • agents

bio-rs focuses on the boring but important layer before inference:

  • parse biological sequence input
  • validate it with structured diagnostics
  • tokenize it into stable IDs
  • emit machine-readable JSON contracts
  • keep preprocessing reproducible outside notebooks

The goal is not to replace Python research workflows.

The goal is to make the input layer around bio-AI models faster, more portable, and easier to trust.

Quickstart

cargo install biors --version 0.37.3
biors tokenize examples/protein.fasta
biors workflow --max-length 8 examples/protein.fasta
biors batch validate --kind auto examples/
biors tokenizer inspect --profile protein-20-special
biors dataset inspect --source uniprot --version 2026_02 --split train examples/

Full commands, demos, and install options: docs/quickstart.md

Proof

bio-rs keeps performance claims tied to reproducible in-repo benchmarks.

Latest recorded FASTA benchmark baseline:

Dataset Matched workload bio-rs core mean Biopython mean bio-rs speedup
Human proteome Parse + validation 0.036s 0.584s 16.09x
Human proteome Parse + tokenization 0.061s 0.587s 9.68x
100MB+ FASTA Parse + validation 0.294s 3.994s 13.59x
100MB+ FASTA Parse + tokenization 0.492s 4.040s 8.22x
Many short records Parse + validation 0.007s 0.204s 28.35x
Many short records Parse + tokenization 0.010s 0.205s 20.54x
Single long sequence Parse + validation 0.005s 0.176s 34.48x
Single long sequence Parse + tokenization 0.007s 0.177s 26.67x

Benchmark details:

  • Datasets:
    • UniProt human reference proteome (UP000005640, 9606)
    • 100MB+ large FASTA generated by repeating the same real proteome to isolate large-input throughput
    • 20,000 short 48-residue records generated from the same proteome residue stream
    • one 960,000-residue sequence generated from the same proteome residue stream
  • Matched workloads:
    • pure parse
    • parse plus validation
    • parse plus tokenization
  • Current best recorded raw throughput:
    • human proteome parse + validation: 315.4M residues/s, 360.6 MB/s
    • 100MB+ FASTA parse + validation: 350.8M residues/s, 401.1 MB/s
    • human proteome parse + tokenization: 189.0M residues/s, 216.1 MB/s
    • 100MB+ FASTA parse + tokenization: 209.7M residues/s, 239.8 MB/s
  • Benchmark doc: benchmarks/fasta_vs_biopython.md
  • Benchmark script: scripts/benchmark_fasta_vs_biopython.py

This benchmark measures biors-core directly and excludes CLI startup and JSON serialization overhead. It is still workload-specific, not a broad claim that bio-rs is faster than Biopython across every FASTA workload or researcher input shape.

What works today

biors-core provides the Rust engine and data contracts. biors provides the CLI surface.

Sequence handling

  • FASTA parsing and normalization with buffered reader APIs
  • Protein/DNA/RNA validation with per-record kind detection (--kind auto)
  • Line and record-index diagnostics with residue warning/error reporting

Tokenization

  • protein-20 tokenization with stable IDs
  • protein-20-special tokenization with UNK/PAD/CLS/SEP/MASK special tokens
  • JSON tokenizer config loading and inspection
  • Hugging Face tokenizer config conversion
  • Positional token alignment preserved with explicit unknown-token IDs

Model input

  • model-input CLI: input_ids, attention_mask, and truncation metadata
  • workflow CLI: end-to-end validation → tokenization → model input with readiness issues and reproducibility provenance
  • pipeline CLI: no-config validate → tokenize → export, or config-driven (TOML/YAML/JSON) workflows with lockfile generation
  • debug CLI: step-by-step per-record inspection with compact residue markers
  • Checked and unchecked model-input builders with safety checks for unresolved residues

Batch and dataset operations

  • batch validate: multiple files, recursive directories, quoted globs
  • dataset inspect: dataset descriptors, sample mapping, file SHA-256 provenance
  • cache inspect and guarded cache clean for local artifact store

Package management

  • Manifest inspection, validation, and migration (v0 → v1)
  • Schema compatibility checks and canonical diffs
  • SHA-256 checksum verification and fixture verification
  • Python project to bio-rs package skeleton conversion
  • Runtime bridge planning reports
  • Typed validation issue codes and manifest enums

Utilities

  • diff: canonical JSON/raw comparison with SHA-256 hashes
  • doctor: platform, toolchain, WASM target, and fixture readiness
  • completions: shell completion generation
  • JSON success/error envelopes for all commands

Documentation

Not yet

These are roadmap directions, not current capabilities:

  • hosted web workflows
  • Python bindings
  • model inference backends
  • package registry or plugin ecosystem
  • general-purpose chemistry tooling
  • structure tooling
  • no-code or low-code workflows

Development

Run checks:

scripts/check.sh

Run the faster local commit gate:

scripts/check-fast.sh

The check suite runs:

  • cargo fmt
  • shell and Python syntax checks for repo scripts
  • benchmark Markdown regeneration check
  • release workflow publish-order invariant check
  • Rust checks
  • biors-core wasm32-unknown-unknown build check
  • tests
  • cargo clippy with warnings denied

Reproduce the FASTA benchmark:

cargo build --release -p biors-core --example benchmark_fasta
python3 -m venv .venv-bench
. .venv-bench/bin/activate
pip install biopython
python scripts/benchmark_fasta_vs_biopython.py
cat benchmarks/fasta_vs_biopython.json

The benchmark script updates both benchmarks/fasta_vs_biopython.json and benchmarks/fasta_vs_biopython.md. scripts/check-benchmark-docs.sh verifies that the Markdown report still matches the JSON artifact.

Compare two benchmark artifacts:

python scripts/compare-benchmark-artifacts.py before.json after.json

Run the Rust library example:

cargo run -p biors-core --example tokenize

Workspace

packages/
  rust/
    biors/       CLI
    biors-core/  Core engine + contracts

schemas/
  batch-validation-output.v0.json
  cache-output.v0.json
  cli-error.v0.json
  cli-success.v0.json
  dataset-inspect-output.v0.json
  doctor-output.v0.json
  fasta-validation-output.v0.json
  inspect-output.v0.json
  model-input-output.v0.json
  output-diff.v0.json
  pipeline-config.v0.json
  pipeline-lock.v0.json
  pipeline-output.v0.json
  sequence-workflow-output.v0.json
  sequence-debug-output.v0.json
  package-bridge-output.v0.json
  package-compatibility-output.v0.json
  package-conversion-output.v0.json
  package-diff-output.v0.json
  package-inspect-output.v0.json
  package-manifest.v0.json
  package-manifest.v1.json
  package-migration-output.v0.json
  package-skeleton-output.v0.json
  package-validation-report.v0.json
  package-verify-output.v0.json
  tokenizer-conversion-output.v0.json
  tokenizer-inspect-output.v0.json
  tokenize-output.v0.json

examples/
  protein.fasta
  multi.fasta
  model-input-contract/
    protein.fasta
    protein-20-special.config.json
    protein-20-special.expected.json
    reference-python-parity.json
  python/
    esm_from_biors_json.py
    pandas_numpy_friendly.py
    protbert_from_biors_json.py
    reference_preprocess.py
  protein-package/
    models/
    docs/
    manifest.json
    observations.json
    fixtures/
    observed/
    tokenizers/
    vocabs/
    pipelines/
  pipeline/
    protein.toml
    protein.yaml
    protein.json
    pipeline.lock

Protein-20 alphabet

A C D E F G H I K L M N P Q R S T V W Y

Token IDs follow that order, starting at 0.

Contributing

See CONTRIBUTING.md for local setup, checks, and PR expectations.

License

Dual licensed under MIT OR Apache-2.0. If you use bio-rs in research software or publications, cite the repository and version via CITATION.cff.

Dependencies

~4–5.5MB
~114K SLoC