62 releases (37 breaking)
| new 0.37.3 | May 13, 2026 |
|---|---|
| 0.36.0 | May 7, 2026 |
#16 in Biology
355KB
9K
SLoC
bio-rs
bio-rs turns protein FASTA into validated, tokenized, model-ready inputs for bio-AI workflows.
FASTA -> validated protein sequence -> token IDs -> model-ready JSON
DNA and RNA FASTA validation is also supported; tokenization is currently protein-only.
Status: pre-1.0 CLI and JSON contract stabilization.
Why bio-rs?
Most bio-AI models are born in Python, but the tooling around them often needs to run somewhere else:
- local CLIs
- CI pipelines
- servers
- browsers
- agents
bio-rs focuses on the boring but important layer before inference:
- parse biological sequence input
- validate it with structured diagnostics
- tokenize it into stable IDs
- emit machine-readable JSON contracts
- keep preprocessing reproducible outside notebooks
The goal is not to replace Python research workflows.
The goal is to make the input layer around bio-AI models faster, more portable, and easier to trust.
Quickstart
cargo install biors --version 0.37.3
biors tokenize examples/protein.fasta
biors workflow --max-length 8 examples/protein.fasta
biors batch validate --kind auto examples/
biors tokenizer inspect --profile protein-20-special
biors dataset inspect --source uniprot --version 2026_02 --split train examples/
Full commands, demos, and install options: docs/quickstart.md
Proof
bio-rs keeps performance claims tied to reproducible in-repo benchmarks.
Latest recorded FASTA benchmark baseline:
| Dataset | Matched workload | bio-rs core mean | Biopython mean | bio-rs speedup |
|---|---|---|---|---|
| Human proteome | Parse + validation | 0.036s | 0.584s | 16.09x |
| Human proteome | Parse + tokenization | 0.061s | 0.587s | 9.68x |
| 100MB+ FASTA | Parse + validation | 0.294s | 3.994s | 13.59x |
| 100MB+ FASTA | Parse + tokenization | 0.492s | 4.040s | 8.22x |
| Many short records | Parse + validation | 0.007s | 0.204s | 28.35x |
| Many short records | Parse + tokenization | 0.010s | 0.205s | 20.54x |
| Single long sequence | Parse + validation | 0.005s | 0.176s | 34.48x |
| Single long sequence | Parse + tokenization | 0.007s | 0.177s | 26.67x |
Benchmark details:
- Datasets:
- UniProt human reference proteome (
UP000005640,9606) - 100MB+ large FASTA generated by repeating the same real proteome to isolate large-input throughput
- 20,000 short 48-residue records generated from the same proteome residue stream
- one 960,000-residue sequence generated from the same proteome residue stream
- UniProt human reference proteome (
- Matched workloads:
- pure parse
- parse plus validation
- parse plus tokenization
- Current best recorded raw throughput:
- human proteome parse + validation:
315.4M residues/s,360.6 MB/s - 100MB+ FASTA parse + validation:
350.8M residues/s,401.1 MB/s - human proteome parse + tokenization:
189.0M residues/s,216.1 MB/s - 100MB+ FASTA parse + tokenization:
209.7M residues/s,239.8 MB/s
- human proteome parse + validation:
- Benchmark doc: benchmarks/fasta_vs_biopython.md
- Benchmark script: scripts/benchmark_fasta_vs_biopython.py
This benchmark measures biors-core directly and excludes CLI startup and JSON
serialization overhead. It is still workload-specific, not a broad claim that
bio-rs is faster than Biopython across every FASTA workload or researcher input
shape.
What works today
biors-core provides the Rust engine and data contracts. biors provides the CLI surface.
Sequence handling
- FASTA parsing and normalization with buffered reader APIs
- Protein/DNA/RNA validation with per-record kind detection (
--kind auto) - Line and record-index diagnostics with residue warning/error reporting
Tokenization
protein-20tokenization with stable IDsprotein-20-specialtokenization with UNK/PAD/CLS/SEP/MASK special tokens- JSON tokenizer config loading and inspection
- Hugging Face tokenizer config conversion
- Positional token alignment preserved with explicit unknown-token IDs
Model input
model-inputCLI:input_ids,attention_mask, and truncation metadataworkflowCLI: end-to-end validation → tokenization → model input with readiness issues and reproducibility provenancepipelineCLI: no-config validate → tokenize → export, or config-driven (TOML/YAML/JSON) workflows with lockfile generationdebugCLI: step-by-step per-record inspection with compact residue markers- Checked and unchecked model-input builders with safety checks for unresolved residues
Batch and dataset operations
batch validate: multiple files, recursive directories, quoted globsdataset inspect: dataset descriptors, sample mapping, file SHA-256 provenancecache inspectand guardedcache cleanfor local artifact store
Package management
- Manifest inspection, validation, and migration (v0 → v1)
- Schema compatibility checks and canonical diffs
- SHA-256 checksum verification and fixture verification
- Python project to bio-rs package skeleton conversion
- Runtime bridge planning reports
- Typed validation issue codes and manifest enums
Utilities
diff: canonical JSON/raw comparison with SHA-256 hashesdoctor: platform, toolchain, WASM target, and fixture readinesscompletions: shell completion generation- JSON success/error envelopes for all commands
Documentation
- Quickstart — install, first commands, demos
- Launch demo — researcher-facing demo workflow
- Installation and distribution — cargo, binaries, completions
- CLI contract — commands, JSON envelopes, exit codes
- Package format — manifest layout and research metadata
- Package conversion — HF/Python project conversion path
- Pipeline config — config-driven static preprocessing workflows
- Dataset inputs and artifact store
- Error code registry
- Reliability and input safety
- Python interop
- WASM readiness
- 1.0 contract candidates
- Versioning policy
- Schema versioning
- Final release checklist
- JSON schemas
- Citation metadata
Not yet
These are roadmap directions, not current capabilities:
- hosted web workflows
- Python bindings
- model inference backends
- package registry or plugin ecosystem
- general-purpose chemistry tooling
- structure tooling
- no-code or low-code workflows
Development
Run checks:
scripts/check.sh
Run the faster local commit gate:
scripts/check-fast.sh
The check suite runs:
cargo fmt- shell and Python syntax checks for repo scripts
- benchmark Markdown regeneration check
- release workflow publish-order invariant check
- Rust checks
biors-corewasm32-unknown-unknownbuild check- tests
cargo clippywith warnings denied
Reproduce the FASTA benchmark:
cargo build --release -p biors-core --example benchmark_fasta
python3 -m venv .venv-bench
. .venv-bench/bin/activate
pip install biopython
python scripts/benchmark_fasta_vs_biopython.py
cat benchmarks/fasta_vs_biopython.json
The benchmark script updates both benchmarks/fasta_vs_biopython.json and
benchmarks/fasta_vs_biopython.md. scripts/check-benchmark-docs.sh verifies
that the Markdown report still matches the JSON artifact.
Compare two benchmark artifacts:
python scripts/compare-benchmark-artifacts.py before.json after.json
Run the Rust library example:
cargo run -p biors-core --example tokenize
Workspace
packages/
rust/
biors/ CLI
biors-core/ Core engine + contracts
schemas/
batch-validation-output.v0.json
cache-output.v0.json
cli-error.v0.json
cli-success.v0.json
dataset-inspect-output.v0.json
doctor-output.v0.json
fasta-validation-output.v0.json
inspect-output.v0.json
model-input-output.v0.json
output-diff.v0.json
pipeline-config.v0.json
pipeline-lock.v0.json
pipeline-output.v0.json
sequence-workflow-output.v0.json
sequence-debug-output.v0.json
package-bridge-output.v0.json
package-compatibility-output.v0.json
package-conversion-output.v0.json
package-diff-output.v0.json
package-inspect-output.v0.json
package-manifest.v0.json
package-manifest.v1.json
package-migration-output.v0.json
package-skeleton-output.v0.json
package-validation-report.v0.json
package-verify-output.v0.json
tokenizer-conversion-output.v0.json
tokenizer-inspect-output.v0.json
tokenize-output.v0.json
examples/
protein.fasta
multi.fasta
model-input-contract/
protein.fasta
protein-20-special.config.json
protein-20-special.expected.json
reference-python-parity.json
python/
esm_from_biors_json.py
pandas_numpy_friendly.py
protbert_from_biors_json.py
reference_preprocess.py
protein-package/
models/
docs/
manifest.json
observations.json
fixtures/
observed/
tokenizers/
vocabs/
pipelines/
pipeline/
protein.toml
protein.yaml
protein.json
pipeline.lock
Protein-20 alphabet
A C D E F G H I K L M N P Q R S T V W Y
Token IDs follow that order, starting at 0.
Contributing
See CONTRIBUTING.md for local setup, checks, and PR expectations.
License
Dual licensed under MIT OR Apache-2.0. If you use bio-rs in research software or publications, cite the repository and version via CITATION.cff.
Dependencies
~4–5.5MB
~114K SLoC