A modular PyTorch framework for training language models on biological sequences (RNA/protein). Features a plugin architecture where model implementations are separate packages developed and versioned independently.
- Installation
- Adding Plugins
- Data Format
- Modes Overview
- Usage
- Available Plugins
- Configuration Management
- Output Directory Structure
- MLflow Tracking
- Testing
- Contributing
- Citation
Requirements:
- Python 3.10+
- Poetry (install guide)
Framework Installation (no plugins yet):
git clone https://github.com/dieterich-lab/biolm_utils.git
cd biolm_utils
git checkout main
./install.shBioLM development happens on the main branch.
install.sh installs only the BioLM framework. Plugins are installed separately (see below).
- Choose one path (most users only need Path A):
# inside the biolm_utils repo
poetry run biolm install-plugin <git-url>
poetry run biolm list-pluginsFor Saluki specifically, install from the active branch:
poetry run biolm install-plugin "https://github.com/dieterich-lab/rna_saluki_cnn.git?ref=main"What install-plugin does:
- Clones the plugin repo into
./plugins/<name>. - Installs it into the active Poetry environment (editable install) so BioLM can load it.
Use this path when you want to run a plugin and do not plan to modify plugin source code.
Plugin discovery: As long as the plugin is installed in the same Poetry environment (via install-plugin or develop-plugin), BioLM automatically discovers the entry point—no extra registration steps are needed.
If you are editing plugin code, first clone the plugin repository locally, then point BioLM to that local path.
# inside the biolm_utils repo
poetry run biolm develop-plugin /path/to/your/pluginIf the framework environment is not set up yet, run ./install.sh first.
This keeps pyproject.toml unchanged while wiring editable installs through the CLI. Edits in your plugin repo are picked up immediately.
Remove a plugin later via poetry run biolm remove-plugin <plugin-name> (recommended).
If you previously used install-plugin and no longer want the cloned copies, you can safely remove the ./plugins directory; the CLI will recreate it on demand for future user installs.
Input files must specify the delimiter using the data_source.columnsep configuration. By default, the delimiter is set to tab (\t). Example (tab-separated columns, raw sequence text):
ID Label Sequence
seq_001 1.5 AUGCUAGCUAGC
seq_002 2.3 AUGGCUAUGGCU| Mode | Description | Typical Use/Plugin |
|---|---|---|
| tokenize | Build vocabulary/tokenizer from data. | All models |
| pre-train | (Optional) Pre-train language model on unlabeled data. | Required for LMs |
| fine-tune | Train model on labeled data for your task. | All models |
| predict | Run inference/prediction on new data. | All models |
| interpret | Feature importance/interpretation (e.g., saliency, attention, etc.). | All models |
Notes:
- Language models (e.g., XLNet) require pre-training before fine-tuning.
- CNN-based models (e.g., Saluki) do not require pre-training.
Below are the canonical commands, vital configuration knobs, and outputs for each mode. Reference paths assume you keep experiment-specific overrides under ./my_experiment and set outputpath inside that config.
Tokenize
poetry run biolm mode=tokenize plugin=<plugin_name> data_source.filepath=/path/to/data.tsv outputpath=/tmp/biolm_run- Key config values:
data_source.filepath,tokenization.encoding,tokenization.vocabsize. - Output: tokenizer artifacts in
${outputpath}/tokenize(e.g., merges.txt, vocab.json).
Pre-train
poetry run biolm mode=pre-train plugin=<plugin_name> data_source.filepath=/path/to/data.tsv outputpath=/tmp/biolm_run- Requires a plugin whose config sets
task: pre-train(seemode/pre-train.yaml). - Important options:
training.nepochs,training.batchsize,training.scaling,settings.mlflow.enabled. - Output: pretrained weights and logs in
${outputpath}/pre-train.
Fine-tune
poetry run biolm mode=fine-tune plugin=<plugin_name> task=<classification|regression> data_source.filepath=/path/to/data.tsv outputpath=/tmp/biolm_run- Make sure
pluginpoints to the installed model package andtaskmatches the plugin expectation (classification/regression). - Main toggles:
data_source.splitratio,training.nepochs,training.patience,training.gradacc. - Output: fine-tuned checkpoints, metrics, and MLflow logs in
${outputpath}/fine-tune.
Predict
poetry run biolm mode=predict plugin=<plugin_name> task=<classification|regression> data_source.filepath=/path/to/data.tsv inference.pretrainedmodel=/path/to/model.ckpt outputpath=/tmp/biolm_run- Ensure
inference.pretrainedmodelis set to the checkpoint produced by fine-tuning or pre-training. - Optional overrides:
inference.looscores.handletokens(defaults tomaskhere),debugging.devfor quick dry-runs. - Output:
${outputpath}/predict/test_predictions.csv(IDs plus plugin-specific scores/probabilities) and logs in${outputpath}/predict/logs/.
Interpret
poetry run biolm mode=interpret plugin=<plugin_name> task=<classification|regression> data_source.filepath=/path/to/data.tsv inference.pretrainedmodel=/path/to/model.ckpt outputpath=/tmp/biolm_run- Core options under
inference.looscores:handletokens:mask(default) orremoveto control occlusion behaviour.replacementdict: dictionary limiting replacements per token; leavenullfor full masking.replacespecifier: boolean to include sequence specifier fields in replacements.
- Other useful flags:
debugging.devto restrict the number of samples,training.batchsizefor occlusion batching. - Output:
${outputpath}/interpret/loo_scores_<handletokens>.csvand.pklplus logs in${outputpath}/interpret/logs/.
Run any mode with:
poetry run biolm mode=<tokenize|pre-train|fine-tune|predict|interpret> plugin=<plugin_name> data_source.filepath=/path/to/data.tsv outputpath=/tmp/biolm_runOptional equivalent invocation:
poetry run python -m biolm.runner mode=fine-tune plugin=<plugin_name> task=<classification|regression> data_source.filepath=/path/to/data.tsv outputpath=/tmp/biolm_runHydra has no --config-file flag in this CLI. For custom config files, use --config-path (directory) and --config-name (filename without .yaml).
- CLI parses args and Hydra composes configs.
plugin_configresolves the plugin entry point; plugin config classes are loaded.- Data is loaded/prepared (tokenizer built or loaded); datasets are cached under
${outputpath}/{mode}. - Mode dispatcher (
runner) calls the appropriate trainer/evaluator. - Artifacts and logs are written to
${outputpath}/{mode}; MLflow (if enabled) logs params/metrics/artifacts to${outputpath}/mlruns.
BioLM uses Hydra composition in layers:
- Base config (always loaded): biolm/conf/config.yaml
- Mode config (always loaded): one file from biolm/conf/mode, selected via
mode=... - Task config (required for some modes): one file from
biolm/conf/task, selected viatask=... - Experiment config (optional): your own
config.yamlwhen you want reusable project-specific defaults
You do not need to maintain all of these files yourself. In practice:
- CLI-only runs need only runtime overrides (
mode=... plugin=... ...). - A single experiment
config.yamlis optional for convenience/reproducibility. - For
fine-tune/predict/interpret, Hydra now requirestask=classificationortask=regressionduring composition.
A) No experiment file (fastest way):
poetry run biolm mode=tokenize plugin=<plugin_name> data_source.filepath=/path/to/data.tsv outputpath=/tmp/biolm_run
poetry run biolm mode=fine-tune plugin=<plugin_name> task=<classification|regression> data_source.filepath=/path/to/data.tsv outputpath=/tmp/biolm_runUse the same outputpath for both commands so fine-tune can reuse tokenizer artifacts from tokenize.
B) One experiment file (recommended for repeat runs):
Pick a plugin, output path, and the options that change per run. Here is a minimal config.yaml you
can drop into any experiment directory:
plugin: <plugin_name>
mode: fine-tune
outputpath: /tmp/biolm_quickstart
task: classification
data_source:
filepath: examples/data/quickstart_sequences.tsv
columnsep: "\t"
idpos: 1
seqpos: 2
labelpos: 3
splitratio: [70, 15, 15]
training:
nepochs: 3
batchsize: 4Then run it with:
poetry run biolm --config-path /path/to/experiment --config-name configWhen running training/inference from that config, start with mode=tokenize once per dataset/output path before fine-tune.
The shared base config declares:
defaults:
- mode: ???
- _self_This means Hydra expects you to resolve a mode file (e.g., the mode/fine-tune.yaml bundle) before the CLI can run. For task-dependent modes (fine-tune / predict / interpret), Hydra also expects a task selection because those mode files include a task default placeholder.
You can resolve mode/task either in your experiment file or by passing them on the command line.
Example in config file:
defaults:
- mode: fine-tune
- task: classification
- _self_Or via CLI overrides: mode=fine-tune task=classification.
Hydra merges the base config, selected mode config, selected task config (when used), optional experiment config, and runtime overrides
(for example, training.nepochs=50 or data_source.filepath=/new/path). That keeps the common
defaults inside biolm/conf untouched while letting you customize only the pieces that change per run.
You can keep experiment files as simple as:
my_experiment/
└── config.yaml
Add custom mode files (for example my_experiment/mode/fine-tune.yaml) only when you intentionally
want to override built-in mode defaults from biolm/conf/mode/fine-tune.yaml.
--config-path and --config-name mean:
--config-path: directory where Hydra should look for your config files.--config-name: filename (without.yaml) to load from that directory.
For transparency: there is no --config-file flag in this interface.
Example:
poetry run biolm --config-path my_experiment --config-name configIf your config file does not pin mode, append mode=... on the CLI.
| Mode | task required? |
Allowed / typical value |
|---|---|---|
tokenize |
No | Not used |
pre-train |
No | Not used (framework runs MLM pre-training path) |
fine-tune |
Yes | classification or regression |
predict |
Yes | classification or regression |
interpret |
Yes | classification or regression |
Only two task values are supported in task-dependent modes: classification and regression.
With a config directory ready, run the modes sequentially as follows (adjust for your plugin if it does not require pre-training):
For Saluki (CNN-based, no pre-training needed):
# Tokenize data (required for atomic encoding)
poetry run biolm mode=tokenize plugin=saluki data_source.filepath=examples/data/quickstart_sequences.tsv data_source.stripheader=true data_source.idpos=1 data_source.seqpos=3 data_source.labelpos=2 outputpath=/tmp/biolm_quickstart
# Fine-tune directly (no pre-training required)
poetry run biolm mode=fine-tune plugin=saluki task=classification data_source.filepath=examples/data/quickstart_sequences.tsv data_source.stripheader=true data_source.idpos=1 data_source.seqpos=3 data_source.labelpos=2 outputpath=/tmp/biolm_quickstart
poetry run biolm mode=predict plugin=saluki task=classification data_source.filepath=examples/data/quickstart_sequences.tsv data_source.stripheader=true data_source.idpos=1 data_source.seqpos=3 data_source.labelpos=2 inference.pretrainedmodel=/tmp/biolm_quickstart/fine-tune/model.safetensors outputpath=/tmp/biolm_quickstart
poetry run biolm mode=interpret plugin=saluki task=classification data_source.filepath=examples/data/quickstart_sequences.tsv data_source.stripheader=true data_source.idpos=1 data_source.seqpos=3 data_source.labelpos=2 inference.pretrainedmodel=/tmp/biolm_quickstart/fine-tune/model.safetensors outputpath=/tmp/biolm_quickstartFor XLNet (transformer-based, requires pre-training):
poetry run biolm mode=tokenize plugin=xlnet data_source.filepath=examples/data/quickstart_sequences.tsv data_source.stripheader=true data_source.idpos=1 data_source.seqpos=3 data_source.labelpos=2 outputpath=/tmp/biolm_quickstart
poetry run biolm mode=pre-train plugin=xlnet data_source.filepath=examples/data/quickstart_sequences.tsv data_source.stripheader=true data_source.idpos=1 data_source.seqpos=3 data_source.labelpos=2 outputpath=/tmp/biolm_quickstart
poetry run biolm mode=fine-tune plugin=xlnet task=classification data_source.filepath=examples/data/quickstart_sequences.tsv data_source.stripheader=true data_source.idpos=1 data_source.seqpos=3 data_source.labelpos=2 outputpath=/tmp/biolm_quickstartSkip the pre-train command if your plugin (for example, a CNN) only needs fine-tuning. The
examples/data/quickstart_sequences.tsv file includes 100
tab-separated rows (ID, label, sequence) so you can experiment without cloning any plugins.
Pass overrides like training.batchsize=8, data_source.filepath=/new.tsv, or
settings.mlflow.enabled=true after the command to tweak a single value without editing YAML.
Hydra merges these last, so they take precedence over the experiment files and the framework defaults.
| Plugin | Model | Sequences | Pre-training | Use Case |
|---|---|---|---|---|
| rna_protein_xlnet | XLNet | RNA/DNA/Protein | Yes | General sequence modeling (pre-train + downstream tasks) |
| rna_saluki_cnn | CNN | RNA/DNA/Protein | No | Sequence classification/regression without pre-train |
The framework organizes outputs under the configured outputpath:
output/
├── tokenize/
│ ├── merges.txt # BPE merge rules (if applicable)
│ ├── vocab.json # Tokenizer vocabulary
│ ├── tokenizer_config.json # HuggingFace tokenizer configuration
│ └── tokenizer.json # Serialized tokenizer weights
├── pre-train/
│ ├── checkpoint-XX/ # Checkpoints saved per epoch
│ ├── model.safetensors # Final model weights
│ ├── config.json # Model config
│ ├── pre-train_dataset.pkl # Cached dataset (for reproducibility)
│ ├── logs/<timestamp>.log # Training logs
│ └── final_model/ # Copy of best checkpoint
├── fine-tune/
│ ├── checkpoint-XX/ # Checkpoints
│ ├── model.safetensors # Fine-tuned weights
│ ├── fine-tune_dataset.pkl # Dataset cache
│ ├── all_results.json # Aggregated metrics (trainer)
│ ├── test_predictions.csv # Raw predictions on the test split
│ ├── rank_deltas.csv # Rank delta report (regression)
│ ├── logs/<timestamp>.log # Training logs
│ └── final_model/ # Best checkpoint copy
├── predict/
│ ├── predict_dataset.pkl # Cached inference dataset
│ ├── test_predictions.csv # Model predictions (IDs + outputs)
│ ├── rank_deltas.csv # Ranking comparison (regression)
│ ├── logs/<timestamp>.log # Inference logs
│ └── report.csv # Legacy report file (legacy modes)
├── interpret/
│ ├── interpret_dataset.pkl # Cached dataset for LOO scoring
│ ├── loo_scores_mask.csv # Leave-one-out scores (mask policy)
│ ├── loo_scores_mask.pkl # Serialized SHAP explanations
│ ├── loo_scores_remove.csv # Leave-one-out scores (remove policy)
│ ├── loo_scores_remove.pkl # Serialized explanations
│ └── logs/<timestamp>.log # Interpret logs
└── mlruns/ # MLflow tracking data
Each mode writes logs/<timestamp>.log plus the dataset cache (<mode>_dataset.pkl) and any ranking/report files so reproducing a run only needs the appropriate slice of the tree.
- Checkpoints: Saved under
${outputpath}/pre-trainand${outputpath}/fine-tune(plugin-specific filenames, e.g.,model.safetensors). Reuse them by pointinginference.pretrainedmodel(for predict/interpret) ormodel_load_path(for continued training). test_predictions.csv: Typically includes sample identifiers plus plugin-specific scores/probabilities; labels may appear if available. Schemas can differ by plugin—consult the plugin README for exact columns.loo_scores_<handletokens>.csv/.pkl: Per-position leave-one-out scores; includes sequence IDs, positions, tokens, and plugin-specific score deltas. The<handletokens>suffix reflects the occlusion strategy (mask/remove).- MLflow run folders: Contain
params,metrics, andartifacts(including checkpoints and logs). MLflow UI can browse and download these directly.
BioLM integrates with MLflow for experiment tracking. To enable MLflow:
-
Set
settings.mlflow.enabled: truein the configuration. -
Access the MLflow UI:
poetry run mlflow ui --backend-store-uri output/mlruns
-
Download artifacts (e.g., models, logs) directly from the UI.
Tracking is scoped to each run’s outputpath (default ${outputpath}/mlruns) rather than a global store; set mlflow.tracking_uri if you want a shared backend.
See docs/PLUGIN_CONTRACT.md for the required entry point, factory return shape, and dataset/model/tokenizer expectations.
Run tests with:
poetry run pytest tests/For specific suites:
poetry run pytest tests/integration/ # Plugin system tests
poetry run pytest tests/test_*.py # Unit testsWith coverage:
poetry run pytest --cov=biolm --cov-report=html- Fork the repository
- Create feature branch:
git checkout -b feature/amazing-feature - Commit changes:
git commit -m 'Add amazing feature' - Push:
git push origin feature/amazing-feature - Open Pull Request
Plugin development: See Plugin Development Guide below
@software{biolm2024,
title = {BioLM 2.0: A Modular Framework for Biological Language Models},
author = {Philipp Wiesenbach},
year = {2024},
url = {https://github.com/dieterich-lab/biolm_utils}
}