BioLM 2.0 Framework

A modular PyTorch framework for training language models on biological sequences (RNA/protein). Features a plugin architecture where model implementations are separate packages developed and versioned independently.

🚀 Installation

Requirements:

Python 3.10+
Poetry (install guide)

Framework Installation (no plugins yet):

git clone https://github.com/dieterich-lab/biolm_utils.git
cd biolm_utils
git checkout main
./install.sh

BioLM development happens on the main branch.

install.sh installs only the BioLM framework. Plugins are installed separately (see below).

🔌 Adding Plugins

Choose one path (most users only need Path A):

Path A — Run an existing plugin (recommended)

# inside the biolm_utils repo
poetry run biolm install-plugin <git-url>
poetry run biolm list-plugins

For Saluki specifically, install from the active branch:

poetry run biolm install-plugin "https://github.com/dieterich-lab/rna_saluki_cnn.git?ref=main"

What install-plugin does:

Clones the plugin repo into ./plugins/<name>.
Installs it into the active Poetry environment (editable install) so BioLM can load it.

Use this path when you want to run a plugin and do not plan to modify plugin source code.

Plugin discovery: As long as the plugin is installed in the same Poetry environment (via install-plugin or develop-plugin), BioLM automatically discovers the entry point—no extra registration steps are needed.

Path B — Develop a plugin locally

If you are editing plugin code, first clone the plugin repository locally, then point BioLM to that local path.

# inside the biolm_utils repo
poetry run biolm develop-plugin /path/to/your/plugin

If the framework environment is not set up yet, run ./install.sh first.

This keeps pyproject.toml unchanged while wiring editable installs through the CLI. Edits in your plugin repo are picked up immediately.

Remove a plugin later via poetry run biolm remove-plugin <plugin-name> (recommended).

If you previously used install-plugin and no longer want the cloned copies, you can safely remove the ./plugins directory; the CLI will recreate it on demand for future user installs.

📊 Data Format

Input files must specify the delimiter using the data_source.columnsep configuration. By default, the delimiter is set to tab (\t). Example (tab-separated columns, raw sequence text):

ID	Label	Sequence
seq_001	1.5	AUGCUAGCUAGC
seq_002	2.3	AUGGCUAUGGCU

⚡ Modes Overview

Mode	Description	Typical Use/Plugin
tokenize	Build vocabulary/tokenizer from data.	All models
pre-train	(Optional) Pre-train language model on unlabeled data.	Required for LMs
fine-tune	Train model on labeled data for your task.	All models
predict	Run inference/prediction on new data.	All models
interpret	Feature importance/interpretation (e.g., saliency, attention, etc.).	All models

Notes:

Language models (e.g., XLNet) require pre-training before fine-tuning.
CNN-based models (e.g., Saluki) do not require pre-training.

Mode Quickstart

Below are the canonical commands, vital configuration knobs, and outputs for each mode. Reference paths assume you keep experiment-specific overrides under ./my_experiment and set outputpath inside that config.

Tokenize

poetry run biolm mode=tokenize plugin=<plugin_name> data_source.filepath=/path/to/data.tsv outputpath=/tmp/biolm_run

Key config values: data_source.filepath, tokenization.encoding, tokenization.vocabsize.
Output: tokenizer artifacts in ${outputpath}/tokenize (e.g., merges.txt, vocab.json).

Pre-train

poetry run biolm mode=pre-train plugin=<plugin_name> data_source.filepath=/path/to/data.tsv outputpath=/tmp/biolm_run

Requires a plugin whose config sets task: pre-train (see mode/pre-train.yaml).
Important options: training.nepochs, training.batchsize, training.scaling, settings.mlflow.enabled.
Output: pretrained weights and logs in ${outputpath}/pre-train.

Fine-tune

poetry run biolm mode=fine-tune plugin=<plugin_name> task=<classification|regression> data_source.filepath=/path/to/data.tsv outputpath=/tmp/biolm_run

Make sure plugin points to the installed model package and task matches the plugin expectation (classification/regression).
Main toggles: data_source.splitratio, training.nepochs, training.patience, training.gradacc.
Output: fine-tuned checkpoints, metrics, and MLflow logs in ${outputpath}/fine-tune.

Predict

poetry run biolm mode=predict plugin=<plugin_name> task=<classification|regression> data_source.filepath=/path/to/data.tsv inference.pretrainedmodel=/path/to/model.ckpt outputpath=/tmp/biolm_run

Ensure inference.pretrainedmodel is set to the checkpoint produced by fine-tuning or pre-training.
Optional overrides: inference.looscores.handletokens (defaults to mask here), debugging.dev for quick dry-runs.
Output: ${outputpath}/predict/test_predictions.csv (IDs plus plugin-specific scores/probabilities) and logs in ${outputpath}/predict/logs/.

Interpret

poetry run biolm mode=interpret plugin=<plugin_name> task=<classification|regression> data_source.filepath=/path/to/data.tsv inference.pretrainedmodel=/path/to/model.ckpt outputpath=/tmp/biolm_run

Core options under inference.looscores:
- handletokens: mask (default) or remove to control occlusion behaviour.
- replacementdict: dictionary limiting replacements per token; leave null for full masking.
- replacespecifier: boolean to include sequence specifier fields in replacements.
Other useful flags: debugging.dev to restrict the number of samples, training.batchsize for occlusion batching.
Output: ${outputpath}/interpret/loo_scores_<handletokens>.csv and .pkl plus logs in ${outputpath}/interpret/logs/.

🛠️ Usage

Run any mode with:

poetry run biolm mode=<tokenize|pre-train|fine-tune|predict|interpret> plugin=<plugin_name> data_source.filepath=/path/to/data.tsv outputpath=/tmp/biolm_run

Optional equivalent invocation:

poetry run python -m biolm.runner mode=fine-tune plugin=<plugin_name> task=<classification|regression> data_source.filepath=/path/to/data.tsv outputpath=/tmp/biolm_run

Hydra has no --config-file flag in this CLI. For custom config files, use --config-path (directory) and --config-name (filename without .yaml).

🧭 Execution Flow (at a glance)

CLI parses args and Hydra composes configs.
plugin_config resolves the plugin entry point; plugin config classes are loaded.
Data is loaded/prepared (tokenizer built or loaded); datasets are cached under ${outputpath}/{mode}.
Mode dispatcher (runner) calls the appropriate trainer/evaluator.
Artifacts and logs are written to ${outputpath}/{mode}; MLflow (if enabled) logs params/metrics/artifacts to ${outputpath}/mlruns.

⚙️ Configuration & Quickstart

BioLM uses Hydra composition in layers:

Base config (always loaded): biolm/conf/config.yaml
Mode config (always loaded): one file from biolm/conf/mode, selected via mode=...
Task config (required for some modes): one file from biolm/conf/task, selected via task=...
Experiment config (optional): your own config.yaml when you want reusable project-specific defaults

You do not need to maintain all of these files yourself. In practice:

CLI-only runs need only runtime overrides (mode=... plugin=... ...).
A single experiment config.yaml is optional for convenience/reproducibility.
For fine-tune / predict / interpret, Hydra now requires task=classification or task=regression during composition.

Minimal ways to run

A) No experiment file (fastest way):

poetry run biolm mode=tokenize plugin=<plugin_name> data_source.filepath=/path/to/data.tsv outputpath=/tmp/biolm_run
poetry run biolm mode=fine-tune plugin=<plugin_name> task=<classification|regression> data_source.filepath=/path/to/data.tsv outputpath=/tmp/biolm_run

Use the same outputpath for both commands so fine-tune can reuse tokenizer artifacts from tokenize.

B) One experiment file (recommended for repeat runs):

Pick a plugin, output path, and the options that change per run. Here is a minimal config.yaml you can drop into any experiment directory:

plugin: <plugin_name>
mode: fine-tune
outputpath: /tmp/biolm_quickstart
task: classification
data_source:
  filepath: examples/data/quickstart_sequences.tsv
  columnsep: "\t"
  idpos: 1
  seqpos: 2
  labelpos: 3
  splitratio: [70, 15, 15]
training:
  nepochs: 3
  batchsize: 4

Then run it with:

poetry run biolm --config-path /path/to/experiment --config-name config

When running training/inference from that config, start with mode=tokenize once per dataset/output path before fine-tune.

Hydra composition

The shared base config declares:

defaults:
  - mode: ???
  - _self_

This means Hydra expects you to resolve a mode file (e.g., the mode/fine-tune.yaml bundle) before the CLI can run. For task-dependent modes (fine-tune / predict / interpret), Hydra also expects a task selection because those mode files include a task default placeholder.

You can resolve mode/task either in your experiment file or by passing them on the command line.

Example in config file:

defaults:
  - mode: fine-tune
  - task: classification
  - _self_

Or via CLI overrides: mode=fine-tune task=classification.

Hydra merges the base config, selected mode config, selected task config (when used), optional experiment config, and runtime overrides (for example, training.nepochs=50 or data_source.filepath=/new/path). That keeps the common defaults inside biolm/conf untouched while letting you customize only the pieces that change per run.

Custom experiment directories

You can keep experiment files as simple as:

my_experiment/
└── config.yaml

Add custom mode files (for example my_experiment/mode/fine-tune.yaml) only when you intentionally want to override built-in mode defaults from biolm/conf/mode/fine-tune.yaml.

--config-path and --config-name mean:

--config-path: directory where Hydra should look for your config files.
--config-name: filename (without .yaml) to load from that directory.

For transparency: there is no --config-file flag in this interface.

Example:

poetry run biolm --config-path my_experiment --config-name config

If your config file does not pin mode, append mode=... on the CLI.

When is `task` required?

Mode	`task` required?	Allowed / typical value
`tokenize`	No	Not used
`pre-train`	No	Not used (framework runs MLM pre-training path)
`fine-tune`	Yes	`classification` or `regression`
`predict`	Yes	`classification` or `regression`
`interpret`	Yes	`classification` or `regression`

Only two task values are supported in task-dependent modes: classification and regression.

Quickstart commands

With a config directory ready, run the modes sequentially as follows (adjust for your plugin if it does not require pre-training):

For Saluki (CNN-based, no pre-training needed):

# Tokenize data (required for atomic encoding)
poetry run biolm mode=tokenize plugin=saluki data_source.filepath=examples/data/quickstart_sequences.tsv data_source.stripheader=true data_source.idpos=1 data_source.seqpos=3 data_source.labelpos=2 outputpath=/tmp/biolm_quickstart
# Fine-tune directly (no pre-training required)
poetry run biolm mode=fine-tune plugin=saluki task=classification data_source.filepath=examples/data/quickstart_sequences.tsv data_source.stripheader=true data_source.idpos=1 data_source.seqpos=3 data_source.labelpos=2 outputpath=/tmp/biolm_quickstart
poetry run biolm mode=predict plugin=saluki task=classification data_source.filepath=examples/data/quickstart_sequences.tsv data_source.stripheader=true data_source.idpos=1 data_source.seqpos=3 data_source.labelpos=2 inference.pretrainedmodel=/tmp/biolm_quickstart/fine-tune/model.safetensors outputpath=/tmp/biolm_quickstart
poetry run biolm mode=interpret plugin=saluki task=classification data_source.filepath=examples/data/quickstart_sequences.tsv data_source.stripheader=true data_source.idpos=1 data_source.seqpos=3 data_source.labelpos=2 inference.pretrainedmodel=/tmp/biolm_quickstart/fine-tune/model.safetensors outputpath=/tmp/biolm_quickstart

For XLNet (transformer-based, requires pre-training):

poetry run biolm mode=tokenize plugin=xlnet data_source.filepath=examples/data/quickstart_sequences.tsv data_source.stripheader=true data_source.idpos=1 data_source.seqpos=3 data_source.labelpos=2 outputpath=/tmp/biolm_quickstart
poetry run biolm mode=pre-train plugin=xlnet data_source.filepath=examples/data/quickstart_sequences.tsv data_source.stripheader=true data_source.idpos=1 data_source.seqpos=3 data_source.labelpos=2 outputpath=/tmp/biolm_quickstart
poetry run biolm mode=fine-tune plugin=xlnet task=classification data_source.filepath=examples/data/quickstart_sequences.tsv data_source.stripheader=true data_source.idpos=1 data_source.seqpos=3 data_source.labelpos=2 outputpath=/tmp/biolm_quickstart

Skip the pre-train command if your plugin (for example, a CNN) only needs fine-tuning. The examples/data/quickstart_sequences.tsv file includes 100 tab-separated rows (ID, label, sequence) so you can experiment without cloning any plugins.

Runtime overrides

Pass overrides like training.batchsize=8, data_source.filepath=/new.tsv, or settings.mlflow.enabled=true after the command to tweak a single value without editing YAML. Hydra merges these last, so they take precedence over the experiment files and the framework defaults.

🔌 Available Plugins

Plugin	Model	Sequences	Pre-training	Use Case
rna_protein_xlnet	XLNet	RNA/DNA/Protein	Yes	General sequence modeling (pre-train + downstream tasks)
rna_saluki_cnn	CNN	RNA/DNA/Protein	No	Sequence classification/regression without pre-train

📂 Output Directory Structure

The framework organizes outputs under the configured outputpath:

output/
├── tokenize/
│   ├── merges.txt              # BPE merge rules (if applicable)
│   ├── vocab.json             # Tokenizer vocabulary
│   ├── tokenizer_config.json  # HuggingFace tokenizer configuration
│   └── tokenizer.json         # Serialized tokenizer weights
├── pre-train/
│   ├── checkpoint-XX/         # Checkpoints saved per epoch
│   ├── model.safetensors      # Final model weights
│   ├── config.json            # Model config
│   ├── pre-train_dataset.pkl  # Cached dataset (for reproducibility)
│   ├── logs/<timestamp>.log   # Training logs
│   └── final_model/           # Copy of best checkpoint
├── fine-tune/
│   ├── checkpoint-XX/         # Checkpoints
│   ├── model.safetensors      # Fine-tuned weights
│   ├── fine-tune_dataset.pkl  # Dataset cache
│   ├── all_results.json       # Aggregated metrics (trainer)
│   ├── test_predictions.csv   # Raw predictions on the test split
│   ├── rank_deltas.csv        # Rank delta report (regression)
│   ├── logs/<timestamp>.log   # Training logs
│   └── final_model/           # Best checkpoint copy
├── predict/
│   ├── predict_dataset.pkl    # Cached inference dataset
│   ├── test_predictions.csv   # Model predictions (IDs + outputs)
│   ├── rank_deltas.csv        # Ranking comparison (regression)
│   ├── logs/<timestamp>.log   # Inference logs
│   └── report.csv             # Legacy report file (legacy modes)
├── interpret/
│   ├── interpret_dataset.pkl  # Cached dataset for LOO scoring
│   ├── loo_scores_mask.csv     # Leave-one-out scores (mask policy)
│   ├── loo_scores_mask.pkl     # Serialized SHAP explanations
│   ├── loo_scores_remove.csv   # Leave-one-out scores (remove policy)
│   ├── loo_scores_remove.pkl   # Serialized explanations
│   └── logs/<timestamp>.log   # Interpret logs
└── mlruns/                     # MLflow tracking data

Each mode writes logs/<timestamp>.log plus the dataset cache (<mode>_dataset.pkl) and any ranking/report files so reproducing a run only needs the appropriate slice of the tree.

Artifact contents (what to expect)

Checkpoints: Saved under ${outputpath}/pre-train and ${outputpath}/fine-tune (plugin-specific filenames, e.g., model.safetensors). Reuse them by pointing inference.pretrainedmodel (for predict/interpret) or model_load_path (for continued training).
test_predictions.csv: Typically includes sample identifiers plus plugin-specific scores/probabilities; labels may appear if available. Schemas can differ by plugin—consult the plugin README for exact columns.
loo_scores_<handletokens>.csv / .pkl: Per-position leave-one-out scores; includes sequence IDs, positions, tokens, and plugin-specific score deltas. The <handletokens> suffix reflects the occlusion strategy (mask/remove).
MLflow run folders: Contain params, metrics, and artifacts (including checkpoints and logs). MLflow UI can browse and download these directly.

📈 MLflow Tracking

BioLM integrates with MLflow for experiment tracking. To enable MLflow:

Set settings.mlflow.enabled: true in the configuration.

Access the MLflow UI:

poetry run mlflow ui --backend-store-uri output/mlruns

Download artifacts (e.g., models, logs) directly from the UI.

Tracking is scoped to each run’s outputpath (default ${outputpath}/mlruns) rather than a global store; set mlflow.tracking_uri if you want a shared backend.

📜 Plugin Contract (for plugin authors)

See docs/PLUGIN_CONTRACT.md for the required entry point, factory return shape, and dataset/model/tokenizer expectations.

🧪 Testing

Run tests with:

poetry run pytest tests/

For specific suites:

poetry run pytest tests/integration/      # Plugin system tests
poetry run pytest tests/test_*.py         # Unit tests

With coverage:

poetry run pytest --cov=biolm --cov-report=html

🤝 Contributing

Fork the repository
Create feature branch: git checkout -b feature/amazing-feature
Commit changes: git commit -m 'Add amazing feature'
Push: git push origin feature/amazing-feature
Open Pull Request

Plugin development: See Plugin Development Guide below

📝 Citation

@software{biolm2024,
  title = {BioLM 2.0: A Modular Framework for Biological Language Models},
  author = {Philipp Wiesenbach},
  year = {2024},
  url = {https://github.com/dieterich-lab/biolm_utils}
}

Name		Name	Last commit message	Last commit date
Latest commit History 234 Commits
.github/workflows		.github/workflows
biolm		biolm
docs		docs
examples/data		examples/data
internal_scripts		internal_scripts
tests		tests
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md
install.sh		install.sh
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BioLM 2.0 Framework

Table of Contents

🚀 Installation

🔌 Adding Plugins

Path A — Run an existing plugin (recommended)

Path B — Develop a plugin locally

📊 Data Format

⚡ Modes Overview

Mode Quickstart

🛠️ Usage

🧭 Execution Flow (at a glance)

⚙️ Configuration & Quickstart

Minimal ways to run

Hydra composition

Custom experiment directories

When is `task` required?

Quickstart commands

Runtime overrides

🔌 Available Plugins

📂 Output Directory Structure

Artifact contents (what to expect)

📈 MLflow Tracking

📜 Plugin Contract (for plugin authors)

🧪 Testing

🤝 Contributing

📝 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

BioLM 2.0 Framework

Table of Contents

🚀 Installation

🔌 Adding Plugins

Path A — Run an existing plugin (recommended)

Path B — Develop a plugin locally

📊 Data Format

⚡ Modes Overview

Mode Quickstart

🛠️ Usage

🧭 Execution Flow (at a glance)

⚙️ Configuration & Quickstart

Minimal ways to run

Hydra composition

Custom experiment directories

When is task required?

Quickstart commands

Runtime overrides

🔌 Available Plugins

📂 Output Directory Structure

Artifact contents (what to expect)

📈 MLflow Tracking

📜 Plugin Contract (for plugin authors)

🧪 Testing

🤝 Contributing

📝 Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

When is `task` required?

Packages