Bolmo

The first family of competitive fully open byte-level language models.

Bolmo is the first fully-open byte-level language model achieving performance on the level of state-of-the-art subword-level language models. Unlike traditional language models that rely on subword tokenizers (like BPE or WordPiece), Bolmo operates directly on raw UTF-8 bytes, making it:

Free of subword tokenization: No need for language-specific tokenizers or vocabulary management.
Universally applicable: Works seamlessly across all languages, scripts, and domains.
Fully open: Complete training code, model weights, data processing pipeline, and paper.
Competitive performance: Comes close to matching (and in some cases exceeds) subword-based state-of-the-art models across a wide range of tasks.
Better character understanding: Superior performance on tasks requiring character-level knowledge.

See our technical report for details: https://allenai.org/papers/bolmo.

This repository is a fork of OLMo-core that implements the complete Bolmo architecture and training pipeline through byteifying - our approach to converting existing subword models to byte-level models, using <1% of the pretraining budget.

Models

We release Bolmo models in two sizes:

Model	Parameters	Base Model	HuggingFace
Bolmo-7B	7.6B	Olmo 3 7B	allenai/Bolmo-7B
Bolmo-1B	1.5B	OLMo 2 1B	allenai/Bolmo-1B

Training data is available via HuggingFace at allenai/bolmo_mix.

Installation

First install PyTorch according to the instructions specific to your operating system and hardware.

From Source (Recommended for Development)

git clone https://github.com/allenai/bolmo-core.git
cd bolmo-core
pip install -e .[all]

Optional Dependencies

For full functionality, you may need:

flash-attn for efficient attention
TransformerEngine for optimized training
xlstm for xLSTM components (mLSTM layers used in the Bolmo local models)
Liger-Kernel for low-memory loss implementations

See the OLMo-core documentation for complete installation details.

Quick Start

Inference with HuggingFace

from transformers import AutoModelForCausalLM, AutoTokenizer

device = "cuda"
bolmo = AutoModelForCausalLM.from_pretrained("allenai/Bolmo-7B", trust_remote_code=True).to(device)
tokenizer = AutoTokenizer.from_pretrained("allenai/Bolmo-7B", trust_remote_code=True)

message = ["Language modeling is "]
input_ids = tokenizer(message, return_tensors="pt")["input_ids"].to(device)

# `max_new_tokens` is the amuont of bytes to generate
response = bolmo.generate(input_ids, max_new_tokens=256, do_sample=True, temperature=0.1)
print(tokenizer.decode(response[0], skip_special_tokens=True))

HuggingFace checkpoints vs. olmo-core checkpoints

This codebase uses the olmo-core checkpoint format. Bolmo models can be converted from this format to the HuggingFace format via:

python3 src/examples/huggingface/convert_checkpoint_to_hf.py \
    -i /path/to/bolmo/checkpoint \
    -o /path/to/bolmo/checkpoint/in/hf/format \
    -s 65536 \ # max sequence length
    --dtype float32 \
    --skip-validation

Converting from HF format back to olmo-core is not implemented at the moment. However, we provide the original olmo-core checkpoints for Bolmo 1B and Bolmo 7B in the olmo_core/ subdirectory on HF: 1B, 7B.

Training

Bolmo training uses a two-stage byteifying procedure to convert existing subword models to byte-level:

Stage 1: Subword-to-Byte Distillation

Quickly learn weights for local models while freezing the global model (9.8B tokens ≈ 43B bytes). Training scripts for this stage are available at bolmo_scripts/launch_stage1_*.

Stage 2: End-to-End Training

Train the entire model to utilize byte-level information (39.3B tokens ≈ 173B bytes). Training scripts for this stage are available at bolmo_scripts/launch_stage2_*.

Post-Training via Task Arithmetic

Existing post-trained checkpoints can be byteified without additional training using Task Arithmetic:

python3 src/examples/bolmo/instructify.py \
    --output=/path/to/output/ \
    --checkpoint-dir=/path/to/bolmo/checkpoint \
    --base-checkpoint-dir=/path/to/base-olmo/checkpoint \
    --instruct-checkpoint-dir=/path/to/post-trained-olmo/checkpoint \
    --alpha=1.0

Performance

Bolmo 7B Results

Bolmo 7B matches or exceeds the performance of state-of-the-art byte-level models and comes close to the source Olmo 3 7B model:

Category	Bolmo 7B	Olmo 3 7B	BLT 7B
Character Understanding (CUTE)	78.6	56.9	52.3
Multilingual Char (EXECUTE)	71.6	55.1	46.3
Code	41.0	40.1	31.6
Math	48.9	55.3	15.7
MC Stem	65.5	66.3	49.0
MC Non-Stem	75.8	77.7	56.6
GenQA	70.9	72.4	68.4

Full evaluation results available in the paper.

Citation

Forthcoming!

For the underlying OLMo-core framework:

@misc{olmo20242olmo2furious,
  title={{2 OLMo 2 Furious}},
  author={{Team OLMo} and Pete Walsh and Luca Soldaini and Dirk Groeneveld and Kyle Lo and Shane Arora and Akshita Bhagia and Yuling Gu and Shengyi Huang and Matt Jordan and Nathan Lambert and Dustin Schwenk and Oyvind Tafjord and Taira Anderson and David Atkinson and Faeze Brahman and Christopher Clark and Pradeep Dasigi and Nouha Dziri and Michal Guerquin and Hamish Ivison and Pang Wei Koh and Jiacheng Liu and Saumya Malik and William Merrill and Lester James V. Miranda and Jacob Morrison and Tyler Murray and Crystal Nam and Valentina Pyatkin and Aman Rangapur and Michael Schmitz and Sam Skjonsberg and David Wadden and Christopher Wilhelm and Michael Wilson and Luke Zettlemoyer and Ali Farhadi and Noah A. Smith and Hannaneh Hajishirzi},
  year={2024},
  eprint={2501.00656},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2501.00656},
}

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 1,203 Commits
.github		.github
bolmo_scripts		bolmo_scripts
docs		docs
src		src
.gitignore		.gitignore
.readthedocs.yaml		.readthedocs.yaml
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Bolmo

The first family of competitive fully open byte-level language models.

Models

Installation

From Source (Recommended for Development)

Optional Dependencies

Quick Start

Inference with HuggingFace

HuggingFace checkpoints vs. olmo-core checkpoints

Training

Stage 1: Subword-to-Byte Distillation

Stage 2: End-to-End Training

Post-Training via Task Arithmetic

Performance

Bolmo 7B Results

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 15

Languages

License

allenai/bolmo-core

Folders and files

Latest commit

History

Repository files navigation

Bolmo

The first family of competitive fully open byte-level language models.

Models

Installation

From Source (Recommended for Development)

Optional Dependencies

Quick Start

Inference with HuggingFace

HuggingFace checkpoints vs. olmo-core checkpoints

Training

Stage 1: Subword-to-Byte Distillation

Stage 2: End-to-End Training

Post-Training via Task Arithmetic

Performance

Bolmo 7B Results

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 15

Languages

Packages