Bolmo is the first fully-open byte-level language model achieving performance on the level of state-of-the-art subword-level language models. Unlike traditional language models that rely on subword tokenizers (like BPE or WordPiece), Bolmo operates directly on raw UTF-8 bytes, making it:
- Free of subword tokenization: No need for language-specific tokenizers or vocabulary management.
- Universally applicable: Works seamlessly across all languages, scripts, and domains.
- Fully open: Complete training code, model weights, data processing pipeline, and paper.
- Competitive performance: Comes close to matching (and in some cases exceeds) subword-based state-of-the-art models across a wide range of tasks.
- Better character understanding: Superior performance on tasks requiring character-level knowledge.
See our technical report for details: https://allenai.org/papers/bolmo.
This repository is a fork of OLMo-core that implements the complete Bolmo architecture and training pipeline through byteifying - our approach to converting existing subword models to byte-level models, using <1% of the pretraining budget.
We release Bolmo models in two sizes:
| Model | Parameters | Base Model | HuggingFace |
|---|---|---|---|
| Bolmo-7B | 7.6B | Olmo 3 7B | allenai/Bolmo-7B |
| Bolmo-1B | 1.5B | OLMo 2 1B | allenai/Bolmo-1B |
Training data is available via HuggingFace at allenai/bolmo_mix.
First install PyTorch according to the instructions specific to your operating system and hardware.
git clone https://github.com/allenai/bolmo-core.git
cd bolmo-core
pip install -e .[all]For full functionality, you may need:
- flash-attn for efficient attention
- TransformerEngine for optimized training
- xlstm for xLSTM components (mLSTM layers used in the Bolmo local models)
- Liger-Kernel for low-memory loss implementations
See the OLMo-core documentation for complete installation details.
from transformers import AutoModelForCausalLM, AutoTokenizer
device = "cuda"
bolmo = AutoModelForCausalLM.from_pretrained("allenai/Bolmo-7B", trust_remote_code=True).to(device)
tokenizer = AutoTokenizer.from_pretrained("allenai/Bolmo-7B", trust_remote_code=True)
message = ["Language modeling is "]
input_ids = tokenizer(message, return_tensors="pt")["input_ids"].to(device)
# `max_new_tokens` is the amuont of bytes to generate
response = bolmo.generate(input_ids, max_new_tokens=256, do_sample=True, temperature=0.1)
print(tokenizer.decode(response[0], skip_special_tokens=True))This codebase uses the olmo-core checkpoint format. Bolmo models can be converted from this format to the HuggingFace format via:
python3 src/examples/huggingface/convert_checkpoint_to_hf.py \
-i /path/to/bolmo/checkpoint \
-o /path/to/bolmo/checkpoint/in/hf/format \
-s 65536 \ # max sequence length
--dtype float32 \
--skip-validationConverting from HF format back to olmo-core is not implemented at the moment. However, we provide the original olmo-core checkpoints for Bolmo 1B and Bolmo 7B in the olmo_core/ subdirectory on HF: 1B, 7B.
Bolmo training uses a two-stage byteifying procedure to convert existing subword models to byte-level:
Quickly learn weights for local models while freezing the global model (9.8B tokens ≈ 43B bytes). Training scripts for this stage are available at bolmo_scripts/launch_stage1_*.
Train the entire model to utilize byte-level information (39.3B tokens ≈ 173B bytes). Training scripts for this stage are available at bolmo_scripts/launch_stage2_*.
Existing post-trained checkpoints can be byteified without additional training using Task Arithmetic:
python3 src/examples/bolmo/instructify.py \
--output=/path/to/output/ \
--checkpoint-dir=/path/to/bolmo/checkpoint \
--base-checkpoint-dir=/path/to/base-olmo/checkpoint \
--instruct-checkpoint-dir=/path/to/post-trained-olmo/checkpoint \
--alpha=1.0Bolmo 7B matches or exceeds the performance of state-of-the-art byte-level models and comes close to the source Olmo 3 7B model:
| Category | Bolmo 7B | Olmo 3 7B | BLT 7B |
|---|---|---|---|
| Character Understanding (CUTE) | 78.6 | 56.9 | 52.3 |
| Multilingual Char (EXECUTE) | 71.6 | 55.1 | 46.3 |
| Code | 41.0 | 40.1 | 31.6 |
| Math | 48.9 | 55.3 | 15.7 |
| MC Stem | 65.5 | 66.3 | 49.0 |
| MC Non-Stem | 75.8 | 77.7 | 56.6 |
| GenQA | 70.9 | 72.4 | 68.4 |
Full evaluation results available in the paper.
Forthcoming!
For the underlying OLMo-core framework:
@misc{olmo20242olmo2furious,
title={{2 OLMo 2 Furious}},
author={{Team OLMo} and Pete Walsh and Luca Soldaini and Dirk Groeneveld and Kyle Lo and Shane Arora and Akshita Bhagia and Yuling Gu and Shengyi Huang and Matt Jordan and Nathan Lambert and Dustin Schwenk and Oyvind Tafjord and Taira Anderson and David Atkinson and Faeze Brahman and Christopher Clark and Pradeep Dasigi and Nouha Dziri and Michal Guerquin and Hamish Ivison and Pang Wei Koh and Jiacheng Liu and Saumya Malik and William Merrill and Lester James V. Miranda and Jacob Morrison and Tyler Murray and Crystal Nam and Valentina Pyatkin and Aman Rangapur and Michael Schmitz and Sam Skjonsberg and David Wadden and Christopher Wilhelm and Michael Wilson and Luke Zettlemoyer and Ali Farhadi and Noah A. Smith and Hannaneh Hajishirzi},
year={2024},
eprint={2501.00656},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2501.00656},
}This project is licensed under the Apache License 2.0 - see the LICENSE file for details.