Skip to content

allenai/bolmo-core

Repository files navigation


Bolmo

The first family of competitive fully open byte-level language models.

bolmo_architecture

GitHub HuggingFace Models License Discord


Bolmo is the first fully-open byte-level language model achieving performance on the level of state-of-the-art subword-level language models. Unlike traditional language models that rely on subword tokenizers (like BPE or WordPiece), Bolmo operates directly on raw UTF-8 bytes, making it:

  • Free of subword tokenization: No need for language-specific tokenizers or vocabulary management.
  • Universally applicable: Works seamlessly across all languages, scripts, and domains.
  • Fully open: Complete training code, model weights, data processing pipeline, and paper.
  • Competitive performance: Comes close to matching (and in some cases exceeds) subword-based state-of-the-art models across a wide range of tasks.
  • Better character understanding: Superior performance on tasks requiring character-level knowledge.

See our technical report for details: https://allenai.org/papers/bolmo.

This repository is a fork of OLMo-core that implements the complete Bolmo architecture and training pipeline through byteifying - our approach to converting existing subword models to byte-level models, using <1% of the pretraining budget.

Models

We release Bolmo models in two sizes:

Model Parameters Base Model HuggingFace
Bolmo-7B 7.6B Olmo 3 7B allenai/Bolmo-7B
Bolmo-1B 1.5B OLMo 2 1B allenai/Bolmo-1B

Training data is available via HuggingFace at allenai/bolmo_mix.

Installation

First install PyTorch according to the instructions specific to your operating system and hardware.

From Source (Recommended for Development)

git clone https://github.com/allenai/bolmo-core.git
cd bolmo-core
pip install -e .[all]

Optional Dependencies

For full functionality, you may need:

See the OLMo-core documentation for complete installation details.

Quick Start

Inference with HuggingFace

from transformers import AutoModelForCausalLM, AutoTokenizer

device = "cuda"
bolmo = AutoModelForCausalLM.from_pretrained("allenai/Bolmo-7B", trust_remote_code=True).to(device)
tokenizer = AutoTokenizer.from_pretrained("allenai/Bolmo-7B", trust_remote_code=True)

message = ["Language modeling is "]
input_ids = tokenizer(message, return_tensors="pt")["input_ids"].to(device)

# `max_new_tokens` is the amuont of bytes to generate
response = bolmo.generate(input_ids, max_new_tokens=256, do_sample=True, temperature=0.1)
print(tokenizer.decode(response[0], skip_special_tokens=True))

HuggingFace checkpoints vs. olmo-core checkpoints

This codebase uses the olmo-core checkpoint format. Bolmo models can be converted from this format to the HuggingFace format via:

python3 src/examples/huggingface/convert_checkpoint_to_hf.py \
    -i /path/to/bolmo/checkpoint \
    -o /path/to/bolmo/checkpoint/in/hf/format \
    -s 65536 \ # max sequence length
    --dtype float32 \
    --skip-validation

Converting from HF format back to olmo-core is not implemented at the moment. However, we provide the original olmo-core checkpoints for Bolmo 1B and Bolmo 7B in the olmo_core/ subdirectory on HF: 1B, 7B.

Training

Bolmo training uses a two-stage byteifying procedure to convert existing subword models to byte-level:

Stage 1: Subword-to-Byte Distillation

Quickly learn weights for local models while freezing the global model (9.8B tokens ≈ 43B bytes). Training scripts for this stage are available at bolmo_scripts/launch_stage1_*.

Stage 2: End-to-End Training

Train the entire model to utilize byte-level information (39.3B tokens ≈ 173B bytes). Training scripts for this stage are available at bolmo_scripts/launch_stage2_*.

Post-Training via Task Arithmetic

Existing post-trained checkpoints can be byteified without additional training using Task Arithmetic:

python3 src/examples/bolmo/instructify.py \
    --output=/path/to/output/ \
    --checkpoint-dir=/path/to/bolmo/checkpoint \
    --base-checkpoint-dir=/path/to/base-olmo/checkpoint \
    --instruct-checkpoint-dir=/path/to/post-trained-olmo/checkpoint \
    --alpha=1.0

Performance

Bolmo 7B Results

Bolmo 7B matches or exceeds the performance of state-of-the-art byte-level models and comes close to the source Olmo 3 7B model:

Category Bolmo 7B Olmo 3 7B BLT 7B
Character Understanding (CUTE) 78.6 56.9 52.3
Multilingual Char (EXECUTE) 71.6 55.1 46.3
Code 41.0 40.1 31.6
Math 48.9 55.3 15.7
MC Stem 65.5 66.3 49.0
MC Non-Stem 75.8 77.7 56.6
GenQA 70.9 72.4 68.4

Full evaluation results available in the paper.

Citation

Forthcoming!

For the underlying OLMo-core framework:

@misc{olmo20242olmo2furious,
  title={{2 OLMo 2 Furious}},
  author={{Team OLMo} and Pete Walsh and Luca Soldaini and Dirk Groeneveld and Kyle Lo and Shane Arora and Akshita Bhagia and Yuling Gu and Shengyi Huang and Matt Jordan and Nathan Lambert and Dustin Schwenk and Oyvind Tafjord and Taira Anderson and David Atkinson and Faeze Brahman and Christopher Clark and Pradeep Dasigi and Nouha Dziri and Michal Guerquin and Hamish Ivison and Pang Wei Koh and Jiacheng Liu and Saumya Malik and William Merrill and Lester James V. Miranda and Jacob Morrison and Tyler Murray and Crystal Nam and Valentina Pyatkin and Aman Rangapur and Michael Schmitz and Sam Skjonsberg and David Wadden and Christopher Wilhelm and Michael Wilson and Luke Zettlemoyer and Ali Farhadi and Noah A. Smith and Hannaneh Hajishirzi},
  year={2024},
  eprint={2501.00656},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2501.00656},
}

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

About

Code for Bolmo: Byteifying the Next Generation of Language Models

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages