#tokenize #bpe #nlp

bin+lib unitoken

Fast BPE tokenizer/trainer with a Rust core and Python bindings

2 releases

Uses new Rust 2024

0.1.1 Dec 18, 2025
0.1.0 Dec 17, 2025

#2837 in Text processing

MIT license

155KB
3.5K SLoC

unitoken

unitoken is a fast BPE tokenizer/trainer with a Rust core and optional Python bindings.

Install

Rust:

cargo add unitoken

Python (wheels via PyPI):

pip install uni-tokenizer

Quickstart (Python)

from uni_tokenizer import BpeTrainer, BpeEncoder

trainer = BpeTrainer(["<|endoftext|>"])  # first token is treated as EOT
trainer.add_words({"hello": 10, "world": 7})
trainer.train(vocab_size=256)
trainer.save("demo")

enc = BpeEncoder.load("demo")
ids = enc.encode_word("hello")

Building from source

This project uses maturin for the Python extension module.

maturin develop

Dependencies

~16–37MB
~491K SLoC