2 releases
Uses new Rust 2024
| 0.1.1 | Dec 18, 2025 |
|---|---|
| 0.1.0 | Dec 17, 2025 |
#2837 in Text processing
155KB
3.5K
SLoC
unitoken
unitoken is a fast BPE tokenizer/trainer with a Rust core and optional Python bindings.
Install
Rust:
cargo add unitoken
Python (wheels via PyPI):
pip install uni-tokenizer
Quickstart (Python)
from uni_tokenizer import BpeTrainer, BpeEncoder
trainer = BpeTrainer(["<|endoftext|>"]) # first token is treated as EOT
trainer.add_words({"hello": 10, "world": 7})
trainer.train(vocab_size=256)
trainer.save("demo")
enc = BpeEncoder.load("demo")
ids = enc.encode_word("hello")
Building from source
This project uses maturin for the Python extension module.
maturin develop
Dependencies
~16–37MB
~491K SLoC