A pure Python port of the high-performance Byte Pair Encoding (BPE) algorithm detailed in this blog post. The original source code can be found here (bpe algorithm) and here (openai tokenizer)
This code is based entirely on the logic from the bpe and bpe-openai crates found
here.
While the original Rust implementation is optimized for production use, this Python port is designed for:
- Education: Understanding how GPT-4 tokenization works under the hood (Regex split + BPE merge).
- Portability: Running in environments where compiling Rust extensions or installing binary wheels is difficult.
- Zero-Dependency: Depends only on
regex(for PCRE patterns) and the standard library.
byte_pair_encoding.py: The core BPE logic. Implements the Trie structure, the valid merge check (Dynamic Programming), and the.tiktokenfile loader.tokenizer.py: The wrapper that handles OpenAI's specific Regex splitting patterns (cl100k_base,o200k_base) and normalization.download_vocab.py: A helper to fetch the public vocabulary files.
You only need the regex library
(because Python's standard re module does not support the specific Unicode properties used by GPT-4).
pip install regexFirst, download the official dictionary file from OpenAI.
# Run this once or use the provided download script
import urllib.request
urllib.request.urlretrieve("https://openaipublic.blob.core.windows.net/encodings/cl100k_base.tiktoken",
"cl100k_base.tiktoken")from tokenizer import get_cl100k_base
# Initialize the tokenizer
# Note: This takes ~1-2 seconds to load the 100k token vocabulary.
enc = get_cl100k_base("cl100k_base.tiktoken")
text = "Hello, world! This is a pure Python BPE."
# Encode to integers
tokens = enc.encode(text)
print(f"IDs: {tokens}")
# Output: [9906, 11, 1917, 0, 1115, 374, 264, 10748, 13325, 426, 1777, 13]
# Decode back to string
decoded = enc.decode(tokens)
print(f"Decoded: {decoded}")If you download the o200k_base.tiktoken file, you can use the newer tokenizer:
from tokenizer import get_o200k_base
enc = get_o200k_base("o200k_base.tiktoken")