Syllable-aware BPE tokenizer for the Amharic language (አማርኛ) – fast, accurate, trainable.
-
Updated
Nov 17, 2025 - Python
Syllable-aware BPE tokenizer for the Amharic language (አማርኛ) – fast, accurate, trainable.
A ridiculously fast Python BPE (Byte Pair Encoder) implementation written in Rust
A PHP implementation of OpenAI's BPE tokenizer tiktoken.
Teaching transformer-based architectures
High-Performance Tokenizer implementation in PHP.
Byte-Pair Encoding tokenizer for training large language models on huge datasets
BPE tokenizer for LLMs in Pure Zig
(1) Train large language models to help people with automatic essay scoring. (2) Extract essay features and train new tokenizer to build tree models for score prediction.
🐍This is a fast, lightweight, and clean CPython extension for the Byte Pair Encoding (BPE) algorithm, which is commonly used in LLM tokenization and NLP tasks.
implementation of Byte-Pair Encoding (BPE) for subword tokenization, written entirely in C++ . The tokenizer learns merges from raw text and supports encoding/decoding with UTF-8
R-BPE: Improving BPE-Tokenizers with Token Reuse
a parallel and minimal implementation of Byte Pair Encoding (BPE) from scratch in less than 200 lines of python.
[Rust] Unofficial implementation of "SuperBPE: Space Travel for Language Models" in Rust
Multi-language BPE tokenizer implementation for Qwen3 models. Lightweight byte-pair encoding for C#/.NET
Transformer Models for Humorous Text Generation. Fine-tuned on Russian jokes dataset with ALiBi, RoPE, GQA, and SwiGLU.Plus a custom Byte-level BPE tokenizer.
Build a light-weight Llama from scratch, based on course Stanford CS336 2025.
Byte-Pair Encoding tokenizer built from scratch in Python. The same algorithm used by GPT-2.
LLM Learning step-by-step.
Add a description, image, and links to the bpe-tokenizer topic page so that developers can more easily learn about it.
To associate your repository with the bpe-tokenizer topic, visit your repo's landing page and select "manage topics."