bpe-tokenizer

Here are 51 public repositories matching this topic...

sefineh-ai / Amharic-Tokenizer

Syllable-aware BPE tokenizer for the Amharic language (አማርኛ) – fast, accurate, trainable.

nlp machine-learning deep-learning tokenizer python3 text-processing amharic african-languages language-processing ethiopic low-resource-languages bpe llm bpe-tokenizer amharic-tokenizer amharictokenizer amhtokenizer

Updated Nov 17, 2025
Python

gweidart / rs-bpe

Star

A ridiculously fast Python BPE (Byte Pair Encoder) implementation written in Rust

python rust openai pypi-package bpe byte-pair-encoding huggingface tokenizers llm tiktoken bpe-tokenizer byte-pair-tokenizer

Updated Mar 19, 2025
Python

RahulDey12 / tiktoken-php

Sponsor

Star

A PHP implementation of OpenAI's BPE tokenizer tiktoken.

tiktoken bpe-tokenizer tiktoken-php special-tokens

Updated Jan 25, 2025
PHP

xmarva / transformer-architectures

Star

Teaching transformer-based architectures

nlp transformer attention-mechanism weights-and-biases positional-encoding bpe-tokenizer

Updated May 4, 2025
Jupyter Notebook

neuron-core / tokenizer

Sponsor

Star

High-Performance Tokenizer implementation in PHP.

php ai tokenizer ai-framework ai-agents llm bpe-tokenizer agentic-framework agentic-workflow

Updated Oct 21, 2025
PHP

jmaczan / bpe-tokenizer

Star

Byte-Pair Encoding tokenizer for training large language models on huge datasets

python machine-learning deep-learning tokenizer chunking from-scratch bpe byte-pair-encoding large-language-models llm bpe-tokenizer

Updated Jun 4, 2024
Python

gianndev / Tok

Star

Tok: my own Tokenizer

tokenizer bpe bpe-tokenizer

Updated Sep 7, 2025
Jupyter Notebook

jaco-bro / tokenizer

Star

BPE tokenizer for LLMs in Pure Zig

zig regex tokenizer pcre2 zig-package bpe-tokenizer

Updated May 16, 2025
Zig

Lizhecheng02 / Kaggle-Automated_Essay_Scoring_2.0

Star

(1) Train large language models to help people with automatic essay scoring. (2) Extract essay features and train new tokenizer to build tree models for score prediction.

regression transformer classification vectorizer awp pooling huggingface treemodel kfold llm deberta-v3-large bpe-tokenizer

Updated Jul 3, 2024
Python

neluca / tinybpe

Star

🐍This is a fast, lightweight, and clean CPython extension for the Byte Pair Encoding (BPE) algorithm, which is commonly used in LLM tokenization and NLP tasks.

tokenizer cpython-extensions bpe llm bpe-tokenizer

Updated Apr 20, 2025
C

mrinalxdev / bpe-cpp

Star

implementation of Byte-Pair Encoding (BPE) for subword tokenization, written entirely in C++ . The tokenizer learns merges from raw text and supports encoding/decoding with UTF-8

machine-learning bpe-tokenizer

Updated Aug 21, 2025
C++

U4RASD / r-bpe

Star

R-BPE: Improving BPE-Tokenizers with Token Reuse

low-resource-nlp large-language-models bpe-tokenizer vocabulary-adaptation

Updated Nov 26, 2025
Python

Demon-Sheriff / tiny-BPE

Star

a parallel and minimal implementation of Byte Pair Encoding (BPE) from scratch in less than 200 lines of python.

python multiprocessing tokenization bpe-tokenizer

Updated Aug 30, 2025
Jupyter Notebook

willxxy / superbpe

Star

[Rust] Unofficial implementation of "SuperBPE: Space Travel for Language Models" in Rust

rust rust-lang bpe bytepairencoding bpe-tokenizer

Updated Apr 14, 2025
Rust

jmaczan / bpe.c

Star

High performance Byte-Pair Encoding tokenizer for large language models

c tokenizer clang bpe llm bpe-tokenizer

Updated Jun 23, 2024
C

yuniko-software / qwen3-tokenizer-dotnet

Star

Multi-language BPE tokenizer implementation for Qwen3 models. Lightweight byte-pair encoding for C#/.NET

machine-learning csharp dotnet inference embedding-models onnx huggingface vector-database llm qwen bpe-tokenizer

Updated Dec 15, 2025
C#

estnafinema0 / russian-jokes-generator

Star

Transformer Models for Humorous Text Generation. Fine-tuned on Russian jokes dataset with ALiBi, RoPE, GQA, and SwiGLU.Plus a custom Byte-level BPE tokenizer.

nlp pytorch alibi transformer-models rotary-position-embedding grouped-query-attention swiglu bpe-tokenizer

Updated Mar 10, 2025
Jupyter Notebook

jerrypan617 / LightLlama

Star

Build a light-weight Llama from scratch, based on course Stanford CS336 2025.

llama llm bpe-tokenizer

Updated Aug 1, 2025
Python

designer-coderajay / bpe-tokenizer-scratch

Star

Byte-Pair Encoding tokenizer built from scratch in Python. The same algorithm used by GPT-2.

python nlp machine-learning natural-language-processing tokenizer from-scratch bpe gpt-2 bpe-tokenizer

Updated Dec 8, 2025
Python

odongi / Learning-LLM

Star

LLM Learning step-by-step.

tokenization llm vectorembeddings bpe-tokenizer positionalembedding

Updated Dec 11, 2025
Jupyter Notebook

Improve this page

Add a description, image, and links to the bpe-tokenizer topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the bpe-tokenizer topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bpe-tokenizer

Here are 51 public repositories matching this topic...

sefineh-ai / Amharic-Tokenizer

gweidart / rs-bpe

RahulDey12 / tiktoken-php

xmarva / transformer-architectures

neuron-core / tokenizer

jmaczan / bpe-tokenizer

gianndev / Tok

jaco-bro / tokenizer

Lizhecheng02 / Kaggle-Automated_Essay_Scoring_2.0

neluca / tinybpe

mrinalxdev / bpe-cpp

U4RASD / r-bpe

Demon-Sheriff / tiny-BPE

willxxy / superbpe

jmaczan / bpe.c

yuniko-software / qwen3-tokenizer-dotnet

estnafinema0 / russian-jokes-generator

jerrypan617 / LightLlama

designer-coderajay / bpe-tokenizer-scratch

odongi / Learning-LLM

Improve this page

Add this topic to your repo