Unsupervised Word Segmentation for Neural Machine Translation and Text Generation
-
Updated
Aug 7, 2024 - Python
Unsupervised Word Segmentation for Neural Machine Translation and Text Generation
Fast bare-bones BPE for modern tokenizer training
Full-stack open-source AI engine for building language models — tokenizer training, transformer architecture, cognitive reasoning and chat pipeline.
Build LLM from scratch
Syllable-aware BPE tokenizer for the Amharic language (አማርኛ) – fast, accurate, trainable.
Subword Encoding in Lattice LSTM for Chinese Word Segmentation
Simple-to-use scoring function for arbitrarily tokenized texts.
A ridiculously fast Python BPE (Byte Pair Encoder) implementation written in Rust
Geometric Byte Pair Encoding of Protein Structure (ICLR 2026)
Learning BPE embeddings by first learning a segmentation model and then training word2vec
Parity-Aware Byte-Pair Encoding: Improving Cross-lingual Fairness in Tokenization [arXiv 2025]
Byte-Pair Encoding (BPE) (subword-based tokenization) algorithm implementaions from scratch with python
Code for the paper "BPE stays on SCRIPT"
Subword-augmented Embedding for Cloze Reading Comprehension (COLING 2018)
This repository provides a clear, educational implementation of Byte Pair Encoding (BPE) tokenization in plain Python. The focus is on algorithmic understanding, not raw performance.
Byte-Pair Encoding tokenizer for training large language models on huge datasets
Add a description, image, and links to the bpe topic page so that developers can more easily learn about it.
To associate your repository with the bpe topic, visit your repo's landing page and select "manage topics."