Unsupervised Word Segmentation for Neural Machine Translation and Text Generation
-
Updated
Aug 7, 2024 - Python
Unsupervised Word Segmentation for Neural Machine Translation and Text Generation
Fast bare-bones BPE for modern tokenizer training
Syllable-aware BPE tokenizer for the Amharic language (አማርኛ) – fast, accurate, trainable.
Build LLM from scratch
Subword Encoding in Lattice LSTM for Chinese Word Segmentation
Simple-to-use scoring function for arbitrarily tokenized texts.
A ridiculously fast Python BPE (Byte Pair Encoder) implementation written in Rust
Learning BPE embeddings by first learning a segmentation model and then training word2vec
Byte-Pair Encoding (BPE) (subword-based tokenization) algorithm implementaions from scratch with python
Parity-Aware Byte-Pair Encoding: Improving Cross-lingual Fairness in Tokenization [arXiv 2025]
Code for the paper "BPE stays on SCRIPT"
Subword-augmented Embedding for Cloze Reading Comprehension (COLING 2018)
This repository provides a clear, educational implementation of Byte Pair Encoding (BPE) tokenization in plain Python. The focus is on algorithmic understanding, not raw performance.
Byte-Pair Encoding tokenizer for training large language models on huge datasets
Add a description, image, and links to the bpe topic page so that developers can more easily learn about it.
To associate your repository with the bpe topic, visit your repo's landing page and select "manage topics."