A multilingual morphological analysis library.
-
Updated
Dec 12, 2025 - Rust
A grammar describes the syntax of a programming language, and might be defined in Backus-Naur form (BNF). A lexer performs lexical analysis, turning text into tokens. A parser takes tokens and builds a data structure like an abstract syntax tree (AST). The parser is concerned with context: does the sequence of tokens fit the grammar? A compiler is a combined lexer and parser, built for a specific grammar.
A multilingual morphological analysis library.
🎤 vibrato: Viterbi-based accelerated tokenizer
Rust-tokenizer offers high-performance tokenizers for modern language models, including WordPiece, Byte-Pair Encoding (BPE) and Unigram (SentencePiece) models
🛥 Vaporetto: Very accelerated pointwise prediction based tokenizer
Rust re-implementation of OpenFST - library for constructing, combining, optimizing, and searching weighted finite-state transducers (FSTs). A Python binding is also available.
Japanese Morphological Analyzer written in Rust
CLI tool – estimates LLM tokens/costs and runs provider-aware load tests for OpenAI, Anthropic, OpenRouter, or custom endpoints.
Chinese tokenizer for tantivy, based on jieba-rs
Viterbi-based accelerated tokenizer (Python wrapper)
Fast and versatile tokenizer for language models, compatible with SentencePiece, Tokenizers, Tiktoken and more. Supports BPE, Unigram and WordPiece tokenization in JavaScript, Python and Rust.
Thai natural language processing library in Rust, with Python and Node bindings.
C language lexer & parser & virtual interpreter from scratch
🛥 Vaporetto is a fast and lightweight pointwise prediction based tokenizer. This is a Python wrapper for Vaporetto.
A real-time LLM stream interceptor for token-level interaction research
Rust wrapper for the BlingFire tokenization library