A multilingual morphological analysis library.
-
Updated
Jul 17, 2025 - Rust
A grammar describes the syntax of a programming language, and might be defined in Backus-Naur form (BNF). A lexer performs lexical analysis, turning text into tokens. A parser takes tokens and builds a data structure like an abstract syntax tree (AST). The parser is concerned with context: does the sequence of tokens fit the grammar? A compiler is a combined lexer and parser, built for a specific grammar.
A multilingual morphological analysis library.
🎤 vibrato: Viterbi-based accelerated tokenizer
Rust-tokenizer offers high-performance tokenizers for modern language models, including WordPiece, Byte-Pair Encoding (BPE) and Unigram (SentencePiece) models
🛥 Vaporetto: Very accelerated pointwise prediction based tokenizer
Rust re-implementation of OpenFST - library for constructing, combining, optimizing, and searching weighted finite-state transducers (FSTs). A Python binding is also available.
Japanese Morphological Analyzer written in Rust
Chinese tokenizer for tantivy, based on jieba-rs
Viterbi-based accelerated tokenizer (Python wrapper)
Thai natural language processing library in Rust, with Python and Node bindings.
Fast and versatile tokenizer for language models, compatible with SentencePiece, Tokenizers, Tiktoken and more. Supports BPE, Unigram and WordPiece tokenization in JavaScript, Python and Rust.
C language lexer & parser & virtual interpreter from scratch in Rust
🛥 Vaporetto is a fast and lightweight pointwise prediction based tokenizer. This is a Python wrapper for Vaporetto.
Rust wrapper for the BlingFire tokenization library