-
tokenizers
today's most used tokenizers, with a focus on performances and versatility
-
xmlparser
Pull-based, zero-allocation XML parser
-
markup5ever
Common code for xml5ever and html5ever
-
markdown
CommonMark compliant markdown parser in Rust with ASTs and extensions
-
svgtypes
SVG types parser
-
charabia
detect the language, tokenize the text and normalize the tokens
-
sqlite3-parser
SQL parser (as understood by SQLite)
-
sentencepiece
Binding for the sentencepiece tokenizer
-
htmlparser
Pull-based, zero-allocation HTML parser
-
html5gum
A WHATWG-compliant HTML5 tokenizer and tag soup parser
-
azul-simplecss
A very simple CSS 2.1 tokenizer with CSS nesting support
-
momoa
A JSON parsing library suitable for static analysis
-
cargo-context-mcp
Model Context Protocol server exposing cargo-context as tools/resources/prompts
-
textprep
Text preprocessing primitives: normalization, tokenization, and fast keyword matching
-
cargo-context-cli
CLI front-end for cargo-context — the
cargo contextsubcommand -
vaporetto
pointwise prediction based tokenizer
-
formualizer-parse
High-performance Excel/OpenFormula tokenizer + parser with a stable AST surface
-
lindera-tantivy
Lindera Tokenizer for Tantivy
-
vibrato-rkyv
Vibrato: viterbi-based accelerated tokenizer with rkyv support for fast dictionary loading
-
harn-lexer
Tokenizer with span tracking for the Harn programming language
-
splintr
Fast Rust tokenizer (BPE + SentencePiece + WordPiece) with Python bindings
-
octofhir-fhirpath-parser
Parser and tokenizer for FHIRPath expressions
-
styx-tokenizer
Tokenizer for the Styx configuration language
-
chaptr
Filename tokenizer for manga, manhwa, manhua, and light novels
-
kiwi-rs
Ergonomic Rust bindings for the Kiwi Korean morphological analyzer C API
-
tokstream-cli
CLI token stream simulator using Hugging Face tokenizers
-
alyze
High-performance text analysis for full-text search
-
vibrato
viterbi-based accelerated tokenizer
-
huggingface/tokenizers-python
💥 Fast State-of-the-Art Tokenizers optimized for Research and Production
-
fsqlite-ext-fts5
FTS5 full-text search extension
-
kiri-engine
Core Rust engine for Kiri Japanese morphological analyzer
-
unscanny
Painless string scanning
-
sentencepiece-rs
Rust runtime reimplementation of SentencePiece model loading, normalization, encoding, and decoding
-
gaze-mcp-core
Transport-free MCP-shaped chokepoint runtime for Gaze. Enforces redact→manifest→return ordering at the type level.
-
naturallanguage
Safe Rust bindings for Apple's NaturalLanguage framework — language detection, tokenization, tagging, embeddings, gazetteers, and custom models on macOS
-
toktrie_hf_tokenizers
HuggingFace tokenizers library support for toktrie and llguidance
-
tokie
Blazingly fast tokenizer - 50x faster tokenization, 10x smaller model files, 100% accurate drop-in replacement for HuggingFace
-
segtok
Sentence segmentation and word tokenization tools
-
pinyinchch
一个拼音转汉字的工具库
-
rwkv-tokenizer
A fast RWKV Tokenizer
-
pred-recdec
Predicated Recursive Descent Parsing with BNF and impure hooks
-
miktik
A unified, multi-backend tokenizer library for LLMs
-
agent-shell-parser
Shared parsing substrate for agent hook binaries — JSON input, shell tokenization
-
fsqlite-ext-icu
ICU collation extension
-
toak-ocr
OCR engine with Apple Vision framework support for macOS
-
jayce
tokenizer 🌌
-
language-tokenizer
Text tokenizer for linguistic purposes, such as text matching. Supports more than 40 languages, including English, French, Russian, Japanese, Thai etc.
-
tergo-tokenizer
R language tokenizer
-
paltoquet
rule-based general-purpose tokenizers
-
token-count
Count tokens for LLM models using exact tokenization
-
unobtanium-segmenter
A text segmentation toolbox for search applications inspired by charabia and tantivy
-
cang-jie
A Chinese tokenizer for tantivy
-
sqlite-simple-tokenizer
This's a run-time loadable extension of SQLite fts5, supports Chinese and pinyin word segmentation and search
-
lexers
Tools for tokenizing and scanning
-
tentoku
Japanese text tokenizer with deinflection support
-
libsql-sqlite3-parser
SQL parser (as understood by SQLite) (libsql fork)
-
europa
A lightweight AI utilities library for Rust
-
rag_engine
Core Rust RAG engine with optional Flutter Rust Bridge integration
-
tokmat
Standalone high-performance Canadian address parsing engine core
-
rust_tokenizers
High performance tokenizers for Rust
-
mecab-ko
한국어 형태소 분석기 - MeCab-Ko의 순수 Rust 구현
-
scanlex
lexical scanner for parsing text into tokens
-
kiri-native
Native Rust accelerator for Kiri Japanese morphological analyzer
-
tantivy-tokenizer-api
Tokenizer API of tantivy
-
kiri-yaiba
kiri-刃: Standalone Rust Japanese morphological tokenizer
-
neurorvq
biosignal tokenizer — inference in Rust with Burn ML
-
text-tokenizer
Custom text tokenizer
-
ah-ah-ah
VUN token! TWO tokens! Count all the beautiful tokens ... offline! Ah-ah-ah!
-
ailed-soulsteal
Fast game data extractor — drains PGN into tokenized training data for ML models. 60K+ games/sec.
-
nlpo3
Thai natural language processing library, with Python and Node bindings
-
latex-math
Command-line interface (CLI) tool and library for parsing and representing mathematical expressions
-
bleuscore
A fast bleu score calculator
-
tokc
Count tokens from stdin or files, with colorized tokenization display
-
rust-forth-tokenizer
A Forth tokenizer written in Rust
-
panko
A small, zero-copy text tokenizer that crumbles strings into Words, Symbols, and Newlines
-
syn_derive
Derive macros for
syn::Parseandquote::ToTokens -
crossandra
A fast and simple lexical tokenization library
-
makepad-live-tokenizer
Makepad platform live DSL tokenizer
-
toktrie_hf_downloader
HuggingFace Hub download library support for toktrie and llguidance
-
limbo_sqlite3_parser
SQL parser (as understood by SQLite)
-
rune-tokenize
Approximate token counting, budget checking, text truncation, and overlapping chunk splitting for LLM context windows
-
fast_html5ever
High-performance browser-grade HTML5 parser
-
fsqlite-ext-fts3
FTS3/FTS4 full-text search extension
-
token-dict
basic dictionary based tokenization
-
tokstream-core
Core tokenizer streaming engine for tokstream
-
axonml-text
Text processing utilities for the Axonml ML framework
-
edifact-parser
Streaming EDIFACT tokenizer and SAX-style parser — standalone, no BO4E dependency
-
roketok
way to simply set up a tokenizer and use it. Not recommended for simple tokenizers as this crate adds a bunch of stuff to support many if not all kinds of tokenizers
-
tantivy-vibrato
A Tantivy tokenizer using Vibrato
-
turso_sqlite3_parser
SQL parser (as understood by SQLite)
-
parse_light
customizable, and lightweight JSON parser with minimal dependencies
-
tokenx-rs
Fast token count estimation for LLMs at 96% accuracy without a full tokenizer
-
ferrum-tokenizer
Tokenization wrapper for Ferrum inference engine
-
divvunspell
Spell checking library for ZHFST/BHFST spellers, with case handling and tokenization support
-
lindera-tokenizer
A morphological analysis library
-
punkt
sentence tokenizer
-
fuzzy-pickles
A low-level parser of Rust source code with high-level visitor implementations
-
llama-tokenizer
Tokenizer crate for llama.rs — deterministic text-to-token conversion
-
trustformers-tokenizers
Tokenizers for TrustformeRS
-
rusqlite-ext
Rusqlite extension for building the FTS5 tokenizer
-
cuttle
A large language model inference engine in Rust
-
burn_dragon_tokenizer
Tokenizer primitives for burn_dragon
-
flat-cli
Flatten codebases into AI-friendly format
-
oxidoc-text
Shared tokenization pipeline for oxidoc — used by both build-time and query-time search
-
mipl
Minimal Imperative Parsing Library
-
kanpyo-dict
Dictionary Library for Kanpyo
-
indent_tokenizer
Generate tokens based on indentation
-
toktrie_tiktoken
tiktoken (OpenAI BPE) library support for toktrie and llguidance
-
lang_pt
A parser tool to generate recursive descent top down parser
-
oaken
Search tree based Lexical Analysis tokenizer
-
tokenizer-lib
Tokenization utilities for building parsers in Rust
-
kohaku
tokenizer
-
vn-nlp
Vietnamese NLP library — tokenization, normalization, segmentation
-
philharmonic-connector-impl-embed
Embedding connector implementation for Philharmonic (embed capability: text in, vector out)
-
rs-jsontxt2token
Converts the jsonl to tokens
-
pinyinchch-type
一个拼音转汉字的工具库
-
tekken-rs
Mistral Tekken tokenizer with audio support
-
rten-text
Text tokenization and other ML pre/post-processing functions
-
voice-stt
Speech-to-text library backed by MLX, starting with Moonshine
-
syntaqlite-buildtools
Internal codegen and build tools for syntaqlite — not intended for direct use
-
wordchipper-cli-util
Wordchipper CLI
-
ragit-korean
korean tokenizer for ragit
-
neco-syntax-textmate
TextMate-style syntax loading and tokenization on top of syntect
-
mecab-ko-dict-sync
Korean National Institute dictionary API client for MeCab-Ko
-
scivex-nlp
Scivex — Tokenization, embeddings, and text processing
-
unitoken
Fast BPE tokenizer/trainer with a Rust core and Python bindings
-
sqlite-jieba-tokenizer
This's a run-time loadable extension of SQLite fts5, supports Chinese and English word segmentation and search
-
vaporetto_rules
Rule-base filters for Vaporetto
-
mt_mtc
Tokenizer and parser for the Minot language
-
chinese_segmenter
Tokenize Chinese sentences using a dictionary-driven largest first matching approach
-
sentencepiece-sys
Binding for the sentencepiece tokenizer
-
derive-finite-automaton
Procedural macro for generating finite automaton
-
go-brrr
Token-efficient code analysis for LLMs - Rust implementation
-
rkllm-sys-rs
Raw FFI bindings for librkllmrt
-
instant-clip-tokenizer
Fast text tokenizer for the CLIP neural network
-
skimmer
streams reader
-
use-token
Composable tokenization primitives for RustUse
-
syntaqlite-syntax
Internal parser and tokenizer for syntaqlite — not intended for direct use
-
tokenizers-enfer
today's most used tokenizers, with a focus on performances and versatility
-
vn-nlp-tokenize
Vietnamese tokenization algorithms for vn-nlp
-
vtext
NLP with Rust
-
bpe-tokenizer
A BPE Tokenizer library
-
marukov
markov chain text generator
-
rtf-grimoire
A Rich Text File (RTF) document tokenizer. Useful for writing RTF parsers.
-
vn-nlp-segment
Vietnamese sentence segmentation for vn-nlp
-
vn-nlp-normalize
Vietnamese text normalization — diacritics, unicode NFC/NFD
-
alith-models
Load and Download LLM Models, Metadata, and Tokenizers
-
vil_tokenizer
VIL Tokenizer Engine — native Rust BPE tokenization for LLM token counting and text splitting
-
vaporetto_tantivy
Vaporetto Tokenizer for Tantivy
-
tokenizer
Thai text tokenizer
-
palate_polyglot_tokenizer
A generic programming language tokenizer
-
ld-lucivy-tokenizer-api
Tokenizer API of lucivy
-
punkt_n
Punkt sentence tokenizer
-
mako
main Sidekick AI data processing library
-
s-expression
parser
-
wordpieces
Split tokens into word pieces
-
sixel-tokenizer
A tokenizer for serialized Sixel bytes
-
thfst-tools
Support tools for DivvunSpell - convert ZHFST files to BHFST
-
sentencepiece-model
SentencePiece model parser generated from the SentencePiece protobuf definition
-
ellie_tokenizer
Tokenizer for ellie language
-
alpino-tokenizer
Wrapper around the Alpino tokenizer for Dutch
-
biors-core
Core biological sequence validation and protein tokenization contracts for bio-rs
-
simple-tokenizer
A tiny no_std tokenizer with line & column tracking
-
syntaxdot-tokenizers
Subword tokenizers
-
procedural-masquarade
Incorrect spelling for procedural-masquerade
-
gtars-tokenizers
Genomic region tokenizers for machine learning in Rust
-
tele_tokenizer
A CSS tokenizer
-
divvunspell-bin
Spellchecker for ZHFST/BHFST spellers, with case handling and tokenization support
-
tokengeex
efficient tokenizer for code based on UnigramLM and TokenMonster
-
data_vault
Data Vault is a modular, pragmatic, credit card vault for Rust
-
tiniestsegmenter
Compact Japanese segmenter
-
uscan
A universal source code scanner
-
sentence
tokenizes English language sentences for use in TTS applications
-
toresy
term rewriting system based on tokenization
-
izihawa-tantivy-tokenizer-api
Tokenizer API of tantivy
-
earl-lang-syntax
tokenizer and parser for the language Earl
-
sqlite-charabia-tokenizer
This's a run-time loadable extension of SQLite fts5, supports Chinese and English word segmentation and search
-
tinytoken
tokenizing text into words, numbers, symbols, and more, with customizable parsing options
-
caddyfile
working with Caddy's Caddyfile format
-
aleph-alpha-tokenizer
A fast implementation of a wordpiece-inspired tokenizer
-
cssparser-macros
Procedural macros for cssparser
-
strizer
minimal and fast library for text tokenization
-
specmc-base
common code for parsing Minecraft specification
-
colorblast
Syntax highlighting library for various programming languages, markup languages and various other formats
-
castle_tokenizer
Castle Tokenizer: tokenizer
-
json-parser
JSON parser
-
blingfire
Wrapper for the BlingFire tokenization library
-
indentation_flattener
From indented input, generate plain output with indentation PUSH and POP codes
-
nipah_tokenizer
A powerful yet simple text tokenizer for your everyday needs!
-
gtokenizers
tokenizing genomic data with an emphasis on region set data
-
regex-bnf
A deterministic parser for a BNF inspired syntax with regular expressions
-
basic_lexer
Basic lexical analyzer for parsing and compiling
-
neca-cmd
command tokenizer used by my Twitch chat bot
-
xtoken
Iterator based no_std XML Tokenizer using memchr
-
mecab-ko-core
한국어 형태소 분석 핵심 엔진 - Lattice, Viterbi, 토크나이저
-
boost_tokenizer
Boost C++ library boost_tokenizer packaged using Zanbil
-
tokenmonster
Greedy tiktoken-like tokenizer with embedded vocabulary (cl100k-base approximator)
-
morsels_lang_ascii
Basic ascii tokenizer for morsels
-
tokeneer
tokenizer crate
-
regex-tokenizer
A regex tokenizer
-
tinysegmenter
Compact Japanese tokenizer
-
pretok
A string pre-tokenizer for C-like syntaxes
-
alpino-tokenize
Wrapper around the Alpino tokenizer for Dutch
-
morsels_lang_chinese
Chinese tokenizer for morsels
-
text-scanner
A UTF-8 char-oriented, zero-copy, text and code scanning library
-
quote-data
A tokenization Library for Rust
-
tokenate
do some grunt work of writing a tokenizer
-
ccgen
generate manually maintained C (and C++) headers
Try searching with DuckDuckGo.