A high-performance tokenization and analysis implementation for full-text search. Provides a UAX #29 compliant tokenizer, implemented with a hand-rolled deterministic finite automaton (DFA). Includes a complete analyzer implementation, with support for lowercasing, ASCII case folding, stemming & stopword removal.
Currently in production at turbopuffer powering the word_v4 tokenizer.
Throughput over 64 MiB of English Wikipedia article text (cargo bench), running on an M5 Pro.
Numbers are the median of 16 samples.
Tokenization (benches/wikipedia.rs, wikipedia group):
| Benchmark | Throughput |
|---|---|
| word break | 508 MiB/s |
word break + word_like |
490 MiB/s |
| sentence break | 465 MiB/s |
Analysis (benches/wikipedia.rs, analysis group) — each row adds one stage to the pipeline,
so the deltas approximate each filter's marginal cost:
| Pipeline | Throughput |
|---|---|
| tokenize only (case sensitive) | 415 MiB/s |
| + lowercase | 324 MiB/s |
| + stopword removal (English) | 283 MiB/s |
| + stemming (English) | 132 MiB/s |
| full (max length + stopwords + stemming + ASCII fold) | 126 MiB/s |
Reproduce with cargo bench --bench wikipedia (first run downloads the Wikipedia dataset into
.cache/).