alyze

A high-performance tokenization and analysis implementation for full-text search. Provides a UAX #29 compliant tokenizer, implemented with a hand-rolled deterministic finite automaton (DFA). Includes a complete analyzer implementation, with support for lowercasing, ASCII case folding, stemming & stopword removal.

Currently in production at turbopuffer powering the word_v4 tokenizer.

Benchmarks

Throughput over 64 MiB of English Wikipedia article text (cargo bench), running on an M5 Pro. Numbers are the median of 16 samples.

Tokenization (benches/wikipedia.rs, wikipedia group):

Benchmark	Throughput
word break	508 MiB/s
word break + `word_like`	490 MiB/s
sentence break	465 MiB/s

Analysis (benches/wikipedia.rs, analysis group) — each row adds one stage to the pipeline, so the deltas approximate each filter's marginal cost:

Pipeline	Throughput
tokenize only (case sensitive)	415 MiB/s
+ lowercase	324 MiB/s
+ stopword removal (English)	283 MiB/s
+ stemming (English)	132 MiB/s
full (max length + stopwords + stemming + ASCII fold)	126 MiB/s

Reproduce with cargo bench --bench wikipedia (first run downloads the Wikipedia dataset into .cache/).

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
.github		.github
benches		benches
src		src
testdata		testdata
wasm		wasm
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

alyze

Benchmarks

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

alyze

Benchmarks

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages