Skip to content

fbernier/alyze

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

alyze

A high-performance tokenization and analysis implementation for full-text search. Provides a UAX #29 compliant tokenizer, implemented with a hand-rolled deterministic finite automaton (DFA). Includes a complete analyzer implementation, with support for lowercasing, ASCII case folding, stemming & stopword removal.

Currently in production at turbopuffer powering the word_v4 tokenizer.

Benchmarks

Throughput over 64 MiB of English Wikipedia article text (cargo bench), running on an M5 Pro. Numbers are the median of 16 samples.

Tokenization (benches/wikipedia.rs, wikipedia group):

Benchmark Throughput
word break 508 MiB/s
word break + word_like 490 MiB/s
sentence break 465 MiB/s

Analysis (benches/wikipedia.rs, analysis group) — each row adds one stage to the pipeline, so the deltas approximate each filter's marginal cost:

Pipeline Throughput
tokenize only (case sensitive) 415 MiB/s
+ lowercase 324 MiB/s
+ stopword removal (English) 283 MiB/s
+ stemming (English) 132 MiB/s
full (max length + stopwords + stemming + ASCII fold) 126 MiB/s

Reproduce with cargo bench --bench wikipedia (first run downloads the Wikipedia dataset into .cache/).

About

Tokenization and analysis pipeline for full-text search

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Rust 99.8%
  • Shell 0.2%