A Rust library implementing two efficient lexicon data structures — Trie and DAWG (Directed Acyclic Word Graph) — with an optional Python binding via PyO3 and Maturin.
lexrs is the successor to lexpy (pure Python). It exposes the same API while delivering 10–100× faster insertion and search by moving the core data structures to Rust.
Documentation: aosingh.github.io/lexrs
- Install
- Data structures
- Features
- Rust usage
- Python usage
- Production HTTP server
- Running tests
- Project structure
- Related components
- License
Rust library — add to Cargo.toml:
[dependencies]
lexrs = "1.0"Python package:
pip install pylexrsHTTP server binaries:
cargo install lexrs-serverThis installs both the writer and reader binaries. See the Production HTTP server section for deployment details.
A standard prefix tree using arena allocation (Vec<Node>). Supports insertion in any order.
A minimized Trie that compresses shared suffixes in addition to shared prefixes, resulting in a significantly smaller node count for large lexicons. Words must be inserted in lexicographic (alphabetical) order; call reduce() after all insertions to finalize minimization.
Both Trie and DAWG expose the same API:
| Method | Description |
|---|---|
add(word, count) |
Insert a word with an optional frequency count |
add_all(words) |
Insert from any iterable |
add_from_file(path) |
Insert words from a file (one per line) |
contains(word) |
Exact membership test |
contains_prefix(prefix) |
Check if any word starts with the given prefix |
search(pattern) |
Wildcard search (* = zero or more chars, ? = exactly one) |
search_with_count(pattern) |
Like search, but returns (word, count) pairs |
search_with_prefix(prefix) |
All words beginning with the prefix |
search_with_prefix_count(prefix) |
Like above, with counts |
search_within_distance(word, dist) |
Levenshtein fuzzy search |
search_within_distance_count(word, dist) |
Like above, with counts |
word_count() |
Number of words stored |
node_count() |
Number of nodes in the structure |
Batch APIs (parallel via Rayon, results in input order):
| Method | Description |
|---|---|
batch_contains(words) |
Membership test for a list of words → list[bool] |
batch_search(patterns) |
Wildcard search for a list of patterns → list[list[str]] |
batch_search_with_prefix(prefixes) |
Prefix search for a list of prefixes → list[list[str]] |
batch_search_within_distance(words, dist) |
Fuzzy search for a list of words → list[list[str]] |
# Cargo.toml
[dependencies]
lexrs = "1.0"use lexrs::{Trie, Dawg};
// ── Trie ──────────────────────────────────────────────────────────────────────
let mut trie = Trie::new();
trie.add("hello", 5).unwrap(); // word + optional count
trie.add_all(vec!["world".to_string(), "foo".to_string()]).unwrap();
trie.add_from_file("words.txt").unwrap();
trie.contains("hello"); // true
trie.contains_prefix("wor"); // true
trie.word_count(); // total words stored
trie.node_count(); // total nodes in structure
trie.search("h*").unwrap(); // wildcard → Vec<String>
trie.search_with_count("h*").unwrap(); // → Vec<(String, usize)>
trie.search_with_prefix("wo"); // prefix completion → Vec<String>
trie.search_with_prefix_count("wo"); // → Vec<(String, usize)>
trie.search_within_distance("helo", 1); // fuzzy (Levenshtein ≤ 1) → Vec<String>
trie.search_within_distance_count("helo", 1); // → Vec<(String, usize)>
// ── DAWG (words must be inserted in alphabetical order) ───────────────────────
let mut dawg = Dawg::new();
dawg.add_all(vec!["apple".to_string(), "apply".to_string(), "apt".to_string()]).unwrap();
// add_all sorts automatically; call reduce() if you use add() directly
dawg.reduce();
dawg.contains("apple"); // true
dawg.search("ap*").unwrap(); // wildcard
dawg.search_within_distance("aple", 1); // fuzzyBatch methods run all inputs in parallel via Rayon and return results in input order.
let words = vec!["apple", "apply", "cherry", "apt"];
trie.batch_contains(&words);
// [true, true, false, true]
trie.batch_search(&["ap*", "b*", "xyz*"]);
// [["apple", "apply", "apt"], ["banana"], []]
trie.batch_search_with_prefix(&["app", "ban"]);
// [["apple", "apply"], ["banana"]]
trie.batch_search_within_distance(&["aple", "bannana"], 1);
// [["apple"], ["banana"]]The library is available on PyPI as pylexrs:
pip install pylexrsfrom lexrs import Trie, DAWG
# ── Trie ──────────────────────────────────────────────────────────────────────
t = Trie()
t.add("hello", 5) # word + optional count
t.add_all(["world", "foo"])
t.add_from_file("words.txt")
"hello" in t # True
t.contains_prefix("wor") # True
t.get_word_count() # total words
len(t) # total nodes
t.search("h*") # wildcard, returns list of words
t.search("h*", with_count=True) # returns list of (word, count)
t.search_with_prefix("wo") # prefix completion
t.search_with_prefix_count("wo") # with counts
t.search_within_distance("helo", 1) # fuzzy (Levenshtein ≤ 1)
t.search_within_distance("helo", 1, with_count=True)
# ── DAWG ──────────────────────────────────────────────────────────────────────
d = DAWG()
d.add_all(["apple", "apply", "apt"]) # sorted automatically
"apple" in d # True
d.search("ap*") # wildcard
d.search_within_distance("aple", dist=1)Batch methods process a list of inputs in one call, running all items in parallel via Rayon. They avoid repeated FFI boundary crossings and deliver 2–6× speedup over a Python loop calling the single-item equivalents.
words = ["apple", "apply", "cherry", "apt"]
t.batch_contains(words)
# [True, True, False, True]
t.batch_search(["ap*", "b*", "xyz*"])
# [["apple", "apply", "apt"], ["banana"], []]
t.batch_search_with_prefix(["app", "ban"])
# [["apple", "apply"], ["banana"]]
t.batch_search_within_distance(["aple", "bannana"], dist=1)
# [["apple"], ["banana"]]Results are always returned in the same order as the input list. An unmatched pattern or prefix returns an empty inner list. dist defaults to 0.
| Pattern | Meaning |
|---|---|
* |
Zero or more characters |
? |
Exactly one character |
h* |
All words starting with h |
?at |
Three-letter words ending in at |
a?* |
Words of two or more characters starting with a |
Consecutive wildcards are normalized (** → *, ?* → *).
lexrs ships a production-ready HTTP service in lexrs-server/ with two binaries:
| Binary | Role |
|---|---|
| writer | Accepts word ingestion (POST /words), buffers a delta Trie in memory, and periodically compacts it into a versioned snapshot on disk. |
| reader | Loads the latest snapshot into a DAWG and serves all search queries. Multiple readers can run as replicas and hot-reload new snapshots without downtime. |
Consul is used for snapshot version coordination — the writer publishes new snapshot versions to the Consul KV store and readers watch for changes via blocking queries, atomically swapping the in-memory DAWG when a new snapshot arrives.
The docker/ directory has a Compose file that brings up the entire stack:
nginx (port 80)
│
┌──────────┴───────────┐
▼ ▼ (round-robin)
writer reader-1 reader-2
│ ▲ ▲
├──► snapshots ─────────┴──────────┘
│ (shared volume)
└──► Consul KV ──► readers watch & hot-reload
cd docker
docker compose up -d# ingest words with optional per-word counts
curl -X POST http://localhost/words \
-H 'Content-Type: application/json' \
-d '{"words": [{"word": "apple", "count": 10}, {"word": "apply", "count": 3}, "apt"]}'
# flush to snapshot (readers reload within ~30 s)
curl -X POST http://localhost/compact
# prefix search
curl 'http://localhost/prefix?q=app'
# wildcard search
curl 'http://localhost/search?q=app*'
# fuzzy search (Levenshtein distance ≤ 1)
curl 'http://localhost/search?q=aple&dist=1'
# exact lookup
curl 'http://localhost/contains?q=apple'
# batch membership
curl -X POST http://localhost/batch/contains \
-H 'Content-Type: application/json' \
-d '{"words": ["apple", "cherry", "apricot"]}'
# batch prefix
curl -X POST http://localhost/batch/prefix \
-H 'Content-Type: application/json' \
-d '{"prefixes": ["app", "ban"]}'The writer and reader are built from a single Dockerfile — the command: field in Compose selects which binary to start. See the server docs for the full API reference and configuration options.
# Rust unit tests
cargo test
# Python tests
pip install pylexrs pytest
pytest tests/lexrs/src/
lib.rs — public re-exports and Python module registration
trie.rs — Trie implementation + wildcard / Levenshtein helpers
dawg.rs — DAWG implementation
python.rs — PyO3 bindings (Trie, DAWG, batch APIs)
node.rs — shared Node type (arena-allocated)
utils.rs — file I/O and pattern normalization
error.rs — LexError enum
lexrs/tests/
trie_tests.rs — Rust unit tests for Trie
dawg_tests.rs — Rust unit tests for DAWG
lexrs-server/src/
writer.rs — word ingestion + compaction binary
reader.rs — search server + batch endpoints binary
consul.rs — Consul KV watch / publish helpers
snapshot.rs — snapshot file I/O
tests/
test_python_api.py — pytest suite for the Python bindings
| Directory | Description |
|---|---|
| lexrs-server/ | Production HTTP server — a writer binary for word ingestion and compaction, and a reader binary for search. Readers scale horizontally and hot-reload snapshots via Consul. |
| docker/ | Docker Compose setup running the full stack: Consul, writer, two reader replicas, and an nginx reverse proxy that routes reads and writes to the right service. |
| benchmarks/ | Python scripts comparing pylexrs against lexpy (pure Python) across insertion, prefix, wildcard, Levenshtein, and batch workloads. |
| tests/ | Integration tests — pytest suite for the Python API and Rust-level integration tests. |
MIT