Skip to content

aosingh/lexrs

Repository files navigation

lexrs

A Rust library implementing two efficient lexicon data structures — Trie and DAWG (Directed Acyclic Word Graph) — with an optional Python binding via PyO3 and Maturin.

lexrs is the successor to lexpy (pure Python). It exposes the same API while delivering 10–100× faster insertion and search by moving the core data structures to Rust.

Documentation: aosingh.github.io/lexrs

Table of contents

Install

Rust library — add to Cargo.toml:

[dependencies]
lexrs = "1.0"

Python package:

pip install pylexrs

HTTP server binaries:

cargo install lexrs-server

This installs both the writer and reader binaries. See the Production HTTP server section for deployment details.


Data structures

Trie

A standard prefix tree using arena allocation (Vec<Node>). Supports insertion in any order.

DAWG

A minimized Trie that compresses shared suffixes in addition to shared prefixes, resulting in a significantly smaller node count for large lexicons. Words must be inserted in lexicographic (alphabetical) order; call reduce() after all insertions to finalize minimization.

Features

Both Trie and DAWG expose the same API:

Method Description
add(word, count) Insert a word with an optional frequency count
add_all(words) Insert from any iterable
add_from_file(path) Insert words from a file (one per line)
contains(word) Exact membership test
contains_prefix(prefix) Check if any word starts with the given prefix
search(pattern) Wildcard search (* = zero or more chars, ? = exactly one)
search_with_count(pattern) Like search, but returns (word, count) pairs
search_with_prefix(prefix) All words beginning with the prefix
search_with_prefix_count(prefix) Like above, with counts
search_within_distance(word, dist) Levenshtein fuzzy search
search_within_distance_count(word, dist) Like above, with counts
word_count() Number of words stored
node_count() Number of nodes in the structure

Batch APIs (parallel via Rayon, results in input order):

Method Description
batch_contains(words) Membership test for a list of words → list[bool]
batch_search(patterns) Wildcard search for a list of patterns → list[list[str]]
batch_search_with_prefix(prefixes) Prefix search for a list of prefixes → list[list[str]]
batch_search_within_distance(words, dist) Fuzzy search for a list of words → list[list[str]]

Rust usage

# Cargo.toml
[dependencies]
lexrs = "1.0"

API

use lexrs::{Trie, Dawg};

// ── Trie ──────────────────────────────────────────────────────────────────────
let mut trie = Trie::new();
trie.add("hello", 5).unwrap();                    // word + optional count
trie.add_all(vec!["world".to_string(), "foo".to_string()]).unwrap();
trie.add_from_file("words.txt").unwrap();

trie.contains("hello");                           // true
trie.contains_prefix("wor");                      // true
trie.word_count();                                // total words stored
trie.node_count();                                // total nodes in structure

trie.search("h*").unwrap();                       // wildcard → Vec<String>
trie.search_with_count("h*").unwrap();            // → Vec<(String, usize)>
trie.search_with_prefix("wo");                    // prefix completion → Vec<String>
trie.search_with_prefix_count("wo");              // → Vec<(String, usize)>
trie.search_within_distance("helo", 1);           // fuzzy (Levenshtein ≤ 1) → Vec<String>
trie.search_within_distance_count("helo", 1);     // → Vec<(String, usize)>

// ── DAWG (words must be inserted in alphabetical order) ───────────────────────
let mut dawg = Dawg::new();
dawg.add_all(vec!["apple".to_string(), "apply".to_string(), "apt".to_string()]).unwrap();
// add_all sorts automatically; call reduce() if you use add() directly
dawg.reduce();

dawg.contains("apple");                           // true
dawg.search("ap*").unwrap();                      // wildcard
dawg.search_within_distance("aple", 1);           // fuzzy

Batch APIs

Batch methods run all inputs in parallel via Rayon and return results in input order.

let words = vec!["apple", "apply", "cherry", "apt"];

trie.batch_contains(&words);
// [true, true, false, true]

trie.batch_search(&["ap*", "b*", "xyz*"]);
// [["apple", "apply", "apt"], ["banana"], []]

trie.batch_search_with_prefix(&["app", "ban"]);
// [["apple", "apply"], ["banana"]]

trie.batch_search_within_distance(&["aple", "bannana"], 1);
// [["apple"], ["banana"]]

Python usage

The library is available on PyPI as pylexrs:

pip install pylexrs

API

from lexrs import Trie, DAWG

# ── Trie ──────────────────────────────────────────────────────────────────────
t = Trie()
t.add("hello", 5)          # word + optional count
t.add_all(["world", "foo"])
t.add_from_file("words.txt")

"hello" in t               # True
t.contains_prefix("wor")   # True
t.get_word_count()         # total words
len(t)                     # total nodes

t.search("h*")                          # wildcard, returns list of words
t.search("h*", with_count=True)         # returns list of (word, count)
t.search_with_prefix("wo")              # prefix completion
t.search_with_prefix_count("wo")        # with counts
t.search_within_distance("helo", 1)     # fuzzy (Levenshtein ≤ 1)
t.search_within_distance("helo", 1, with_count=True)

# ── DAWG ──────────────────────────────────────────────────────────────────────
d = DAWG()
d.add_all(["apple", "apply", "apt"])   # sorted automatically

"apple" in d               # True
d.search("ap*")            # wildcard
d.search_within_distance("aple", dist=1)

Batch APIs

Batch methods process a list of inputs in one call, running all items in parallel via Rayon. They avoid repeated FFI boundary crossings and deliver 2–6× speedup over a Python loop calling the single-item equivalents.

words = ["apple", "apply", "cherry", "apt"]

t.batch_contains(words)
# [True, True, False, True]

t.batch_search(["ap*", "b*", "xyz*"])
# [["apple", "apply", "apt"], ["banana"], []]

t.batch_search_with_prefix(["app", "ban"])
# [["apple", "apply"], ["banana"]]

t.batch_search_within_distance(["aple", "bannana"], dist=1)
# [["apple"], ["banana"]]

Results are always returned in the same order as the input list. An unmatched pattern or prefix returns an empty inner list. dist defaults to 0.

Wildcard syntax

Pattern Meaning
* Zero or more characters
? Exactly one character
h* All words starting with h
?at Three-letter words ending in at
a?* Words of two or more characters starting with a

Consecutive wildcards are normalized (***, ?**).

Production HTTP server

lexrs ships a production-ready HTTP service in lexrs-server/ with two binaries:

Binary Role
writer Accepts word ingestion (POST /words), buffers a delta Trie in memory, and periodically compacts it into a versioned snapshot on disk.
reader Loads the latest snapshot into a DAWG and serves all search queries. Multiple readers can run as replicas and hot-reload new snapshots without downtime.

Consul is used for snapshot version coordination — the writer publishes new snapshot versions to the Consul KV store and readers watch for changes via blocking queries, atomically swapping the in-memory DAWG when a new snapshot arrives.

Docker Compose — full stack in one command

The docker/ directory has a Compose file that brings up the entire stack:

                 nginx  (port 80)
                   │
        ┌──────────┴───────────┐
        ▼                      ▼ (round-robin)
      writer              reader-1   reader-2
        │                      ▲          ▲
        ├──► snapshots ─────────┴──────────┘
        │    (shared volume)
        └──► Consul KV ──► readers watch & hot-reload
cd docker
docker compose up -d
# ingest words with optional per-word counts
curl -X POST http://localhost/words \
  -H 'Content-Type: application/json' \
  -d '{"words": [{"word": "apple", "count": 10}, {"word": "apply", "count": 3}, "apt"]}'

# flush to snapshot (readers reload within ~30 s)
curl -X POST http://localhost/compact

# prefix search
curl 'http://localhost/prefix?q=app'

# wildcard search
curl 'http://localhost/search?q=app*'

# fuzzy search (Levenshtein distance ≤ 1)
curl 'http://localhost/search?q=aple&dist=1'

# exact lookup
curl 'http://localhost/contains?q=apple'

# batch membership
curl -X POST http://localhost/batch/contains \
  -H 'Content-Type: application/json' \
  -d '{"words": ["apple", "cherry", "apricot"]}'

# batch prefix
curl -X POST http://localhost/batch/prefix \
  -H 'Content-Type: application/json' \
  -d '{"prefixes": ["app", "ban"]}'

The writer and reader are built from a single Dockerfile — the command: field in Compose selects which binary to start. See the server docs for the full API reference and configuration options.

Running tests

# Rust unit tests
cargo test

# Python tests
pip install pylexrs pytest
pytest tests/

Project structure

lexrs/src/
  lib.rs      — public re-exports and Python module registration
  trie.rs     — Trie implementation + wildcard / Levenshtein helpers
  dawg.rs     — DAWG implementation
  python.rs   — PyO3 bindings (Trie, DAWG, batch APIs)
  node.rs     — shared Node type (arena-allocated)
  utils.rs    — file I/O and pattern normalization
  error.rs    — LexError enum
lexrs/tests/
  trie_tests.rs — Rust unit tests for Trie
  dawg_tests.rs — Rust unit tests for DAWG
lexrs-server/src/
  writer.rs   — word ingestion + compaction binary
  reader.rs   — search server + batch endpoints binary
  consul.rs   — Consul KV watch / publish helpers
  snapshot.rs — snapshot file I/O
tests/
  test_python_api.py — pytest suite for the Python bindings

Related components

Directory Description
lexrs-server/ Production HTTP server — a writer binary for word ingestion and compaction, and a reader binary for search. Readers scale horizontally and hot-reload snapshots via Consul.
docker/ Docker Compose setup running the full stack: Consul, writer, two reader replicas, and an nginx reverse proxy that routes reads and writes to the right service.
benchmarks/ Python scripts comparing pylexrs against lexpy (pure Python) across insertion, prefix, wildcard, Levenshtein, and batch workloads.
tests/ Integration tests — pytest suite for the Python API and Rust-level integration tests.

License

MIT

About

A Rust lexicon library (Trie + DAWG) with Python bindings and a horizontally-scalable HTTP search server.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors