#min-hash #sketching #primitive #ir #signature #candidate #simhash #near-duplicate

sketchir

Sketching primitives for IR: minhash/simhash/LSH-style signatures

3 releases

0.1.2 Jan 26, 2026
0.1.1 Jan 26, 2026
0.1.0 Jan 26, 2026

#788 in Text processing

Download history 7/week @ 2026-01-20 818/week @ 2026-01-27 725/week @ 2026-02-03

1,550 downloads per month

MIT/Apache

37KB
828 lines

sketchir: sketching primitives for IR.

This crate is intended for index-only similarity sketches used in:

  • near-duplicate detection (MinHash / shingles)
  • text fingerprinting (SimHash)
  • approximate similarity search (LSH-style candidate generation)

Scope here is primitives: signatures, basic indexing, deterministic behavior. Higher-level workflows (crawl dedupe pipelines, content extraction, etc.) belong elsewhere.

Dependencies

~0.5–1MB
~19K SLoC