Implement MinLZ Stream Search Tables#35
Open
klauspost wants to merge 26 commits into
Open
Conversation
Implements - and includes minio#31 Allows for text search inside compressed data. # Block Search Tables ## What it does MinLZ streams can include optional per-block bloom filter tables that enable searching compressed data without decompressing every block. Blocks that definitely don't contain the search pattern are skipped entirely via `io.Seeker` (single syscall) or buffered read. ## Encoding ```go cfg := minlz.NewSearchTableConfig() // matchLen defaults to 6 cfg = cfg.WithBytePrefix('"', ':') // optional: only index after these bytes w := minlz.NewWriter(output, minlz.WriterSearchTable(cfg)) ``` - `NewSearchTableConfig()` -- no arguments, defaults to matchLen=6 - `WithMatchLen(n)` -- override match length (1-8) - `WithBytePrefix(bytes...)` -- type 2 (<=8 bytes) or auto-promotes to type 3 (bitmask) for >8 - `WithMaskPrefix(mask)` -- type 3 (256-bit bitmask) - `WithLongPrefix(prefix)` -- type 4 (multi-byte prefix, 1-256 bytes) - `WithMaxPopulation(pct)` -- skip table if >N% bits set (default 70) - `WithMaxConflicts(pct)` -- stop reducing at >N% conflicts (default 25) Tables are generated in compression goroutines with zero extra channel overhead (single result per block). Only compressible blocks get tables. Empty tables (prefix not found in block) are reduced to minimum 32 bytes -- still useful for proving absence. Base table size = 1 bit per uncompressed byte, automatically derived from block size. Per-block reductions adapt based on population/conflict thresholds. ## Searching ```go searcher := minlz.NewBlockSearcher(input) err := searcher.Search([]byte("pattern"), func(r minlz.SearchResult) bool { // r.Data = decoded block, r.BlockStart = uncompressed offset // r.PrevBlock() = previous decoded block (nil if skipped/first) return true // continue }) stats := searcher.Stats() stats.Fprint(os.Stderr) ``` - Uses `io.Seeker` when available for O(1) block skipping - For prefix tables, scans the search pattern for prefix bytes **anywhere inside it** -- `stamp":"1679` works with prefix `"` because `"` appears at position 5 - Empty prefix tables (all zeros) correctly skip blocks where the prefix byte never appears - Falls back to full decode when tables are absent or incompatible - `BlockSearchBailOnMissing()` to error instead of falling back - `SearchResult` struct with `PrevBlock()` for boundary matching ## Stats ``` Blocks total: 2505, skipped: 2331, searched: 174 Skip rate: 93.1% Table bits/byte: 0.0041, log2: 8.1, avg reductions: 13.9 Table total: 5368528 bytes, avg 2143 bytes/table, 0.05% of 10506623721 uncompressed Table population: avg 21.3%, min 0.0%, max 46.9% ``` ## CLI (mz) ``` mz c -search=8 -search.prefixes='":' file.log # byte prefixes mz c -search=4 -search.prefix='id:"' file.log # long prefix (single byte auto-promotes) mz search -v "pattern" file.log.mz # search with stats mz search -l -n "pattern" file.log.mz # line mode with line numbers mz search -c "pattern" file.log.mz # count only mz search -q "pattern" file.log.mz # exit code only ``` ## Performance (10GB cockroach log, Ryzen 9 9950X) | Config | Skip rate | Search time | Throughput | Table overhead | |--------|-----------|-------------|------------|----------------| | matchLen=8, prefix `":` | 99.9% | 144ms | 55 GB/s | 0.05% | | matchLen=4, prefix `":` | 93.1% | 191ms | 44 GB/s | 0.05% | | matchLen=4, prefix absent | 100% | 17ms | 618 GB/s | 0.001% | | matchLen=8, no prefix | 99.9% | 144ms | 55 GB/s | 2.1% | **Indexing**: ~1.2 GB/s no-prefix, ~2.1 GB/s with prefix (single core) **Pattern lookup**: 37ns, zero allocations ## Wire format Two new skippable chunk types (backward compatible -- old readers silently skip them): - `0x44` -- Search table info (per-stream): table type, match length, base size, prefixes - `0x45` -- Block search table (per-block): reductions + bit array Spec: [SPEC_SEARCH.md](SPEC_SEARCH.md) with Appendix A on searcher lookup strategies.
3b116f4 to
540b4e9
Compare
harshavardhana
approved these changes
Mar 27, 2026
Collaborator
Author
|
(still fine tuning) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Implements - and includes #31
Allows for text search inside compressed data.
Default settings, length = 5:
Full stats:
Stream Searching
Introduction
MinLZ streams can include optional per-block hash tables that allow searching
compressed data without decompressing every block. When a block's table indicates the
search pattern is definitely absent, the block is skipped entirely — via
io.Seekerif available (single syscall), or by buffering past the compressed data.
Streams with search tables remain fully backward-compatible: readers that don't
understand the table chunks silently skip them, and the compressed data is unchanged.
Search tables are generated during compression with zero impact on the compressed data
itself — they are stored as additional skippable chunks interleaved between blocks.
For format specification see SPEC_SEARCH.md.
How It Works
Each block's search table is a bit array where each position in the uncompressed data
is hashed and the corresponding bit is set. When searching, the pattern's byte windows
are hashed and checked against the table:
Tables are reduced (halved) by OR-folding the upper and lower halves, trading accuracy
for size. Reductions are applied per-block based on population density.
Longer search patterns produce more window checks, giving exponentially better
filtering. For example, a 19-byte pattern with matchLen=8 produces 12 window
checks — all 12 must match for a false positive, which is extremely unlikely
with typical table populations of 10–30%.
There is a limit to table sizes at 1 bit per byte. This means that at most the tables will b 1/8th of the
uncompressed stream size - and for these tables a maximum population count - default at 70%.
If the 8:1 table is filled more than this they will not be saved to the stream.
This means that blocks with near-random data will not have any tables
and searching will have to fall back to decompression.
MinLZ will not attempt to generate tables for incompressible blocks.
Parameters
Match Length
Controls how many bytes of each position are hashed into the table. Range: 1–8,
default: 6.
shorter search patterns. However, shorter hashes collide more, increasing table
population.
However, a search pattern of length N only produces
N - matchLen + 1hash windowsto check. Fewer windows means fewer independent chances to prove a block doesn't
contain the pattern, which can reduce skip rates. Higher match lengths also produce
lower base population, which means fewer reductions and larger tables on disk.
The match length must be less than or equal to the search pattern length. Patterns
shorter than the match length cannot use the table (the searcher falls back to full
decode).
A good default is 6: it balances table density against the number of check windows.
Use 4 for short patterns (e.g. short IDs), but be aware that short windows from common
character classes (digits, hex, lowercase) will appear in nearly every block, collapsing
skip rates. For example, searching numeric data with matchLen=4 can drop skip rates to
single digits because 4-byte digit sequences are ubiquitous.
Table Max Population Size
Maximum percentage of bits that may be set in the base table before it is discarded
entirely. Default: 70%.
When a block's data is highly random or the match length is short, most hash slots get
filled and the table loses its ability to prove absence. Tables exceeding this threshold
are dropped — the block will always be decoded during search.
Lowering this value makes the compressor more aggressive about discarding noisy tables,
reducing overhead at the cost of fewer indexed blocks. Raising it keeps more tables but
with higher false-positive rates.
Table Reduction Limit
Maximum population percentage of the reduced table. Library default: 25%.
The
mzCLI defaults to 50% without prefix and 25% with prefix.After the base table is built, it is iteratively halved by folding (OR-ing) the upper
half into the lower half. Each reduction halves the table size but increases the
population density. Reductions stop before this threshold is exceeded.
Lower values produce smaller tables (fewer bytes per block in the compressed stream) but
with more false positives. Higher values keep larger, sparser tables that skip more blocks.
Prefixes
Prefix filtering dramatically reduces table size for structured data by only indexing
positions that follow specific bytes. For example, in JSON data, values always follow
"or:, so most byte positions can be skipped during indexing.Since fewer positions are indexed, the base table population is much lower, allowing
more reductions and producing significantly smaller tables on disk. The downside is that
search patterns must contain at least one prefix byte (or match the long prefix) for
the table to be usable. Patterns without any prefix bytes fall back to full block decode.
There are two prefix modes:
Single byte
Single-byte prefixes only indexes positions preceded by one of these bytes.
WithBytePrefixaccepts up to 8 bytes directly; for more than 8, useWithMaskPrefixwith a 256-bit bitmask. Both produce the same result.
Long prefix
Long prefix only indexes positions preceded by an exact multi-byte sequence (1–256 bytes):
The searcher scans the search pattern for prefix bytes anywhere inside it. For example,
searching for
"unique-9876"with byte prefix"works because"appears at position 0in the pattern. The table is consulted for the hash window that follows each prefix
occurrence in the pattern.
When no prefix bytes appear in the search pattern, the table cannot be used and the
searcher falls back to full block decode.
Choosing good prefix bytes
Pick bytes that immediately precede the values you'll search for:
"and:— values always follow":or:[.,or\t— field separators.=precedes values.=, or:.Commandline
The
mztool supports search table generation during compression and pattern searchon compressed files.
Compression with search tables:
Searching compressed files:
When tables are absent, the searcher decodes every block (equivalent to decompress + grep).
With tables present, only blocks that might contain the pattern are decoded.
API reference
Compression
Enable search tables by passing
WriterSearchTableas a writer option:Tables are generated concurrently alongside compression with no extra goroutine
synchronization overhead. The
Writerhandles all table generation and chunkserialization automatically.
Configuration methods:
NewSearchTableConfig()WithMatchLen(n)WithBytePrefix(b...)WithMaskPrefix(mask)WithLongPrefix(p)WithMaxPopulation(pct)WithMaxReducedPopulation(pct)Decompressing the stream will ignore the search tables.
Searching
Search compressed streams using
BlockSearcher:The callback receives a
SearchResultfor each match:Blocks[2][]byte[0]= previous block (nil if skipped/lazy),[1]= current blockOffsetintPrevBlock()+Blocks[1]StreamOffsetint64BlockStartint64PrevBlock()dataPrevBlockLenintMethods on
SearchResult:PrevBlock() []byte— Returns the previous block's data. Lazily decompresses if theprevious block was skipped by the index. Returns nil if no previous block exists.
Return values from the callback:
nil— continue searchingErrSearchForward— request the next block for forward context; the searcher willre-call the callback with the same match but
Blocks[1]replaced by the next blockSearcher options:
BlockSearchBailOnMissing()ErrSearchTablesUnusableif tables are absent or incompatibleBlockSearchIgnoreCRC()BlockSearchMaxBlockSize(n)After
Searchreturns, callStats()for aSearchStatsstruct with block counts,skip rates, table population metrics, and byte-level statistics. Use
stats.Fprint(os.Stderr)for a human-readable summary.Note that the maximum backreference on matches is limited by the block size.
So a match right after a block boundary will only have the previous block's data available.
Spec: SPEC_SEARCH.md with Appendix A on searcher lookup strategies.