go run main.go
- Reads
Chunk
byChunkSize
frominput/input.bin
- Initializes two processors:
beginning
(readsChunk
from start)inverse
(readsChunk
from end)
Common logic for both processors:
Before processing each Crumb, checks if any counter reached CounterValue
:
- If any counter ≥
CounterValue
:- All counters are halved (integer division)
- Threshold check (divide all by 2 if any ≥
CounterValue
) - Sort counters (descending by
Value
) - Check actualization zone (top
ActualizationValue
% counters):- If match in zone:
Matches
+1, counter +PredictIncrement
- If match in zone:
- Check other counters:
- If match outside zone:
Matches
+1, counter +Increment
- If match outside zone:
- Create new counter (if no match: create counter with
ID = Crumb value
) - Save state (to
*.bin
and*_matches.bin
files) - Filtration check (if
matches/crumbs
≥FiltrationValue
):- Remove bottom
FiltrationValue
% counters
- Remove bottom
- Input:
Chunk = [0x01, 0x02, 0x03, 0x04, 0x05]
(CrumbSize = 2
) - Beginning processor:
- Processes:
[0x01,0x02] → [0x03,0x04]
- Remainder:
[0x05]
(padded:[0x05,0x00]
)
- Processes:
- Inverse processor:
- Processes:
[0x05,0x04] → [0x03,0x02]
- Remainder:
[0x01]
(padded:[0x01,0x00]
)
- Processes:
- For large files: set
"ChunkSize">65536
in config - For faster processing: increase
"CounterValue"
and decrease"FiltrationValue"
- Output files structure:
-
Main files (
*_begin.bin
,*_inverse.bin
):- Binary format:
key (4B) + value (4B)
per entry - Key: numeric Crumb value (2-byte sequence)
- Value: occurrence counter
- Format:
struct.pack(">II", key, value)
(big-endian)
- Binary format:
-
Stats file (
*_matches.bin
):- Contains:
0: total_matches
(total matches)1: total_crumbs
(total processed crumbs)
- Same format:
struct.pack(">II", 0, total_matches)
etc.
- Contains: