Experiment : pre-splitter #4136

Cyan4973 · 2024-09-03T23:37:44Z

This is currently just an experiment:

Instead of ingesting full blocks only (128 KB),
make an a-priori analysis of the data,
and infer a position to split, if useful.

This leads to some small yet non-trivial compression gains, sometimes equivalent to a full compression level, yet for a correspondingly small speed cost.
The benefit is higher when there isn't a post-splitter (like in higher btultra and above),
but even when there is, there is still some small compression ratio benefit left, making it desirable for higher compression modes.

However, the dynamic analysis is not free, so it's currently reserved for higher compression strategies (currently btlazy2 and above), where the speed cost can be considered "negligible" (< 5%).
For other modes, this is replaced by a static block size strategy, but it's no longer limited to 128 KB only. Through tests, it appears that a 92 KB block size seems preferable, scoring higher compression ratio at a very small compression speed loss (due to increased nb of blocks, hence of block headers).

silesia.tar :

level	`dev`	this `PR`	savings	note
1	73422432	73324476	-97956	new static block size
12	58213308	58149669	-63639	last static block size
13	57994603	57720597	-274006	first dynamic size
15	57176243	56905822	-270421	last dynamic size alone
16	55338671	55256977	-81694	combine pre+post splitting
19	52887208	52842629	-44579	idem
22	52322760	52284324	-38436	idem

calgary.tar :

level	`dev`	this `PR`	savings	note
1	1143607	1129229	-14378	new static block size
12	930831	924326	-6505	last static block size
13	922061	914326	-7735	first dynamic size
15	921606	912055	-9951	last dynamic size alone
16	882353	881217	-1136	combine pre+post splitting
19	861336	859737	-1599	idem
22	861200	859578	-1622	idem

It might be preferable to consider this parameter as yet another knob that could be enabled by the trainer.
But this would first require a refactor, to make this parameter independent.

Todo :

Employ workspace for allocation
Make the parameter selectable

instead of ingesting only full blocks, make an analysis of data, and infer where to split.

for better portability on Linux kernel

though I really wonder if this is a property worth maintaining.

Cyan4973 · 2024-09-04T00:35:50Z

mmmh,
this could be a problem:

test 45 : in-place decompression : Error : test (margin <= ZSTD_DECOMPRESSION_MARGIN(CNBuffSize, ZSTD_BLOCKSIZE_MAX)) failed

I suspect the problem is that the macro ZSTD_DECOMPRESSION_MARGIN() may make an assumption about block sizes being always full (128 KB), and derive an assumption about maximum expansion, which is no longer respected when block sizes are 92 KB by default (lower compression levels). In contrast, ZSTD_decompressionMargin() scans the compressed data, hence it probably ends up with a different result, and now both results differ.

I'm not completely sure what's the wanted behavior here...

edit: confirmed that, when I change the default block size to anything other than 128 KB, it breaks this test.
On the other hand, ZSTD_DECOMPRESSION_MARGIN() macro requires a parameter blockSize, and the test passes ZSTD_BLOCKSIZE_MAX, aka 128 KB, so it's pretty clear what it's expecting.
So the question is: how could this test be passed "reasonably", i.e. without inner knowledge of how the internal block splitting decision is done ?

XP: add a pre-splitter

751e165

instead of ingesting only full blocks, make an analysis of data, and infer where to split.

Cyan4973 self-assigned this Sep 3, 2024

facebook-github-bot added the CLA Signed label Sep 3, 2024

Cyan4973 added 5 commits September 3, 2024 16:44

fixed strict C90 semantic

8862f28

do not use new as variable name

bb07b6f

use ZSTD_memset()

e3727fc

for better portability on Linux kernel

minor C++-ism

9c7d138

though I really wonder if this is a property worth maintaining.

more ZSTD_memset() to apply

6d6d3db

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Experiment : pre-splitter #4136

Experiment : pre-splitter #4136

Cyan4973 commented Sep 3, 2024 •

edited

Loading

Cyan4973 commented Sep 4, 2024 •

edited

Loading

Experiment : pre-splitter #4136

Are you sure you want to change the base?

Experiment : pre-splitter #4136

Conversation

Cyan4973 commented Sep 3, 2024 • edited Loading

Cyan4973 commented Sep 4, 2024 • edited Loading

Cyan4973 commented Sep 3, 2024 •

edited

Loading

Cyan4973 commented Sep 4, 2024 •

edited

Loading