Improve compression speed through data preloading and reordering of search range upper bound check #1666

BiplabRaut · 2025-10-23T10:25:45Z

- Data is preloaded for byte matching within the inner loop in LZ4_compress_generic_validated.
- Order of LZ4_DISTANCE_MAX check for match is changed to allow better branch predictor behavior for noDict directive.

Cyan4973 · 2025-10-23T13:05:48Z

On which platforms is this PR presumed to be beneficial ?
Please document the expected gains in these cases
Are there other platforms on which this patch is suspected to be detrimental ?

Cyan4973

The asan / valgrind errors are very consequential.
They must be fixed before moving the patch forward.

BiplabRaut · 2025-11-01T08:09:40Z

The asan / valgrind errors are very consequential. They must be fixed before moving the patch forward.

Hi Yann,
The issues have been fixed and all the tests are now passing. Thanks

BiplabRaut · 2025-11-01T08:34:13Z

On which platforms is this PR presumed to be beneficial ?

Please document the expected gains in these cases

Are there other platforms on which this patch is suspected to be detrimental ?

The code change adds two types of optimizations to speed up compression (also mentioned in the commit log message) :

Prefetching of input and match bytes by accessing them before they get used for comparison
Delay the condition checking involving LZ4_DISTANCE_MAX for noDict case for helping branch prediction.

These optimizations help compression speed on x86 CPUs in general and may also extend the performance benefits to Arm CPUs (although not tested).
As these changes are mostly generic memory and code optimizations without any assembly/architecture specific changes, so we don't expect any platforms to suffer in any way.

In our tests on Zen4 and Zen5 based AMD CPUs, we observed the below performance benefits (with GCC 14.2) :
8.8% compression speed up for Silesia
13% compression speed up for freeBSD

Thanks.

BiplabRaut · 2025-11-19T03:27:28Z

HI @Cyan4973 , can you please take up this PR for review

Cyan4973 · 2025-11-26T18:28:08Z

build/make/lz4defs.make

+# LZ4_compress_generic_validated and better branch predictor behavior for noDict directive.
+
+
+USEROPTIONALFLAGS = -DAOCL_PRELOAD_BRANCH_OPT=1


nit: I have a preference for :

USEROPTIONALFLAGS ?= -DAOCL_PRELOAD_BRANCH_OPT=1

so that USEROPTIONALFLAGS can also be set via environment variable (on top of command line argument, which remains higher priority).

Also: if the patch is beneficial to some targets and detrimental to others, we might have some conditional platform logic to implement here.

Hi @Cyan4973, we have modified it based on your comment. Thanks

Cyan4973 · 2025-11-26T18:33:07Z

HI @Cyan4973 , can you please take up this PR for review

Hi @BiplabRaut ,

sorry, we had some consequential activity to deal with these past few weeks.

Back to some reasonable availability.

I guess the questions asked at the beginning of this post remain valid:

On which platforms is this PR presumed to be beneficial ?
I'm presuming AMD, but maybe only "recent" cpus, or maybe x64 in general.
Please document the expected gains for these known cases
Are there other platforms (like arm) to check to ensure this patch (which is currently applied unconditionally) is not detrimental ?

On the plus side, the patch is (mostly) concentrated in one place, which makes gating it easy.
What I don't know at this point is if it's expected to produce generally positive speed outcomes, and at worse neutral, thus being a replacement candidate for the existing code,
or if its impact varies (sometimes positive, sometimes negative) across platforms, thus requiring to keep the alternative around and cleverly select the correct variant for the local environment.

BiplabRaut · 2025-12-04T12:00:40Z

HI @Cyan4973 , can you please take up this PR for review

Hi @BiplabRaut ,

sorry, we had some consequential activity to deal with these past few weeks.

Back to some reasonable availability.

I guess the questions asked at the beginning of this post remain valid:

On which platforms is this PR presumed to be beneficial ?
I'm presuming AMD, but maybe only "recent" cpus, or maybe x64 in general.

Please document the expected gains for these known cases

Are there other platforms (like arm) to check to ensure this patch (which is currently applied unconditionally) is not detrimental ?

On the plus side, the patch is (mostly) concentrated in one place, which makes gating it easy. What I don't know at this point is if it's expected to produce generally positive speed outcomes, and at worse neutral, thus being a replacement candidate for the existing code, or if its impact varies (sometimes positive, sometimes negative) across platforms, thus requiring to keep the alternative around and cleverly select the correct variant for the local environment.

Thank you Yann for your further comments.

We will get back with detailed test numbers/graphs and share here.

On your point regarding whether this code can be a replacement/separate candidate to the existing code, we think it would give more flexibility for its use across platforms.
Once we share our performance graphs, may be you can review and give your views on this.

…earch range upper bound check - Data is preloaded for byte matching within the inner loop in LZ4_compress_generic_validated. - Order of LZ4_DISTANCE_MAX check for match is changed to allow better branch predictor behavior for noDict directive. Co-authored-by: vedan102_amdeng <Veda.N@amd.com> Co-authored-by: sraut_amdeng <Biplab.Raut@amd.com>

BiplabRaut · 2025-12-16T08:30:26Z

Hi @Cyan4973,
Here is the summary of performance gains that we see with this PR:
(As you suggested, we have also benchmarked it on Arm)

Performance Highlights :- (No compression ratio change)
Benchmark result (single-threaded execution) graphs are shared below for the datasets: Silesia, Calgary, Canterbury and FreeBSD.
Other datasets like geekBench and Enwik showed similar trends.

Performance gains on AMD Zen4 based CPU Setup: AMD Genoa (Zen4) CPU, SMT OFF, GCC 14.2.1

Performance gains on AWS Arm based Graviton CPU Setup: AWS Graviton4 (Neoverse-V2) CPU, SMT OFF, GCC 13.3

Cyan4973 self-assigned this Oct 23, 2025

Cyan4973 requested changes Oct 23, 2025

View reviewed changes

BiplabRaut force-pushed the lz4_data_preload branch 2 times, most recently from 9b78d32 to 13410e6 Compare November 1, 2025 07:34

BiplabRaut requested a review from Cyan4973 November 9, 2025 10:12

Cyan4973 reviewed Nov 26, 2025

View reviewed changes

BiplabRaut force-pushed the lz4_data_preload branch from 13410e6 to a9cc314 Compare December 16, 2025 07:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve compression speed through data preloading and reordering of search range upper bound check #1666

Improve compression speed through data preloading and reordering of search range upper bound check #1666

BiplabRaut commented Oct 23, 2025

Uh oh!

Cyan4973 commented Oct 23, 2025

Uh oh!

Cyan4973 left a comment

Uh oh!

BiplabRaut commented Nov 1, 2025

Uh oh!

BiplabRaut commented Nov 1, 2025 •

edited

Loading

Uh oh!

BiplabRaut commented Nov 19, 2025

Uh oh!

Cyan4973 Nov 26, 2025 •

edited

Loading

Uh oh!

BiplabRaut Dec 16, 2025

Uh oh!

Cyan4973 commented Nov 26, 2025 •

edited

Loading

Uh oh!

BiplabRaut commented Dec 4, 2025

Uh oh!

BiplabRaut commented Dec 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		# LZ4_compress_generic_validated and better branch predictor behavior for noDict directive.


		USEROPTIONALFLAGS = -DAOCL_PRELOAD_BRANCH_OPT=1

Improve compression speed through data preloading and reordering of search range upper bound check #1666

Are you sure you want to change the base?

Improve compression speed through data preloading and reordering of search range upper bound check #1666

Conversation

BiplabRaut commented Oct 23, 2025

Uh oh!

Cyan4973 commented Oct 23, 2025

Uh oh!

Cyan4973 left a comment

Choose a reason for hiding this comment

Uh oh!

BiplabRaut commented Nov 1, 2025

Uh oh!

BiplabRaut commented Nov 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BiplabRaut commented Nov 19, 2025

Uh oh!

Cyan4973 Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

BiplabRaut Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

Cyan4973 commented Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BiplabRaut commented Dec 4, 2025

Uh oh!

BiplabRaut commented Dec 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

BiplabRaut commented Nov 1, 2025 •

edited

Loading

Cyan4973 Nov 26, 2025 •

edited

Loading

Cyan4973 commented Nov 26, 2025 •

edited

Loading