Skip to content

Conversation

@BiplabRaut
Copy link

- Data is preloaded for byte matching within the inner loop in LZ4_compress_generic_validated.
- Order of LZ4_DISTANCE_MAX check for match is changed to allow better branch predictor behavior for noDict directive.

@Cyan4973 Cyan4973 self-assigned this Oct 23, 2025
@Cyan4973
Copy link
Member

  • On which platforms is this PR presumed to be beneficial ?
  • Please document the expected gains in these cases
  • Are there other platforms on which this patch is suspected to be detrimental ?

Copy link
Member

@Cyan4973 Cyan4973 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The asan / valgrind errors are very consequential.
They must be fixed before moving the patch forward.

@BiplabRaut BiplabRaut force-pushed the lz4_data_preload branch 2 times, most recently from 9b78d32 to 13410e6 Compare November 1, 2025 07:34
@BiplabRaut
Copy link
Author

The asan / valgrind errors are very consequential. They must be fixed before moving the patch forward.

Hi Yann,
The issues have been fixed and all the tests are now passing. Thanks

@BiplabRaut
Copy link
Author

BiplabRaut commented Nov 1, 2025

  • On which platforms is this PR presumed to be beneficial ?
  • Please document the expected gains in these cases
  • Are there other platforms on which this patch is suspected to be detrimental ?

The code change adds two types of optimizations to speed up compression (also mentioned in the commit log message) :

  • Prefetching of input and match bytes by accessing them before they get used for comparison
  • Delay the condition checking involving LZ4_DISTANCE_MAX for noDict case for helping branch prediction.

These optimizations help compression speed on x86 CPUs in general and may also extend the performance benefits to Arm CPUs (although not tested).
As these changes are mostly generic memory and code optimizations without any assembly/architecture specific changes, so we don't expect any platforms to suffer in any way.

In our tests on Zen4 and Zen5 based AMD CPUs, we observed the below performance benefits (with GCC 14.2) :
8.8% compression speed up for Silesia
13% compression speed up for freeBSD

Thanks.

@BiplabRaut BiplabRaut requested a review from Cyan4973 November 9, 2025 10:12
@BiplabRaut
Copy link
Author

HI @Cyan4973 , can you please take up this PR for review

# LZ4_compress_generic_validated and better branch predictor behavior for noDict directive.


USEROPTIONALFLAGS = -DAOCL_PRELOAD_BRANCH_OPT=1
Copy link
Member

@Cyan4973 Cyan4973 Nov 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I have a preference for :

USEROPTIONALFLAGS ?= -DAOCL_PRELOAD_BRANCH_OPT=1

so that USEROPTIONALFLAGS can also be set via environment variable (on top of command line argument, which remains higher priority).

Also: if the patch is beneficial to some targets and detrimental to others, we might have some conditional platform logic to implement here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Cyan4973, we have modified it based on your comment. Thanks

@Cyan4973
Copy link
Member

Cyan4973 commented Nov 26, 2025

HI @Cyan4973 , can you please take up this PR for review

Hi @BiplabRaut ,

sorry, we had some consequential activity to deal with these past few weeks.

Back to some reasonable availability.

I guess the questions asked at the beginning of this post remain valid:

  • On which platforms is this PR presumed to be beneficial ?
    I'm presuming AMD, but maybe only "recent" cpus, or maybe x64 in general.
  • Please document the expected gains for these known cases
  • Are there other platforms (like arm) to check to ensure this patch (which is currently applied unconditionally) is not detrimental ?

On the plus side, the patch is (mostly) concentrated in one place, which makes gating it easy.
What I don't know at this point is if it's expected to produce generally positive speed outcomes, and at worse neutral, thus being a replacement candidate for the existing code,
or if its impact varies (sometimes positive, sometimes negative) across platforms, thus requiring to keep the alternative around and cleverly select the correct variant for the local environment.

@BiplabRaut
Copy link
Author

HI @Cyan4973 , can you please take up this PR for review

Hi @BiplabRaut ,

sorry, we had some consequential activity to deal with these past few weeks.

Back to some reasonable availability.

I guess the questions asked at the beginning of this post remain valid:

  • On which platforms is this PR presumed to be beneficial ?
    I'm presuming AMD, but maybe only "recent" cpus, or maybe x64 in general.
  • Please document the expected gains for these known cases
  • Are there other platforms (like arm) to check to ensure this patch (which is currently applied unconditionally) is not detrimental ?

On the plus side, the patch is (mostly) concentrated in one place, which makes gating it easy. What I don't know at this point is if it's expected to produce generally positive speed outcomes, and at worse neutral, thus being a replacement candidate for the existing code, or if its impact varies (sometimes positive, sometimes negative) across platforms, thus requiring to keep the alternative around and cleverly select the correct variant for the local environment.

Thank you Yann for your further comments.

We will get back with detailed test numbers/graphs and share here.

On your point regarding whether this code can be a replacement/separate candidate to the existing code, we think it would give more flexibility for its use across platforms.
Once we share our performance graphs, may be you can review and give your views on this.

…earch range upper bound check

    - Data is preloaded for byte matching within the inner loop in LZ4_compress_generic_validated.
    - Order of LZ4_DISTANCE_MAX check for match is changed to allow better branch predictor behavior for noDict directive.

Co-authored-by: vedan102_amdeng <Veda.N@amd.com>
Co-authored-by: sraut_amdeng <Biplab.Raut@amd.com>
@BiplabRaut
Copy link
Author

Hi @Cyan4973,
Here is the summary of performance gains that we see with this PR:
(As you suggested, we have also benchmarked it on Arm)

Performance Highlights :- (No compression ratio change)
Benchmark result (single-threaded execution) graphs are shared below for the datasets: Silesia, Calgary, Canterbury and FreeBSD.
Other datasets like geekBench and Enwik showed similar trends.

image

Performance gains on AMD Zen4 based CPU Setup: AMD Genoa (Zen4) CPU, SMT OFF, GCC 14.2.1

image

Performance gains on AWS Arm based Graviton CPU Setup: AWS Graviton4 (Neoverse-V2) CPU, SMT OFF, GCC 13.3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants