-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Improve compression speed through data preloading and reordering of search range upper bound check #1666
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Conversation
BiplabRaut
commented
Oct 23, 2025
|
Cyan4973
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The asan / valgrind errors are very consequential.
They must be fixed before moving the patch forward.
9b78d32 to
13410e6
Compare
Hi Yann, |
The code change adds two types of optimizations to speed up compression (also mentioned in the commit log message) :
These optimizations help compression speed on x86 CPUs in general and may also extend the performance benefits to Arm CPUs (although not tested). In our tests on Zen4 and Zen5 based AMD CPUs, we observed the below performance benefits (with GCC 14.2) : Thanks. |
|
HI @Cyan4973 , can you please take up this PR for review |
build/make/lz4defs.make
Outdated
| # LZ4_compress_generic_validated and better branch predictor behavior for noDict directive. | ||
|
|
||
|
|
||
| USEROPTIONALFLAGS = -DAOCL_PRELOAD_BRANCH_OPT=1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I have a preference for :
USEROPTIONALFLAGS ?= -DAOCL_PRELOAD_BRANCH_OPT=1
so that USEROPTIONALFLAGS can also be set via environment variable (on top of command line argument, which remains higher priority).
Also: if the patch is beneficial to some targets and detrimental to others, we might have some conditional platform logic to implement here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @Cyan4973, we have modified it based on your comment. Thanks
Hi @BiplabRaut , sorry, we had some consequential activity to deal with these past few weeks. Back to some reasonable availability. I guess the questions asked at the beginning of this post remain valid:
On the plus side, the patch is (mostly) concentrated in one place, which makes gating it easy. |
Thank you Yann for your further comments. We will get back with detailed test numbers/graphs and share here. On your point regarding whether this code can be a replacement/separate candidate to the existing code, we think it would give more flexibility for its use across platforms. |
…earch range upper bound check
- Data is preloaded for byte matching within the inner loop in LZ4_compress_generic_validated.
- Order of LZ4_DISTANCE_MAX check for match is changed to allow better branch predictor behavior for noDict directive.
Co-authored-by: vedan102_amdeng <Veda.N@amd.com>
Co-authored-by: sraut_amdeng <Biplab.Raut@amd.com>
13410e6 to
a9cc314
Compare
|
Hi @Cyan4973, Performance Highlights :- (No compression ratio change) Performance gains on AMD Zen4 based CPU Setup: AMD Genoa (Zen4) CPU, SMT OFF, GCC 14.2.1 Performance gains on AWS Arm based Graviton CPU Setup: AWS Graviton4 (Neoverse-V2) CPU, SMT OFF, GCC 13.3 |