-
-
Notifications
You must be signed in to change notification settings - Fork 258
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up crc32_[v]pclmulqdq on small strings. #1650
Conversation
|
I experimented with using DO8 in there as well, but on my cpu at least it made zero difference, only adding code size and slowing down the len=1 case ~50%. |
Weird, I was almost certain that the pclmulqdq variant of crc32 was slower for anything smaller than 64 bytes. I can't verify anything for the v variants as I don't have a CPU that support them but I'll have a look. Oh I see, you wrote a new bypass method. |
I actually came across this attempting to make crc_small because I wanted to get rid of the crc_braid dependency (for another PR in the works), and when I enabled additional benchmarks to make sure it was not slower, I got quite a surprise.
I also tested using crc_braid with the smaller lengths, and that is actaully also faster (only changed the I did my testing on a i9-9900K, I also don't have anything capable of vpclmulqdq. |
Hmm, the CI failures there are probably what I was referring to with the size requirement. Let me look at that code again, it's been a while. |
Yeah so it's definitely required to be at least 31 bytes for the load. The reason for this is what we do for aligning the loads. Now, unaligned vs aligned loads has made very little difference for a while now, but I wouldn't be shocked if it were something. The CLMUL capable CPUs are all post Nehalem, though. |
Thats what I was afraid of. It means we get a very nice speedup for sizes 32-63 bytes though, while 12-31 remain slow. Unless someone has other ideas to improve on that. Updated the PR to always do |
Yeah, the idea behind the aligning load was that every load forward from that point is also aligned. We could do something where we handle extremely short strings by just doing an unaligned version. We do something somewhat similar in the adler checksum for ssse3, where it incorporates CPUs where alignment really makes or breaks it. |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## develop #1650 +/- ##
===========================================
+ Coverage 83.13% 83.41% +0.27%
===========================================
Files 135 135
Lines 10895 11126 +231
Branches 2816 2988 +172
===========================================
+ Hits 9058 9281 +223
- Misses 1127 1130 +3
- Partials 710 715 +5 ☔ View full report in Codecov by Sentry. |
@Dead2 I will benchmark on my laptop that has AVX512. |
ee7f79b
to
ebd78ca
Compare
Removed the extra header file, as it is not really needed any more (already so many crc files 😅). |
Rebased and split into two commits. |
Updated benchmarks on my i9-9000K
|
PR
DEVELOP
|
Hmm using Google benchmarks, it does not go through |
Your develop benchmarks don't have the change to benchmark lengths (now a separate commit to make that easier). |
They will if they call |
@nmoinvaz Actually you do have the speedup there, just nothing to compare it to.
The 32 test is the fastest because it is the shortest one now using the [v]pclmulqdq code. Previously the smalles was the 64 size, and the 32 would be around 78ns or so for your platform instead of now 14.7ns, so ~5x faster. |
Running through the larger 740 something test batteries but it'd probably be good to see it pass CI as well. Yeah so part of this was due to the fact that we had to do some back bending to make sure that the copy+CRC checksum worked for small lengths, anyway. It was only when we ended up combining the code into the template maze we have now did it seem silly that we weren't accommodating the the 16 < len < 31 cases. I think I have something here that works but I'd feel better with more testing. |
I see, I'll rerun the benchmarks in a bit with that one commit pulled into develop. |
DEVELOP
PR
|
@nmoinvaz Did your laptop overheat and throttle or something?
|
@Dead2 here's what I have so far:
The area I'm least confident about is where the user provides an initial CRC into the function and for arbitrarily aligned buffers and lengths > 16 but lengths < 32. Do we have coverage of that in tests? |
I put a manual abort statement in the area I wasn't confident about and it did manage to abort in at least one gtest, so perhaps that's something. |
@Dead2 this seems to be correct but warrants more testing if you can manage: diff --git a/arch/x86/crc32_fold_pclmulqdq_tpl.h b/arch/x86/crc32_fold_pclmulqdq_tpl.h
index 3e799283..753417ba 100644
--- a/arch/x86/crc32_fold_pclmulqdq_tpl.h
+++ b/arch/x86/crc32_fold_pclmulqdq_tpl.h
@@ -26,9 +26,8 @@ Z_INTERNAL void CRC32_FOLD(crc32_fold *crc, const uint8_t *src, size_t len, uint
__m128i xmm_t0, xmm_t1, xmm_t2, xmm_t3;
__m128i xmm_crc0, xmm_crc1, xmm_crc2, xmm_crc3;
__m128i xmm_crc_part = _mm_setzero_si128();
-#ifdef COPY
char ALIGNED_(16) partial_buf[16] = { 0 };
-#else
+#ifndef COPY
__m128i xmm_initial = _mm_cvtsi32_si128(init_crc);
int32_t first = init_crc != 0;
@@ -41,12 +40,12 @@ Z_INTERNAL void CRC32_FOLD(crc32_fold *crc, const uint8_t *src, size_t len, uint
crc32_fold_load((__m128i *)crc->fold, &xmm_crc0, &xmm_crc1, &xmm_crc2, &xmm_crc3);
if (len < 16) {
-#ifdef COPY
if (len == 0)
return;
memcpy(partial_buf, src, len);
xmm_crc_part = _mm_load_si128((const __m128i *)partial_buf);
+#ifdef COPY
memcpy(dst, partial_buf, len);
#endif
goto partial;
@@ -63,9 +62,23 @@ Z_INTERNAL void CRC32_FOLD(crc32_fold *crc, const uint8_t *src, size_t len, uint
if (algn_diff < 4 && init_crc != 0) {
xmm_t0 = xmm_crc_part;
- xmm_crc_part = _mm_loadu_si128((__m128i*)src + 1);
- fold_1(&xmm_crc0, &xmm_crc1, &xmm_crc2, &xmm_crc3);
- xmm_crc3 = _mm_xor_si128(xmm_crc3, xmm_t0);
+ if (len >= 32) {
+ xmm_crc_part = _mm_loadu_si128((__m128i*)src + 1);
+ fold_1(&xmm_crc0, &xmm_crc1, &xmm_crc2, &xmm_crc3);
+ xmm_crc3 = _mm_xor_si128(xmm_crc3, xmm_t0);
+ } else {
+ memcpy(partial_buf, src + 16, len - 16);
+ xmm_crc_part = _mm_load_si128((__m128i*)partial_buf);
+ fold_1(&xmm_crc0, &xmm_crc1, &xmm_crc2, &xmm_crc3);
+ xmm_crc3 = _mm_xor_si128(xmm_crc3, xmm_t0);
+ src += 16;
+ len -= 16;
+#ifdef COPY
+ dst -= algn_diff;
+#endif
+ goto partial;
+ }
+
src += 16;
len -= 16;
}
diff --git a/arch/x86/crc32_pclmulqdq_tpl.h b/arch/x86/crc32_pclmulqdq_tpl.h
index 80a35b03..3a4f6af5 100644
--- a/arch/x86/crc32_pclmulqdq_tpl.h
+++ b/arch/x86/crc32_pclmulqdq_tpl.h
@@ -365,7 +365,7 @@ static inline uint32_t crc32_small(uint32_t crc, const uint8_t *buf, size_t len)
Z_INTERNAL uint32_t CRC32(uint32_t crc32, const uint8_t *buf, size_t len) {
/* For lens smaller than ~12, crc32_small method is faster.
* But there are also minimum requirements for the pclmul functions due to alignment */
- if (len < 32)
+ if (len < 16)
return crc32_small(crc32, buf, len);
crc32_fold ALIGNED_(16) crc_state; Or, feel free to clean the logic up a bit, it's a little messy, I know. |
@KungFuJesus you can specify "```diff" markdown and it will color code your diffs for you. diff --git a/arch/x86/crc32_fold_pclmulqdq_tpl.h b/arch/x86/crc32_fold_pclmulqdq_tpl.h
index 3e799283..753417ba 100644
--- a/arch/x86/crc32_fold_pclmulqdq_tpl.h
+++ b/arch/x86/crc32_fold_pclmulqdq_tpl.h
@@ -26,9 +26,8 @@ Z_INTERNAL void CRC32_FOLD(crc32_fold *crc, const uint8_t *src, size_t len, uint
__m128i xmm_t0, xmm_t1, xmm_t2, xmm_t3;
__m128i xmm_crc0, xmm_crc1, xmm_crc2, xmm_crc3;
__m128i xmm_crc_part = _mm_setzero_si128();
... |
Updated benchmarks on my i9-9000K. (Added the last part with patch from @KungFuJesus) Develop
PR #1650 small-crc32 below len 32
PR #1650 small-crc32 below len 16 + expanded pclmulqdq handling for len 16-31
Very nice improvement at 16 bytes, and likely much better at 31 bytes. Good work @KungFuJesus. |
CI error: https://github.com/zlib-ng/zlib-ng/actions/runs/7681876330/job/20935515796?pr=1650
|
Yeah, the assertion is a known failure as it has not been updated yet, nor the comment above it. |
Coverity scan seems to suggest the current tests are not hitting all of the changed code... I don't know if it's glitch in Coverity or just that those are corner cases that are not hit in simple tests. |
Codecov sees 100% coverage on all changes from the PR, and an overall +0.28% coverage: https://app.codecov.io/gh/zlib-ng/zlib-ng/pull/1650 Do you have a reference to what you are seeing with Coverity scan? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Speeds up crc32_[v]pclmulqdq by letting them run on all strings bigger than 32bytes instead of 64bytes, and adding a simpler fallback that avoids spending time trying to handle big strings.