Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up crc32_[v]pclmulqdq on small strings. #1650

Merged
merged 3 commits into from
Feb 2, 2024
Merged

Speed up crc32_[v]pclmulqdq on small strings. #1650

merged 3 commits into from
Feb 2, 2024

Conversation

Dead2
Copy link
Member

@Dead2 Dead2 commented Jan 25, 2024

Speeds up crc32_[v]pclmulqdq by letting them run on all strings bigger than 32bytes instead of 64bytes, and adding a simpler fallback that avoids spending time trying to handle big strings.

@Dead2
Copy link
Member Author

Dead2 commented Jan 25, 2024

## Develop
crc32/braid/1                 7.17 ns         7.16 ns     97429653
crc32/braid/8                 18.7 ns         18.6 ns     37576722
crc32/braid/12                27.8 ns         27.8 ns     25212999
crc32/braid/15                34.5 ns         34.5 ns     20294607
crc32/braid/16                36.8 ns         36.7 ns     19056080
crc32/braid/31                70.7 ns         70.6 ns      9916083
crc32/braid/32                73.0 ns         72.9 ns      9607301
crc32/braid/63                 133 ns          133 ns      5257338
crc32/braid/64                 136 ns          135 ns      5170867
crc32/braid/512                462 ns          462 ns      1516092

crc32/pclmulqdq/1             8.32 ns         8.31 ns     84413050
crc32/pclmulqdq/8             18.6 ns         18.6 ns     37607806
crc32/pclmulqdq/12            27.8 ns         27.7 ns     25227018
crc32/pclmulqdq/15            34.5 ns         34.4 ns     20327031
crc32/pclmulqdq/16            36.8 ns         36.7 ns     19073963
crc32/pclmulqdq/31            70.7 ns         70.6 ns      9921032
crc32/pclmulqdq/32            72.9 ns         72.8 ns      9614389
crc32/pclmulqdq/63             133 ns          133 ns      5257425
crc32/pclmulqdq/64            28.1 ns         28.1 ns     24951603
crc32/pclmulqdq/512           53.4 ns         53.3 ns     14169403

   text    data     bss     dec     hex filename
 131796    1352       8  133156   20824 libz-ng.so.2
## Small-crc32 bypass32
crc32/pclmulqdq/1             4.77 ns         4.76 ns    146939917
crc32/pclmulqdq/8             18.7 ns         18.6 ns     37547975
crc32/pclmulqdq/12            27.8 ns         27.7 ns     25244669
crc32/pclmulqdq/15            34.5 ns         34.4 ns     20329798
crc32/pclmulqdq/16            36.8 ns         36.7 ns     19055431
crc32/pclmulqdq/31            70.7 ns         70.6 ns      9910188
crc32/pclmulqdq/32            24.8 ns         24.8 ns     28266098
crc32/pclmulqdq/63            34.5 ns         34.4 ns     20318395
crc32/pclmulqdq/64            28.2 ns         28.1 ns     24882989
crc32/pclmulqdq/512           49.5 ns         49.5 ns     14152738

   text    data     bss     dec     hex filename
 132900    1352       8  134260   20c74 libz-ng.so.2
## Small-crc32 bypass16
crc32/pclmulqdq/1             4.77 ns         4.76 ns    146932063
crc32/pclmulqdq/8             18.7 ns         18.6 ns     37548198
crc32/pclmulqdq/12            27.8 ns         27.7 ns     25244843
crc32/pclmulqdq/15            34.5 ns         34.4 ns     20330711
crc32/pclmulqdq/16            23.5 ns         23.5 ns     29785620
crc32/pclmulqdq/31            33.3 ns         33.3 ns     21042142
crc32/pclmulqdq/32            24.8 ns         24.8 ns     28268347
crc32/pclmulqdq/63            34.5 ns         34.4 ns     20334868
crc32/pclmulqdq/64            28.2 ns         28.1 ns     24883692
crc32/pclmulqdq/512           50.6 ns         50.5 ns     10000000

   text    data     bss     dec     hex filename
 132900    1352       8  134260   20c74 libz-ng.so.2

@Dead2
Copy link
Member Author

Dead2 commented Jan 25, 2024

I experimented with using DO8 in there as well, but on my cpu at least it made zero difference, only adding code size and slowing down the len=1 case ~50%.

@KungFuJesus
Copy link
Contributor

KungFuJesus commented Jan 25, 2024

Weird, I was almost certain that the pclmulqdq variant of crc32 was slower for anything smaller than 64 bytes. I can't verify anything for the v variants as I don't have a CPU that support them but I'll have a look.

Oh I see, you wrote a new bypass method.

test/benchmarks/benchmark_crc32.cc Outdated Show resolved Hide resolved
arch/x86/crc32_pclmulqdq_tpl.h Outdated Show resolved Hide resolved
arch/x86/crc32_pclmulqdq_tpl.h Outdated Show resolved Hide resolved
@Dead2
Copy link
Member Author

Dead2 commented Jan 25, 2024

Weird, I was almost certain that the pclmulqdq variant of crc32 was slower for anything smaller than 64 bytes. I can't verify anything for the v variants as I don't have a CPU that support them but I'll have a look.

I actually came across this attempting to make crc_small because I wanted to get rid of the crc_braid dependency (for another PR in the works), and when I enabled additional benchmarks to make sure it was not slower, I got quite a surprise.

Oh I see, you wrote a new bypass method.

I also tested using crc_braid with the smaller lengths, and that is actaully also faster (only changed the len < 64 to 16).

I did my testing on a i9-9900K, I also don't have anything capable of vpclmulqdq.

@KungFuJesus
Copy link
Contributor

Hmm, the CI failures there are probably what I was referring to with the size requirement. Let me look at that code again, it's been a while.

@KungFuJesus
Copy link
Contributor

Hmm, the CI failures there are probably what I was referring to with the size requirement. Let me look at that code again, it's been a while.

Yeah so it's definitely required to be at least 31 bytes for the load. The reason for this is what we do for aligning the loads. Now, unaligned vs aligned loads has made very little difference for a while now, but I wouldn't be shocked if it were something. The CLMUL capable CPUs are all post Nehalem, though.

@Dead2
Copy link
Member Author

Dead2 commented Jan 25, 2024

Yeah so it's definitely required to be at least 31 bytes for the load. The reason for this is what we do for aligning the loads. Now, unaligned vs aligned loads has made very little difference for a while now, but I wouldn't be shocked if it were something. The CLMUL capable CPUs are all post Nehalem, though.

Thats what I was afraid of. It means we get a very nice speedup for sizes 32-63 bytes though, while 12-31 remain slow. Unless someone has other ideas to improve on that.

Updated the PR to always do len < 32 to be on the safe side. Hopefully it now passes CI.

@KungFuJesus
Copy link
Contributor

Yeah so it's definitely required to be at least 31 bytes for the load. The reason for this is what we do for aligning the loads. Now, unaligned vs aligned loads has made very little difference for a while now, but I wouldn't be shocked if it were something. The CLMUL capable CPUs are all post Nehalem, though.

Thats what I was afraid of. It means we get a very nice speedup for sizes 32-63 bytes though, while 12-31 remain slow. Unless someone has other ideas to improve on that.

Updated the PR to always do len < 32 to be on the safe side. Hopefully it now passes CI.

Yeah, the idea behind the aligning load was that every load forward from that point is also aligned. We could do something where we handle extremely short strings by just doing an unaligned version. We do something somewhat similar in the adler checksum for ssse3, where it incorporates CPUs where alignment really makes or breaks it.

Copy link

codecov bot commented Jan 25, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (1aa53f4) 83.13% compared to head (88a9f8b) 83.41%.
Report is 4 commits behind head on develop.

Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #1650      +/-   ##
===========================================
+ Coverage    83.13%   83.41%   +0.27%     
===========================================
  Files          135      135              
  Lines        10895    11126     +231     
  Branches      2816     2988     +172     
===========================================
+ Hits          9058     9281     +223     
- Misses        1127     1130       +3     
- Partials       710      715       +5     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@nmoinvaz
Copy link
Member

@Dead2 I will benchmark on my laptop that has AVX512.

@Dead2 Dead2 force-pushed the speedup-crc32 branch 2 times, most recently from ee7f79b to ebd78ca Compare January 25, 2024 20:33
@Dead2
Copy link
Member Author

Dead2 commented Jan 25, 2024

Removed the extra header file, as it is not really needed any more (already so many crc files 😅).
This also means that all the N and W processing happens together in crc32_braid_p.h, making it a bit easier to understand.

@Dead2
Copy link
Member Author

Dead2 commented Jan 25, 2024

Rebased and split into two commits.

@Dead2
Copy link
Member Author

Dead2 commented Jan 25, 2024

Updated benchmarks on my i9-9000K

## Develop
crc32/braid/1                             7.15 ns         7.14 ns     97662151
crc32/braid/8                             18.7 ns         18.6 ns     37574368
crc32/braid/12                            27.8 ns         27.8 ns     25212989
crc32/braid/16                            36.8 ns         36.7 ns     19056127
crc32/braid/32                            73.0 ns         72.9 ns      9607586
crc32/braid/64                             136 ns          135 ns      5170815
crc32/braid/512                            464 ns          463 ns      1505207
crc32/braid/4096                          2956 ns         2952 ns       237152
crc32/braid/32768                        23070 ns        23034 ns        30388
crc32/braid/262144                      184311 ns       184003 ns         3805
crc32/braid/4194304                    2952183 ns      2946962 ns          238

crc32/pclmulqdq/1                         8.65 ns         8.64 ns     80827224
crc32/pclmulqdq/8                         18.6 ns         18.6 ns     37613951
crc32/pclmulqdq/12                        27.8 ns         27.7 ns     25227336
crc32/pclmulqdq/16                        36.8 ns         36.7 ns     19073905
crc32/pclmulqdq/32                        72.9 ns         72.8 ns      9614650
crc32/pclmulqdq/64                        28.1 ns         28.0 ns     24963663
crc32/pclmulqdq/512                       49.2 ns         49.1 ns     14259025
crc32/pclmulqdq/4096                       226 ns          225 ns      3106307
crc32/pclmulqdq/32768                     1607 ns         1604 ns       436387
crc32/pclmulqdq/262144                   13467 ns        13443 ns        52067
crc32/pclmulqdq/4194304                 226517 ns       226075 ns         3098

   text    data     bss     dec     hex filename
 131780    1352       8  133140   20814 libz-ng.so.2
## PR #1650
crc32/pclmulqdq/1                         5.04 ns         5.03 ns    139252670  ## Faster
crc32/pclmulqdq/8                         18.7 ns         18.6 ns     37548313
crc32/pclmulqdq/12                        27.8 ns         27.7 ns     25245168
crc32/pclmulqdq/16                        36.8 ns         36.7 ns     19055822
crc32/pclmulqdq/32                        25.0 ns         24.9 ns     28087357  ## Faster
crc32/pclmulqdq/64                        28.2 ns         28.1 ns     24895247
crc32/pclmulqdq/512                       49.2 ns         49.1 ns     14252094
crc32/pclmulqdq/4096                       226 ns          225 ns      3103245
crc32/pclmulqdq/32768                     1605 ns         1602 ns       436962
crc32/pclmulqdq/262144                   13590 ns        13564 ns        51610
crc32/pclmulqdq/4194304                 226707 ns       226219 ns         3096

   text    data     bss     dec     hex filename
 132868    1352       8  134228   20c54 libz-ng.so.2

@Dead2 Dead2 changed the title Draft: Speed up crc32_[v]pclmulqdq on small strings. Speed up crc32_[v]pclmulqdq on small strings. Jan 25, 2024
@nmoinvaz
Copy link
Member

PR

2024-01-25T13:36:38-08:00
Run on (8 X 3035.05 MHz CPU s)
CPU Caches:
  L1 Data 48 KiB (x4)
  L1 Instruction 32 KiB (x4)
  L2 Unified 1280 KiB (x4)
  L3 Unified 12288 KiB (x1)
-------------------------------------------------------------------
Benchmark                         Time             CPU   Iterations
-------------------------------------------------------------------
crc32/braid/1                  3.33 ns         2.57 ns    560000000
crc32/braid/8                  19.4 ns         15.3 ns     40727273
crc32/braid/12                 28.6 ns         22.9 ns     28000000
crc32/braid/16                 36.8 ns         31.1 ns     23578947
crc32/braid/32                 74.9 ns         61.4 ns     11200000
crc32/braid/64                 61.5 ns         54.4 ns     11200000
crc32/braid/512                 198 ns          167 ns      4480000
crc32/braid/4096               1208 ns          977 ns       560000
crc32/braid/32768              9302 ns         7324 ns        89600
crc32/braid/262144            71939 ns        57547 ns         8960
crc32/braid/4194304         1206351 ns      1025391 ns          640
crc32/pclmulqdq/1              3.62 ns         3.05 ns    235789474
crc32/pclmulqdq/8              20.8 ns         18.1 ns     44800000
crc32/pclmulqdq/12             30.4 ns         24.5 ns     49777778
crc32/pclmulqdq/16             39.1 ns         33.0 ns     21333333
crc32/pclmulqdq/32             14.8 ns         12.9 ns     44800000
crc32/pclmulqdq/64             19.7 ns         16.8 ns     34461538
crc32/pclmulqdq/512            35.9 ns         28.3 ns     32000000
crc32/pclmulqdq/4096            168 ns          137 ns      5600000
crc32/pclmulqdq/32768          1243 ns          952 ns       640000
crc32/pclmulqdq/262144        10171 ns         7673 ns        89600
crc32/pclmulqdq/4194304      160583 ns       138108 ns         4978
crc32/vpclmulqdq/1             3.75 ns         3.28 ns    224000000
crc32/vpclmulqdq/8             21.5 ns         16.9 ns     40727273
crc32/vpclmulqdq/12            31.4 ns         26.4 ns     37333333
crc32/vpclmulqdq/16            38.8 ns         33.0 ns     20363636
crc32/vpclmulqdq/32            14.7 ns         12.1 ns     74666667
crc32/vpclmulqdq/64            18.9 ns         15.3 ns     40727273
crc32/vpclmulqdq/512           30.8 ns         24.5 ns     23578947
crc32/vpclmulqdq/4096          81.7 ns         64.1 ns     10000000
crc32/vpclmulqdq/32768          582 ns          459 ns      1600000
crc32/vpclmulqdq/262144        4436 ns         3599 ns       186667
crc32/vpclmulqdq/4194304      82027 ns        66267 ns         8960

DEVELOP

2024-01-25T13:38:44-08:00
Run on (8 X 2996.15 MHz CPU s)
CPU Caches:
  L1 Data 48 KiB (x4)
  L1 Instruction 32 KiB (x4)
  L2 Unified 1280 KiB (x4)
  L3 Unified 12288 KiB (x1)
-------------------------------------------------------------------
Benchmark                         Time             CPU   Iterations
-------------------------------------------------------------------
crc32/braid/1                  3.40 ns         2.85 ns    373333333
crc32/braid/8                  20.5 ns         13.3 ns     44800000
crc32/braid/64                 56.7 ns         42.4 ns     16592593
crc32/braid/512                 187 ns          161 ns      4072727
crc32/braid/4096               1148 ns          903 ns       640000
crc32/braid/32768              9590 ns         9062 ns       100000
crc32/braid/262144            74291 ns        62779 ns        11200
crc32/braid/2097152          578376 ns       530134 ns         1120
crc32/braid/4194304         1310545 ns      1171875 ns          560
crc32/pclmulqdq/1              3.87 ns         3.42 ns    224000000
crc32/pclmulqdq/8              19.8 ns         17.6 ns     37333333
crc32/pclmulqdq/64             18.9 ns         16.9 ns     40727273
crc32/pclmulqdq/512            35.5 ns         33.0 ns     21333333
crc32/pclmulqdq/4096            168 ns          142 ns      5600000
crc32/pclmulqdq/32768          1195 ns         1074 ns       640000
crc32/pclmulqdq/262144        10247 ns         8954 ns        92490
crc32/pclmulqdq/2097152       78702 ns        61035 ns         8960
crc32/pclmulqdq/4194304      167554 ns       139509 ns         5600
crc32/vpclmulqdq/1             3.69 ns         3.08 ns    248888889
crc32/vpclmulqdq/8             20.5 ns         15.7 ns     44800000
crc32/vpclmulqdq/64            18.7 ns         15.0 ns     40727273
crc32/vpclmulqdq/512           29.9 ns         24.9 ns     26352941
crc32/vpclmulqdq/4096          84.6 ns         72.5 ns     11200000
crc32/vpclmulqdq/32768          564 ns          516 ns      1000000
crc32/vpclmulqdq/262144        4369 ns         3924 ns       179200
crc32/vpclmulqdq/2097152      39138 ns        29297 ns        21333
crc32/vpclmulqdq/4194304      81050 ns        68011 ns         8960

@nmoinvaz
Copy link
Member

nmoinvaz commented Jan 25, 2024

Hmm using Google benchmarks, it does not go through CRC32 function, so perhaps that is why I am not seeing a difference. Functions that go through functable also wouldn't hit your crc32_small...

@Dead2
Copy link
Member Author

Dead2 commented Jan 25, 2024

Hmm using Google benchmarks, it does not go through CRC32 function, so perhaps that is why I am not seeing a difference.

Your develop benchmarks don't have the change to benchmark lengths (now a separate commit to make that easier).
Therefore you have not benchmarked anything between len 12 and 63, where the speedups are.

@Dead2
Copy link
Member Author

Dead2 commented Jan 25, 2024

Functions that go through functable also wouldn't hit your crc32_small...

They will if they call ft.crc32...

@Dead2
Copy link
Member Author

Dead2 commented Jan 25, 2024

@nmoinvaz Actually you do have the speedup there, just nothing to compare it to.

crc32/pclmulqdq/8              20.8 ns         18.1 ns     44800000
crc32/pclmulqdq/12             30.4 ns         24.5 ns     49777778
crc32/pclmulqdq/16             39.1 ns         33.0 ns     21333333
crc32/pclmulqdq/32             14.8 ns         12.9 ns     44800000  ## faster than above and below
crc32/pclmulqdq/64             19.7 ns         16.8 ns     34461538

crc32/vpclmulqdq/8             21.5 ns         16.9 ns     40727273
crc32/vpclmulqdq/12            31.4 ns         26.4 ns     37333333
crc32/vpclmulqdq/16            38.8 ns         33.0 ns     20363636
crc32/vpclmulqdq/32            14.7 ns         12.1 ns     74666667  ## faster than above and below
crc32/vpclmulqdq/64            18.9 ns         15.3 ns     40727273

The 32 test is the fastest because it is the shortest one now using the [v]pclmulqdq code. Previously the smalles was the 64 size, and the 32 would be around 78ns or so for your platform instead of now 14.7ns, so ~5x faster.

@KungFuJesus
Copy link
Contributor

KungFuJesus commented Jan 25, 2024

Running through the larger 740 something test batteries but it'd probably be good to see it pass CI as well.

Yeah so part of this was due to the fact that we had to do some back bending to make sure that the copy+CRC checksum worked for small lengths, anyway. It was only when we ended up combining the code into the template maze we have now did it seem silly that we weren't accommodating the the 16 < len < 31 cases.

I think I have something here that works but I'd feel better with more testing.

@nmoinvaz
Copy link
Member

I see, I'll rerun the benchmarks in a bit with that one commit pulled into develop.

@nmoinvaz
Copy link
Member

DEVELOP

-------------------------------------------------------------------
Benchmark                         Time             CPU   Iterations
-------------------------------------------------------------------
crc32/braid/1                  3.57 ns         1.88 ns    448000000
crc32/braid/8                  20.0 ns         11.7 ns     74666667
crc32/braid/12                 28.4 ns         20.3 ns     64000000
crc32/braid/16                 36.9 ns         24.6 ns     28000000
crc32/braid/32                 70.7 ns         41.0 ns     18666667
crc32/braid/64                 56.4 ns         44.9 ns     16000000
crc32/braid/512                 192 ns          123 ns      4072727
crc32/braid/4096               1201 ns          907 ns       896000
crc32/braid/32768              9975 ns         7500 ns       100000
crc32/braid/262144            74687 ns        53013 ns        11200
crc32/braid/4194304         1155435 ns       892857 ns          560
crc32/pclmulqdq/1              3.62 ns         2.77 ns    298666667
crc32/pclmulqdq/8              19.1 ns         14.2 ns     37333333
crc32/pclmulqdq/12             28.2 ns         20.1 ns     28000000
crc32/pclmulqdq/16             37.1 ns         25.5 ns     25088000
crc32/pclmulqdq/32             65.8 ns         48.1 ns     14933333
crc32/pclmulqdq/64             18.2 ns         13.3 ns    112000000
crc32/pclmulqdq/512            34.5 ns         25.7 ns     24888889
crc32/pclmulqdq/4096            162 ns          129 ns      8960000
crc32/pclmulqdq/32768          1189 ns          912 ns       924903
crc32/pclmulqdq/262144         9575 ns         7150 ns        89600
crc32/pclmulqdq/4194304      154119 ns       118583 ns         4480
crc32/vpclmulqdq/1             3.67 ns         2.73 ns    371674074
crc32/vpclmulqdq/8             19.2 ns         13.0 ns     40727273
crc32/vpclmulqdq/12            28.3 ns         22.6 ns     37333333
crc32/vpclmulqdq/16            37.0 ns         30.0 ns     21333333
crc32/vpclmulqdq/32            65.7 ns         46.5 ns     14451613
crc32/vpclmulqdq/64            18.2 ns         14.0 ns     56000000
crc32/vpclmulqdq/512           29.1 ns         24.9 ns     26352941
crc32/vpclmulqdq/4096          84.6 ns         68.0 ns      8960000
crc32/vpclmulqdq/32768          565 ns          410 ns      1600000
crc32/vpclmulqdq/262144        4604 ns         3575 ns       179200
crc32/vpclmulqdq/4194304      90562 ns        73438 ns        10000

PR

-------------------------------------------------------------------
Benchmark                         Time             CPU   Iterations
-------------------------------------------------------------------
crc32/braid/1                  3.71 ns         2.23 ns    560000000
crc32/braid/8                  24.1 ns         13.3 ns     44800000
crc32/braid/12                 29.3 ns         22.0 ns     29866667
crc32/braid/16                 38.0 ns         31.3 ns     19478261
crc32/braid/32                 66.3 ns         56.2 ns     10000000
crc32/braid/64                 54.6 ns         39.8 ns     14933333
crc32/braid/512                 191 ns          157 ns      4480000
crc32/braid/4096               1147 ns          942 ns       746667
crc32/braid/32768              9246 ns         6875 ns       100000
crc32/braid/262144            73018 ns        57812 ns        10000
crc32/braid/4194304         1172960 ns      1049805 ns          640
crc32/pclmulqdq/1              3.54 ns         2.61 ns    263529412
crc32/pclmulqdq/8              19.4 ns         16.5 ns     56000000
crc32/pclmulqdq/12             28.0 ns         24.1 ns     29866667
crc32/pclmulqdq/16             38.4 ns         32.0 ns     24888889
crc32/pclmulqdq/32             15.1 ns         12.0 ns     56000000
crc32/pclmulqdq/64             19.2 ns         13.9 ns     53952688
crc32/pclmulqdq/512            34.6 ns         29.6 ns     26352941
crc32/pclmulqdq/4096            162 ns          126 ns      5600000
crc32/pclmulqdq/32768          1163 ns          928 ns       640000
crc32/pclmulqdq/262144         9518 ns         7673 ns        89600
crc32/pclmulqdq/4194304      151493 ns       127645 ns         7467
crc32/vpclmulqdq/1             3.52 ns         2.79 ns    280000000
crc32/vpclmulqdq/8             19.4 ns         15.7 ns     49777778
crc32/vpclmulqdq/12            28.7 ns         23.1 ns     34461538
crc32/vpclmulqdq/16            38.7 ns         31.1 ns     23578947
crc32/vpclmulqdq/32            28.7 ns         23.7 ns     26352941
crc32/vpclmulqdq/64            29.2 ns         25.5 ns     26352941
crc32/vpclmulqdq/512           44.3 ns         38.5 ns     19478261
crc32/vpclmulqdq/4096          92.9 ns         79.5 ns      7466667
crc32/vpclmulqdq/32768          550 ns          500 ns      1000000
crc32/vpclmulqdq/262144        4144 ns         3606 ns       203636
crc32/vpclmulqdq/4194304      75882 ns        66267 ns         8960

@Dead2
Copy link
Member Author

Dead2 commented Jan 25, 2024

@nmoinvaz Did your laptop overheat and throttle or something?

crc32/pclmulqdq/32             15.1 ns         12.0 ns     56000000
crc32/pclmulqdq/64             19.2 ns         13.9 ns     53952688
crc32/pclmulqdq/512            34.6 ns         29.6 ns     26352941

# Why is vpclmulqdq slower for these all of a sudden, they were faster on your previous run. Just a fluke I hope.
crc32/vpclmulqdq/32            28.7 ns         23.7 ns     26352941
crc32/vpclmulqdq/64            29.2 ns         25.5 ns     26352941
crc32/vpclmulqdq/512           44.3 ns         38.5 ns     19478261

@KungFuJesus
Copy link
Contributor

@Dead2 here's what I have so far:

diff --git a/arch/x86/crc32_fold_pclmulqdq_tpl.h b/arch/x86/crc32_fold_pclmulqdq_tpl.h
index 3e799283..2d43f5aa 100644
--- a/arch/x86/crc32_fold_pclmulqdq_tpl.h
+++ b/arch/x86/crc32_fold_pclmulqdq_tpl.h
@@ -26,9 +26,8 @@ Z_INTERNAL void CRC32_FOLD(crc32_fold *crc, const uint8_t *src, size_t len, uint
     __m128i xmm_t0, xmm_t1, xmm_t2, xmm_t3;
     __m128i xmm_crc0, xmm_crc1, xmm_crc2, xmm_crc3;
     __m128i xmm_crc_part = _mm_setzero_si128();
-#ifdef COPY
     char ALIGNED_(16) partial_buf[16] = { 0 };
-#else
+#ifndef COPY
     __m128i xmm_initial = _mm_cvtsi32_si128(init_crc);
     int32_t first = init_crc != 0;
 
@@ -41,12 +40,12 @@ Z_INTERNAL void CRC32_FOLD(crc32_fold *crc, const uint8_t *src, size_t len, uint
     crc32_fold_load((__m128i *)crc->fold, &xmm_crc0, &xmm_crc1, &xmm_crc2, &xmm_crc3);
 
     if (len < 16) {
-#ifdef COPY
         if (len == 0)
             return;
 
         memcpy(partial_buf, src, len);
         xmm_crc_part = _mm_load_si128((const __m128i *)partial_buf);
+#ifdef COPY
         memcpy(dst, partial_buf, len);
 #endif
         goto partial;
@@ -63,7 +62,12 @@ Z_INTERNAL void CRC32_FOLD(crc32_fold *crc, const uint8_t *src, size_t len, uint
 
         if (algn_diff < 4 && init_crc != 0) {
             xmm_t0 = xmm_crc_part;
-            xmm_crc_part = _mm_loadu_si128((__m128i*)src + 1);
+            if (len >= 32) {
+                xmm_crc_part = _mm_loadu_si128((__m128i*)src + 1);
+            } else {
+                memcpy(partial_buf, src + 16, len);
+                xmm_crc_part = _mm_load_si128((__m128i*)partial_buf);
+            }
             fold_1(&xmm_crc0, &xmm_crc1, &xmm_crc2, &xmm_crc3);
             xmm_crc3 = _mm_xor_si128(xmm_crc3, xmm_t0);
             src += 16;
diff --git a/arch/x86/crc32_pclmulqdq_tpl.h b/arch/x86/crc32_pclmulqdq_tpl.h
index 80a35b03..3a4f6af5 100644
--- a/arch/x86/crc32_pclmulqdq_tpl.h
+++ b/arch/x86/crc32_pclmulqdq_tpl.h
@@ -365,7 +365,7 @@ static inline uint32_t crc32_small(uint32_t crc, const uint8_t *buf, size_t len)
 Z_INTERNAL uint32_t CRC32(uint32_t crc32, const uint8_t *buf, size_t len) {
     /* For lens smaller than ~12, crc32_small method is faster.
      * But there are also minimum requirements for the pclmul functions due to alignment */
-    if (len < 32)
+    if (len < 16)
         return crc32_small(crc32, buf, len);
 
     crc32_fold ALIGNED_(16) crc_state;

The area I'm least confident about is where the user provides an initial CRC into the function and for arbitrarily aligned buffers and lengths > 16 but lengths < 32. Do we have coverage of that in tests?

@KungFuJesus
Copy link
Contributor

I put a manual abort statement in the area I wasn't confident about and it did manage to abort in at least one gtest, so perhaps that's something.

@KungFuJesus
Copy link
Contributor

KungFuJesus commented Jan 26, 2024

@Dead2 this seems to be correct but warrants more testing if you can manage:

diff --git a/arch/x86/crc32_fold_pclmulqdq_tpl.h b/arch/x86/crc32_fold_pclmulqdq_tpl.h
index 3e799283..753417ba 100644
--- a/arch/x86/crc32_fold_pclmulqdq_tpl.h
+++ b/arch/x86/crc32_fold_pclmulqdq_tpl.h
@@ -26,9 +26,8 @@ Z_INTERNAL void CRC32_FOLD(crc32_fold *crc, const uint8_t *src, size_t len, uint
     __m128i xmm_t0, xmm_t1, xmm_t2, xmm_t3;
     __m128i xmm_crc0, xmm_crc1, xmm_crc2, xmm_crc3;
     __m128i xmm_crc_part = _mm_setzero_si128();
-#ifdef COPY
     char ALIGNED_(16) partial_buf[16] = { 0 };
-#else
+#ifndef COPY
     __m128i xmm_initial = _mm_cvtsi32_si128(init_crc);
     int32_t first = init_crc != 0;
 
@@ -41,12 +40,12 @@ Z_INTERNAL void CRC32_FOLD(crc32_fold *crc, const uint8_t *src, size_t len, uint
     crc32_fold_load((__m128i *)crc->fold, &xmm_crc0, &xmm_crc1, &xmm_crc2, &xmm_crc3);
 
     if (len < 16) {
-#ifdef COPY
         if (len == 0)
             return;
 
         memcpy(partial_buf, src, len);
         xmm_crc_part = _mm_load_si128((const __m128i *)partial_buf);
+#ifdef COPY
         memcpy(dst, partial_buf, len);
 #endif
         goto partial;
@@ -63,9 +62,23 @@ Z_INTERNAL void CRC32_FOLD(crc32_fold *crc, const uint8_t *src, size_t len, uint
 
         if (algn_diff < 4 && init_crc != 0) {
             xmm_t0 = xmm_crc_part;
-            xmm_crc_part = _mm_loadu_si128((__m128i*)src + 1);
-            fold_1(&xmm_crc0, &xmm_crc1, &xmm_crc2, &xmm_crc3);
-            xmm_crc3 = _mm_xor_si128(xmm_crc3, xmm_t0);
+            if (len >= 32) {
+                xmm_crc_part = _mm_loadu_si128((__m128i*)src + 1);
+                fold_1(&xmm_crc0, &xmm_crc1, &xmm_crc2, &xmm_crc3);
+                xmm_crc3 = _mm_xor_si128(xmm_crc3, xmm_t0);
+            } else {
+                memcpy(partial_buf, src + 16, len - 16);
+                xmm_crc_part = _mm_load_si128((__m128i*)partial_buf);
+                fold_1(&xmm_crc0, &xmm_crc1, &xmm_crc2, &xmm_crc3);
+                xmm_crc3 = _mm_xor_si128(xmm_crc3, xmm_t0);
+                src += 16;
+                len -= 16;
+#ifdef COPY
+                dst -= algn_diff;
+#endif
+                goto partial;
+            }
+
             src += 16;
             len -= 16;
         }
diff --git a/arch/x86/crc32_pclmulqdq_tpl.h b/arch/x86/crc32_pclmulqdq_tpl.h
index 80a35b03..3a4f6af5 100644
--- a/arch/x86/crc32_pclmulqdq_tpl.h
+++ b/arch/x86/crc32_pclmulqdq_tpl.h
@@ -365,7 +365,7 @@ static inline uint32_t crc32_small(uint32_t crc, const uint8_t *buf, size_t len)
 Z_INTERNAL uint32_t CRC32(uint32_t crc32, const uint8_t *buf, size_t len) {
     /* For lens smaller than ~12, crc32_small method is faster.
      * But there are also minimum requirements for the pclmul functions due to alignment */
-    if (len < 32)
+    if (len < 16)
         return crc32_small(crc32, buf, len);
 
     crc32_fold ALIGNED_(16) crc_state;

Or, feel free to clean the logic up a bit, it's a little messy, I know.

@nmoinvaz
Copy link
Member

nmoinvaz commented Jan 26, 2024

@KungFuJesus you can specify "```diff" markdown and it will color code your diffs for you.

diff --git a/arch/x86/crc32_fold_pclmulqdq_tpl.h b/arch/x86/crc32_fold_pclmulqdq_tpl.h
index 3e799283..753417ba 100644
--- a/arch/x86/crc32_fold_pclmulqdq_tpl.h
+++ b/arch/x86/crc32_fold_pclmulqdq_tpl.h
@@ -26,9 +26,8 @@ Z_INTERNAL void CRC32_FOLD(crc32_fold *crc, const uint8_t *src, size_t len, uint
     __m128i xmm_t0, xmm_t1, xmm_t2, xmm_t3;
     __m128i xmm_crc0, xmm_crc1, xmm_crc2, xmm_crc3;
     __m128i xmm_crc_part = _mm_setzero_si128();
...

@Dead2
Copy link
Member Author

Dead2 commented Jan 27, 2024

Updated benchmarks on my i9-9000K. (Added the last part with patch from @KungFuJesus)

Develop

crc32/braid/1                             7.15 ns         7.14 ns     97662151
crc32/braid/8                             18.7 ns         18.6 ns     37574368
crc32/braid/12                            27.8 ns         27.8 ns     25212989
crc32/braid/16                            36.8 ns         36.7 ns     19056127
crc32/braid/32                            73.0 ns         72.9 ns      9607586
crc32/braid/64                             136 ns          135 ns      5170815
crc32/braid/512                            464 ns          463 ns      1505207
crc32/braid/4096                          2956 ns         2952 ns       237152
crc32/braid/32768                        23070 ns        23034 ns        30388
crc32/braid/262144                      184311 ns       184003 ns         3805
crc32/braid/4194304                    2952183 ns      2946962 ns          238

crc32/pclmulqdq/1                         8.65 ns         8.64 ns     80827224
crc32/pclmulqdq/8                         18.6 ns         18.6 ns     37613951
crc32/pclmulqdq/12                        27.8 ns         27.7 ns     25227336
crc32/pclmulqdq/16                        36.8 ns         36.7 ns     19073905
crc32/pclmulqdq/32                        72.9 ns         72.8 ns      9614650
crc32/pclmulqdq/64                        28.1 ns         28.0 ns     24963663
crc32/pclmulqdq/512                       49.2 ns         49.1 ns     14259025
crc32/pclmulqdq/4096                       226 ns          225 ns      3106307
crc32/pclmulqdq/32768                     1607 ns         1604 ns       436387
crc32/pclmulqdq/262144                   13467 ns        13443 ns        52067
crc32/pclmulqdq/4194304                 226517 ns       226075 ns         3098

   text    data     bss     dec     hex filename
 131780    1352       8  133140   20814 libz-ng.so.2

PR #1650 small-crc32 below len 32

crc32/pclmulqdq/1                         5.04 ns         5.03 ns    139252670  ## Faster
crc32/pclmulqdq/8                         18.7 ns         18.6 ns     37548313
crc32/pclmulqdq/12                        27.8 ns         27.7 ns     25245168
crc32/pclmulqdq/16                        36.8 ns         36.7 ns     19055822
crc32/pclmulqdq/32                        25.0 ns         24.9 ns     28087357  ## Faster
crc32/pclmulqdq/64                        28.2 ns         28.1 ns     24895247
crc32/pclmulqdq/512                       49.2 ns         49.1 ns     14252094
crc32/pclmulqdq/4096                       226 ns          225 ns      3103245
crc32/pclmulqdq/32768                     1605 ns         1602 ns       436962
crc32/pclmulqdq/262144                   13590 ns        13564 ns        51610
crc32/pclmulqdq/4194304                 226707 ns       226219 ns         3096

   text    data     bss     dec     hex filename
 132868    1352       8  134228   20c54 libz-ng.so.2

PR #1650 small-crc32 below len 16 + expanded pclmulqdq handling for len 16-31

crc32/pclmulqdq/1             4.77 ns         4.76 ns    146925163
crc32/pclmulqdq/8             18.7 ns         18.6 ns     37549648
crc32/pclmulqdq/12            27.8 ns         27.7 ns     25245009
crc32/pclmulqdq/16            26.1 ns         26.1 ns     26857766
crc32/pclmulqdq/32            27.0 ns         27.0 ns     25951416
crc32/pclmulqdq/64            29.1 ns         29.0 ns     24132333
crc32/pclmulqdq/512           50.1 ns         50.0 ns     13997529
crc32/pclmulqdq/4096           227 ns          226 ns      3090671
crc32/pclmulqdq/32768         1607 ns         1604 ns       436398
crc32/pclmulqdq/262144       13564 ns        13537 ns        51720
crc32/pclmulqdq/4194304     226624 ns       226151 ns         3095

   text    data     bss     dec     hex filename
 133252    1352       8  134612   20dd4 libz-ng.so.2

Very nice improvement at 16 bytes, and likely much better at 31 bytes. Good work @KungFuJesus.
The penalty for >32 bytes seems pretty small.

@phprus
Copy link
Contributor

phprus commented Jan 28, 2024

CI error:

https://github.com/zlib-ng/zlib-ng/actions/runs/7681876330/job/20935515796?pr=1650

fuzzer_checksum: /home/runner/work/zlib-ng/zlib-ng/arch/x86/crc32_fold_pclmulqdq_tpl.h:38:
void crc32_fold_pclmulqdq(crc32_fold *, const uint8_t *, size_t, uint32_t): Assertion `len >= 31 || first == 0' failed.

@Dead2
Copy link
Member Author

Dead2 commented Jan 28, 2024

CI error:
https://github.com/zlib-ng/zlib-ng/actions/runs/7681876330/job/20935515796?pr=1650

fuzzer_checksum: /home/runner/work/zlib-ng/zlib-ng/arch/x86/crc32_fold_pclmulqdq_tpl.h:38:
void crc32_fold_pclmulqdq(crc32_fold *, const uint8_t *, size_t, uint32_t): Assertion `len >= 31 || first == 0' failed.

Yeah, the assertion is a known failure as it has not been updated yet, nor the comment above it.

@mtl1979
Copy link
Collaborator

mtl1979 commented Jan 31, 2024

Coverity scan seems to suggest the current tests are not hitting all of the changed code... I don't know if it's glitch in Coverity or just that those are corner cases that are not hit in simple tests.

@Dead2
Copy link
Member Author

Dead2 commented Jan 31, 2024

Coverity scan seems to suggest the current tests are not hitting all of the changed code... I don't know if it's glitch in Coverity or just that those are corner cases that are not hit in simple tests.

Codecov sees 100% coverage on all changes from the PR, and an overall +0.28% coverage: https://app.codecov.io/gh/zlib-ng/zlib-ng/pull/1650

Do you have a reference to what you are seeing with Coverity scan?

Copy link
Contributor

@KungFuJesus KungFuJesus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Dead2 Dead2 merged commit 9076047 into develop Feb 2, 2024
271 checks passed
@Dead2 Dead2 deleted the speedup-crc32 branch February 22, 2024 18:57
@Dead2 Dead2 mentioned this pull request May 30, 2024
@Dead2 Dead2 mentioned this pull request Jun 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants