Speed up crc32_[v]pclmulqdq on small strings. #1650

Dead2 · 2024-01-25T14:11:18Z

Speeds up crc32_[v]pclmulqdq by letting them run on all strings bigger than 32bytes instead of 64bytes, and adding a simpler fallback that avoids spending time trying to handle big strings.

Dead2 · 2024-01-25T14:12:50Z

## Develop
crc32/braid/1                 7.17 ns         7.16 ns     97429653
crc32/braid/8                 18.7 ns         18.6 ns     37576722
crc32/braid/12                27.8 ns         27.8 ns     25212999
crc32/braid/15                34.5 ns         34.5 ns     20294607
crc32/braid/16                36.8 ns         36.7 ns     19056080
crc32/braid/31                70.7 ns         70.6 ns      9916083
crc32/braid/32                73.0 ns         72.9 ns      9607301
crc32/braid/63                 133 ns          133 ns      5257338
crc32/braid/64                 136 ns          135 ns      5170867
crc32/braid/512                462 ns          462 ns      1516092

crc32/pclmulqdq/1             8.32 ns         8.31 ns     84413050
crc32/pclmulqdq/8             18.6 ns         18.6 ns     37607806
crc32/pclmulqdq/12            27.8 ns         27.7 ns     25227018
crc32/pclmulqdq/15            34.5 ns         34.4 ns     20327031
crc32/pclmulqdq/16            36.8 ns         36.7 ns     19073963
crc32/pclmulqdq/31            70.7 ns         70.6 ns      9921032
crc32/pclmulqdq/32            72.9 ns         72.8 ns      9614389
crc32/pclmulqdq/63             133 ns          133 ns      5257425
crc32/pclmulqdq/64            28.1 ns         28.1 ns     24951603
crc32/pclmulqdq/512           53.4 ns         53.3 ns     14169403

   text    data     bss     dec     hex filename
 131796    1352       8  133156   20824 libz-ng.so.2

## Small-crc32 bypass32
crc32/pclmulqdq/1             4.77 ns         4.76 ns    146939917
crc32/pclmulqdq/8             18.7 ns         18.6 ns     37547975
crc32/pclmulqdq/12            27.8 ns         27.7 ns     25244669
crc32/pclmulqdq/15            34.5 ns         34.4 ns     20329798
crc32/pclmulqdq/16            36.8 ns         36.7 ns     19055431
crc32/pclmulqdq/31            70.7 ns         70.6 ns      9910188
crc32/pclmulqdq/32            24.8 ns         24.8 ns     28266098
crc32/pclmulqdq/63            34.5 ns         34.4 ns     20318395
crc32/pclmulqdq/64            28.2 ns         28.1 ns     24882989
crc32/pclmulqdq/512           49.5 ns         49.5 ns     14152738

   text    data     bss     dec     hex filename
 132900    1352       8  134260   20c74 libz-ng.so.2

## Small-crc32 bypass16
crc32/pclmulqdq/1             4.77 ns         4.76 ns    146932063
crc32/pclmulqdq/8             18.7 ns         18.6 ns     37548198
crc32/pclmulqdq/12            27.8 ns         27.7 ns     25244843
crc32/pclmulqdq/15            34.5 ns         34.4 ns     20330711
crc32/pclmulqdq/16            23.5 ns         23.5 ns     29785620
crc32/pclmulqdq/31            33.3 ns         33.3 ns     21042142
crc32/pclmulqdq/32            24.8 ns         24.8 ns     28268347
crc32/pclmulqdq/63            34.5 ns         34.4 ns     20334868
crc32/pclmulqdq/64            28.2 ns         28.1 ns     24883692
crc32/pclmulqdq/512           50.6 ns         50.5 ns     10000000

   text    data     bss     dec     hex filename
 132900    1352       8  134260   20c74 libz-ng.so.2

Dead2 · 2024-01-25T14:16:54Z

I experimented with using DO8 in there as well, but on my cpu at least it made zero difference, only adding code size and slowing down the len=1 case ~50%.

KungFuJesus · 2024-01-25T14:17:47Z

Weird, I was almost certain that the pclmulqdq variant of crc32 was slower for anything smaller than 64 bytes. I can't verify anything for the v variants as I don't have a CPU that support them but I'll have a look.

Oh I see, you wrote a new bypass method.

test/benchmarks/benchmark_crc32.cc

arch/x86/crc32_pclmulqdq_tpl.h

Dead2 · 2024-01-25T14:21:52Z

Weird, I was almost certain that the pclmulqdq variant of crc32 was slower for anything smaller than 64 bytes. I can't verify anything for the v variants as I don't have a CPU that support them but I'll have a look.

I actually came across this attempting to make crc_small because I wanted to get rid of the crc_braid dependency (for another PR in the works), and when I enabled additional benchmarks to make sure it was not slower, I got quite a surprise.

Oh I see, you wrote a new bypass method.

I also tested using crc_braid with the smaller lengths, and that is actaully also faster (only changed the len < 64 to 16).

I did my testing on a i9-9900K, I also don't have anything capable of vpclmulqdq.

arch/x86/crc32_pclmulqdq_tpl.h

KungFuJesus · 2024-01-25T14:28:15Z

Hmm, the CI failures there are probably what I was referring to with the size requirement. Let me look at that code again, it's been a while.

KungFuJesus · 2024-01-25T14:44:24Z

Hmm, the CI failures there are probably what I was referring to with the size requirement. Let me look at that code again, it's been a while.

Yeah so it's definitely required to be at least 31 bytes for the load. The reason for this is what we do for aligning the loads. Now, unaligned vs aligned loads has made very little difference for a while now, but I wouldn't be shocked if it were something. The CLMUL capable CPUs are all post Nehalem, though.

Dead2 · 2024-01-25T14:47:17Z

Yeah so it's definitely required to be at least 31 bytes for the load. The reason for this is what we do for aligning the loads. Now, unaligned vs aligned loads has made very little difference for a while now, but I wouldn't be shocked if it were something. The CLMUL capable CPUs are all post Nehalem, though.

Thats what I was afraid of. It means we get a very nice speedup for sizes 32-63 bytes though, while 12-31 remain slow. Unless someone has other ideas to improve on that.

Updated the PR to always do len < 32 to be on the safe side. Hopefully it now passes CI.

KungFuJesus · 2024-01-25T14:52:02Z

Yeah so it's definitely required to be at least 31 bytes for the load. The reason for this is what we do for aligning the loads. Now, unaligned vs aligned loads has made very little difference for a while now, but I wouldn't be shocked if it were something. The CLMUL capable CPUs are all post Nehalem, though.

Thats what I was afraid of. It means we get a very nice speedup for sizes 32-63 bytes though, while 12-31 remain slow. Unless someone has other ideas to improve on that.

Updated the PR to always do len < 32 to be on the safe side. Hopefully it now passes CI.

Yeah, the idea behind the aligning load was that every load forward from that point is also aligned. We could do something where we handle extremely short strings by just doing an unaligned version. We do something somewhat similar in the adler checksum for ssse3, where it incorporates CPUs where alignment really makes or breaks it.

codecov · 2024-01-25T16:15:44Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (1aa53f4) 83.13% compared to head (88a9f8b) 83.41%.
Report is 4 commits behind head on develop.

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #1650      +/-   ##
===========================================
+ Coverage    83.13%   83.41%   +0.27%     
===========================================
  Files          135      135              
  Lines        10895    11126     +231     
  Branches      2816     2988     +172     
===========================================
+ Hits          9058     9281     +223     
- Misses        1127     1130       +3     
- Partials       710      715       +5

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

arch/x86/crc32_pclmulqdq_tpl.h

nmoinvaz · 2024-01-25T19:19:44Z

@Dead2 I will benchmark on my laptop that has AVX512.

Dead2 · 2024-01-25T20:35:08Z

Removed the extra header file, as it is not really needed any more (already so many crc files 😅).
This also means that all the N and W processing happens together in crc32_braid_p.h, making it a bit easier to understand.

Dead2 · 2024-01-25T21:14:15Z

Rebased and split into two commits.

Dead2 · 2024-01-25T21:16:27Z

Updated benchmarks on my i9-9000K

## Develop
crc32/braid/1                             7.15 ns         7.14 ns     97662151
crc32/braid/8                             18.7 ns         18.6 ns     37574368
crc32/braid/12                            27.8 ns         27.8 ns     25212989
crc32/braid/16                            36.8 ns         36.7 ns     19056127
crc32/braid/32                            73.0 ns         72.9 ns      9607586
crc32/braid/64                             136 ns          135 ns      5170815
crc32/braid/512                            464 ns          463 ns      1505207
crc32/braid/4096                          2956 ns         2952 ns       237152
crc32/braid/32768                        23070 ns        23034 ns        30388
crc32/braid/262144                      184311 ns       184003 ns         3805
crc32/braid/4194304                    2952183 ns      2946962 ns          238

crc32/pclmulqdq/1                         8.65 ns         8.64 ns     80827224
crc32/pclmulqdq/8                         18.6 ns         18.6 ns     37613951
crc32/pclmulqdq/12                        27.8 ns         27.7 ns     25227336
crc32/pclmulqdq/16                        36.8 ns         36.7 ns     19073905
crc32/pclmulqdq/32                        72.9 ns         72.8 ns      9614650
crc32/pclmulqdq/64                        28.1 ns         28.0 ns     24963663
crc32/pclmulqdq/512                       49.2 ns         49.1 ns     14259025
crc32/pclmulqdq/4096                       226 ns          225 ns      3106307
crc32/pclmulqdq/32768                     1607 ns         1604 ns       436387
crc32/pclmulqdq/262144                   13467 ns        13443 ns        52067
crc32/pclmulqdq/4194304                 226517 ns       226075 ns         3098

   text    data     bss     dec     hex filename
 131780    1352       8  133140   20814 libz-ng.so.2

## PR #1650
crc32/pclmulqdq/1                         5.04 ns         5.03 ns    139252670  ## Faster
crc32/pclmulqdq/8                         18.7 ns         18.6 ns     37548313
crc32/pclmulqdq/12                        27.8 ns         27.7 ns     25245168
crc32/pclmulqdq/16                        36.8 ns         36.7 ns     19055822
crc32/pclmulqdq/32                        25.0 ns         24.9 ns     28087357  ## Faster
crc32/pclmulqdq/64                        28.2 ns         28.1 ns     24895247
crc32/pclmulqdq/512                       49.2 ns         49.1 ns     14252094
crc32/pclmulqdq/4096                       226 ns          225 ns      3103245
crc32/pclmulqdq/32768                     1605 ns         1602 ns       436962
crc32/pclmulqdq/262144                   13590 ns        13564 ns        51610
crc32/pclmulqdq/4194304                 226707 ns       226219 ns         3096

   text    data     bss     dec     hex filename
 132868    1352       8  134228   20c54 libz-ng.so.2

nmoinvaz · 2024-01-25T21:42:52Z

PR

2024-01-25T13:36:38-08:00
Run on (8 X 3035.05 MHz CPU s)
CPU Caches:
  L1 Data 48 KiB (x4)
  L1 Instruction 32 KiB (x4)
  L2 Unified 1280 KiB (x4)
  L3 Unified 12288 KiB (x1)
-------------------------------------------------------------------
Benchmark                         Time             CPU   Iterations
-------------------------------------------------------------------
crc32/braid/1                  3.33 ns         2.57 ns    560000000
crc32/braid/8                  19.4 ns         15.3 ns     40727273
crc32/braid/12                 28.6 ns         22.9 ns     28000000
crc32/braid/16                 36.8 ns         31.1 ns     23578947
crc32/braid/32                 74.9 ns         61.4 ns     11200000
crc32/braid/64                 61.5 ns         54.4 ns     11200000
crc32/braid/512                 198 ns          167 ns      4480000
crc32/braid/4096               1208 ns          977 ns       560000
crc32/braid/32768              9302 ns         7324 ns        89600
crc32/braid/262144            71939 ns        57547 ns         8960
crc32/braid/4194304         1206351 ns      1025391 ns          640
crc32/pclmulqdq/1              3.62 ns         3.05 ns    235789474
crc32/pclmulqdq/8              20.8 ns         18.1 ns     44800000
crc32/pclmulqdq/12             30.4 ns         24.5 ns     49777778
crc32/pclmulqdq/16             39.1 ns         33.0 ns     21333333
crc32/pclmulqdq/32             14.8 ns         12.9 ns     44800000
crc32/pclmulqdq/64             19.7 ns         16.8 ns     34461538
crc32/pclmulqdq/512            35.9 ns         28.3 ns     32000000
crc32/pclmulqdq/4096            168 ns          137 ns      5600000
crc32/pclmulqdq/32768          1243 ns          952 ns       640000
crc32/pclmulqdq/262144        10171 ns         7673 ns        89600
crc32/pclmulqdq/4194304      160583 ns       138108 ns         4978
crc32/vpclmulqdq/1             3.75 ns         3.28 ns    224000000
crc32/vpclmulqdq/8             21.5 ns         16.9 ns     40727273
crc32/vpclmulqdq/12            31.4 ns         26.4 ns     37333333
crc32/vpclmulqdq/16            38.8 ns         33.0 ns     20363636
crc32/vpclmulqdq/32            14.7 ns         12.1 ns     74666667
crc32/vpclmulqdq/64            18.9 ns         15.3 ns     40727273
crc32/vpclmulqdq/512           30.8 ns         24.5 ns     23578947
crc32/vpclmulqdq/4096          81.7 ns         64.1 ns     10000000
crc32/vpclmulqdq/32768          582 ns          459 ns      1600000
crc32/vpclmulqdq/262144        4436 ns         3599 ns       186667
crc32/vpclmulqdq/4194304      82027 ns        66267 ns         8960

DEVELOP

2024-01-25T13:38:44-08:00
Run on (8 X 2996.15 MHz CPU s)
CPU Caches:
  L1 Data 48 KiB (x4)
  L1 Instruction 32 KiB (x4)
  L2 Unified 1280 KiB (x4)
  L3 Unified 12288 KiB (x1)
-------------------------------------------------------------------
Benchmark                         Time             CPU   Iterations
-------------------------------------------------------------------
crc32/braid/1                  3.40 ns         2.85 ns    373333333
crc32/braid/8                  20.5 ns         13.3 ns     44800000
crc32/braid/64                 56.7 ns         42.4 ns     16592593
crc32/braid/512                 187 ns          161 ns      4072727
crc32/braid/4096               1148 ns          903 ns       640000
crc32/braid/32768              9590 ns         9062 ns       100000
crc32/braid/262144            74291 ns        62779 ns        11200
crc32/braid/2097152          578376 ns       530134 ns         1120
crc32/braid/4194304         1310545 ns      1171875 ns          560
crc32/pclmulqdq/1              3.87 ns         3.42 ns    224000000
crc32/pclmulqdq/8              19.8 ns         17.6 ns     37333333
crc32/pclmulqdq/64             18.9 ns         16.9 ns     40727273
crc32/pclmulqdq/512            35.5 ns         33.0 ns     21333333
crc32/pclmulqdq/4096            168 ns          142 ns      5600000
crc32/pclmulqdq/32768          1195 ns         1074 ns       640000
crc32/pclmulqdq/262144        10247 ns         8954 ns        92490
crc32/pclmulqdq/2097152       78702 ns        61035 ns         8960
crc32/pclmulqdq/4194304      167554 ns       139509 ns         5600
crc32/vpclmulqdq/1             3.69 ns         3.08 ns    248888889
crc32/vpclmulqdq/8             20.5 ns         15.7 ns     44800000
crc32/vpclmulqdq/64            18.7 ns         15.0 ns     40727273
crc32/vpclmulqdq/512           29.9 ns         24.9 ns     26352941
crc32/vpclmulqdq/4096          84.6 ns         72.5 ns     11200000
crc32/vpclmulqdq/32768          564 ns          516 ns      1000000
crc32/vpclmulqdq/262144        4369 ns         3924 ns       179200
crc32/vpclmulqdq/2097152      39138 ns        29297 ns        21333
crc32/vpclmulqdq/4194304      81050 ns        68011 ns         8960

nmoinvaz · 2024-01-25T21:47:54Z

Hmm using Google benchmarks, it does not go through CRC32 function, so perhaps that is why I am not seeing a difference. Functions that go through functable also wouldn't hit your crc32_small...

Dead2 · 2024-01-25T21:50:43Z

Hmm using Google benchmarks, it does not go through CRC32 function, so perhaps that is why I am not seeing a difference.

Your develop benchmarks don't have the change to benchmark lengths (now a separate commit to make that easier).
Therefore you have not benchmarked anything between len 12 and 63, where the speedups are.

Dead2 · 2024-01-25T21:52:48Z

Functions that go through functable also wouldn't hit your crc32_small...

They will if they call ft.crc32...

Dead2 · 2024-01-25T21:58:30Z

@nmoinvaz Actually you do have the speedup there, just nothing to compare it to.

crc32/pclmulqdq/8              20.8 ns         18.1 ns     44800000
crc32/pclmulqdq/12             30.4 ns         24.5 ns     49777778
crc32/pclmulqdq/16             39.1 ns         33.0 ns     21333333
crc32/pclmulqdq/32             14.8 ns         12.9 ns     44800000  ## faster than above and below
crc32/pclmulqdq/64             19.7 ns         16.8 ns     34461538

crc32/vpclmulqdq/8             21.5 ns         16.9 ns     40727273
crc32/vpclmulqdq/12            31.4 ns         26.4 ns     37333333
crc32/vpclmulqdq/16            38.8 ns         33.0 ns     20363636
crc32/vpclmulqdq/32            14.7 ns         12.1 ns     74666667  ## faster than above and below
crc32/vpclmulqdq/64            18.9 ns         15.3 ns     40727273

The 32 test is the fastest because it is the shortest one now using the [v]pclmulqdq code. Previously the smalles was the 64 size, and the 32 would be around 78ns or so for your platform instead of now 14.7ns, so ~5x faster.

KungFuJesus · 2024-01-25T22:00:09Z

Running through the larger 740 something test batteries but it'd probably be good to see it pass CI as well.

Yeah so part of this was due to the fact that we had to do some back bending to make sure that the copy+CRC checksum worked for small lengths, anyway. It was only when we ended up combining the code into the template maze we have now did it seem silly that we weren't accommodating the the 16 < len < 31 cases.

I think I have something here that works but I'd feel better with more testing.

nmoinvaz · 2024-01-25T22:05:15Z

I see, I'll rerun the benchmarks in a bit with that one commit pulled into develop.

nmoinvaz · 2024-01-25T22:13:58Z

DEVELOP

-------------------------------------------------------------------
Benchmark                         Time             CPU   Iterations
-------------------------------------------------------------------
crc32/braid/1                  3.57 ns         1.88 ns    448000000
crc32/braid/8                  20.0 ns         11.7 ns     74666667
crc32/braid/12                 28.4 ns         20.3 ns     64000000
crc32/braid/16                 36.9 ns         24.6 ns     28000000
crc32/braid/32                 70.7 ns         41.0 ns     18666667
crc32/braid/64                 56.4 ns         44.9 ns     16000000
crc32/braid/512                 192 ns          123 ns      4072727
crc32/braid/4096               1201 ns          907 ns       896000
crc32/braid/32768              9975 ns         7500 ns       100000
crc32/braid/262144            74687 ns        53013 ns        11200
crc32/braid/4194304         1155435 ns       892857 ns          560
crc32/pclmulqdq/1              3.62 ns         2.77 ns    298666667
crc32/pclmulqdq/8              19.1 ns         14.2 ns     37333333
crc32/pclmulqdq/12             28.2 ns         20.1 ns     28000000
crc32/pclmulqdq/16             37.1 ns         25.5 ns     25088000
crc32/pclmulqdq/32             65.8 ns         48.1 ns     14933333
crc32/pclmulqdq/64             18.2 ns         13.3 ns    112000000
crc32/pclmulqdq/512            34.5 ns         25.7 ns     24888889
crc32/pclmulqdq/4096            162 ns          129 ns      8960000
crc32/pclmulqdq/32768          1189 ns          912 ns       924903
crc32/pclmulqdq/262144         9575 ns         7150 ns        89600
crc32/pclmulqdq/4194304      154119 ns       118583 ns         4480
crc32/vpclmulqdq/1             3.67 ns         2.73 ns    371674074
crc32/vpclmulqdq/8             19.2 ns         13.0 ns     40727273
crc32/vpclmulqdq/12            28.3 ns         22.6 ns     37333333
crc32/vpclmulqdq/16            37.0 ns         30.0 ns     21333333
crc32/vpclmulqdq/32            65.7 ns         46.5 ns     14451613
crc32/vpclmulqdq/64            18.2 ns         14.0 ns     56000000
crc32/vpclmulqdq/512           29.1 ns         24.9 ns     26352941
crc32/vpclmulqdq/4096          84.6 ns         68.0 ns      8960000
crc32/vpclmulqdq/32768          565 ns          410 ns      1600000
crc32/vpclmulqdq/262144        4604 ns         3575 ns       179200
crc32/vpclmulqdq/4194304      90562 ns        73438 ns        10000

PR

-------------------------------------------------------------------
Benchmark                         Time             CPU   Iterations
-------------------------------------------------------------------
crc32/braid/1                  3.71 ns         2.23 ns    560000000
crc32/braid/8                  24.1 ns         13.3 ns     44800000
crc32/braid/12                 29.3 ns         22.0 ns     29866667
crc32/braid/16                 38.0 ns         31.3 ns     19478261
crc32/braid/32                 66.3 ns         56.2 ns     10000000
crc32/braid/64                 54.6 ns         39.8 ns     14933333
crc32/braid/512                 191 ns          157 ns      4480000
crc32/braid/4096               1147 ns          942 ns       746667
crc32/braid/32768              9246 ns         6875 ns       100000
crc32/braid/262144            73018 ns        57812 ns        10000
crc32/braid/4194304         1172960 ns      1049805 ns          640
crc32/pclmulqdq/1              3.54 ns         2.61 ns    263529412
crc32/pclmulqdq/8              19.4 ns         16.5 ns     56000000
crc32/pclmulqdq/12             28.0 ns         24.1 ns     29866667
crc32/pclmulqdq/16             38.4 ns         32.0 ns     24888889
crc32/pclmulqdq/32             15.1 ns         12.0 ns     56000000
crc32/pclmulqdq/64             19.2 ns         13.9 ns     53952688
crc32/pclmulqdq/512            34.6 ns         29.6 ns     26352941
crc32/pclmulqdq/4096            162 ns          126 ns      5600000
crc32/pclmulqdq/32768          1163 ns          928 ns       640000
crc32/pclmulqdq/262144         9518 ns         7673 ns        89600
crc32/pclmulqdq/4194304      151493 ns       127645 ns         7467
crc32/vpclmulqdq/1             3.52 ns         2.79 ns    280000000
crc32/vpclmulqdq/8             19.4 ns         15.7 ns     49777778
crc32/vpclmulqdq/12            28.7 ns         23.1 ns     34461538
crc32/vpclmulqdq/16            38.7 ns         31.1 ns     23578947
crc32/vpclmulqdq/32            28.7 ns         23.7 ns     26352941
crc32/vpclmulqdq/64            29.2 ns         25.5 ns     26352941
crc32/vpclmulqdq/512           44.3 ns         38.5 ns     19478261
crc32/vpclmulqdq/4096          92.9 ns         79.5 ns      7466667
crc32/vpclmulqdq/32768          550 ns          500 ns      1000000
crc32/vpclmulqdq/262144        4144 ns         3606 ns       203636
crc32/vpclmulqdq/4194304      75882 ns        66267 ns         8960

Dead2 · 2024-01-25T22:17:06Z

@nmoinvaz Did your laptop overheat and throttle or something?

crc32/pclmulqdq/32             15.1 ns         12.0 ns     56000000
crc32/pclmulqdq/64             19.2 ns         13.9 ns     53952688
crc32/pclmulqdq/512            34.6 ns         29.6 ns     26352941

# Why is vpclmulqdq slower for these all of a sudden, they were faster on your previous run. Just a fluke I hope.
crc32/vpclmulqdq/32            28.7 ns         23.7 ns     26352941
crc32/vpclmulqdq/64            29.2 ns         25.5 ns     26352941
crc32/vpclmulqdq/512           44.3 ns         38.5 ns     19478261

KungFuJesus · 2024-01-25T22:54:57Z

@Dead2 here's what I have so far:

diff --git a/arch/x86/crc32_fold_pclmulqdq_tpl.h b/arch/x86/crc32_fold_pclmulqdq_tpl.h
index 3e799283..2d43f5aa 100644
--- a/arch/x86/crc32_fold_pclmulqdq_tpl.h
+++ b/arch/x86/crc32_fold_pclmulqdq_tpl.h
@@ -26,9 +26,8 @@ Z_INTERNAL void CRC32_FOLD(crc32_fold *crc, const uint8_t *src, size_t len, uint
     __m128i xmm_t0, xmm_t1, xmm_t2, xmm_t3;
     __m128i xmm_crc0, xmm_crc1, xmm_crc2, xmm_crc3;
     __m128i xmm_crc_part = _mm_setzero_si128();
-#ifdef COPY
     char ALIGNED_(16) partial_buf[16] = { 0 };
-#else
+#ifndef COPY
     __m128i xmm_initial = _mm_cvtsi32_si128(init_crc);
     int32_t first = init_crc != 0;
 
@@ -41,12 +40,12 @@ Z_INTERNAL void CRC32_FOLD(crc32_fold *crc, const uint8_t *src, size_t len, uint
     crc32_fold_load((__m128i *)crc->fold, &xmm_crc0, &xmm_crc1, &xmm_crc2, &xmm_crc3);
 
     if (len < 16) {
-#ifdef COPY
         if (len == 0)
             return;
 
         memcpy(partial_buf, src, len);
         xmm_crc_part = _mm_load_si128((const __m128i *)partial_buf);
+#ifdef COPY
         memcpy(dst, partial_buf, len);
 #endif
         goto partial;
@@ -63,7 +62,12 @@ Z_INTERNAL void CRC32_FOLD(crc32_fold *crc, const uint8_t *src, size_t len, uint
 
         if (algn_diff < 4 && init_crc != 0) {
             xmm_t0 = xmm_crc_part;
-            xmm_crc_part = _mm_loadu_si128((__m128i*)src + 1);
+            if (len >= 32) {
+                xmm_crc_part = _mm_loadu_si128((__m128i*)src + 1);
+            } else {
+                memcpy(partial_buf, src + 16, len);
+                xmm_crc_part = _mm_load_si128((__m128i*)partial_buf);
+            }
             fold_1(&xmm_crc0, &xmm_crc1, &xmm_crc2, &xmm_crc3);
             xmm_crc3 = _mm_xor_si128(xmm_crc3, xmm_t0);
             src += 16;
diff --git a/arch/x86/crc32_pclmulqdq_tpl.h b/arch/x86/crc32_pclmulqdq_tpl.h
index 80a35b03..3a4f6af5 100644
--- a/arch/x86/crc32_pclmulqdq_tpl.h
+++ b/arch/x86/crc32_pclmulqdq_tpl.h
@@ -365,7 +365,7 @@ static inline uint32_t crc32_small(uint32_t crc, const uint8_t *buf, size_t len)
 Z_INTERNAL uint32_t CRC32(uint32_t crc32, const uint8_t *buf, size_t len) {
     /* For lens smaller than ~12, crc32_small method is faster.
      * But there are also minimum requirements for the pclmul functions due to alignment */
-    if (len < 32)
+    if (len < 16)
         return crc32_small(crc32, buf, len);
 
     crc32_fold ALIGNED_(16) crc_state;

The area I'm least confident about is where the user provides an initial CRC into the function and for arbitrarily aligned buffers and lengths > 16 but lengths < 32. Do we have coverage of that in tests?

KungFuJesus · 2024-01-26T01:04:59Z

I put a manual abort statement in the area I wasn't confident about and it did manage to abort in at least one gtest, so perhaps that's something.

KungFuJesus · 2024-01-26T01:58:52Z

@Dead2 this seems to be correct but warrants more testing if you can manage:

diff --git a/arch/x86/crc32_fold_pclmulqdq_tpl.h b/arch/x86/crc32_fold_pclmulqdq_tpl.h
index 3e799283..753417ba 100644
--- a/arch/x86/crc32_fold_pclmulqdq_tpl.h
+++ b/arch/x86/crc32_fold_pclmulqdq_tpl.h
@@ -26,9 +26,8 @@ Z_INTERNAL void CRC32_FOLD(crc32_fold *crc, const uint8_t *src, size_t len, uint
     __m128i xmm_t0, xmm_t1, xmm_t2, xmm_t3;
     __m128i xmm_crc0, xmm_crc1, xmm_crc2, xmm_crc3;
     __m128i xmm_crc_part = _mm_setzero_si128();
-#ifdef COPY
     char ALIGNED_(16) partial_buf[16] = { 0 };
-#else
+#ifndef COPY
     __m128i xmm_initial = _mm_cvtsi32_si128(init_crc);
     int32_t first = init_crc != 0;
 
@@ -41,12 +40,12 @@ Z_INTERNAL void CRC32_FOLD(crc32_fold *crc, const uint8_t *src, size_t len, uint
     crc32_fold_load((__m128i *)crc->fold, &xmm_crc0, &xmm_crc1, &xmm_crc2, &xmm_crc3);
 
     if (len < 16) {
-#ifdef COPY
         if (len == 0)
             return;
 
         memcpy(partial_buf, src, len);
         xmm_crc_part = _mm_load_si128((const __m128i *)partial_buf);
+#ifdef COPY
         memcpy(dst, partial_buf, len);
 #endif
         goto partial;
@@ -63,9 +62,23 @@ Z_INTERNAL void CRC32_FOLD(crc32_fold *crc, const uint8_t *src, size_t len, uint
 
         if (algn_diff < 4 && init_crc != 0) {
             xmm_t0 = xmm_crc_part;
-            xmm_crc_part = _mm_loadu_si128((__m128i*)src + 1);
-            fold_1(&xmm_crc0, &xmm_crc1, &xmm_crc2, &xmm_crc3);
-            xmm_crc3 = _mm_xor_si128(xmm_crc3, xmm_t0);
+            if (len >= 32) {
+                xmm_crc_part = _mm_loadu_si128((__m128i*)src + 1);
+                fold_1(&xmm_crc0, &xmm_crc1, &xmm_crc2, &xmm_crc3);
+                xmm_crc3 = _mm_xor_si128(xmm_crc3, xmm_t0);
+            } else {
+                memcpy(partial_buf, src + 16, len - 16);
+                xmm_crc_part = _mm_load_si128((__m128i*)partial_buf);
+                fold_1(&xmm_crc0, &xmm_crc1, &xmm_crc2, &xmm_crc3);
+                xmm_crc3 = _mm_xor_si128(xmm_crc3, xmm_t0);
+                src += 16;
+                len -= 16;
+#ifdef COPY
+                dst -= algn_diff;
+#endif
+                goto partial;
+            }
+
             src += 16;
             len -= 16;
         }
diff --git a/arch/x86/crc32_pclmulqdq_tpl.h b/arch/x86/crc32_pclmulqdq_tpl.h
index 80a35b03..3a4f6af5 100644
--- a/arch/x86/crc32_pclmulqdq_tpl.h
+++ b/arch/x86/crc32_pclmulqdq_tpl.h
@@ -365,7 +365,7 @@ static inline uint32_t crc32_small(uint32_t crc, const uint8_t *buf, size_t len)
 Z_INTERNAL uint32_t CRC32(uint32_t crc32, const uint8_t *buf, size_t len) {
     /* For lens smaller than ~12, crc32_small method is faster.
      * But there are also minimum requirements for the pclmul functions due to alignment */
-    if (len < 32)
+    if (len < 16)
         return crc32_small(crc32, buf, len);
 
     crc32_fold ALIGNED_(16) crc_state;

Or, feel free to clean the logic up a bit, it's a little messy, I know.

nmoinvaz · 2024-01-26T02:18:13Z

@KungFuJesus you can specify "```diff" markdown and it will color code your diffs for you.

diff --git a/arch/x86/crc32_fold_pclmulqdq_tpl.h b/arch/x86/crc32_fold_pclmulqdq_tpl.h
index 3e799283..753417ba 100644
--- a/arch/x86/crc32_fold_pclmulqdq_tpl.h
+++ b/arch/x86/crc32_fold_pclmulqdq_tpl.h
@@ -26,9 +26,8 @@ Z_INTERNAL void CRC32_FOLD(crc32_fold *crc, const uint8_t *src, size_t len, uint
     __m128i xmm_t0, xmm_t1, xmm_t2, xmm_t3;
     __m128i xmm_crc0, xmm_crc1, xmm_crc2, xmm_crc3;
     __m128i xmm_crc_part = _mm_setzero_si128();
...

Dead2 · 2024-01-27T23:40:25Z

Updated benchmarks on my i9-9000K. (Added the last part with patch from @KungFuJesus)

Develop

crc32/braid/1                             7.15 ns         7.14 ns     97662151
crc32/braid/8                             18.7 ns         18.6 ns     37574368
crc32/braid/12                            27.8 ns         27.8 ns     25212989
crc32/braid/16                            36.8 ns         36.7 ns     19056127
crc32/braid/32                            73.0 ns         72.9 ns      9607586
crc32/braid/64                             136 ns          135 ns      5170815
crc32/braid/512                            464 ns          463 ns      1505207
crc32/braid/4096                          2956 ns         2952 ns       237152
crc32/braid/32768                        23070 ns        23034 ns        30388
crc32/braid/262144                      184311 ns       184003 ns         3805
crc32/braid/4194304                    2952183 ns      2946962 ns          238

crc32/pclmulqdq/1                         8.65 ns         8.64 ns     80827224
crc32/pclmulqdq/8                         18.6 ns         18.6 ns     37613951
crc32/pclmulqdq/12                        27.8 ns         27.7 ns     25227336
crc32/pclmulqdq/16                        36.8 ns         36.7 ns     19073905
crc32/pclmulqdq/32                        72.9 ns         72.8 ns      9614650
crc32/pclmulqdq/64                        28.1 ns         28.0 ns     24963663
crc32/pclmulqdq/512                       49.2 ns         49.1 ns     14259025
crc32/pclmulqdq/4096                       226 ns          225 ns      3106307
crc32/pclmulqdq/32768                     1607 ns         1604 ns       436387
crc32/pclmulqdq/262144                   13467 ns        13443 ns        52067
crc32/pclmulqdq/4194304                 226517 ns       226075 ns         3098

   text    data     bss     dec     hex filename
 131780    1352       8  133140   20814 libz-ng.so.2

PR #1650 small-crc32 below len 32

crc32/pclmulqdq/1                         5.04 ns         5.03 ns    139252670  ## Faster
crc32/pclmulqdq/8                         18.7 ns         18.6 ns     37548313
crc32/pclmulqdq/12                        27.8 ns         27.7 ns     25245168
crc32/pclmulqdq/16                        36.8 ns         36.7 ns     19055822
crc32/pclmulqdq/32                        25.0 ns         24.9 ns     28087357  ## Faster
crc32/pclmulqdq/64                        28.2 ns         28.1 ns     24895247
crc32/pclmulqdq/512                       49.2 ns         49.1 ns     14252094
crc32/pclmulqdq/4096                       226 ns          225 ns      3103245
crc32/pclmulqdq/32768                     1605 ns         1602 ns       436962
crc32/pclmulqdq/262144                   13590 ns        13564 ns        51610
crc32/pclmulqdq/4194304                 226707 ns       226219 ns         3096

   text    data     bss     dec     hex filename
 132868    1352       8  134228   20c54 libz-ng.so.2

PR #1650 small-crc32 below len 16 + expanded pclmulqdq handling for len 16-31

crc32/pclmulqdq/1             4.77 ns         4.76 ns    146925163
crc32/pclmulqdq/8             18.7 ns         18.6 ns     37549648
crc32/pclmulqdq/12            27.8 ns         27.7 ns     25245009
crc32/pclmulqdq/16            26.1 ns         26.1 ns     26857766
crc32/pclmulqdq/32            27.0 ns         27.0 ns     25951416
crc32/pclmulqdq/64            29.1 ns         29.0 ns     24132333
crc32/pclmulqdq/512           50.1 ns         50.0 ns     13997529
crc32/pclmulqdq/4096           227 ns          226 ns      3090671
crc32/pclmulqdq/32768         1607 ns         1604 ns       436398
crc32/pclmulqdq/262144       13564 ns        13537 ns        51720
crc32/pclmulqdq/4194304     226624 ns       226151 ns         3095

   text    data     bss     dec     hex filename
 133252    1352       8  134612   20dd4 libz-ng.so.2

Very nice improvement at 16 bytes, and likely much better at 31 bytes. Good work @KungFuJesus.
The penalty for >32 bytes seems pretty small.

arch/x86/crc32_fold_pclmulqdq_tpl.h

phprus · 2024-01-28T08:21:50Z

CI error:

https://github.com/zlib-ng/zlib-ng/actions/runs/7681876330/job/20935515796?pr=1650

fuzzer_checksum: /home/runner/work/zlib-ng/zlib-ng/arch/x86/crc32_fold_pclmulqdq_tpl.h:38:
void crc32_fold_pclmulqdq(crc32_fold *, const uint8_t *, size_t, uint32_t): Assertion `len >= 31 || first == 0' failed.

Dead2 · 2024-01-28T10:32:53Z

CI error:
https://github.com/zlib-ng/zlib-ng/actions/runs/7681876330/job/20935515796?pr=1650

fuzzer_checksum: /home/runner/work/zlib-ng/zlib-ng/arch/x86/crc32_fold_pclmulqdq_tpl.h:38:
void crc32_fold_pclmulqdq(crc32_fold *, const uint8_t *, size_t, uint32_t): Assertion `len >= 31 || first == 0' failed.

Yeah, the assertion is a known failure as it has not been updated yet, nor the comment above it.

arch/x86/crc32_fold_pclmulqdq_tpl.h

…to 16 bytes in crc32_fold_[v]pclmulqdq

mtl1979 · 2024-01-31T08:13:57Z

Coverity scan seems to suggest the current tests are not hitting all of the changed code... I don't know if it's glitch in Coverity or just that those are corner cases that are not hit in simple tests.

Dead2 · 2024-01-31T10:57:17Z

Coverity scan seems to suggest the current tests are not hitting all of the changed code... I don't know if it's glitch in Coverity or just that those are corner cases that are not hit in simple tests.

Codecov sees 100% coverage on all changes from the PR, and an overall +0.28% coverage: https://app.codecov.io/gh/zlib-ng/zlib-ng/pull/1650

Do you have a reference to what you are seeing with Coverity scan?

KungFuJesus

LGTM

Dead2 added the optimization label Jan 25, 2024

Dead2 requested review from nmoinvaz, KungFuJesus and mtl1979 January 25, 2024 14:11

nmoinvaz reviewed Jan 25, 2024

View reviewed changes

test/benchmarks/benchmark_crc32.cc Outdated Show resolved Hide resolved

arch/x86/crc32_pclmulqdq_tpl.h Outdated Show resolved Hide resolved

arch/x86/crc32_pclmulqdq_tpl.h Outdated Show resolved Hide resolved

KungFuJesus reviewed Jan 25, 2024

View reviewed changes

arch/x86/crc32_pclmulqdq_tpl.h Outdated Show resolved Hide resolved

Dead2 force-pushed the speedup-crc32 branch from 60df666 to e7c4ffd Compare January 25, 2024 14:32

Dead2 force-pushed the speedup-crc32 branch from e7c4ffd to 2faf960 Compare January 25, 2024 14:45

mtl1979 reviewed Jan 25, 2024

View reviewed changes

arch/x86/crc32_pclmulqdq_tpl.h Outdated Show resolved Hide resolved

Dead2 force-pushed the speedup-crc32 branch 2 times, most recently from ee7f79b to ebd78ca Compare January 25, 2024 20:33

Dead2 force-pushed the speedup-crc32 branch from ebd78ca to 86390a6 Compare January 25, 2024 21:09

Dead2 changed the title ~~Draft: Speed up crc32_[v]pclmulqdq on small strings.~~ Speed up crc32_[v]pclmulqdq on small strings. Jan 25, 2024

Dead2 force-pushed the speedup-crc32 branch from 86390a6 to 7822609 Compare January 27, 2024 23:43

Dead2 commented Jan 27, 2024

View reviewed changes

arch/x86/crc32_fold_pclmulqdq_tpl.h Show resolved Hide resolved

Dead2 commented Jan 28, 2024

View reviewed changes

arch/x86/crc32_fold_pclmulqdq_tpl.h Show resolved Hide resolved

Dead2 force-pushed the speedup-crc32 branch from 114d01a to ffd6468 Compare January 29, 2024 10:19

Dead2 and others added 3 commits January 29, 2024 20:20

Add more fine-grained small-length benchmarking of adler32 and crc32.

3a1f341

Speed up crc32_[v]pclmulqdq on small strings.

b732d79

Add support for handling alignment correction for input buffers down …

88a9f8b

…to 16 bytes in crc32_fold_[v]pclmulqdq

Dead2 force-pushed the speedup-crc32 branch from ffd6468 to 88a9f8b Compare January 29, 2024 19:21

Dead2 requested review from KungFuJesus, nmoinvaz and mtl1979 January 30, 2024 19:53

KungFuJesus approved these changes Feb 1, 2024

View reviewed changes

Dead2 merged commit 9076047 into develop Feb 2, 2024
271 checks passed

Dead2 deleted the speedup-crc32 branch February 22, 2024 18:57

Dead2 mentioned this pull request May 30, 2024

Prepare 2.2.0 release #1741

Closed

Dead2 mentioned this pull request Jun 12, 2024

2.2.0 Release #1747

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up crc32_[v]pclmulqdq on small strings. #1650

Speed up crc32_[v]pclmulqdq on small strings. #1650

Dead2 commented Jan 25, 2024 •

edited

Loading

Dead2 commented Jan 25, 2024

Dead2 commented Jan 25, 2024 •

edited

Loading

KungFuJesus commented Jan 25, 2024 •

edited

Loading

Dead2 commented Jan 25, 2024

KungFuJesus commented Jan 25, 2024

KungFuJesus commented Jan 25, 2024

Dead2 commented Jan 25, 2024 •

edited

Loading

KungFuJesus commented Jan 25, 2024

codecov bot commented Jan 25, 2024 •

edited

Loading

nmoinvaz commented Jan 25, 2024

Dead2 commented Jan 25, 2024

Dead2 commented Jan 25, 2024

Dead2 commented Jan 25, 2024

nmoinvaz commented Jan 25, 2024

nmoinvaz commented Jan 25, 2024 •

edited

Loading

Dead2 commented Jan 25, 2024

Dead2 commented Jan 25, 2024

Dead2 commented Jan 25, 2024 •

edited

Loading

KungFuJesus commented Jan 25, 2024 •

edited

Loading

nmoinvaz commented Jan 25, 2024

nmoinvaz commented Jan 25, 2024

Dead2 commented Jan 25, 2024

KungFuJesus commented Jan 25, 2024

KungFuJesus commented Jan 26, 2024

KungFuJesus commented Jan 26, 2024 •

edited

Loading

nmoinvaz commented Jan 26, 2024 •

edited

Loading

Dead2 commented Jan 27, 2024 •

edited

Loading

phprus commented Jan 28, 2024

Dead2 commented Jan 28, 2024

mtl1979 commented Jan 31, 2024

Dead2 commented Jan 31, 2024

KungFuJesus left a comment

Speed up crc32_[v]pclmulqdq on small strings. #1650

Speed up crc32_[v]pclmulqdq on small strings. #1650

Conversation

Dead2 commented Jan 25, 2024 • edited Loading

Dead2 commented Jan 25, 2024

Dead2 commented Jan 25, 2024 • edited Loading

KungFuJesus commented Jan 25, 2024 • edited Loading

Dead2 commented Jan 25, 2024

KungFuJesus commented Jan 25, 2024

KungFuJesus commented Jan 25, 2024

Dead2 commented Jan 25, 2024 • edited Loading

KungFuJesus commented Jan 25, 2024

codecov bot commented Jan 25, 2024 • edited Loading

Codecov Report

nmoinvaz commented Jan 25, 2024

Dead2 commented Jan 25, 2024

Dead2 commented Jan 25, 2024

Dead2 commented Jan 25, 2024

nmoinvaz commented Jan 25, 2024

PR

DEVELOP

nmoinvaz commented Jan 25, 2024 • edited Loading

Dead2 commented Jan 25, 2024

Dead2 commented Jan 25, 2024

Dead2 commented Jan 25, 2024 • edited Loading

KungFuJesus commented Jan 25, 2024 • edited Loading

nmoinvaz commented Jan 25, 2024

nmoinvaz commented Jan 25, 2024

DEVELOP

Dead2 commented Jan 25, 2024

KungFuJesus commented Jan 25, 2024

KungFuJesus commented Jan 26, 2024

KungFuJesus commented Jan 26, 2024 • edited Loading

nmoinvaz commented Jan 26, 2024 • edited Loading

Dead2 commented Jan 27, 2024 • edited Loading

Develop

PR #1650 small-crc32 below len 32

PR #1650 small-crc32 below len 16 + expanded pclmulqdq handling for len 16-31

phprus commented Jan 28, 2024

Dead2 commented Jan 28, 2024

mtl1979 commented Jan 31, 2024

Dead2 commented Jan 31, 2024

KungFuJesus left a comment

Choose a reason for hiding this comment

Dead2 commented Jan 25, 2024 •

edited

Loading

Dead2 commented Jan 25, 2024 •

edited

Loading

KungFuJesus commented Jan 25, 2024 •

edited

Loading

Dead2 commented Jan 25, 2024 •

edited

Loading

codecov bot commented Jan 25, 2024 •

edited

Loading

nmoinvaz commented Jan 25, 2024 •

edited

Loading

Dead2 commented Jan 25, 2024 •

edited

Loading

KungFuJesus commented Jan 25, 2024 •

edited

Loading

KungFuJesus commented Jan 26, 2024 •

edited

Loading

nmoinvaz commented Jan 26, 2024 •

edited

Loading

Dead2 commented Jan 27, 2024 •

edited

Loading