Skip to content

Improve AVX512 hashing#25

Merged
zeebo merged 1 commit into
zeebo:masterfrom
klauspost:avx512-pipeline
Jul 24, 2025
Merged

Improve AVX512 hashing#25
zeebo merged 1 commit into
zeebo:masterfrom
klauspost:avx512-pipeline

Conversation

@klauspost
Copy link
Copy Markdown
Contributor

@klauspost klauspost commented Jul 24, 2025

Combine 2 iterations into one, instead of relying on out-of-order execution.

Somewhere in the area of 1.3x speed:

BenchmarkFixed128/1024-AVX512/default-32          44128.15      44238.53      1.00x
BenchmarkFixed128/1024-AVX512/seed-32             34396.41      34763.09      1.01x
BenchmarkFixed128/8192-AVX512/default-32          96538.92      119978.21     1.24x
BenchmarkFixed128/8192-AVX512/seed-32             90018.94      108670.69     1.21x
BenchmarkFixed128/102400-AVX512/default-32        110115.61     149139.18     1.35x
BenchmarkFixed128/102400-AVX512/seed-32           107460.24     143022.01     1.33x
BenchmarkFixed128/1024000-AVX512/default-32       109639.48     151057.90     1.38x
BenchmarkFixed128/1024000-AVX512/seed-32          110066.84     150997.76     1.37x
BenchmarkFixed128/10240000-AVX512/default-32      108251.15     133473.95     1.23x
BenchmarkFixed128/10240000-AVX512/seed-32         108322.75     133864.06     1.24x
BenchmarkFixed128/102400000-AVX512/default-32     65743.82      67059.45      1.02x
BenchmarkFixed128/102400000-AVX512/seed-32        62060.75      67780.78      1.09x
BenchmarkFixed64/1024-AVX512/default-32           44285.01      44405.68      1.00x
BenchmarkFixed64/1024-AVX512/seed-32              33900.20      34577.14      1.02x
BenchmarkFixed64/8192-AVX512/default-32           95319.96      120186.59     1.26x
BenchmarkFixed64/8192-AVX512/seed-32              82473.29      106373.01     1.29x
BenchmarkFixed64/102400-AVX512/default-32         110099.86     148809.00     1.35x
BenchmarkFixed64/102400-AVX512/seed-32            107085.63     143698.12     1.34x
BenchmarkFixed64/1024000-AVX512/default-32        108111.70     144013.73     1.33x
BenchmarkFixed64/1024000-AVX512/seed-32           108732.88     145428.61     1.34x
BenchmarkFixed64/10240000-AVX512/default-32       109531.61     136144.53     1.24x
BenchmarkFixed64/10240000-AVX512/seed-32          108423.07     136779.23     1.26x
BenchmarkFixed64/102400000-AVX512/default-32      65866.30      67455.55      1.02x
BenchmarkFixed64/102400000-AVX512/seed-32         59265.74      67757.55      1.14x

Add AVX512 block hasher.

BenchmarkHasher64/1024/avx512/plain-32            29767.29      30678.39      1.03x
BenchmarkHasher64/1024/avx512/seed-32             27718.80      28077.80      1.01x
BenchmarkHasher64/4096/avx512/plain-32            23373.30      53449.36      2.29x
BenchmarkHasher64/4096/avx512/seed-32             22476.05      52516.27      2.34x
BenchmarkHasher64/16384/avx512/plain-32           23920.83      92829.28      3.88x
BenchmarkHasher64/16384/avx512/seed-32            22716.19      78889.04      3.47x
BenchmarkHasher64/65536/avx512/plain-32           23864.11      109047.09     4.57x
BenchmarkHasher64/65536/avx512/seed-32            23268.23      102684.80     4.41x
BenchmarkHasher64/262144/avx512/plain-32          24142.71      114728.56     4.75x
BenchmarkHasher64/262144/avx512/seed-32           22919.09      110129.25     4.81x
BenchmarkHasher64/1048576/avx512/plain-32         23889.66      111943.95     4.69x
BenchmarkHasher64/1048576/avx512/seed-32          23104.38      106299.34     4.60x
BenchmarkHasher64/4194304/avx512/plain-32         24217.09      111626.61     4.61x
BenchmarkHasher64/4194304/avx512/seed-32          23279.68      105657.91     4.54x
BenchmarkHasher64/16777216/avx512/plain-32        24077.75      112962.97     4.69x
BenchmarkHasher64/16777216/avx512/seed-32         23330.17      106772.89     4.58x
BenchmarkHasher64/67108864/avx512/plain-32        23447.80      72003.16      3.07x
BenchmarkHasher64/67108864/avx512/seed-32         22440.39      71853.28      3.20x
BenchmarkHasher64/268435456/avx512/plain-32       23381.81      57520.80      2.46x
BenchmarkHasher64/268435456/avx512/seed-32        22537.73      57488.81      2.55x
BenchmarkHasher128/16/avx512/plain-32             2623.78       2614.76       1.00x
BenchmarkHasher128/16/avx512/seed-32              2247.58       2255.99       1.00x
BenchmarkHasher128/64/avx512/plain-32             8441.35       8569.21       1.02x
BenchmarkHasher128/64/avx512/seed-32              7928.51       8207.56       1.04x
BenchmarkHasher128/256/avx512/plain-32            11685.72      11960.20      1.02x
BenchmarkHasher128/256/avx512/seed-32             8656.83       9076.80       1.05x
BenchmarkHasher128/1024/avx512/plain-32           35400.36      36663.11      1.04x
BenchmarkHasher128/1024/avx512/seed-32            26047.85      27162.07      1.04x
BenchmarkHasher128/4096/avx512/plain-32           22589.17      52415.58      2.32x
BenchmarkHasher128/4096/avx512/seed-32            20721.40      51786.58      2.50x
BenchmarkHasher128/16384/avx512/plain-32          24000.61      89452.72      3.73x
BenchmarkHasher128/16384/avx512/seed-32           23326.01      87501.12      3.75x
BenchmarkHasher128/65536/avx512/plain-32          23894.06      106572.62     4.46x
BenchmarkHasher128/65536/avx512/seed-32           23225.19      104809.13     4.51x
BenchmarkHasher128/262144/avx512/plain-32         24213.35      113692.11     4.70x
BenchmarkHasher128/262144/avx512/seed-32          23499.29      111420.89     4.74x
BenchmarkHasher128/1048576/avx512/plain-32        24279.03      110841.82     4.57x
BenchmarkHasher128/1048576/avx512/seed-32         23345.59      108720.33     4.66x
BenchmarkHasher128/4194304/avx512/plain-32        24398.75      109798.04     4.50x
BenchmarkHasher128/4194304/avx512/seed-32         23379.07      107000.69     4.58x
BenchmarkHasher128/16777216/avx512/plain-32       24183.66      110475.47     4.57x
BenchmarkHasher128/16777216/avx512/seed-32        23469.79      107950.68     4.60x
BenchmarkHasher128/67108864/avx512/plain-32       23300.86      69514.94      2.98x
BenchmarkHasher128/67108864/avx512/seed-32        22750.83      70221.43      3.09x
BenchmarkHasher128/268435456/avx512/plain-32      22398.46      56600.03      2.53x
BenchmarkHasher128/268435456/avx512/seed-32       22372.26      56274.37      2.52x

"avx512" is compared against Go. These should really be compared against AVX2 code, where the speedup is there, but more modest:

BenchmarkHasher128/65536/go/plain-32              24031.69      23695.38      0.99x
BenchmarkHasher128/65536/go/seed-32               23210.74      23037.17      0.99x
BenchmarkHasher128/65536/avx512/plain-32          23894.06      106572.62     4.46x
BenchmarkHasher128/65536/avx512/seed-32           23225.19      104809.13     4.51x
BenchmarkHasher128/65536/avx2/plain-32            82474.73      82628.79      1.00x
BenchmarkHasher128/65536/avx2/seed-32             81615.29      80617.75      0.99x
BenchmarkHasher128/65536/sse2/plain-32            46018.63      45949.57      1.00x
BenchmarkHasher128/65536/sse2/seed-32             45743.47      45791.77      1.00x

Tested on AMD Ryzen 9 9950X 16-Core Processor.

Combine 2 iterations into one, instead of relying on out-of-order execution.

Somewhere in the area of 1.3x speed:

```
BenchmarkFixed128/1024-AVX512/default-32          44128.15      44238.53      1.00x
BenchmarkFixed128/1024-AVX512/seed-32             34396.41      34763.09      1.01x
BenchmarkFixed128/8192-AVX512/default-32          96538.92      119978.21     1.24x
BenchmarkFixed128/8192-AVX512/seed-32             90018.94      108670.69     1.21x
BenchmarkFixed128/102400-AVX512/default-32        110115.61     149139.18     1.35x
BenchmarkFixed128/102400-AVX512/seed-32           107460.24     143022.01     1.33x
BenchmarkFixed128/1024000-AVX512/default-32       109639.48     151057.90     1.38x
BenchmarkFixed128/1024000-AVX512/seed-32          110066.84     150997.76     1.37x
BenchmarkFixed128/10240000-AVX512/default-32      108251.15     133473.95     1.23x
BenchmarkFixed128/10240000-AVX512/seed-32         108322.75     133864.06     1.24x
BenchmarkFixed128/102400000-AVX512/default-32     65743.82      67059.45      1.02x
BenchmarkFixed128/102400000-AVX512/seed-32        62060.75      67780.78      1.09x
BenchmarkFixed64/1024-AVX512/default-32           44285.01      44405.68      1.00x
BenchmarkFixed64/1024-AVX512/seed-32              33900.20      34577.14      1.02x
BenchmarkFixed64/8192-AVX512/default-32           95319.96      120186.59     1.26x
BenchmarkFixed64/8192-AVX512/seed-32              82473.29      106373.01     1.29x
BenchmarkFixed64/102400-AVX512/default-32         110099.86     148809.00     1.35x
BenchmarkFixed64/102400-AVX512/seed-32            107085.63     143698.12     1.34x
BenchmarkFixed64/1024000-AVX512/default-32        108111.70     144013.73     1.33x
BenchmarkFixed64/1024000-AVX512/seed-32           108732.88     145428.61     1.34x
BenchmarkFixed64/10240000-AVX512/default-32       109531.61     136144.53     1.24x
BenchmarkFixed64/10240000-AVX512/seed-32          108423.07     136779.23     1.26x
BenchmarkFixed64/102400000-AVX512/default-32      65866.30      67455.55      1.02x
BenchmarkFixed64/102400000-AVX512/seed-32         59265.74      67757.55      1.14x
```

Add AVX512 block hasher.

```
BenchmarkHasher64/1024/avx512/plain-32            29767.29      30678.39      1.03x
BenchmarkHasher64/1024/avx512/seed-32             27718.80      28077.80      1.01x
BenchmarkHasher64/4096/avx512/plain-32            23373.30      53449.36      2.29x
BenchmarkHasher64/4096/avx512/seed-32             22476.05      52516.27      2.34x
BenchmarkHasher64/16384/avx512/plain-32           23920.83      92829.28      3.88x
BenchmarkHasher64/16384/avx512/seed-32            22716.19      78889.04      3.47x
BenchmarkHasher64/65536/avx512/plain-32           23864.11      109047.09     4.57x
BenchmarkHasher64/65536/avx512/seed-32            23268.23      102684.80     4.41x
BenchmarkHasher64/262144/avx512/plain-32          24142.71      114728.56     4.75x
BenchmarkHasher64/262144/avx512/seed-32           22919.09      110129.25     4.81x
BenchmarkHasher64/1048576/avx512/plain-32         23889.66      111943.95     4.69x
BenchmarkHasher64/1048576/avx512/seed-32          23104.38      106299.34     4.60x
BenchmarkHasher64/4194304/avx512/plain-32         24217.09      111626.61     4.61x
BenchmarkHasher64/4194304/avx512/seed-32          23279.68      105657.91     4.54x
BenchmarkHasher64/16777216/avx512/plain-32        24077.75      112962.97     4.69x
BenchmarkHasher64/16777216/avx512/seed-32         23330.17      106772.89     4.58x
BenchmarkHasher64/67108864/avx512/plain-32        23447.80      72003.16      3.07x
BenchmarkHasher64/67108864/avx512/seed-32         22440.39      71853.28      3.20x
BenchmarkHasher64/268435456/avx512/plain-32       23381.81      57520.80      2.46x
BenchmarkHasher64/268435456/avx512/seed-32        22537.73      57488.81      2.55x
BenchmarkHasher128/16/avx512/plain-32             2623.78       2614.76       1.00x
BenchmarkHasher128/16/avx512/seed-32              2247.58       2255.99       1.00x
BenchmarkHasher128/64/avx512/plain-32             8441.35       8569.21       1.02x
BenchmarkHasher128/64/avx512/seed-32              7928.51       8207.56       1.04x
BenchmarkHasher128/256/avx512/plain-32            11685.72      11960.20      1.02x
BenchmarkHasher128/256/avx512/seed-32             8656.83       9076.80       1.05x
BenchmarkHasher128/1024/avx512/plain-32           35400.36      36663.11      1.04x
BenchmarkHasher128/1024/avx512/seed-32            26047.85      27162.07      1.04x
BenchmarkHasher128/4096/avx512/plain-32           22589.17      52415.58      2.32x
BenchmarkHasher128/4096/avx512/seed-32            20721.40      51786.58      2.50x
BenchmarkHasher128/16384/avx512/plain-32          24000.61      89452.72      3.73x
BenchmarkHasher128/16384/avx512/seed-32           23326.01      87501.12      3.75x
BenchmarkHasher128/65536/avx512/plain-32          23894.06      106572.62     4.46x
BenchmarkHasher128/65536/avx512/seed-32           23225.19      104809.13     4.51x
BenchmarkHasher128/262144/avx512/plain-32         24213.35      113692.11     4.70x
BenchmarkHasher128/262144/avx512/seed-32          23499.29      111420.89     4.74x
BenchmarkHasher128/1048576/avx512/plain-32        24279.03      110841.82     4.57x
BenchmarkHasher128/1048576/avx512/seed-32         23345.59      108720.33     4.66x
BenchmarkHasher128/4194304/avx512/plain-32        24398.75      109798.04     4.50x
BenchmarkHasher128/4194304/avx512/seed-32         23379.07      107000.69     4.58x
BenchmarkHasher128/16777216/avx512/plain-32       24183.66      110475.47     4.57x
BenchmarkHasher128/16777216/avx512/seed-32        23469.79      107950.68     4.60x
BenchmarkHasher128/67108864/avx512/plain-32       23300.86      69514.94      2.98x
BenchmarkHasher128/67108864/avx512/seed-32        22750.83      70221.43      3.09x
BenchmarkHasher128/268435456/avx512/plain-32      22398.46      56600.03      2.53x
BenchmarkHasher128/268435456/avx512/seed-32       22372.26      56274.37      2.52x
```

These should really be compared against AVX2 code, where the speedup is there, but more modest:

```
BenchmarkHasher128/65536/go/plain-32              24031.69      23695.38      0.99x
BenchmarkHasher128/65536/go/seed-32               23210.74      23037.17      0.99x
BenchmarkHasher128/65536/avx512/plain-32          23894.06      106572.62     4.46x
BenchmarkHasher128/65536/avx512/seed-32           23225.19      104809.13     4.51x
BenchmarkHasher128/65536/avx2/plain-32            82474.73      82628.79      1.00x
BenchmarkHasher128/65536/avx2/seed-32             81615.29      80617.75      0.99x
BenchmarkHasher128/65536/sse2/plain-32            46018.63      45949.57      1.00x
BenchmarkHasher128/65536/sse2/seed-32             45743.47      45791.77      1.00x
```

Tested on AMD Ryzen 9 9950X 16-Core Processor
@zeebo zeebo merged commit 77e65e1 into zeebo:master Jul 24, 2025
4 checks passed
@zeebo
Copy link
Copy Markdown
Owner

zeebo commented Jul 24, 2025

avx512 sure does have a lot of registers 😄

@klauspost klauspost deleted the avx512-pipeline branch July 24, 2025 20:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants