Improve AVX512 hashing#25
Merged
Merged
Conversation
Combine 2 iterations into one, instead of relying on out-of-order execution. Somewhere in the area of 1.3x speed: ``` BenchmarkFixed128/1024-AVX512/default-32 44128.15 44238.53 1.00x BenchmarkFixed128/1024-AVX512/seed-32 34396.41 34763.09 1.01x BenchmarkFixed128/8192-AVX512/default-32 96538.92 119978.21 1.24x BenchmarkFixed128/8192-AVX512/seed-32 90018.94 108670.69 1.21x BenchmarkFixed128/102400-AVX512/default-32 110115.61 149139.18 1.35x BenchmarkFixed128/102400-AVX512/seed-32 107460.24 143022.01 1.33x BenchmarkFixed128/1024000-AVX512/default-32 109639.48 151057.90 1.38x BenchmarkFixed128/1024000-AVX512/seed-32 110066.84 150997.76 1.37x BenchmarkFixed128/10240000-AVX512/default-32 108251.15 133473.95 1.23x BenchmarkFixed128/10240000-AVX512/seed-32 108322.75 133864.06 1.24x BenchmarkFixed128/102400000-AVX512/default-32 65743.82 67059.45 1.02x BenchmarkFixed128/102400000-AVX512/seed-32 62060.75 67780.78 1.09x BenchmarkFixed64/1024-AVX512/default-32 44285.01 44405.68 1.00x BenchmarkFixed64/1024-AVX512/seed-32 33900.20 34577.14 1.02x BenchmarkFixed64/8192-AVX512/default-32 95319.96 120186.59 1.26x BenchmarkFixed64/8192-AVX512/seed-32 82473.29 106373.01 1.29x BenchmarkFixed64/102400-AVX512/default-32 110099.86 148809.00 1.35x BenchmarkFixed64/102400-AVX512/seed-32 107085.63 143698.12 1.34x BenchmarkFixed64/1024000-AVX512/default-32 108111.70 144013.73 1.33x BenchmarkFixed64/1024000-AVX512/seed-32 108732.88 145428.61 1.34x BenchmarkFixed64/10240000-AVX512/default-32 109531.61 136144.53 1.24x BenchmarkFixed64/10240000-AVX512/seed-32 108423.07 136779.23 1.26x BenchmarkFixed64/102400000-AVX512/default-32 65866.30 67455.55 1.02x BenchmarkFixed64/102400000-AVX512/seed-32 59265.74 67757.55 1.14x ``` Add AVX512 block hasher. ``` BenchmarkHasher64/1024/avx512/plain-32 29767.29 30678.39 1.03x BenchmarkHasher64/1024/avx512/seed-32 27718.80 28077.80 1.01x BenchmarkHasher64/4096/avx512/plain-32 23373.30 53449.36 2.29x BenchmarkHasher64/4096/avx512/seed-32 22476.05 52516.27 2.34x BenchmarkHasher64/16384/avx512/plain-32 23920.83 92829.28 3.88x BenchmarkHasher64/16384/avx512/seed-32 22716.19 78889.04 3.47x BenchmarkHasher64/65536/avx512/plain-32 23864.11 109047.09 4.57x BenchmarkHasher64/65536/avx512/seed-32 23268.23 102684.80 4.41x BenchmarkHasher64/262144/avx512/plain-32 24142.71 114728.56 4.75x BenchmarkHasher64/262144/avx512/seed-32 22919.09 110129.25 4.81x BenchmarkHasher64/1048576/avx512/plain-32 23889.66 111943.95 4.69x BenchmarkHasher64/1048576/avx512/seed-32 23104.38 106299.34 4.60x BenchmarkHasher64/4194304/avx512/plain-32 24217.09 111626.61 4.61x BenchmarkHasher64/4194304/avx512/seed-32 23279.68 105657.91 4.54x BenchmarkHasher64/16777216/avx512/plain-32 24077.75 112962.97 4.69x BenchmarkHasher64/16777216/avx512/seed-32 23330.17 106772.89 4.58x BenchmarkHasher64/67108864/avx512/plain-32 23447.80 72003.16 3.07x BenchmarkHasher64/67108864/avx512/seed-32 22440.39 71853.28 3.20x BenchmarkHasher64/268435456/avx512/plain-32 23381.81 57520.80 2.46x BenchmarkHasher64/268435456/avx512/seed-32 22537.73 57488.81 2.55x BenchmarkHasher128/16/avx512/plain-32 2623.78 2614.76 1.00x BenchmarkHasher128/16/avx512/seed-32 2247.58 2255.99 1.00x BenchmarkHasher128/64/avx512/plain-32 8441.35 8569.21 1.02x BenchmarkHasher128/64/avx512/seed-32 7928.51 8207.56 1.04x BenchmarkHasher128/256/avx512/plain-32 11685.72 11960.20 1.02x BenchmarkHasher128/256/avx512/seed-32 8656.83 9076.80 1.05x BenchmarkHasher128/1024/avx512/plain-32 35400.36 36663.11 1.04x BenchmarkHasher128/1024/avx512/seed-32 26047.85 27162.07 1.04x BenchmarkHasher128/4096/avx512/plain-32 22589.17 52415.58 2.32x BenchmarkHasher128/4096/avx512/seed-32 20721.40 51786.58 2.50x BenchmarkHasher128/16384/avx512/plain-32 24000.61 89452.72 3.73x BenchmarkHasher128/16384/avx512/seed-32 23326.01 87501.12 3.75x BenchmarkHasher128/65536/avx512/plain-32 23894.06 106572.62 4.46x BenchmarkHasher128/65536/avx512/seed-32 23225.19 104809.13 4.51x BenchmarkHasher128/262144/avx512/plain-32 24213.35 113692.11 4.70x BenchmarkHasher128/262144/avx512/seed-32 23499.29 111420.89 4.74x BenchmarkHasher128/1048576/avx512/plain-32 24279.03 110841.82 4.57x BenchmarkHasher128/1048576/avx512/seed-32 23345.59 108720.33 4.66x BenchmarkHasher128/4194304/avx512/plain-32 24398.75 109798.04 4.50x BenchmarkHasher128/4194304/avx512/seed-32 23379.07 107000.69 4.58x BenchmarkHasher128/16777216/avx512/plain-32 24183.66 110475.47 4.57x BenchmarkHasher128/16777216/avx512/seed-32 23469.79 107950.68 4.60x BenchmarkHasher128/67108864/avx512/plain-32 23300.86 69514.94 2.98x BenchmarkHasher128/67108864/avx512/seed-32 22750.83 70221.43 3.09x BenchmarkHasher128/268435456/avx512/plain-32 22398.46 56600.03 2.53x BenchmarkHasher128/268435456/avx512/seed-32 22372.26 56274.37 2.52x ``` These should really be compared against AVX2 code, where the speedup is there, but more modest: ``` BenchmarkHasher128/65536/go/plain-32 24031.69 23695.38 0.99x BenchmarkHasher128/65536/go/seed-32 23210.74 23037.17 0.99x BenchmarkHasher128/65536/avx512/plain-32 23894.06 106572.62 4.46x BenchmarkHasher128/65536/avx512/seed-32 23225.19 104809.13 4.51x BenchmarkHasher128/65536/avx2/plain-32 82474.73 82628.79 1.00x BenchmarkHasher128/65536/avx2/seed-32 81615.29 80617.75 0.99x BenchmarkHasher128/65536/sse2/plain-32 46018.63 45949.57 1.00x BenchmarkHasher128/65536/sse2/seed-32 45743.47 45791.77 1.00x ``` Tested on AMD Ryzen 9 9950X 16-Core Processor
Owner
|
avx512 sure does have a lot of registers 😄 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Combine 2 iterations into one, instead of relying on out-of-order execution.
Somewhere in the area of 1.3x speed:
Add AVX512 block hasher.
"avx512" is compared against Go. These should really be compared against AVX2 code, where the speedup is there, but more modest:
Tested on AMD Ryzen 9 9950X 16-Core Processor.