-
-
Notifications
You must be signed in to change notification settings - Fork 258
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ARM] Override Clang x4 NEON intrinsics for Android #1694
Conversation
* Clang for Android requires 256-bit alignment for x4 loads and stores, which can't be guaranteed and is unnecessary
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## develop #1694 +/- ##
========================================
Coverage 83.03% 83.03%
========================================
Files 134 134
Lines 10336 10336
Branches 2813 2813
========================================
Hits 8583 8583
Misses 1054 1054
Partials 699 699 ☔ View full report in Codecov by Sentry. |
Looking back on this I may have missed the subtlety that is strictly an aarch32 android issue? Is that really the case or does aarch64 with android also suffer this nuisance? |
So far I haven't seen anything that suggests that AArch64 ABI even uses alignment hints... I've been reading NEON headers for various ARM/AArch64 toolkits and also source code for code generator in Clang as gcc is no longer used for Android. |
Is it not possible to guarantee alignment by simply doing scalar sums to 32 byte alignment instead of 16 in the adler32 code? The slide_hash code already appears to guarantee 64 byte alignment through the |
We're talking about 8-bit reads here which shouldn't need alignment at all... 16-bit reads and writes should only need maximum of 16-bit (2-byte) alignment, not 256-bit (32-byte) alignment. |
It certainly is though for a lot of use cases you may do a scalar run through the entire data before you hit that point. |
When I hear about Android-specific bugs, I always remember that ARM has been bi-endian since ARMv3 and as such enforcing alignment checks in software might have some justification but in this case Google has gone full bats in the clock tower. Since I started working with optimizations, I have seen quite a few motherboards where main processor is little-endian and co-processor is big-endian. CPUs on those motherboards have own "move" instructions that automatically do byteswap to correct the endianess. I completely agree with what KungFuJesus said about some/many buffers being too short to allow aligning before using vector instructions. |
The existing NEON adler32 already does scalar sums to 16 bytes/128 bits, with a comment suggesting that it's for speed, which is why I'm wondering if it's simpler to adjust that on Android instead of overriding the intrinsics. https://github.com/zlib-ng/zlib-ng/blob/develop/arch/arm/adler32_neon.c#L171 |
Even 32-bit ARM has enough registers that it makes sense to utilize all/most of them instead of using just one or two registers. That way we can delay the expensive arithmetic operations, for example modulo, as late as possible, resulting in speed gain. |
I suppose we could conditionally force this to be a modulus of 32 bytes under the right conditions but as I was alluding to earlier, the right conditions to make this happen don't exist on very short strings. The 16 byte alignment does help, but seemingly only on the little CPUs on the big.LITTLE SBCs I've tested, the big seem to pipeline the loads better that alignment is unimportant. We also do the fake 4x load in platforms that lack this intrinsic, so any workaround that adjusted this aligning scalar sum would also need to take that into account. |
I've only seen Clang and MSVC having the x4 versions, at least my gcc doesn't have them... A lot of people still prefer gcc as it's kinda self-contained toolchain, Clang will almost always need parts of another toolchain to maintain ABI compatibility. |
I have it in my version of GCC and I believe I've had it for a while (since maybe v10 or v11?). I imagine the impact of the wider load is felt more on some implementations than others. It is a bit of a balancing act to determine when the alignment is actually helpful and when doing the scalar computations completely nullifies the gains. With pre-nehalem CPUs, the adler checksum there is doing a bit of calculus to determine that. Everything after nehalem the alignment mattered less and less (and really the part where it has any impact, however small, is the stores rather than loads). Given that this data is fed from a raw stream of bytes and in the worse case, we're already doing up to 15 scalar sums, I don't know how I feel about 16 more. Perhaps a viable strategy could be that we do a 16 byte wide load and jump over the wide load if not aligned. This would probably maximize the benefit while not going too overboard? It's a little complicated, but to clarify, the loop is already unrolling by a factor of 4. The remainder peeling for the modulo 4 is done after that loop with "if (rem)". The loop could be restructured such that it jumped to that section and then jumped back back up to the top of the loop based on both the alignment requirements and whether or not the remaining len had at least 32 bytes left to checksum. |
I have gcc 9.4.0 and few different versions of Clang... Default gcc version number has lagged behind a lot compared to Clang version numbers on at least Ubuntu. Clang 12.0.0 is latest I have without rebooting... |
It's not the prettiest and there are some corner cases that are not being caught, but something in the vein of this is what I had been proposing: adam@pi5:~/zlib-ng/build $ git diff
diff --git a/arch/arm/adler32_neon.c b/arch/arm/adler32_neon.c
index 8e46b380..972480c1 100644
--- a/arch/arm/adler32_neon.c
+++ b/arch/arm/adler32_neon.c
@@ -11,7 +11,7 @@
#include "adler32_p.h"
static void NEON_accum32(uint32_t *s, const uint8_t *buf, size_t len) {
- static const uint16_t ALIGNED_(16) taps[64] = {
+ static const uint16_t ALIGNED_(32) taps[64] = {
64, 63, 62, 61, 60, 59, 58, 57,
56, 55, 54, 53, 52, 51, 50, 49,
48, 47, 46, 45, 44, 43, 42, 41,
@@ -39,8 +39,22 @@ static void NEON_accum32(uint32_t *s, const uint8_t *buf, size_t len) {
uint16x8_t s2_4, s2_5, s2_6, s2_7;
s2_4 = s2_5 = s2_6 = s2_7 = vdupq_n_u16(0);
+loop_unrolled:
size_t num_iter = len >> 2;
int rem = len & 3;
+#if 1
+ int align_rem = (uintptr_t)buf & 31;
+
+ if (align_rem != 0) {
+ /* Determine if the modulus for the aligning loads is greater
+ * than the length of the buffer in 16 byte increments. If so,
+ * only checksum the remaining length of the buffer. If not,
+ * compute the residual number of 16 byte loads needed and at
+ * the end of this loop, jump back to the 4x load loop */
+ rem = 1;
+ goto rem_peel;
+ }
+#endif
for (size_t i = 0; i < num_iter; ++i) {
uint8x16x4_t d0_d3 = vld1q_u8_x4(buf);
@@ -75,10 +89,12 @@ static void NEON_accum32(uint32_t *s, const uint8_t *buf, size_t len) {
adacc_prev = adacc;
buf += 64;
+ len -= 4;
}
s3acc = vshlq_n_u32(s3acc, 6);
+rem_peel:
if (rem) {
uint32x4_t s3acc_0 = vdupq_n_u32(0);
while (rem--) {
@@ -91,10 +107,15 @@ static void NEON_accum32(uint32_t *s, const uint8_t *buf, size_t len) {
s3acc_0 = vaddq_u32(s3acc_0, adacc_prev);
adacc_prev = adacc;
buf += 16;
+ --len;
}
s3acc_0 = vshlq_n_u32(s3acc_0, 4);
s3acc = vaddq_u32(s3acc_0, s3acc);
+
+ if (len) {
+ goto loop_unrolled;
+ }
}
uint16x8x4_t t0_t3 = vld1q_u16_x4(taps); I don't love having to decrement another counting variable (undoubtedly there might also be a way around that by decrement len before the top jump or something). |
Using |
I mean I get that it's spaghetti I was just trying to find a way to decrease the potential machine code size and reuse bits of the code. On further inspection though this isn't going to be all that helpful because the code doing the alignment to 16 bytes only does so if there will still be work left over (forgot I did that). For a short enough string, it feeds in arbitrarily aligned data, because that's what the ABI allows for so long as the alignment hint doesn't get compiled it. The benefit the 4x loads really buys you varies wildly between ARM implementations, but it's never been super huge (maybe M1 it's more significant, I'm not sure). I think falling back on Android to use the 16 byte at a time unaligned load is probably acceptable, however silly. It is interesting that it only requires 256 bit alignment, despite it being a 512 bit load. That might indicate that most microarchitectures are only loading 32 bytes at a time, anyway. |
I think I said before that x4 versions might compile to two x2 loads/stores on 32-bit targets... However I haven't checked if some 32-bit ARM processors allow x4 loads/stores natively. I disassembled call to _vld1q_u8_x4:
vld1.8 {d0-d3}, [r0]!
vld1.8 {d4-d7}, [r0] I disassembled call to _vld1q_u16_x4:
vld1.16 {d0-d3}, [r0]!
vld1.16 {d4-d7}, [r0] |
Fixes #1343.