[bug] Fixed vdpbf16ps RW classification (destination is read-modify-write)#522
Open
HarshDave-Sithlord wants to merge 1 commit into
Open
Conversation
…rite)
`vdpbf16ps` (AVX-512-BF16, EVEX.xyz.F3.0F38.W0 52 /r) accumulates a
pair-wise bf16 dot product into its destination fp32 lanes:
dst.f32[i] += src1.bf16[2i] * src2.bf16[2i]
+ src1.bf16[2i+1] * src2.bf16[2i+1]
The destination is therefore read-modify-write, not write-only. The
instruction database tagged the destination operand as `W:` (write-only),
which caused the Compiler-mode register allocator to treat the destination
as having no incoming live-in and made it eligible to alias a source
register whose last use was the same instruction. The result was silent
accumulator corruption in code that hand-rolls bf16 GEMM-style inner loops,
where the same zmm holds the accumulator across many `vdpbf16ps` invocations.
Compare with `vfmadd231ps`, which has the same accumulator semantics and
is correctly tagged `X:`. All AVX-10.2 bf16 FMA variants (e.g.
`vfmadd{132,213,231}nepbf16`) are also already correctly tagged `X:` --
only `vdpbf16ps` was misclassified.
Fix: change `W:` to `X:` for the `vdpbf16ps` entry in `db/isa_x86.json`
and regenerate `asmjit/x86/x86instdb.cpp` with `tools/tablegen-x86.js`.
The regen-only churn in `x86instdb.cpp` is the dedup table reordering;
the semantic delta is that `vdpbf16ps`'s row in `rw_info_b_table` now
shares the FMA destination-RW signature (`{ 2, 3, 3, ... }`) instead of
the write-only signature (`{ 10, 5, 5, ... }`).
Co-authored-by: Cursor <cursoragent@cursor.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
vdpbf16ps(AVX-512-BF16, EVEX.xyz.F3.0F38.W0 52 /r) accumulates a pair-wise bf16 dot product into its destination fp32 lanes:The destination is therefore read-modify-write, not write-only. The instruction database tagged the destination operand as
W:(write-only), which caused the Compiler-mode register allocator to treat the destination as having no incoming live-in and made it eligible to alias a source register whose last use was the same instruction.Compare with
vfmadd231ps, which has the same accumulator semantics and is correctly taggedX:. All AVX-10.2 bf16 FMA variants (e.g.vfmadd{132,213,231}nepbf16) are also already correctly taggedX:-- onlyvdpbf16pswas misclassified.Fix: change
W:toX:for thevdpbf16psentry indb/isa_x86.jsonand regenerateasmjit/x86/x86instdb.cppwithtools/tablegen-x86.js. The regen-only churn inx86instdb.cppis the dedup table reordering; the semantic delta is thatvdpbf16ps's row inrw_info_b_tablenow shares the FMA destination-RW signature ({ 2, 3, 3, ... }) instead of the write-only signature ({ 10, 5, 5, ... }).