Skip to content

[bug] Fixed vdpbf16ps RW classification (destination is read-modify-write)#522

Open
HarshDave-Sithlord wants to merge 1 commit into
asmjit:masterfrom
HarshDave-Sithlord:fix-vdpbf16ps-rw-classification
Open

[bug] Fixed vdpbf16ps RW classification (destination is read-modify-write)#522
HarshDave-Sithlord wants to merge 1 commit into
asmjit:masterfrom
HarshDave-Sithlord:fix-vdpbf16ps-rw-classification

Conversation

@HarshDave-Sithlord

Copy link
Copy Markdown

vdpbf16ps (AVX-512-BF16, EVEX.xyz.F3.0F38.W0 52 /r) accumulates a pair-wise bf16 dot product into its destination fp32 lanes:

dst.f32[i] += src1.bf16[2i]   * src2.bf16[2i]
            + src1.bf16[2i+1] * src2.bf16[2i+1]

The destination is therefore read-modify-write, not write-only. The instruction database tagged the destination operand as W: (write-only), which caused the Compiler-mode register allocator to treat the destination as having no incoming live-in and made it eligible to alias a source register whose last use was the same instruction.

Compare with vfmadd231ps, which has the same accumulator semantics and is correctly tagged X:. All AVX-10.2 bf16 FMA variants (e.g. vfmadd{132,213,231}nepbf16) are also already correctly tagged X: -- only vdpbf16ps was misclassified.

Fix: change W: to X: for the vdpbf16ps entry in db/isa_x86.json and regenerate asmjit/x86/x86instdb.cpp with tools/tablegen-x86.js. The regen-only churn in x86instdb.cpp is the dedup table reordering; the semantic delta is that vdpbf16ps's row in rw_info_b_table now shares the FMA destination-RW signature ({ 2, 3, 3, ... }) instead of the write-only signature ({ 10, 5, 5, ... }).

…rite)

`vdpbf16ps` (AVX-512-BF16, EVEX.xyz.F3.0F38.W0 52 /r) accumulates a
pair-wise bf16 dot product into its destination fp32 lanes:

    dst.f32[i] += src1.bf16[2i]   * src2.bf16[2i]
                + src1.bf16[2i+1] * src2.bf16[2i+1]

The destination is therefore read-modify-write, not write-only. The
instruction database tagged the destination operand as `W:` (write-only),
which caused the Compiler-mode register allocator to treat the destination
as having no incoming live-in and made it eligible to alias a source
register whose last use was the same instruction. The result was silent
accumulator corruption in code that hand-rolls bf16 GEMM-style inner loops,
where the same zmm holds the accumulator across many `vdpbf16ps` invocations.

Compare with `vfmadd231ps`, which has the same accumulator semantics and
is correctly tagged `X:`. All AVX-10.2 bf16 FMA variants (e.g.
`vfmadd{132,213,231}nepbf16`) are also already correctly tagged `X:` --
only `vdpbf16ps` was misclassified.

Fix: change `W:` to `X:` for the `vdpbf16ps` entry in `db/isa_x86.json`
and regenerate `asmjit/x86/x86instdb.cpp` with `tools/tablegen-x86.js`.
The regen-only churn in `x86instdb.cpp` is the dedup table reordering;
the semantic delta is that `vdpbf16ps`'s row in `rw_info_b_table` now
shares the FMA destination-RW signature (`{ 2, 3, 3, ... }`) instead of
the write-only signature (`{ 10, 5, 5, ... }`).

Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant