ggml-cpu : fix soft_max_back wrong output when dst aliases src1 (y)#1521
Open
devYRPauli wants to merge 1 commit into
Open
ggml-cpu : fix soft_max_back wrong output when dst aliases src1 (y)#1521devYRPauli wants to merge 1 commit into
devYRPauli wants to merge 1 commit into
Conversation
GGML_OP_SOFT_MAX_BACK is listed in ggml_op_can_inplace, so the backend
scheduler may allocate dst over the buffer of either src0 (dy) or src1
(y). The CPU implementation was a five-step sequence:
ggml_vec_dot_f32 (nc, &dot_y_dy, 0, y, 0, dy, 0, 1); // dot(y, dy)
ggml_vec_cpy_f32 (nc, dx, dy); // dx = dy
ggml_vec_acc1_f32 (nc, dx, -dot_y_dy);
ggml_vec_mul_f32 (nc, dx, dx, y); // re-reads y
ggml_vec_scale_f32(nc, dx, scale);
The allocator preferentially picks src0 for in-place reuse, in which
case dst==dy and the cpy is a no-op. But when src0 is ineligible (e.g.
marked GGML_TENSOR_FLAG_OUTPUT, or used by more than one downstream
consumer), the allocator falls through to src1 (y). Then the cpy
overwrites y with dy values, and the subsequent dx = dx * y reads the
corrupted buffer, producing wildly wrong gradients.
Replace the multi-step sequence with a single fused read-before-write
loop. Each element's read of dy[i] and y[i] precedes its write to
dx[i], which is safe under aliasing of dst with either src0 or src1.
Matches the CUDA reference at ggml-cuda/softmax.cu:268.
Adds tests/test-soft-max-back-inplace.cpp comparing
ggml_graph_compute_with_ctx (no aliasing) against
ggml_backend_sched_graph_compute (may alias) across six configurations
including the dst-aliases-src1 trigger (dy marked as output). All
configurations fail without the fix, all pass after.
Contributor
Author
|
Going to close this for the moment to keep to a single open PR as a new contributor. I've left #1519 open since it fixes the same in-place aliasing problem in the backward ops; I'll bring this soft_max_back one back once that lands. Thanks! |
Contributor
Author
|
Reopening now that my first PR has landed, as promised when I trimmed down to a single open PR. Thanks for your patience! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Sibling fix to #1519 (rms_norm_back in-place aliasing). Same root-cause class.
Bug
GGML_OP_SOFT_MAX_BACKis listed inggml_op_can_inplace, so the backend scheduler may allocatedstover the buffer of eithersrc0(dy, the upstream gradient) orsrc1(y, the saved softmax output). On the CPU backend,ggml_compute_forward_soft_max_ext_back_f32was implemented as a five-step sequence:The allocator preferentially picks
src0for in-place reuse, in which casedst == dyand thecpyis a no-op — historically safe. But whensrc0is ineligible (e.g. markedGGML_TENSOR_FLAG_OUTPUT, or used by more than one downstream consumer), the allocator falls through tosrc1. Then thedx = dystep overwritesywithdyvalues, and the subsequentdx = dx * yreads the corrupted buffer — producing wildly wrong gradients (diffs of ~1.5 on values of magnitude ~0.02 in the test cases below).Fix
Replace the multi-step sequence with a single fused read-before-write loop:
Each element's read of
dy[i]andy[i]precedes its write todx[i], which is safe under aliasing ofdstwith eithersrc0orsrc1. Matches the CUDA reference atggml-cuda/softmax.cu:268.Regression test
Adds
tests/test-soft-max-back-inplace.cpp. Comparesggml_graph_compute_with_ctxagainstggml_backend_sched_graph_computeacross six configurations:default-*— allocator picks src0; historically passes on both paths.dy-output-*—ggml_set_output(t_dy)forces allocator to fall through to src1; previously broken, fixed by this PR.y-output-*— allocator picks src0 (src1 ineligible); safe.both-output-*— both ineligible, fresh allocation; safe.Without the fix: the four "default-" and "-output (src0 path)" cases pass, the two
dy-output-*cases fail with large divergences. With the fix: all six pass.How discovered
Surfaced during the audit prompted by #1519 — same pattern, same backward-op surface (
_backops inggml_op_can_inplace). I audited the other entries in that list:ggml_vec_silu_backward_f32/_f16(used by SILU_BACK) is a single-pass loop, in-place safe.ggml_compute_forward_rope_flt(used by ROPE_BACK) uses pair-wise rotations that the existing impl handles correctly. Only SOFT_MAX_BACK had the multi-step pattern with the same trap as RMS_NORM_BACK.Pre-existing unrelated issue noted
test-backend-ops grad -o SUMon Metal still producesinfforSUM(type=f32,ne=[33,256,1,1],permute=[1,0,2,3])— same pre-existing failure noted in #1519's body. Unrelated to this change.