ggml-cpu : fix soft_max_back wrong output when dst aliases src1 (y) by devYRPauli · Pull Request #1521 · ggml-org/ggml

devYRPauli · 2026-05-28T19:35:27Z

Sibling fix to #1519 (rms_norm_back in-place aliasing). Same root-cause class.

Bug

GGML_OP_SOFT_MAX_BACK is listed in ggml_op_can_inplace, so the backend scheduler may allocate dst over the buffer of either src0 (dy, the upstream gradient) or src1 (y, the saved softmax output). On the CPU backend, ggml_compute_forward_soft_max_ext_back_f32 was implemented as a five-step sequence:

ggml_vec_dot_f32  (nc, &dot_y_dy, 0, y, 0, dy, 0, 1);  // dot(y, dy)
ggml_vec_cpy_f32  (nc, dx, dy);                        // dx = dy
ggml_vec_acc1_f32 (nc, dx, -dot_y_dy);
ggml_vec_mul_f32  (nc, dx, dx, y);                     // re-reads y
ggml_vec_scale_f32(nc, dx, scale);

The allocator preferentially picks src0 for in-place reuse, in which case dst == dy and the cpy is a no-op — historically safe. But when src0 is ineligible (e.g. marked GGML_TENSOR_FLAG_OUTPUT, or used by more than one downstream consumer), the allocator falls through to src1. Then the dx = dy step overwrites y with dy values, and the subsequent dx = dx * y reads the corrupted buffer — producing wildly wrong gradients (diffs of ~1.5 on values of magnitude ~0.02 in the test cases below).

Fix

Replace the multi-step sequence with a single fused read-before-write loop:

float dot_y_dy = 0;
ggml_vec_dot_f32(nc, &dot_y_dy, 0, y, 0, dy, 0, 1);
for (int i = 0; i < nc; i++) {
    dx[i] = scale * (dy[i] - dot_y_dy) * y[i];
}

Each element's read of dy[i] and y[i] precedes its write to dx[i], which is safe under aliasing of dst with either src0 or src1. Matches the CUDA reference at ggml-cuda/softmax.cu:268.

Regression test

Adds tests/test-soft-max-back-inplace.cpp. Compares ggml_graph_compute_with_ctx against ggml_backend_sched_graph_compute across six configurations:

default-* — allocator picks src0; historically passes on both paths.
dy-output-* — ggml_set_output(t_dy) forces allocator to fall through to src1; previously broken, fixed by this PR.
y-output-* — allocator picks src0 (src1 ineligible); safe.
both-output-* — both ineligible, fresh allocation; safe.

Without the fix: the four "default-" and "-output (src0 path)" cases pass, the two dy-output-* cases fail with large divergences. With the fix: all six pass.

How discovered

Surfaced during the audit prompted by #1519 — same pattern, same backward-op surface (_back ops in ggml_op_can_inplace). I audited the other entries in that list: ggml_vec_silu_backward_f32/_f16 (used by SILU_BACK) is a single-pass loop, in-place safe. ggml_compute_forward_rope_flt (used by ROPE_BACK) uses pair-wise rotations that the existing impl handles correctly. Only SOFT_MAX_BACK had the multi-step pattern with the same trap as RMS_NORM_BACK.

Pre-existing unrelated issue noted

test-backend-ops grad -o SUM on Metal still produces inf for SUM(type=f32,ne=[33,256,1,1],permute=[1,0,2,3]) — same pre-existing failure noted in #1519's body. Unrelated to this change.

GGML_OP_SOFT_MAX_BACK is listed in ggml_op_can_inplace, so the backend scheduler may allocate dst over the buffer of either src0 (dy) or src1 (y). The CPU implementation was a five-step sequence: ggml_vec_dot_f32 (nc, &dot_y_dy, 0, y, 0, dy, 0, 1); // dot(y, dy) ggml_vec_cpy_f32 (nc, dx, dy); // dx = dy ggml_vec_acc1_f32 (nc, dx, -dot_y_dy); ggml_vec_mul_f32 (nc, dx, dx, y); // re-reads y ggml_vec_scale_f32(nc, dx, scale); The allocator preferentially picks src0 for in-place reuse, in which case dst==dy and the cpy is a no-op. But when src0 is ineligible (e.g. marked GGML_TENSOR_FLAG_OUTPUT, or used by more than one downstream consumer), the allocator falls through to src1 (y). Then the cpy overwrites y with dy values, and the subsequent dx = dx * y reads the corrupted buffer, producing wildly wrong gradients. Replace the multi-step sequence with a single fused read-before-write loop. Each element's read of dy[i] and y[i] precedes its write to dx[i], which is safe under aliasing of dst with either src0 or src1. Matches the CUDA reference at ggml-cuda/softmax.cu:268. Adds tests/test-soft-max-back-inplace.cpp comparing ggml_graph_compute_with_ctx (no aliasing) against ggml_backend_sched_graph_compute (may alias) across six configurations including the dst-aliases-src1 trigger (dy marked as output). All configurations fail without the fix, all pass after.

devYRPauli · 2026-05-29T18:14:44Z

Going to close this for the moment to keep to a single open PR as a new contributor. I've left #1519 open since it fixes the same in-place aliasing problem in the backward ops; I'll bring this soft_max_back one back once that lands. Thanks!

devYRPauli · 2026-06-11T00:07:44Z

Reopening now that my first PR has landed, as promised when I trimmed down to a single open PR. Thanks for your patience!

devYRPauli closed this May 29, 2026

devYRPauli reopened this Jun 10, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml-cpu : fix soft_max_back wrong output when dst aliases src1 (y)#1521

ggml-cpu : fix soft_max_back wrong output when dst aliases src1 (y)#1521
devYRPauli wants to merge 1 commit into
ggml-org:masterfrom
devYRPauli:fix-soft-max-back-inplace

devYRPauli commented May 28, 2026

Uh oh!

devYRPauli commented May 29, 2026

Uh oh!

devYRPauli commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

devYRPauli commented May 28, 2026

Bug

Fix

Regression test

How discovered

Pre-existing unrelated issue noted

Uh oh!

devYRPauli commented May 29, 2026

Uh oh!

devYRPauli commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant