Skip to content

ggml-cpu : fix soft_max_back wrong output when dst aliases src1 (y)#1521

Open
devYRPauli wants to merge 1 commit into
ggml-org:masterfrom
devYRPauli:fix-soft-max-back-inplace
Open

ggml-cpu : fix soft_max_back wrong output when dst aliases src1 (y)#1521
devYRPauli wants to merge 1 commit into
ggml-org:masterfrom
devYRPauli:fix-soft-max-back-inplace

Conversation

@devYRPauli

Copy link
Copy Markdown
Contributor

Sibling fix to #1519 (rms_norm_back in-place aliasing). Same root-cause class.

Bug

GGML_OP_SOFT_MAX_BACK is listed in ggml_op_can_inplace, so the backend scheduler may allocate dst over the buffer of either src0 (dy, the upstream gradient) or src1 (y, the saved softmax output). On the CPU backend, ggml_compute_forward_soft_max_ext_back_f32 was implemented as a five-step sequence:

ggml_vec_dot_f32  (nc, &dot_y_dy, 0, y, 0, dy, 0, 1);  // dot(y, dy)
ggml_vec_cpy_f32  (nc, dx, dy);                        // dx = dy
ggml_vec_acc1_f32 (nc, dx, -dot_y_dy);
ggml_vec_mul_f32  (nc, dx, dx, y);                     // re-reads y
ggml_vec_scale_f32(nc, dx, scale);

The allocator preferentially picks src0 for in-place reuse, in which case dst == dy and the cpy is a no-op — historically safe. But when src0 is ineligible (e.g. marked GGML_TENSOR_FLAG_OUTPUT, or used by more than one downstream consumer), the allocator falls through to src1. Then the dx = dy step overwrites y with dy values, and the subsequent dx = dx * y reads the corrupted buffer — producing wildly wrong gradients (diffs of ~1.5 on values of magnitude ~0.02 in the test cases below).

Fix

Replace the multi-step sequence with a single fused read-before-write loop:

float dot_y_dy = 0;
ggml_vec_dot_f32(nc, &dot_y_dy, 0, y, 0, dy, 0, 1);
for (int i = 0; i < nc; i++) {
    dx[i] = scale * (dy[i] - dot_y_dy) * y[i];
}

Each element's read of dy[i] and y[i] precedes its write to dx[i], which is safe under aliasing of dst with either src0 or src1. Matches the CUDA reference at ggml-cuda/softmax.cu:268.

Regression test

Adds tests/test-soft-max-back-inplace.cpp. Compares ggml_graph_compute_with_ctx against ggml_backend_sched_graph_compute across six configurations:

  • default-* — allocator picks src0; historically passes on both paths.
  • dy-output-*ggml_set_output(t_dy) forces allocator to fall through to src1; previously broken, fixed by this PR.
  • y-output-* — allocator picks src0 (src1 ineligible); safe.
  • both-output-* — both ineligible, fresh allocation; safe.

Without the fix: the four "default-" and "-output (src0 path)" cases pass, the two dy-output-* cases fail with large divergences. With the fix: all six pass.

How discovered

Surfaced during the audit prompted by #1519 — same pattern, same backward-op surface (_back ops in ggml_op_can_inplace). I audited the other entries in that list: ggml_vec_silu_backward_f32/_f16 (used by SILU_BACK) is a single-pass loop, in-place safe. ggml_compute_forward_rope_flt (used by ROPE_BACK) uses pair-wise rotations that the existing impl handles correctly. Only SOFT_MAX_BACK had the multi-step pattern with the same trap as RMS_NORM_BACK.

Pre-existing unrelated issue noted

test-backend-ops grad -o SUM on Metal still produces inf for SUM(type=f32,ne=[33,256,1,1],permute=[1,0,2,3]) — same pre-existing failure noted in #1519's body. Unrelated to this change.

GGML_OP_SOFT_MAX_BACK is listed in ggml_op_can_inplace, so the backend
scheduler may allocate dst over the buffer of either src0 (dy) or src1
(y). The CPU implementation was a five-step sequence:

    ggml_vec_dot_f32  (nc, &dot_y_dy, 0, y, 0, dy, 0, 1);  // dot(y, dy)
    ggml_vec_cpy_f32  (nc, dx, dy);                        // dx = dy
    ggml_vec_acc1_f32 (nc, dx, -dot_y_dy);
    ggml_vec_mul_f32  (nc, dx, dx, y);                     // re-reads y
    ggml_vec_scale_f32(nc, dx, scale);

The allocator preferentially picks src0 for in-place reuse, in which
case dst==dy and the cpy is a no-op. But when src0 is ineligible (e.g.
marked GGML_TENSOR_FLAG_OUTPUT, or used by more than one downstream
consumer), the allocator falls through to src1 (y). Then the cpy
overwrites y with dy values, and the subsequent dx = dx * y reads the
corrupted buffer, producing wildly wrong gradients.

Replace the multi-step sequence with a single fused read-before-write
loop. Each element's read of dy[i] and y[i] precedes its write to
dx[i], which is safe under aliasing of dst with either src0 or src1.
Matches the CUDA reference at ggml-cuda/softmax.cu:268.

Adds tests/test-soft-max-back-inplace.cpp comparing
ggml_graph_compute_with_ctx (no aliasing) against
ggml_backend_sched_graph_compute (may alias) across six configurations
including the dst-aliases-src1 trigger (dy marked as output). All
configurations fail without the fix, all pass after.
@devYRPauli

Copy link
Copy Markdown
Contributor Author

Going to close this for the moment to keep to a single open PR as a new contributor. I've left #1519 open since it fixes the same in-place aliasing problem in the backward ops; I'll bring this soft_max_back one back once that lands. Thanks!

@devYRPauli devYRPauli closed this May 29, 2026
@devYRPauli devYRPauli reopened this Jun 10, 2026
@devYRPauli

Copy link
Copy Markdown
Contributor Author

Reopening now that my first PR has landed, as promised when I trimmed down to a single open PR. Thanks for your patience!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant