ggml-cpu : fix rms_norm_back wrong output under in-place aliasing (#1491) by devYRPauli · Pull Request #1519 · ggml-org/ggml

devYRPauli · 2026-05-28T19:05:18Z

Bug

GGML_OP_RMS_NORM_BACK is listed in ggml_op_can_inplace, so the backend scheduler may allocate dst over the buffer of src0 (= dz, the upstream gradient). On the CPU backend, ggml_compute_forward_rms_norm_back_f32 was implemented as a four-step sequence:

ggml_vec_cpy_f32  (ne00, dx, x);                       // dx = x  (overwrites dz if dst aliases src0)
ggml_vec_scale_f32(ne00, dx, (float)(-sum_xdz)/sum_eps);
ggml_vec_acc_f32  (ne00, dx, dz);                      // re-reads dz from now-overwritten memory
ggml_vec_scale_f32(ne00, dx, rrms);

When dst aliases src0, the += dz step reads the values written in the first step instead of the original gradient. With the issue's exact inputs dz = x = [1, 0, 0, 0], eps = 1e-4, this produces [-3.9976, 0, 0, 0] instead of the correct [0.000799, 0, 0, 0] — sign flipped, magnitude ~5000× off.

Fix

Replace the multi-step sequence with a single-pass FMA loop that reads each element before writing it. This is in-place safe under aliasing of dst with either src0 or src1. The pattern matches the existing CUDA implementation at ggml-cuda/norm.cu:205-207, which has always been correct.

Regression test

Adds tests/test-rms-norm-back-inplace.cpp. Compares ggml_graph_compute_with_ctx (no aliasing) against ggml_backend_sched_graph_compute (may alias) for three input shapes including the issue's exact reproducer. Without the fix, all three cases fail with significant divergences. With the fix, all pass.

Out of scope

Quick audit on M1 Pro turned up two pre-existing unrelated failures that reproduce on master without these changes — flagging here in case they help triage:

test-backend-ops grad -o SUM on Metal: SUM(type=f32,ne=[33,256,1,1],permute=[1,0,2,3]) reports MAA = 0.997 > 1e-4.
test-backend-ops grad -o RMS_NORM on Metal: RMS_NORM(...,inplace=1) hits GGML_ASSERT(!node->view_src || ...) in ggml_build_backward_expand (ggml.c:7037).

SYCL rms_norm_back implementation wasn't audited here; should probably get the same review since it likely uses similar logic.

…ml-org#1491) GGML_OP_RMS_NORM_BACK is listed in ggml_op_can_inplace, so the backend scheduler may allocate dst over the buffer of src0 (= dz, the upstream gradient). The CPU implementation was a four-step sequence: ggml_vec_cpy_f32 (ne00, dx, x); // overwrites dz if dst aliases src0 ggml_vec_scale_f32(ne00, dx, (float)(-sum_xdz)/sum_eps); ggml_vec_acc_f32 (ne00, dx, dz); // re-reads now-corrupted dz ggml_vec_scale_f32(ne00, dx, rrms); When dst aliases src0, the += dz step reads the values written in the first step instead of the original gradient. With the issue's inputs dz=x=[1,0,0,0], eps=1e-4 this produces [-3.9976, 0, 0, 0] instead of the correct [0.000799, 0, 0, 0] (sign flipped, magnitude ~5000x off). Replace the multi-step sequence with a single-pass FMA loop that reads each element before writing it. In-place safe under aliasing of dst with either src0 or src1. Matches the existing CUDA implementation at ggml-cuda/norm.cu:205-207, which has always been correct. Adds tests/test-rms-norm-back-inplace.cpp comparing ggml_graph_compute_with_ctx (no aliasing) against ggml_backend_sched_graph_compute (may alias) for three shapes including the issue's exact reproducer.

ggerganov · 2026-06-08T11:52:29Z

+                // dx[i00] = (dz + x*(-sum_xdz/sum_eps)) * rrms
+                //
+                // GGML_OP_RMS_NORM_BACK is listed in ggml_op_can_inplace, so the
+                // scheduler may alias dst with src0 (dz) or src1 (x). A single
+                // fused read-before-write loop is safe under either aliasing.
+                // A multi-step cpy/scale/acc/scale sequence is NOT safe — if dst
+                // aliases dz, the +=dz step would re-read overwritten memory.
+                // See: https://github.com/ggml-org/ggml/issues/1491
+                const float scale_x = (float) (-sum_xdz) / sum_eps;
+                for (int64_t i00 = 0; i00 < ne00; i00++) {
+                    dx[i00] = (dz[i00] + x[i00] * scale_x) * rrms;
+                }


Could you submit just this change alone in a PR to https://github.com/ggml-org/llama.cpp?

No need for a regression test.

devYRPauli · 2026-06-08T13:38:32Z

Done — opened ggml-org/llama.cpp#24305 with just this change. Thanks!

devYRPauli mentioned this pull request May 28, 2026

ggml-cpu : fix soft_max_back wrong output when dst aliases src1 (y) #1521

Open

ggerganov reviewed Jun 8, 2026

View reviewed changes

devYRPauli mentioned this pull request Jun 8, 2026

ggml-cpu : fix rms_norm_back wrong output under in-place aliasing ggml-org/llama.cpp#24305

Merged

devYRPauli mentioned this pull request Jun 8, 2026

common: use common_cpu_get_num_math() for --threads -1 ggml-org/llama.cpp#23836

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml-cpu : fix rms_norm_back wrong output under in-place aliasing (#1491)#1519

ggml-cpu : fix rms_norm_back wrong output under in-place aliasing (#1491)#1519
devYRPauli wants to merge 1 commit into
ggml-org:masterfrom
devYRPauli:fix-rms-norm-back-inplace

devYRPauli commented May 28, 2026

Uh oh!

ggerganov Jun 8, 2026

Uh oh!

devYRPauli commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

devYRPauli commented May 28, 2026

Bug

Fix

Regression test

Out of scope

Uh oh!

ggerganov Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

devYRPauli commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants