ggml-cpu : fix rms_norm_back wrong output under in-place aliasing (#1491)#1519
Open
devYRPauli wants to merge 1 commit into
Open
ggml-cpu : fix rms_norm_back wrong output under in-place aliasing (#1491)#1519devYRPauli wants to merge 1 commit into
devYRPauli wants to merge 1 commit into
Conversation
…ml-org#1491) GGML_OP_RMS_NORM_BACK is listed in ggml_op_can_inplace, so the backend scheduler may allocate dst over the buffer of src0 (= dz, the upstream gradient). The CPU implementation was a four-step sequence: ggml_vec_cpy_f32 (ne00, dx, x); // overwrites dz if dst aliases src0 ggml_vec_scale_f32(ne00, dx, (float)(-sum_xdz)/sum_eps); ggml_vec_acc_f32 (ne00, dx, dz); // re-reads now-corrupted dz ggml_vec_scale_f32(ne00, dx, rrms); When dst aliases src0, the += dz step reads the values written in the first step instead of the original gradient. With the issue's inputs dz=x=[1,0,0,0], eps=1e-4 this produces [-3.9976, 0, 0, 0] instead of the correct [0.000799, 0, 0, 0] (sign flipped, magnitude ~5000x off). Replace the multi-step sequence with a single-pass FMA loop that reads each element before writing it. In-place safe under aliasing of dst with either src0 or src1. Matches the existing CUDA implementation at ggml-cuda/norm.cu:205-207, which has always been correct. Adds tests/test-rms-norm-back-inplace.cpp comparing ggml_graph_compute_with_ctx (no aliasing) against ggml_backend_sched_graph_compute (may alias) for three shapes including the issue's exact reproducer.
ggerganov
reviewed
Jun 8, 2026
Comment on lines
+3977
to
+3988
| // dx[i00] = (dz + x*(-sum_xdz/sum_eps)) * rrms | ||
| // | ||
| // GGML_OP_RMS_NORM_BACK is listed in ggml_op_can_inplace, so the | ||
| // scheduler may alias dst with src0 (dz) or src1 (x). A single | ||
| // fused read-before-write loop is safe under either aliasing. | ||
| // A multi-step cpy/scale/acc/scale sequence is NOT safe — if dst | ||
| // aliases dz, the +=dz step would re-read overwritten memory. | ||
| // See: https://github.com/ggml-org/ggml/issues/1491 | ||
| const float scale_x = (float) (-sum_xdz) / sum_eps; | ||
| for (int64_t i00 = 0; i00 < ne00; i00++) { | ||
| dx[i00] = (dz[i00] + x[i00] * scale_x) * rrms; | ||
| } |
Member
There was a problem hiding this comment.
Could you submit just this change alone in a PR to https://github.com/ggml-org/llama.cpp?
No need for a regression test.
Contributor
Author
|
Done — opened ggml-org/llama.cpp#24305 with just this change. Thanks! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #1491.
Bug
GGML_OP_RMS_NORM_BACKis listed inggml_op_can_inplace, so the backend scheduler may allocatedstover the buffer ofsrc0(=dz, the upstream gradient). On the CPU backend,ggml_compute_forward_rms_norm_back_f32was implemented as a four-step sequence:When
dstaliasessrc0, the+= dzstep reads the values written in the first step instead of the original gradient. With the issue's exact inputsdz = x = [1, 0, 0, 0],eps = 1e-4, this produces[-3.9976, 0, 0, 0]instead of the correct[0.000799, 0, 0, 0]— sign flipped, magnitude ~5000× off.Fix
Replace the multi-step sequence with a single-pass FMA loop that reads each element before writing it. This is in-place safe under aliasing of
dstwith eithersrc0orsrc1. The pattern matches the existing CUDA implementation atggml-cuda/norm.cu:205-207, which has always been correct.Regression test
Adds
tests/test-rms-norm-back-inplace.cpp. Comparesggml_graph_compute_with_ctx(no aliasing) againstggml_backend_sched_graph_compute(may alias) for three input shapes including the issue's exact reproducer. Without the fix, all three cases fail with significant divergences. With the fix, all pass.Out of scope
Quick audit on M1 Pro turned up two pre-existing unrelated failures that reproduce on master without these changes — flagging here in case they help triage:
test-backend-ops grad -o SUMon Metal:SUM(type=f32,ne=[33,256,1,1],permute=[1,0,2,3])reportsMAA = 0.997 > 1e-4.test-backend-ops grad -o RMS_NORMon Metal:RMS_NORM(...,inplace=1)hitsGGML_ASSERT(!node->view_src || ...)inggml_build_backward_expand(ggml.c:7037).SYCL
rms_norm_backimplementation wasn't audited here; should probably get the same review since it likely uses similar logic.