Skip to content

ggml-cpu : fix rms_norm_back wrong output under in-place aliasing (#1491)#1519

Open
devYRPauli wants to merge 1 commit into
ggml-org:masterfrom
devYRPauli:fix-rms-norm-back-inplace
Open

ggml-cpu : fix rms_norm_back wrong output under in-place aliasing (#1491)#1519
devYRPauli wants to merge 1 commit into
ggml-org:masterfrom
devYRPauli:fix-rms-norm-back-inplace

Conversation

@devYRPauli

Copy link
Copy Markdown
Contributor

Fixes #1491.

Bug

GGML_OP_RMS_NORM_BACK is listed in ggml_op_can_inplace, so the backend scheduler may allocate dst over the buffer of src0 (= dz, the upstream gradient). On the CPU backend, ggml_compute_forward_rms_norm_back_f32 was implemented as a four-step sequence:

ggml_vec_cpy_f32  (ne00, dx, x);                       // dx = x  (overwrites dz if dst aliases src0)
ggml_vec_scale_f32(ne00, dx, (float)(-sum_xdz)/sum_eps);
ggml_vec_acc_f32  (ne00, dx, dz);                      // re-reads dz from now-overwritten memory
ggml_vec_scale_f32(ne00, dx, rrms);

When dst aliases src0, the += dz step reads the values written in the first step instead of the original gradient. With the issue's exact inputs dz = x = [1, 0, 0, 0], eps = 1e-4, this produces [-3.9976, 0, 0, 0] instead of the correct [0.000799, 0, 0, 0] — sign flipped, magnitude ~5000× off.

Fix

Replace the multi-step sequence with a single-pass FMA loop that reads each element before writing it. This is in-place safe under aliasing of dst with either src0 or src1. The pattern matches the existing CUDA implementation at ggml-cuda/norm.cu:205-207, which has always been correct.

Regression test

Adds tests/test-rms-norm-back-inplace.cpp. Compares ggml_graph_compute_with_ctx (no aliasing) against ggml_backend_sched_graph_compute (may alias) for three input shapes including the issue's exact reproducer. Without the fix, all three cases fail with significant divergences. With the fix, all pass.

Out of scope

Quick audit on M1 Pro turned up two pre-existing unrelated failures that reproduce on master without these changes — flagging here in case they help triage:

  • test-backend-ops grad -o SUM on Metal: SUM(type=f32,ne=[33,256,1,1],permute=[1,0,2,3]) reports MAA = 0.997 > 1e-4.
  • test-backend-ops grad -o RMS_NORM on Metal: RMS_NORM(...,inplace=1) hits GGML_ASSERT(!node->view_src || ...) in ggml_build_backward_expand (ggml.c:7037).

SYCL rms_norm_back implementation wasn't audited here; should probably get the same review since it likely uses similar logic.

…ml-org#1491)

GGML_OP_RMS_NORM_BACK is listed in ggml_op_can_inplace, so the backend
scheduler may allocate dst over the buffer of src0 (= dz, the upstream
gradient). The CPU implementation was a four-step sequence:

    ggml_vec_cpy_f32  (ne00, dx, x);                          // overwrites dz if dst aliases src0
    ggml_vec_scale_f32(ne00, dx, (float)(-sum_xdz)/sum_eps);
    ggml_vec_acc_f32  (ne00, dx, dz);                         // re-reads now-corrupted dz
    ggml_vec_scale_f32(ne00, dx, rrms);

When dst aliases src0, the += dz step reads the values written in the
first step instead of the original gradient. With the issue's inputs
dz=x=[1,0,0,0], eps=1e-4 this produces [-3.9976, 0, 0, 0] instead of the
correct [0.000799, 0, 0, 0] (sign flipped, magnitude ~5000x off).

Replace the multi-step sequence with a single-pass FMA loop that reads
each element before writing it. In-place safe under aliasing of dst with
either src0 or src1. Matches the existing CUDA implementation at
ggml-cuda/norm.cu:205-207, which has always been correct.

Adds tests/test-rms-norm-back-inplace.cpp comparing
ggml_graph_compute_with_ctx (no aliasing) against
ggml_backend_sched_graph_compute (may alias) for three shapes including
the issue's exact reproducer.
Comment thread src/ggml-cpu/ops.cpp
Comment on lines +3977 to +3988
// dx[i00] = (dz + x*(-sum_xdz/sum_eps)) * rrms
//
// GGML_OP_RMS_NORM_BACK is listed in ggml_op_can_inplace, so the
// scheduler may alias dst with src0 (dz) or src1 (x). A single
// fused read-before-write loop is safe under either aliasing.
// A multi-step cpy/scale/acc/scale sequence is NOT safe — if dst
// aliases dz, the +=dz step would re-read overwritten memory.
// See: https://github.com/ggml-org/ggml/issues/1491
const float scale_x = (float) (-sum_xdz) / sum_eps;
for (int64_t i00 = 0; i00 < ne00; i00++) {
dx[i00] = (dz[i00] + x[i00] * scale_x) * rrms;
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you submit just this change alone in a PR to https://github.com/ggml-org/llama.cpp?

No need for a regression test.

@devYRPauli

Copy link
Copy Markdown
Contributor Author

Done — opened ggml-org/llama.cpp#24305 with just this change. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ggml_rms_norm_back: backend-sched compute produces different (wrong) result than legacy compute_with_ctx

2 participants