sycl: fix soft_max_f32 max reduction by someoneinjd · Pull Request #1534 · ggml-org/ggml

someoneinjd · 2026-06-10T09:43:04Z

Fix the second-stage max reduction in the SYCL softmax kernel when a workgroup contains more subgroups than WARP_SIZE.

Previously, after each subgroup wrote its partial max to buf_iw[warp_id], the cross-subgroup reduction only loaded:

max_val = buf_iw[lane_id];

This only reduces the first WARP_SIZE partial maxima. For example, on Intel GPUs, WARP_SIZE can be 16 while the softmax workgroup size is 1024, producing nwarps = 64. In that case, partial maxima from buf_iw[16..63] were ignored.
If the true row max is in one of the ignored subgroups, softmax subtracts a too-small max value. For large attention logits this can overflow exp(x - max) to inf, eventually producing NaN.

llama.cpp PR: ggml-org/llama.cpp#24451

sycl: fix soft_max_f32 max reduction

4342dce

someoneinjd force-pushed the master branch from 621995d to 4342dce Compare June 13, 2026 01:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sycl: fix soft_max_f32 max reduction#1534

sycl: fix soft_max_f32 max reduction#1534
someoneinjd wants to merge 1 commit into
ggml-org:masterfrom
someoneinjd:master

someoneinjd commented Jun 10, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

someoneinjd commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

someoneinjd commented Jun 10, 2026 •

edited

Loading