Skip to content

sycl: fix soft_max_f32 max reduction#1534

Open
someoneinjd wants to merge 1 commit into
ggml-org:masterfrom
someoneinjd:master
Open

sycl: fix soft_max_f32 max reduction#1534
someoneinjd wants to merge 1 commit into
ggml-org:masterfrom
someoneinjd:master

Conversation

@someoneinjd

@someoneinjd someoneinjd commented Jun 10, 2026

Copy link
Copy Markdown

Fix the second-stage max reduction in the SYCL softmax kernel when a workgroup contains more subgroups than WARP_SIZE.

Previously, after each subgroup wrote its partial max to buf_iw[warp_id], the cross-subgroup reduction only loaded:

max_val = buf_iw[lane_id];

This only reduces the first WARP_SIZE partial maxima. For example, on Intel GPUs, WARP_SIZE can be 16 while the softmax workgroup size is 1024, producing nwarps = 64. In that case, partial maxima from buf_iw[16..63] were ignored.
If the true row max is in one of the ignored subgroups, softmax subtracts a too-small max value. For large attention logits this can overflow exp(x - max) to inf, eventually producing NaN.

llama.cpp PR: ggml-org/llama.cpp#24451

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant