perf(hip): enable -funsafe-math-optimizations for the ROCm backend by RapidMark · Pull Request #1526 · ggml-org/ggml

RapidMark · 2026-06-02T21:11:26Z

The CUDA backend compiles with -use_fast_math (ggml/src/ggml-cuda/CMakeLists.txt), so on NVIDIA the
transcendentals in the hot kernels (e.g. expf in attention/softmax) lower to the fast path. The
HIP/ROCm backend sets no fast-math equivalent, so the same expf() calls compile to the slow libm
routine on AMD — measurably slower in attention-bound work.

Why not just `-ffast-math` on HIP

A blanket clang -ffast-math on the HIP backend is not viable, for two concrete reasons (both
verified on gfx1201 / ROCm 7.0.2):

It implies -ffinite-math-only, under which INFINITY is UB — and ggml legitimately uses
INFINITY (e.g. softmax -inf masking, common.cuh). The build fails with -Wnan-infinity-disabled.
expm1f returns -nan on overflow under -ffinite-math-only: expm1f is inlined as
exp(x)-1 via ldexp(poly(r), k); for large x the ldexp overflows to +inf (correct), but
-ffinite-math-only stamps the function with no-infs-fp-math=true, so the optimizer treats that
inf as poison → -nan (e.g. expm1f(90/110/140) on gfx1201). Confirmed in device LLVM IR (the
ldexp is identical with/without the flag — only the function attribute differs). This is
-ffinite-math-only working as documented, not a codegen bug — but it's fatal for ggml, which
legitimately produces inf.

The fix: `-funsafe-math-optimizations`

Bisecting the -ffast-math sub-flags on gfx1201, the speedup comes entirely from
-funsafe-math-optimizations (e.g. expf 140→85 cyc, a dependency-chained shader-clock microbench),
and that flag is inf-safe: it does not imply -ffinite-math-only, so INFINITY stays valid, and
expm1f is correct under it (expm1f(90/110/140) = inf). So it gives the win without either failure
mode — the inf-safe half of what NVIDIA already gets from -use_fast_math.

# Parity with the CUDA backend's -use_fast_math (see ggml-cuda/CMakeLists.txt), minus the
# IEEE-breaking parts. This is intentionally NOT -ffast-math: that implies -ffinite-math-only,
# which both breaks ggml's deliberate INFINITY use (e.g. softmax -inf masking ->
# -Wnan-infinity-disabled) and miscompiles expm1f overflow to -nan on some targets.
# -funsafe-math-optimizations is inf/nan-safe and provides the transcendental speedup.
set(CMAKE_HIP_FLAGS "${CMAKE_HIP_FLAGS} -funsafe-math-optimizations")

Verification (AMD Radeon AI PRO R9700 / Navi 48 / RDNA4 / gfx1201, ROCm 7.0.2)

Correctness: built against pristine master (v0.13.1) with this one commit, test-backend-ops
on the ROCm0 backend reports 12384/12384 tests passed, including EXPM1 (no -nan). No source
changes, no guards.
Kernel perf: FLASH_ATTN_EXT (f16) +8.4% to +14.8% (e.g. 20513µs → 17460µs; 37.76µs → 34.82µs).
Real-workload note: Flux.1 Q4 diffusion is matmul-dominated at 1024² → ~0.5%, rising with
resolution (1536² +1.7%, 2048² +2.2%) as attention's share grows. The win materializes for
attention-bound workloads (LLM decode / long context / higher-res or video diffusion).
Full global -funsafe-math sd-cli, Flux Q4 8-step: 1024² 0% (20.10s→20.10s), 2048² +3.1%
(170.46s→165.19s) — slightly above per-file since the whole build is fast-math'd. Output visually
identical at fixed seed. (LLM / video benchmarks are the regime where this is a larger end-to-end win.)

Backward compatibility / safety

HIP path only; CUDA/CPU/Vulkan/Metal unaffected.
Inf/NaN semantics preserved (-funsafe-math-optimizations ≠ -ffinite-math-only); accuracy stays
within test-backend-ops NMSE tolerance (all ops pass). It's a less aggressive posture than the
-use_fast_math the CUDA backend already ships.

The expm1f -nan under -ffinite-math-only is not a codegen defect to report upstream — it's the
documented behavior of that flag (assume no Inf/NaN → Inf-producing code is UB), which ggml violates
on purpose. So there's nothing to file with ROCm/LLVM; the correct posture is simply to use
-funsafe-math-optimizations, which does not enable that assumption. Bisection of the -ffast-math
sub-flags isolating -ffinite-math-only as the sole trigger is available if useful.

The CUDA backend compiles with -use_fast_math (ggml-cuda/CMakeLists.txt), so transcendentals in the hot kernels (e.g. expf in attention/softmax) take the fast path on NVIDIA. The HIP/ROCm backend set no fast-math equivalent, leaving the same calls on the slow libm routine. Add -funsafe-math-optimizations: the inf/nan-safe subset of clang's -ffast-math. NOT -ffast-math itself, which implies -ffinite-math-only -- that both breaks ggml's deliberate INFINITY use (softmax -inf masking -> -Wnan-infinity-disabled) and miscompiles expm1f overflow to -nan on some targets. Verified on Radeon AI PRO R9700 (gfx1201, ROCm 7.0.2): builds clean, test-backend-ops 12384/12384 pass (incl EXPM1), FLASH_ATTN_EXT +8-15%.

RapidMark · 2026-06-09T16:30:27Z

Gentle nudge — this has been open about a week with the workflow still awaiting approval, so CI hasn't run yet. It's a one-line change (-funsafe-math-optimizations in the HIP CMakeLists — the inf-safe analog of the -use_fast_math the CUDA backend already uses; no source touched). Validated on RDNA4/gfx1201, ROCm 7.0.2: test-backend-ops 12384/12384 pass incl. EXPM1, FLASH_ATTN +8–15%. Could a maintainer approve the workflow run + take a look when there's a moment? Happy to adjust anything.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(hip): enable -funsafe-math-optimizations for the ROCm backend#1526

perf(hip): enable -funsafe-math-optimizations for the ROCm backend#1526
RapidMark wants to merge 1 commit into
ggml-org:masterfrom
CloudhandsAI:cloudhands/hip-funsafe-math

RapidMark commented Jun 2, 2026

Uh oh!

RapidMark commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

RapidMark commented Jun 2, 2026

Why not just -ffast-math on HIP

The fix: -funsafe-math-optimizations

Verification (AMD Radeon AI PRO R9700 / Navi 48 / RDNA4 / gfx1201, ROCm 7.0.2)

Backward compatibility / safety

Related

Uh oh!

RapidMark commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Why not just `-ffast-math` on HIP

The fix: `-funsafe-math-optimizations`