Skip to content

perf(hip): enable -funsafe-math-optimizations for the ROCm backend#1526

Open
RapidMark wants to merge 1 commit into
ggml-org:masterfrom
CloudhandsAI:cloudhands/hip-funsafe-math
Open

perf(hip): enable -funsafe-math-optimizations for the ROCm backend#1526
RapidMark wants to merge 1 commit into
ggml-org:masterfrom
CloudhandsAI:cloudhands/hip-funsafe-math

Conversation

@RapidMark

Copy link
Copy Markdown

The CUDA backend compiles with -use_fast_math (ggml/src/ggml-cuda/CMakeLists.txt), so on NVIDIA the
transcendentals in the hot kernels (e.g. expf in attention/softmax) lower to the fast path. The
HIP/ROCm backend sets no fast-math equivalent, so the same expf() calls compile to the slow libm
routine on AMD — measurably slower in attention-bound work.

Why not just -ffast-math on HIP

A blanket clang -ffast-math on the HIP backend is not viable, for two concrete reasons (both
verified on gfx1201 / ROCm 7.0.2):

  1. It implies -ffinite-math-only, under which INFINITY is UB — and ggml legitimately uses
    INFINITY (e.g. softmax -inf masking, common.cuh). The build fails with -Wnan-infinity-disabled.
  2. expm1f returns -nan on overflow under -ffinite-math-only: expm1f is inlined as
    exp(x)-1 via ldexp(poly(r), k); for large x the ldexp overflows to +inf (correct), but
    -ffinite-math-only stamps the function with no-infs-fp-math=true, so the optimizer treats that
    inf as poison → -nan (e.g. expm1f(90/110/140) on gfx1201). Confirmed in device LLVM IR (the
    ldexp is identical with/without the flag — only the function attribute differs). This is
    -ffinite-math-only working as documented, not a codegen bug — but it's fatal for ggml, which
    legitimately produces inf.

The fix: -funsafe-math-optimizations

Bisecting the -ffast-math sub-flags on gfx1201, the speedup comes entirely from
-funsafe-math-optimizations
(e.g. expf 140→85 cyc, a dependency-chained shader-clock microbench),
and that flag is inf-safe: it does not imply -ffinite-math-only, so INFINITY stays valid, and
expm1f is correct under it (expm1f(90/110/140) = inf). So it gives the win without either failure
mode — the inf-safe half of what NVIDIA already gets from -use_fast_math.

# Parity with the CUDA backend's -use_fast_math (see ggml-cuda/CMakeLists.txt), minus the
# IEEE-breaking parts. This is intentionally NOT -ffast-math: that implies -ffinite-math-only,
# which both breaks ggml's deliberate INFINITY use (e.g. softmax -inf masking ->
# -Wnan-infinity-disabled) and miscompiles expm1f overflow to -nan on some targets.
# -funsafe-math-optimizations is inf/nan-safe and provides the transcendental speedup.
set(CMAKE_HIP_FLAGS "${CMAKE_HIP_FLAGS} -funsafe-math-optimizations")

Verification (AMD Radeon AI PRO R9700 / Navi 48 / RDNA4 / gfx1201, ROCm 7.0.2)

  • Correctness: built against pristine master (v0.13.1) with this one commit, test-backend-ops
    on the ROCm0 backend reports 12384/12384 tests passed, including EXPM1 (no -nan). No source
    changes, no guards.
  • Kernel perf: FLASH_ATTN_EXT (f16) +8.4% to +14.8% (e.g. 20513µs → 17460µs; 37.76µs → 34.82µs).
  • Real-workload note: Flux.1 Q4 diffusion is matmul-dominated at 1024² → ~0.5%, rising with
    resolution (1536² +1.7%, 2048² +2.2%) as attention's share grows. The win materializes for
    attention-bound workloads (LLM decode / long context / higher-res or video diffusion).
    Full global -funsafe-math sd-cli, Flux Q4 8-step: 1024² 0% (20.10s→20.10s), 2048² +3.1%
    (170.46s→165.19s) — slightly above per-file since the whole build is fast-math'd. Output visually
    identical at fixed seed. (LLM / video benchmarks are the regime where this is a larger end-to-end win.)

Backward compatibility / safety

  • HIP path only; CUDA/CPU/Vulkan/Metal unaffected.
  • Inf/NaN semantics preserved (-funsafe-math-optimizations-ffinite-math-only); accuracy stays
    within test-backend-ops NMSE tolerance (all ops pass). It's a less aggressive posture than the
    -use_fast_math the CUDA backend already ships.

Related

The expm1f -nan under -ffinite-math-only is not a codegen defect to report upstream — it's the
documented behavior of that flag (assume no Inf/NaN → Inf-producing code is UB), which ggml violates
on purpose. So there's nothing to file with ROCm/LLVM; the correct posture is simply to use
-funsafe-math-optimizations, which does not enable that assumption. Bisection of the -ffast-math
sub-flags isolating -ffinite-math-only as the sole trigger is available if useful.

The CUDA backend compiles with -use_fast_math (ggml-cuda/CMakeLists.txt), so
transcendentals in the hot kernels (e.g. expf in attention/softmax) take the
fast path on NVIDIA. The HIP/ROCm backend set no fast-math equivalent, leaving
the same calls on the slow libm routine.

Add -funsafe-math-optimizations: the inf/nan-safe subset of clang's -ffast-math.
NOT -ffast-math itself, which implies -ffinite-math-only -- that both breaks
ggml's deliberate INFINITY use (softmax -inf masking -> -Wnan-infinity-disabled)
and miscompiles expm1f overflow to -nan on some targets.

Verified on Radeon AI PRO R9700 (gfx1201, ROCm 7.0.2): builds clean,
test-backend-ops 12384/12384 pass (incl EXPM1), FLASH_ATTN_EXT +8-15%.
@RapidMark

Copy link
Copy Markdown
Author

Gentle nudge — this has been open about a week with the workflow still awaiting approval, so CI hasn't run yet. It's a one-line change (-funsafe-math-optimizations in the HIP CMakeLists — the inf-safe analog of the -use_fast_math the CUDA backend already uses; no source touched). Validated on RDNA4/gfx1201, ROCm 7.0.2: test-backend-ops 12384/12384 pass incl. EXPM1, FLASH_ATTN +8–15%. Could a maintainer approve the workflow run + take a look when there's a moment? Happy to adjust anything.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant