perf(hip): enable -funsafe-math-optimizations for the ROCm backend#1526
Open
RapidMark wants to merge 1 commit into
Open
perf(hip): enable -funsafe-math-optimizations for the ROCm backend#1526RapidMark wants to merge 1 commit into
RapidMark wants to merge 1 commit into
Conversation
The CUDA backend compiles with -use_fast_math (ggml-cuda/CMakeLists.txt), so transcendentals in the hot kernels (e.g. expf in attention/softmax) take the fast path on NVIDIA. The HIP/ROCm backend set no fast-math equivalent, leaving the same calls on the slow libm routine. Add -funsafe-math-optimizations: the inf/nan-safe subset of clang's -ffast-math. NOT -ffast-math itself, which implies -ffinite-math-only -- that both breaks ggml's deliberate INFINITY use (softmax -inf masking -> -Wnan-infinity-disabled) and miscompiles expm1f overflow to -nan on some targets. Verified on Radeon AI PRO R9700 (gfx1201, ROCm 7.0.2): builds clean, test-backend-ops 12384/12384 pass (incl EXPM1), FLASH_ATTN_EXT +8-15%.
Author
|
Gentle nudge — this has been open about a week with the workflow still awaiting approval, so CI hasn't run yet. It's a one-line change ( |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The CUDA backend compiles with
-use_fast_math(ggml/src/ggml-cuda/CMakeLists.txt), so on NVIDIA thetranscendentals in the hot kernels (e.g.
expfin attention/softmax) lower to the fast path. TheHIP/ROCm backend sets no fast-math equivalent, so the same
expf()calls compile to the slow libmroutine on AMD — measurably slower in attention-bound work.
Why not just
-ffast-mathon HIPA blanket clang
-ffast-mathon the HIP backend is not viable, for two concrete reasons (bothverified on gfx1201 / ROCm 7.0.2):
-ffinite-math-only, under whichINFINITYis UB — and ggml legitimately usesINFINITY(e.g. softmax-infmasking,common.cuh). The build fails with-Wnan-infinity-disabled.expm1freturns-nanon overflow under-ffinite-math-only:expm1fis inlined asexp(x)-1vialdexp(poly(r), k); for large x theldexpoverflows to+inf(correct), but-ffinite-math-onlystamps the function withno-infs-fp-math=true, so the optimizer treats thatinfas poison →-nan(e.g.expm1f(90/110/140)on gfx1201). Confirmed in device LLVM IR (theldexpis identical with/without the flag — only the function attribute differs). This is-ffinite-math-onlyworking as documented, not a codegen bug — but it's fatal for ggml, whichlegitimately produces
inf.The fix:
-funsafe-math-optimizationsBisecting the
-ffast-mathsub-flags on gfx1201, the speedup comes entirely from-funsafe-math-optimizations(e.g.expf140→85 cyc, a dependency-chained shader-clock microbench),and that flag is inf-safe: it does not imply
-ffinite-math-only, soINFINITYstays valid, andexpm1fis correct under it (expm1f(90/110/140) = inf). So it gives the win without either failuremode — the inf-safe half of what NVIDIA already gets from
-use_fast_math.Verification (AMD Radeon AI PRO R9700 / Navi 48 / RDNA4 / gfx1201, ROCm 7.0.2)
test-backend-opson the ROCm0 backend reports 12384/12384 tests passed, including EXPM1 (no
-nan). No sourcechanges, no guards.
FLASH_ATTN_EXT(f16) +8.4% to +14.8% (e.g. 20513µs → 17460µs; 37.76µs → 34.82µs).resolution (1536² +1.7%, 2048² +2.2%) as attention's share grows. The win materializes for
attention-bound workloads (LLM decode / long context / higher-res or video diffusion).
Full global
-funsafe-mathsd-cli, Flux Q4 8-step: 1024² 0% (20.10s→20.10s), 2048² +3.1%(170.46s→165.19s) — slightly above per-file since the whole build is fast-math'd. Output visually
identical at fixed seed. (LLM / video benchmarks are the regime where this is a larger end-to-end win.)
Backward compatibility / safety
-funsafe-math-optimizations≠-ffinite-math-only); accuracy stayswithin
test-backend-opsNMSE tolerance (all ops pass). It's a less aggressive posture than the-use_fast_maththe CUDA backend already ships.Related
The
expm1f-nanunder-ffinite-math-onlyis not a codegen defect to report upstream — it's thedocumented behavior of that flag (assume no Inf/NaN → Inf-producing code is UB), which ggml violates
on purpose. So there's nothing to file with ROCm/LLVM; the correct posture is simply to use
-funsafe-math-optimizations, which does not enable that assumption. Bisection of the-ffast-mathsub-flags isolating
-ffinite-math-onlyas the sole trigger is available if useful.