Add AMD GPU support via HIP (ROCm)#2
Open
jeffdaily wants to merge 1 commit into
Open
Conversation
This adds support for building and running CuRast on AMD GPUs with ROCm/HIP, alongside the existing CUDA build. A compatibility header maps the CUDA spellings used in the kernels and host code to their HIP equivalents, and the CUDA driver API plus the nvrtc/nvJitLink runtime compilation path are mapped to the HIP driver API and hiprtc. Build with the USE_HIP CMake option; select the target with CMAKE_HIP_ARCHITECTURES. We have made every effort to leave the CUDA build unchanged: every HIP change is behind a USE_HIP / __HIP_PLATFORM_AMD__ guard that the CUDA build does not compile, the .cu kernels keep their CUDA spellings, and the compat header is a no-op include of the CUDA headers on the NVIDIA path. Suggested review order: 1. cuda_to_hip.h -- the compatibility header. Aliases the CUDA runtime, driver, virtual-memory and external-memory APIs to HIP, and provides the HIP_DEVPTR_ADD helper for byte-offset arithmetic on hipDeviceptr_t (a void* on ROCm, which strict-C++ amdclang++ will not do arithmetic on). 2. HipModularProgram.h -- replaces CudaModularProgram for runtime compilation. HIP compiles directly to code objects via hiprtc; there is no LTO-IR intermediate or separate nvJitLink step, so multiple kernel sources are combined into one compilation unit. hiprtc has no equivalent of nvrtc's -default-device flag, so unannotated device helpers are wrapped in RTC-guarded `#pragma clang attribute push((device))` regions. Note that -ffast-math is NOT passed: clang fast-math implies -ffinite-math-only, which miscompiles this renderer's Infinity depth sentinels and NaN clear/compare values; nvcc --use_fast_math makes no finite-math assumption, so the CUDA build is unaffected. 3. CMakeLists.txt, cmake/common.cmake -- the USE_HIP option, enable_language (HIP), HIP source-file language assignment, and HIP library linking. 4. The kernel sources and CuRast_render.h -- HIP cooperative-groups and warp handling. The pipeline uses width-32 logical-warp operations (tiled_partition<32>, ballot, shfl), which are correct on both wave64 (CDNA) and wave32 (RDNA) without an architecture-specific code path. On the HIP build, stage1 and stage3 use occupancy-sized non-cooperative launches: their per-block counter initialization was moved to host-side memset so the grid-wide sync they previously needed is unnecessary, which also avoids a cooperative-launch failure observed on RDNA4. Texture mipmap generation likewise builds the pyramid one level at a time with ordinary launches in a host loop, the kernel boundary providing the grid-wide ordering grid.sync() gave. These launch changes are confined to #if defined(USE_HIP); the CUDA build keeps upstream's cooperative launches and in-kernel synchronization unchanged. 5. The Linux platform support (mmap in MappedFile.h, O_DIRECT unbuffered IO in unsuck_platform_specific.cpp), which the upstream README listed as a TODO, and the Windows amdclang++ (MSVC ABI) build fixes. 6. A headless `--bench <file.glb> [w h frames]` mode that renders to bench_render.png without a window or Vulkan, used to validate on machines without a display. Two limitations of this build are documented in the README under Known Issues: - HIP-Vulkan texture interop is unavailable. The CUDA-to-Vulkan mipmapped array export relies on hipExternalMemoryGetMappedMipmappedArray, which the ROCm 7.2 HIP runtime does not export, so that path is stubbed. Core rasterization and the headless bench path work without it. - Kernel launches dispatched from inside the module-launch helper intermittently memory-fault on gfx90a (ROCm 7.2.1), while the same dispatch sequence inlined at the call site is reliable. This appears to be a ROCm runtime issue; until it is resolved, the rasterization hot paths dispatch their launches inline as a workaround. Test Plan: Built and run on three AMD architectures. Each renders the bundled example scene correctly to bench_render.png with no GPU faults. Linux, AMD Instinct MI250X (gfx90a, CDNA2, wave64, ROCm 7.2.1) and AMD Radeon Pro W7800 (gfx1100, RDNA3, wave32): ``` cmake -S . -B build -DUSE_HIP=ON -DCMAKE_HIP_ARCHITECTURES=gfx90a cmake --build build -j ./build/CuRast --bench ./example_donaukanal_urania.glb 1920 1080 30 ``` Windows, AMD Radeon RX 9070 XT (gfx1201, RDNA4, wave32), amdclang++: ``` cmake -S . -B build -DUSE_HIP=ON -DCMAKE_HIP_ARCHITECTURES=gfx1201 ` -DCMAKE_HIP_COMPILER=<rocm>/lib/llvm/bin/amdclang++.exe -G Ninja cmake --build build -j .\build\CuRast.exe --bench .\example_donaukanal_urania.glb 1920 1080 30 ``` All three render 966,461 triangles per frame at 1920x1080 with the visbuffer pipeline in well under a millisecond, and bench_render.png shows the scene correctly. This work was authored with the assistance of Claude (an AI assistant by Anthropic).
jeffdaily
added a commit
to jeffdaily/moat
that referenced
this pull request
Jun 11, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This adds support for building and running CuRast on AMD GPUs with
ROCm/HIP, alongside the existing CUDA build. A compatibility header maps
the CUDA spellings used in the kernels and host code to their HIP
equivalents, and the CUDA driver API plus the nvrtc/nvJitLink runtime
compilation path are mapped to the HIP driver API and hiprtc. Build with
the USE_HIP CMake option; select the target with CMAKE_HIP_ARCHITECTURES.
We have made every effort to leave the CUDA build unchanged: every HIP
change is behind a USE_HIP / HIP_PLATFORM_AMD guard that the CUDA build
does not compile, the .cu kernels keep their CUDA spellings, and the compat
header is a no-op include of the CUDA headers on the NVIDIA path.
Suggested review order:
cuda_to_hip.h -- the compatibility header. Aliases the CUDA runtime,
driver, virtual-memory and external-memory APIs to HIP, and provides the
HIP_DEVPTR_ADD helper for byte-offset arithmetic on hipDeviceptr_t (a
void* on ROCm, which strict-C++ amdclang++ will not do arithmetic on).
HipModularProgram.h -- replaces CudaModularProgram for runtime
compilation. HIP compiles directly to code objects via hiprtc; there is
no LTO-IR intermediate or separate nvJitLink step, so multiple kernel
sources are combined into one compilation unit. hiprtc has no equivalent
of nvrtc's -default-device flag, so unannotated device helpers are
wrapped in RTC-guarded
#pragma clang attribute push((device))regions.Note that -ffast-math is NOT passed: clang fast-math implies
-ffinite-math-only, which miscompiles this renderer's Infinity depth
sentinels and NaN clear/compare values; nvcc --use_fast_math makes no
finite-math assumption, so the CUDA build is unaffected.
CMakeLists.txt, cmake/common.cmake -- the USE_HIP option, enable_language
(HIP), HIP source-file language assignment, and HIP library linking.
The kernel sources and CuRast_render.h -- HIP cooperative-groups and warp
handling. The pipeline uses width-32 logical-warp operations
(tiled_partition<32>, ballot, shfl), which are correct on both wave64
(CDNA) and wave32 (RDNA) without an architecture-specific code path.
On the HIP build, stage1 and stage3 use occupancy-sized non-cooperative
launches: their per-block counter initialization was moved to host-side
memset so the grid-wide sync they previously needed is unnecessary, which
also avoids a cooperative-launch failure observed on RDNA4. Texture mipmap
generation likewise builds the pyramid one level at a time with ordinary
launches in a host loop, the kernel boundary providing the grid-wide
ordering grid.sync() gave. These launch changes are confined to
#if defined(USE_HIP); the CUDA build keeps upstream's cooperative launches
and in-kernel synchronization unchanged.
The Linux platform support (mmap in MappedFile.h, O_DIRECT unbuffered IO
in unsuck_platform_specific.cpp), which the upstream README listed as a
TODO, and the Windows amdclang++ (MSVC ABI) build fixes.
A headless
--bench <file.glb> [w h frames]mode that renders tobench_render.png without a window or Vulkan, used to validate on machines
without a display.
Two limitations of this build are documented in the README under Known
Issues:
HIP-Vulkan texture interop is unavailable. The CUDA-to-Vulkan mipmapped
array export relies on hipExternalMemoryGetMappedMipmappedArray, which the
ROCm 7.2 HIP runtime does not export, so that path is stubbed. Core
rasterization and the headless bench path work without it.
Kernel launches dispatched from inside the module-launch helper
intermittently memory-fault on gfx90a (ROCm 7.2.1), while the same
dispatch sequence inlined at the call site is reliable. This appears to be
a ROCm runtime issue; until it is resolved, the rasterization hot paths
dispatch their launches inline as a workaround.
Test Plan:
Built and run on three AMD architectures. Each renders the bundled
example scene correctly to bench_render.png with no GPU faults.
Linux, AMD Instinct MI250X (gfx90a, CDNA2, wave64, ROCm 7.2.1) and
AMD Radeon Pro W7800 (gfx1100, RDNA3, wave32):
Windows, AMD Radeon RX 9070 XT (gfx1201, RDNA4, wave32), amdclang++:
All three render 966,461 triangles per frame at 1920x1080 with the
visbuffer pipeline in well under a millisecond, and bench_render.png shows
the scene correctly.
This work was authored with the assistance of Claude (an AI assistant by
Anthropic).