Add AMD GPU support via HIP (ROCm) by jeffdaily · Pull Request #2 · m-schuetz/CuRast

jeffdaily · 2026-06-11T22:03:00Z

This adds support for building and running CuRast on AMD GPUs with
ROCm/HIP, alongside the existing CUDA build. A compatibility header maps
the CUDA spellings used in the kernels and host code to their HIP
equivalents, and the CUDA driver API plus the nvrtc/nvJitLink runtime
compilation path are mapped to the HIP driver API and hiprtc. Build with
the USE_HIP CMake option; select the target with CMAKE_HIP_ARCHITECTURES.

We have made every effort to leave the CUDA build unchanged: every HIP
change is behind a USE_HIP / HIP_PLATFORM_AMD guard that the CUDA build
does not compile, the .cu kernels keep their CUDA spellings, and the compat
header is a no-op include of the CUDA headers on the NVIDIA path.

Suggested review order:

cuda_to_hip.h -- the compatibility header. Aliases the CUDA runtime,
driver, virtual-memory and external-memory APIs to HIP, and provides the
HIP_DEVPTR_ADD helper for byte-offset arithmetic on hipDeviceptr_t (a
void* on ROCm, which strict-C++ amdclang++ will not do arithmetic on).
HipModularProgram.h -- replaces CudaModularProgram for runtime
compilation. HIP compiles directly to code objects via hiprtc; there is
no LTO-IR intermediate or separate nvJitLink step, so multiple kernel
sources are combined into one compilation unit. hiprtc has no equivalent
of nvrtc's -default-device flag, so unannotated device helpers are
wrapped in RTC-guarded #pragma clang attribute push((device)) regions.
Note that -ffast-math is NOT passed: clang fast-math implies
-ffinite-math-only, which miscompiles this renderer's Infinity depth
sentinels and NaN clear/compare values; nvcc --use_fast_math makes no
finite-math assumption, so the CUDA build is unaffected.
CMakeLists.txt, cmake/common.cmake -- the USE_HIP option, enable_language
(HIP), HIP source-file language assignment, and HIP library linking.
The kernel sources and CuRast_render.h -- HIP cooperative-groups and warp
handling. The pipeline uses width-32 logical-warp operations
(tiled_partition<32>, ballot, shfl), which are correct on both wave64
(CDNA) and wave32 (RDNA) without an architecture-specific code path.
On the HIP build, stage1 and stage3 use occupancy-sized non-cooperative
launches: their per-block counter initialization was moved to host-side
memset so the grid-wide sync they previously needed is unnecessary, which
also avoids a cooperative-launch failure observed on RDNA4. Texture mipmap
generation likewise builds the pyramid one level at a time with ordinary
launches in a host loop, the kernel boundary providing the grid-wide
ordering grid.sync() gave. These launch changes are confined to
#if defined(USE_HIP); the CUDA build keeps upstream's cooperative launches
and in-kernel synchronization unchanged.
The Linux platform support (mmap in MappedFile.h, O_DIRECT unbuffered IO
in unsuck_platform_specific.cpp), which the upstream README listed as a
TODO, and the Windows amdclang++ (MSVC ABI) build fixes.
A headless --bench <file.glb> [w h frames] mode that renders to
bench_render.png without a window or Vulkan, used to validate on machines
without a display.

Two limitations of this build are documented in the README under Known
Issues:

HIP-Vulkan texture interop is unavailable. The CUDA-to-Vulkan mipmapped
array export relies on hipExternalMemoryGetMappedMipmappedArray, which the
ROCm 7.2 HIP runtime does not export, so that path is stubbed. Core
rasterization and the headless bench path work without it.
Kernel launches dispatched from inside the module-launch helper
intermittently memory-fault on gfx90a (ROCm 7.2.1), while the same
dispatch sequence inlined at the call site is reliable. This appears to be
a ROCm runtime issue; until it is resolved, the rasterization hot paths
dispatch their launches inline as a workaround.

Test Plan:

Built and run on three AMD architectures. Each renders the bundled
example scene correctly to bench_render.png with no GPU faults.

Linux, AMD Instinct MI250X (gfx90a, CDNA2, wave64, ROCm 7.2.1) and
AMD Radeon Pro W7800 (gfx1100, RDNA3, wave32):

cmake -S . -B build -DUSE_HIP=ON -DCMAKE_HIP_ARCHITECTURES=gfx90a
cmake --build build -j
./build/CuRast --bench ./example_donaukanal_urania.glb 1920 1080 30

Windows, AMD Radeon RX 9070 XT (gfx1201, RDNA4, wave32), amdclang++:

cmake -S . -B build -DUSE_HIP=ON -DCMAKE_HIP_ARCHITECTURES=gfx1201 `
  -DCMAKE_HIP_COMPILER=<rocm>/lib/llvm/bin/amdclang++.exe -G Ninja
cmake --build build -j
.\build\CuRast.exe --bench .\example_donaukanal_urania.glb 1920 1080 30

All three render 966,461 triangles per frame at 1920x1080 with the
visbuffer pipeline in well under a millisecond, and bench_render.png shows
the scene correctly.

This work was authored with the assistance of Claude (an AI assistant by
Anthropic).

This adds support for building and running CuRast on AMD GPUs with ROCm/HIP, alongside the existing CUDA build. A compatibility header maps the CUDA spellings used in the kernels and host code to their HIP equivalents, and the CUDA driver API plus the nvrtc/nvJitLink runtime compilation path are mapped to the HIP driver API and hiprtc. Build with the USE_HIP CMake option; select the target with CMAKE_HIP_ARCHITECTURES. We have made every effort to leave the CUDA build unchanged: every HIP change is behind a USE_HIP / __HIP_PLATFORM_AMD__ guard that the CUDA build does not compile, the .cu kernels keep their CUDA spellings, and the compat header is a no-op include of the CUDA headers on the NVIDIA path. Suggested review order: 1. cuda_to_hip.h -- the compatibility header. Aliases the CUDA runtime, driver, virtual-memory and external-memory APIs to HIP, and provides the HIP_DEVPTR_ADD helper for byte-offset arithmetic on hipDeviceptr_t (a void* on ROCm, which strict-C++ amdclang++ will not do arithmetic on). 2. HipModularProgram.h -- replaces CudaModularProgram for runtime compilation. HIP compiles directly to code objects via hiprtc; there is no LTO-IR intermediate or separate nvJitLink step, so multiple kernel sources are combined into one compilation unit. hiprtc has no equivalent of nvrtc's -default-device flag, so unannotated device helpers are wrapped in RTC-guarded `#pragma clang attribute push((device))` regions. Note that -ffast-math is NOT passed: clang fast-math implies -ffinite-math-only, which miscompiles this renderer's Infinity depth sentinels and NaN clear/compare values; nvcc --use_fast_math makes no finite-math assumption, so the CUDA build is unaffected. 3. CMakeLists.txt, cmake/common.cmake -- the USE_HIP option, enable_language (HIP), HIP source-file language assignment, and HIP library linking. 4. The kernel sources and CuRast_render.h -- HIP cooperative-groups and warp handling. The pipeline uses width-32 logical-warp operations (tiled_partition<32>, ballot, shfl), which are correct on both wave64 (CDNA) and wave32 (RDNA) without an architecture-specific code path. On the HIP build, stage1 and stage3 use occupancy-sized non-cooperative launches: their per-block counter initialization was moved to host-side memset so the grid-wide sync they previously needed is unnecessary, which also avoids a cooperative-launch failure observed on RDNA4. Texture mipmap generation likewise builds the pyramid one level at a time with ordinary launches in a host loop, the kernel boundary providing the grid-wide ordering grid.sync() gave. These launch changes are confined to #if defined(USE_HIP); the CUDA build keeps upstream's cooperative launches and in-kernel synchronization unchanged. 5. The Linux platform support (mmap in MappedFile.h, O_DIRECT unbuffered IO in unsuck_platform_specific.cpp), which the upstream README listed as a TODO, and the Windows amdclang++ (MSVC ABI) build fixes. 6. A headless `--bench <file.glb> [w h frames]` mode that renders to bench_render.png without a window or Vulkan, used to validate on machines without a display. Two limitations of this build are documented in the README under Known Issues: - HIP-Vulkan texture interop is unavailable. The CUDA-to-Vulkan mipmapped array export relies on hipExternalMemoryGetMappedMipmappedArray, which the ROCm 7.2 HIP runtime does not export, so that path is stubbed. Core rasterization and the headless bench path work without it. - Kernel launches dispatched from inside the module-launch helper intermittently memory-fault on gfx90a (ROCm 7.2.1), while the same dispatch sequence inlined at the call site is reliable. This appears to be a ROCm runtime issue; until it is resolved, the rasterization hot paths dispatch their launches inline as a workaround. Test Plan: Built and run on three AMD architectures. Each renders the bundled example scene correctly to bench_render.png with no GPU faults. Linux, AMD Instinct MI250X (gfx90a, CDNA2, wave64, ROCm 7.2.1) and AMD Radeon Pro W7800 (gfx1100, RDNA3, wave32): ``` cmake -S . -B build -DUSE_HIP=ON -DCMAKE_HIP_ARCHITECTURES=gfx90a cmake --build build -j ./build/CuRast --bench ./example_donaukanal_urania.glb 1920 1080 30 ``` Windows, AMD Radeon RX 9070 XT (gfx1201, RDNA4, wave32), amdclang++: ``` cmake -S . -B build -DUSE_HIP=ON -DCMAKE_HIP_ARCHITECTURES=gfx1201 ` -DCMAKE_HIP_COMPILER=<rocm>/lib/llvm/bin/amdclang++.exe -G Ninja cmake --build build -j .\build\CuRast.exe --bench .\example_donaukanal_urania.glb 1920 1080 30 ``` All three render 966,461 triangles per frame at 1920x1080 with the visbuffer pipeline in well under a millisecond, and bench_render.png shows the scene correctly. This work was authored with the assistance of Claude (an AI assistant by Anthropic).

jeffdaily added a commit to jeffdaily/moat that referenced this pull request Jun 11, 2026

CuRast: lead -> pr-open (m-schuetz/CuRast#2)

9fb8178

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add AMD GPU support via HIP (ROCm)#2

Add AMD GPU support via HIP (ROCm)#2
jeffdaily wants to merge 1 commit into
m-schuetz:mainfrom
jeffdaily:moat-port

jeffdaily commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jeffdaily commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant