Skip to content

Add AMD GPU support via HIP (ROCm)#2

Open
jeffdaily wants to merge 1 commit into
m-schuetz:mainfrom
jeffdaily:moat-port
Open

Add AMD GPU support via HIP (ROCm)#2
jeffdaily wants to merge 1 commit into
m-schuetz:mainfrom
jeffdaily:moat-port

Conversation

@jeffdaily

Copy link
Copy Markdown

This adds support for building and running CuRast on AMD GPUs with
ROCm/HIP, alongside the existing CUDA build. A compatibility header maps
the CUDA spellings used in the kernels and host code to their HIP
equivalents, and the CUDA driver API plus the nvrtc/nvJitLink runtime
compilation path are mapped to the HIP driver API and hiprtc. Build with
the USE_HIP CMake option; select the target with CMAKE_HIP_ARCHITECTURES.

We have made every effort to leave the CUDA build unchanged: every HIP
change is behind a USE_HIP / HIP_PLATFORM_AMD guard that the CUDA build
does not compile, the .cu kernels keep their CUDA spellings, and the compat
header is a no-op include of the CUDA headers on the NVIDIA path.

Suggested review order:

  1. cuda_to_hip.h -- the compatibility header. Aliases the CUDA runtime,
    driver, virtual-memory and external-memory APIs to HIP, and provides the
    HIP_DEVPTR_ADD helper for byte-offset arithmetic on hipDeviceptr_t (a
    void* on ROCm, which strict-C++ amdclang++ will not do arithmetic on).

  2. HipModularProgram.h -- replaces CudaModularProgram for runtime
    compilation. HIP compiles directly to code objects via hiprtc; there is
    no LTO-IR intermediate or separate nvJitLink step, so multiple kernel
    sources are combined into one compilation unit. hiprtc has no equivalent
    of nvrtc's -default-device flag, so unannotated device helpers are
    wrapped in RTC-guarded #pragma clang attribute push((device)) regions.
    Note that -ffast-math is NOT passed: clang fast-math implies
    -ffinite-math-only, which miscompiles this renderer's Infinity depth
    sentinels and NaN clear/compare values; nvcc --use_fast_math makes no
    finite-math assumption, so the CUDA build is unaffected.

  3. CMakeLists.txt, cmake/common.cmake -- the USE_HIP option, enable_language
    (HIP), HIP source-file language assignment, and HIP library linking.

  4. The kernel sources and CuRast_render.h -- HIP cooperative-groups and warp
    handling. The pipeline uses width-32 logical-warp operations
    (tiled_partition<32>, ballot, shfl), which are correct on both wave64
    (CDNA) and wave32 (RDNA) without an architecture-specific code path.
    On the HIP build, stage1 and stage3 use occupancy-sized non-cooperative
    launches: their per-block counter initialization was moved to host-side
    memset so the grid-wide sync they previously needed is unnecessary, which
    also avoids a cooperative-launch failure observed on RDNA4. Texture mipmap
    generation likewise builds the pyramid one level at a time with ordinary
    launches in a host loop, the kernel boundary providing the grid-wide
    ordering grid.sync() gave. These launch changes are confined to
    #if defined(USE_HIP); the CUDA build keeps upstream's cooperative launches
    and in-kernel synchronization unchanged.

  5. The Linux platform support (mmap in MappedFile.h, O_DIRECT unbuffered IO
    in unsuck_platform_specific.cpp), which the upstream README listed as a
    TODO, and the Windows amdclang++ (MSVC ABI) build fixes.

  6. A headless --bench <file.glb> [w h frames] mode that renders to
    bench_render.png without a window or Vulkan, used to validate on machines
    without a display.

Two limitations of this build are documented in the README under Known
Issues:

  • HIP-Vulkan texture interop is unavailable. The CUDA-to-Vulkan mipmapped
    array export relies on hipExternalMemoryGetMappedMipmappedArray, which the
    ROCm 7.2 HIP runtime does not export, so that path is stubbed. Core
    rasterization and the headless bench path work without it.

  • Kernel launches dispatched from inside the module-launch helper
    intermittently memory-fault on gfx90a (ROCm 7.2.1), while the same
    dispatch sequence inlined at the call site is reliable. This appears to be
    a ROCm runtime issue; until it is resolved, the rasterization hot paths
    dispatch their launches inline as a workaround.

Test Plan:

Built and run on three AMD architectures. Each renders the bundled
example scene correctly to bench_render.png with no GPU faults.

Linux, AMD Instinct MI250X (gfx90a, CDNA2, wave64, ROCm 7.2.1) and
AMD Radeon Pro W7800 (gfx1100, RDNA3, wave32):

cmake -S . -B build -DUSE_HIP=ON -DCMAKE_HIP_ARCHITECTURES=gfx90a
cmake --build build -j
./build/CuRast --bench ./example_donaukanal_urania.glb 1920 1080 30

Windows, AMD Radeon RX 9070 XT (gfx1201, RDNA4, wave32), amdclang++:

cmake -S . -B build -DUSE_HIP=ON -DCMAKE_HIP_ARCHITECTURES=gfx1201 `
  -DCMAKE_HIP_COMPILER=<rocm>/lib/llvm/bin/amdclang++.exe -G Ninja
cmake --build build -j
.\build\CuRast.exe --bench .\example_donaukanal_urania.glb 1920 1080 30

All three render 966,461 triangles per frame at 1920x1080 with the
visbuffer pipeline in well under a millisecond, and bench_render.png shows
the scene correctly.

This work was authored with the assistance of Claude (an AI assistant by
Anthropic).

This adds support for building and running CuRast on AMD GPUs with
ROCm/HIP, alongside the existing CUDA build. A compatibility header maps
the CUDA spellings used in the kernels and host code to their HIP
equivalents, and the CUDA driver API plus the nvrtc/nvJitLink runtime
compilation path are mapped to the HIP driver API and hiprtc. Build with
the USE_HIP CMake option; select the target with CMAKE_HIP_ARCHITECTURES.

We have made every effort to leave the CUDA build unchanged: every HIP
change is behind a USE_HIP / __HIP_PLATFORM_AMD__ guard that the CUDA build
does not compile, the .cu kernels keep their CUDA spellings, and the compat
header is a no-op include of the CUDA headers on the NVIDIA path.

Suggested review order:

1. cuda_to_hip.h -- the compatibility header. Aliases the CUDA runtime,
   driver, virtual-memory and external-memory APIs to HIP, and provides the
   HIP_DEVPTR_ADD helper for byte-offset arithmetic on hipDeviceptr_t (a
   void* on ROCm, which strict-C++ amdclang++ will not do arithmetic on).

2. HipModularProgram.h -- replaces CudaModularProgram for runtime
   compilation. HIP compiles directly to code objects via hiprtc; there is
   no LTO-IR intermediate or separate nvJitLink step, so multiple kernel
   sources are combined into one compilation unit. hiprtc has no equivalent
   of nvrtc's -default-device flag, so unannotated device helpers are
   wrapped in RTC-guarded `#pragma clang attribute push((device))` regions.
   Note that -ffast-math is NOT passed: clang fast-math implies
   -ffinite-math-only, which miscompiles this renderer's Infinity depth
   sentinels and NaN clear/compare values; nvcc --use_fast_math makes no
   finite-math assumption, so the CUDA build is unaffected.

3. CMakeLists.txt, cmake/common.cmake -- the USE_HIP option, enable_language
   (HIP), HIP source-file language assignment, and HIP library linking.

4. The kernel sources and CuRast_render.h -- HIP cooperative-groups and warp
   handling. The pipeline uses width-32 logical-warp operations
   (tiled_partition<32>, ballot, shfl), which are correct on both wave64
   (CDNA) and wave32 (RDNA) without an architecture-specific code path.
   On the HIP build, stage1 and stage3 use occupancy-sized non-cooperative
   launches: their per-block counter initialization was moved to host-side
   memset so the grid-wide sync they previously needed is unnecessary, which
   also avoids a cooperative-launch failure observed on RDNA4. Texture mipmap
   generation likewise builds the pyramid one level at a time with ordinary
   launches in a host loop, the kernel boundary providing the grid-wide
   ordering grid.sync() gave. These launch changes are confined to
   #if defined(USE_HIP); the CUDA build keeps upstream's cooperative launches
   and in-kernel synchronization unchanged.

5. The Linux platform support (mmap in MappedFile.h, O_DIRECT unbuffered IO
   in unsuck_platform_specific.cpp), which the upstream README listed as a
   TODO, and the Windows amdclang++ (MSVC ABI) build fixes.

6. A headless `--bench <file.glb> [w h frames]` mode that renders to
   bench_render.png without a window or Vulkan, used to validate on machines
   without a display.

Two limitations of this build are documented in the README under Known
Issues:

- HIP-Vulkan texture interop is unavailable. The CUDA-to-Vulkan mipmapped
  array export relies on hipExternalMemoryGetMappedMipmappedArray, which the
  ROCm 7.2 HIP runtime does not export, so that path is stubbed. Core
  rasterization and the headless bench path work without it.

- Kernel launches dispatched from inside the module-launch helper
  intermittently memory-fault on gfx90a (ROCm 7.2.1), while the same
  dispatch sequence inlined at the call site is reliable. This appears to be
  a ROCm runtime issue; until it is resolved, the rasterization hot paths
  dispatch their launches inline as a workaround.

Test Plan:

Built and run on three AMD architectures. Each renders the bundled
example scene correctly to bench_render.png with no GPU faults.

Linux, AMD Instinct MI250X (gfx90a, CDNA2, wave64, ROCm 7.2.1) and
AMD Radeon Pro W7800 (gfx1100, RDNA3, wave32):

```
cmake -S . -B build -DUSE_HIP=ON -DCMAKE_HIP_ARCHITECTURES=gfx90a
cmake --build build -j
./build/CuRast --bench ./example_donaukanal_urania.glb 1920 1080 30
```

Windows, AMD Radeon RX 9070 XT (gfx1201, RDNA4, wave32), amdclang++:

```
cmake -S . -B build -DUSE_HIP=ON -DCMAKE_HIP_ARCHITECTURES=gfx1201 `
  -DCMAKE_HIP_COMPILER=<rocm>/lib/llvm/bin/amdclang++.exe -G Ninja
cmake --build build -j
.\build\CuRast.exe --bench .\example_donaukanal_urania.glb 1920 1080 30
```

All three render 966,461 triangles per frame at 1920x1080 with the
visbuffer pipeline in well under a millisecond, and bench_render.png shows
the scene correctly.

This work was authored with the assistance of Claude (an AI assistant by
Anthropic).
jeffdaily added a commit to jeffdaily/moat that referenced this pull request Jun 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant