Drop-in CUDA Graph → HIP Graph translation layer for AMD gfx1030/1031 (RDNA2).
Bridges all 4 CUDA Graph parity gaps on ROCm, enabling transparent graph capture/replay with eager fallback, VRAM safety, and validation.
| Component | Requirement |
|---|---|
| GPU | AMD Radeon RX 6700 XT / 6800 / 6900 (gfx1030, RDNA2) |
| ROCm | 7.2.0+ |
| PyTorch | 2.9+ (ROCm build) |
| Python | 3.10+ |
gfxGRAPH works in two tiers depending on which dependencies you install. Most users only need Tier 1 — it provides the full Python-level integration including the monkey-patch that makes CUDA graphs work transparently on RDNA2.
What you get:
torch.cuda.CUDAGraph → BridgedCUDAGraphmonkey-patch (transparent to callers)- Eager fallback — capture/replay failures never crash, just run slower
- Shape bucketing — reduced graph captures for dynamic batch sizes
- VRAM safety cap — prevents graph capture OOM (
GFXGRAPH_VRAM_CAP) - Validation mode — catches silent HIP Graph correctness bugs (PyTorch #155684)
- Thread-safe stats:
gfxgraph.stats()→ capture/replay/fallback counts - Health check:
gfxgraph.health_check()→ GPU info + smoke test - Structured logging:
HGB_LOG_LEVEL=debug|info|warn|error
Dependencies:
# That's it — just PyTorch (ROCm build) and Python
pip install torch --index-url https://download.pytorch.org/whl/rocm6.2 # or your ROCm versionInstall gfxGRAPH:
# From source (editable)
pip install -e /path/to/gfxGRAPH/python/
# Or standard install
pip install /path/to/gfxGRAPH/python/Verify:
python3 -c "import gfxgraph; print(gfxgraph.__version__); print(gfxgraph.health_check())"You'll see native_bridge: False — that's expected and fine. All Python-level
features work without the native library.
What you get additionally:
- C-level HIP Graph gap bridges (conditional nodes, device-side launch, nested capture)
libhipgraph_bridge.so— loaded automatically when present- Full 54/54 CUDA Graph parity matrix (vs 50/54 Python-only)
System dependencies (Ubuntu/Debian):
# ROCm SDK — the big one. Follow AMD's official guide:
# https://rocm.docs.amd.com/projects/install-on-linux/en/latest/
#
# Key packages needed:
sudo apt-get install -y \
rocm-dev \
hip-dev \
hipcc \
rocm-cmake
# Build tools
sudo apt-get install -y cmake ninja-build
⚠️ ROCm SDK installation is non-trivial. It requires kernel-level drivers, specific package repositories, and careful version matching. Plan for 30-60 min on a fresh system. If you're running PyTorch ROCm builds, you likely already havelibamdhip64.so— but you still needhip-devheaders andhipccfor compiling the bridge.
Build the native bridge:
cd /path/to/gfxGRAPH
cmake -B build -GNinja \
-DCMAKE_HIP_COMPILER=/opt/rocm/bin/hipcc \
-DCMAKE_PREFIX_PATH=/opt/rocm \
-DCMAKE_HIP_ARCHITECTURES=gfx1030
cmake --build build -j$(nproc)
# Run tests
ctest --test-dir build --output-on-failureThe built libhipgraph_bridge.so will be in build/. gfxGRAPH auto-discovers it
via the build directory, LD_LIBRARY_PATH, or you can set GFXGRAPH_LIB=/path/to/libhipgraph_bridge.so.
Verify native bridge loaded:
python3 -c "import gfxgraph; print(gfxgraph.health_check())"
# Should show: native_bridge: Trueimport gfxgraph
gfxgraph.enable() # patches torch.cuda.CUDAGraph globally
# Your existing CUDA graph code works unchanged:
graph = torch.cuda.CUDAGraph() # actually BridgedCUDAGraph
# ... capture_begin / capture_end / replay all delegate correctlygfxGRAPH integrates transparently with SGLang's CUDA graph runner. Set these environment variables before launching:
# Required: enable RDNA2 kernel paths (activates gfxGRAPH)
export SGLANG_RDNA2_KERNELS=1
# Required for gfx1031 (RX 6700 XT)
export HSA_OVERRIDE_GFX_VERSION=10.3.0
export PYTORCH_ROCM_ARCH=gfx1030
# Optional: validation mode (catches silent graph correctness bugs)
export GFXGRAPH=validate
# Optional: debug logging
export GFXGRAPH=debug
# Optional: VRAM cap for graph capture scratch (default 0.90 = 90% of total)
export GFXGRAPH_VRAM_CAP=0.90
# Optional: disable gfxGRAPH while keeping RDNA2 kernels
export SGLANG_DISABLE_GFXGRAPH=1
# Launch SGLang
python3 -m sglang.launch_server --model-path <model> ...SGLang logs gfxGRAPH status at startup:
INFO: gfxGRAPH v0.3.0 enabled (mode=normal, vram_cap=0.90)
INFO: gfxGRAPH health check passed: AMD Radeon RX 6700 XT (gfx1030), VRAM 10240MB free / 12288MB total
GFXGRAPH=1 python3 my_script.py # standard mode
GFXGRAPH=debug python3 my_script.py # verbose logging
GFXGRAPH=validate python3 my_script.py # correctness checking┌──────────────────────────────────────────────────────┐
│ User Application │
├──────────────┬───────────────────┬───────────────────┤
│ PyTorch │ Direct HIP C │ Unmodified CUDA │
├──────────────┼───────────────────┼───────────────────┤
│ Layer 2 │ │ Layer 3 │
│ hipgraph_ │ │ libcudagraph_ │
│ bridge/ │ │ compat.so │
│ (Python) │ │ (LD_PRELOAD) │
├──────────────┴───────────────────┴───────────────────┤
│ Layer 1: libhipgraph_bridge.so │
│ Gap bridges · Routing logic · Kernel pool │
├──────────────────────────────────────────────────────┤
│ libamdhip64.so (ROCm · 104 symbols) │
├──────────────────────────────────────────────────────┤
│ gfx1030 · RDNA2 Hardware │
└──────────────────────────────────────────────────────┘
| # | Gap | Bridge Strategy | Perf | Tier |
|---|---|---|---|---|
| 51 | Conditional nodes | Per-branch dispatch (Python) / hipGraphNodeSetEnabled (native) |
~90% | 1/2 |
| 52 | Device-side launch | hipGraphUpload + rapid host pipeline |
~95% | 2 |
| 53 | Dynamic input shapes | Shape bucketing + param update | ~90-95% | 1 |
| 54 | Nested capture | Sequential capture + child graph nodes | ~95% | 2 |
| Tier | Stack | Capabilities |
|---|---|---|
| 0 | torch.compile only |
31/54 |
| 1 | HIP Graph + gfxGRAPH (Python-only) | 52/54 |
| 2 | HIP Graph + gfxGRAPH (full native) | 54/54 |
import gfxgraph
# Performance counters
gfxgraph.stats()
# → {'enabled_at': 1712..., 'capture_count': 32, 'replay_count': 1847,
# 'fallback_count': 0, 'validation_failures': 0, 'avg_replay_us': 42.3}
# Health check
gfxgraph.health_check()
# → {'ok': True, 'gpu': 'AMD Radeon RX 6700 XT', 'rocm': 'gfx1030',
# 'native_bridge': False, 'vram_total_mb': 12288, 'vram_free_mb': 10240,
# 'details': 'Graph capture/replay OK, output verified'}
# Status
gfxgraph.is_enabled() # → TrueExpected in Tier 1. gfxGRAPH runs in pure-Python mode — all key features work.
Build libhipgraph_bridge.so (see Tier 2 above) only if you need the 2 extra native-only gaps.
- Verify ROCm is working:
rocminfo | grep gfx - Check HSA override:
echo $HSA_OVERRIDE_GFX_VERSION(should be10.3.0for gfx1031) - Test PyTorch:
python3 -c "import torch; print(torch.cuda.is_available())" - Check for PyTorch #155684 (HIP Graph correctness bug) — use
GFXGRAPH=validate
- Set
AMD_SERIALIZE_KERNEL=3andAMD_SERIALIZE_COPY=3(SGLang sets these automatically) - Reduce
GFXGRAPH_VRAM_CAPif running near VRAM limits - Try
SGLANG_DISABLE_GFXGRAPH=1to isolate whether gfxGRAPH is the issue
- Some graph shapes may genuinely fail on HIP — eager fallback is intentional
- Check
HGB_LOG_LEVEL=debugfor detailed failure reasons - If all captures fail, the underlying HIP Graph support may be broken
| Config | Decode t/s | Prefill t/s | VRAM |
|---|---|---|---|
| GemLite AWQ + gfxGRAPH | 36.82 | 644.06 | 5.58 GB |
| GemLite AWQ, no graphs | 23.31 | 640.84 | 5.58 GB |
| Improvement | +58% | +0.5% | — |
CUDA graphs primarily accelerate decode (kernel launch overhead dominates at bs=1).
MIT