Skip to content

sar/gfxGRAPH

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

gfxGRAPH logo

gfxGRAPH v0.3.0

Drop-in CUDA Graph → HIP Graph translation layer for AMD gfx1030/1031 (RDNA2).

Bridges all 4 CUDA Graph parity gaps on ROCm, enabling transparent graph capture/replay with eager fallback, VRAM safety, and validation.

Target Hardware

Component Requirement
GPU AMD Radeon RX 6700 XT / 6800 / 6900 (gfx1030, RDNA2)
ROCm 7.2.0+
PyTorch 2.9+ (ROCm build)
Python 3.10+

Two Operating Tiers

gfxGRAPH works in two tiers depending on which dependencies you install. Most users only need Tier 1 — it provides the full Python-level integration including the monkey-patch that makes CUDA graphs work transparently on RDNA2.

Tier 1: Python-Only Mode (recommended starting point)

What you get:

  • torch.cuda.CUDAGraph → BridgedCUDAGraph monkey-patch (transparent to callers)
  • Eager fallback — capture/replay failures never crash, just run slower
  • Shape bucketing — reduced graph captures for dynamic batch sizes
  • VRAM safety cap — prevents graph capture OOM (GFXGRAPH_VRAM_CAP)
  • Validation mode — catches silent HIP Graph correctness bugs (PyTorch #155684)
  • Thread-safe stats: gfxgraph.stats() → capture/replay/fallback counts
  • Health check: gfxgraph.health_check() → GPU info + smoke test
  • Structured logging: HGB_LOG_LEVEL=debug|info|warn|error

Dependencies:

# That's it — just PyTorch (ROCm build) and Python
pip install torch --index-url https://download.pytorch.org/whl/rocm6.2  # or your ROCm version

Install gfxGRAPH:

# From source (editable)
pip install -e /path/to/gfxGRAPH/python/

# Or standard install
pip install /path/to/gfxGRAPH/python/

Verify:

python3 -c "import gfxgraph; print(gfxgraph.__version__); print(gfxgraph.health_check())"

You'll see native_bridge: False — that's expected and fine. All Python-level features work without the native library.

Tier 2: Full Native Mode (advanced — requires ROCm SDK)

What you get additionally:

  • C-level HIP Graph gap bridges (conditional nodes, device-side launch, nested capture)
  • libhipgraph_bridge.so — loaded automatically when present
  • Full 54/54 CUDA Graph parity matrix (vs 50/54 Python-only)

System dependencies (Ubuntu/Debian):

# ROCm SDK — the big one. Follow AMD's official guide:
# https://rocm.docs.amd.com/projects/install-on-linux/en/latest/
#
# Key packages needed:
sudo apt-get install -y \
    rocm-dev \
    hip-dev \
    hipcc \
    rocm-cmake

# Build tools
sudo apt-get install -y cmake ninja-build

⚠️ ROCm SDK installation is non-trivial. It requires kernel-level drivers, specific package repositories, and careful version matching. Plan for 30-60 min on a fresh system. If you're running PyTorch ROCm builds, you likely already have libamdhip64.so — but you still need hip-dev headers and hipcc for compiling the bridge.

Build the native bridge:

cd /path/to/gfxGRAPH

cmake -B build -GNinja \
    -DCMAKE_HIP_COMPILER=/opt/rocm/bin/hipcc \
    -DCMAKE_PREFIX_PATH=/opt/rocm \
    -DCMAKE_HIP_ARCHITECTURES=gfx1030

cmake --build build -j$(nproc)

# Run tests
ctest --test-dir build --output-on-failure

The built libhipgraph_bridge.so will be in build/. gfxGRAPH auto-discovers it via the build directory, LD_LIBRARY_PATH, or you can set GFXGRAPH_LIB=/path/to/libhipgraph_bridge.so.

Verify native bridge loaded:

python3 -c "import gfxgraph; print(gfxgraph.health_check())"
# Should show: native_bridge: True

Usage

Standalone (any PyTorch code)

import gfxgraph
gfxgraph.enable()  # patches torch.cuda.CUDAGraph globally

# Your existing CUDA graph code works unchanged:
graph = torch.cuda.CUDAGraph()  # actually BridgedCUDAGraph
# ... capture_begin / capture_end / replay all delegate correctly

With SGLang

gfxGRAPH integrates transparently with SGLang's CUDA graph runner. Set these environment variables before launching:

# Required: enable RDNA2 kernel paths (activates gfxGRAPH)
export SGLANG_RDNA2_KERNELS=1

# Required for gfx1031 (RX 6700 XT)
export HSA_OVERRIDE_GFX_VERSION=10.3.0
export PYTORCH_ROCM_ARCH=gfx1030

# Optional: validation mode (catches silent graph correctness bugs)
export GFXGRAPH=validate

# Optional: debug logging
export GFXGRAPH=debug

# Optional: VRAM cap for graph capture scratch (default 0.90 = 90% of total)
export GFXGRAPH_VRAM_CAP=0.90

# Optional: disable gfxGRAPH while keeping RDNA2 kernels
export SGLANG_DISABLE_GFXGRAPH=1

# Launch SGLang
python3 -m sglang.launch_server --model-path <model> ...

SGLang logs gfxGRAPH status at startup:

INFO: gfxGRAPH v0.3.0 enabled (mode=normal, vram_cap=0.90)
INFO: gfxGRAPH health check passed: AMD Radeon RX 6700 XT (gfx1030), VRAM 10240MB free / 12288MB total

Via Environment Variable (auto-enables on import)

GFXGRAPH=1 python3 my_script.py        # standard mode
GFXGRAPH=debug python3 my_script.py    # verbose logging
GFXGRAPH=validate python3 my_script.py # correctness checking

Architecture

┌──────────────────────────────────────────────────────┐
│                   User Application                    │
├──────────────┬───────────────────┬───────────────────┤
│   PyTorch    │   Direct HIP C   │  Unmodified CUDA  │
├──────────────┼───────────────────┼───────────────────┤
│  Layer 2     │                   │  Layer 3          │
│  hipgraph_   │                   │  libcudagraph_    │
│  bridge/     │                   │  compat.so        │
│  (Python)    │                   │  (LD_PRELOAD)     │
├──────────────┴───────────────────┴───────────────────┤
│            Layer 1: libhipgraph_bridge.so             │
│     Gap bridges · Routing logic · Kernel pool         │
├──────────────────────────────────────────────────────┤
│         libamdhip64.so  (ROCm · 104 symbols)          │
├──────────────────────────────────────────────────────┤
│              gfx1030 · RDNA2 Hardware                 │
└──────────────────────────────────────────────────────┘

Gaps Bridged

# Gap Bridge Strategy Perf Tier
51 Conditional nodes Per-branch dispatch (Python) / hipGraphNodeSetEnabled (native) ~90% 1/2
52 Device-side launch hipGraphUpload + rapid host pipeline ~95% 2
53 Dynamic input shapes Shape bucketing + param update ~90-95% 1
54 Nested capture Sequential capture + child graph nodes ~95% 2

Routing Strategy

Tier Stack Capabilities
0 torch.compile only 31/54
1 HIP Graph + gfxGRAPH (Python-only) 52/54
2 HIP Graph + gfxGRAPH (full native) 54/54

Observability

import gfxgraph

# Performance counters
gfxgraph.stats()
# → {'enabled_at': 1712..., 'capture_count': 32, 'replay_count': 1847,
#     'fallback_count': 0, 'validation_failures': 0, 'avg_replay_us': 42.3}

# Health check
gfxgraph.health_check()
# → {'ok': True, 'gpu': 'AMD Radeon RX 6700 XT', 'rocm': 'gfx1030',
#     'native_bridge': False, 'vram_total_mb': 12288, 'vram_free_mb': 10240,
#     'details': 'Graph capture/replay OK, output verified'}

# Status
gfxgraph.is_enabled()  # → True

Troubleshooting

"Native bridge not available" message at startup

Expected in Tier 1. gfxGRAPH runs in pure-Python mode — all key features work. Build libhipgraph_bridge.so (see Tier 2 above) only if you need the 2 extra native-only gaps.

Health check returns ok: False

  • Verify ROCm is working: rocminfo | grep gfx
  • Check HSA override: echo $HSA_OVERRIDE_GFX_VERSION (should be 10.3.0 for gfx1031)
  • Test PyTorch: python3 -c "import torch; print(torch.cuda.is_available())"
  • Check for PyTorch #155684 (HIP Graph correctness bug) — use GFXGRAPH=validate

CUDA graphs fail during SGLang model loading

  • Set AMD_SERIALIZE_KERNEL=3 and AMD_SERIALIZE_COPY=3 (SGLang sets these automatically)
  • Reduce GFXGRAPH_VRAM_CAP if running near VRAM limits
  • Try SGLANG_DISABLE_GFXGRAPH=1 to isolate whether gfxGRAPH is the issue

Fallback count keeps increasing

  • Some graph shapes may genuinely fail on HIP — eager fallback is intentional
  • Check HGB_LOG_LEVEL=debug for detailed failure reasons
  • If all captures fail, the underlying HIP Graph support may be broken

Performance (SGLang + GemLite AWQ 7B, bs=1, gfx1030)

Config Decode t/s Prefill t/s VRAM
GemLite AWQ + gfxGRAPH 36.82 644.06 5.58 GB
GemLite AWQ, no graphs 23.31 640.84 5.58 GB
Improvement +58% +0.5%

CUDA graphs primarily accelerate decode (kernel launch overhead dominates at bs=1).


Documentation

License

MIT

About

CUDA Graph → HIP Graph translation layer for AMD gfx1030 (RDNA2). Bridges all 4 CUDA Graph parity gaps on ROCm.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 46.0%
  • HIP 36.3%
  • C 10.7%
  • C++ 4.1%
  • CMake 2.9%