Skip to content

CUDA backend (DGX-Spark) — refactored into modular .cuh files mirroring ROCm structure#398

Open
gundemirbas wants to merge 1 commit into
antirez:mainfrom
gundemirbas:pr/cuda-backend
Open

CUDA backend (DGX-Spark) — refactored into modular .cuh files mirroring ROCm structure#398
gundemirbas wants to merge 1 commit into
antirez:mainfrom
gundemirbas:pr/cuda-backend

Conversation

@gundemirbas

@gundemirbas gundemirbas commented Jun 11, 2026

Copy link
Copy Markdown

Summary

Add a full CUDA backend for NVIDIA GPU support (targeting DGX Spark / NVIDIA Grace Blackwell), refactored into modular .cuh files under a new cuda/ directory, mirroring the existing ROCm backend structure. The monolithic ds4_cuda.cu is replaced with a thin dispatcher that includes the appropriate .cuh modules.

Changes

New cuda/ directory — 18 modular .cuh files

File Description
ds4_cuda_runtime.cuh CUDA runtime, stream management, memory allocation, model loading, device discovery
ds4_cuda_common.cuh Shared macros, device properties, common helpers
ds4_cuda_attention.cuh / ds4_cuda_attention_launch.cuh Flash attention kernels and launch wrappers
ds4_cuda_moe.cuh / ds4_cuda_moe_launch.cuh Mixture-of-Experts compute and launch kernels
ds4_cuda_matmul.cuh Matrix multiplication (cuBLAS, custom kernels)
ds4_cuda_norm_rope.cuh Layer normalization and RoPE embeddings
ds4_cuda_fp8_kv.cuh FP8 KV-cache support
ds4_cuda_compressor.cuh KV-cache compression
ds4_cuda_hc.cuh / ds4_cuda_hc_output_launch.cuh Heavy cache (HC) operations
ds4_cuda_indexer.cuh Indexing and scatter/gather operations
ds4_cuda_router.cuh Expert routing
ds4_cuda_q8_K.cuh Q8_K quantization kernels
ds4_cuda_embedding.cuh / ds4_cuda_embedding_launch.cuh Embedding lookup kernels
ds4_cuda_misc.cuh Miscellaneous utility kernels

Refactored ds4_cuda.cu

  • Reduced from ~11,800 lines to a clean dispatcher that conditionally includes the modular .cuh files.
  • Uses the same pattern as ds4_rocm.curocm/*.cuh.

Makefile

  • Added CUDA target: make CUDA=1 builds the CUDA backend.
  • Conditional compilation with DS4_CUDA preprocessor define.

shell.nix

  • Updated Nix development environment with both ROCm and CUDA toolchains (HIP, cuBLAS, cuDNN, etc.).

Motivation

The DGX Spark (NVIDIA Grace Blackwell) is a compact AI workstation with 128 GB unified memory and a Blackwell GPU. Adding a first-class CUDA backend enables ds4 to run on this platform without the ROCm compatibility layer. The modular structure mirrors the existing ROCm backend (rocm/*.cuh), making future cross-backend maintenance straightforward.

Design Decisions

  • Modular .cuh files — each kernel family lives in its own file, matching the ROCm layout. This makes it easy to compare CUDA and ROCm implementations side by side.
  • ds4_cuda_runtime.cuh — contains all CUDA-specific runtime code (streams, events, memory management, device queries), analogous to ds4_rocm_runtime.cuh.
  • Same API surface — all public functions keep their original signatures. The backend is selected at compile time via DS4_CUDA / DS4_ROCM defines.

Testing

  • Successfully builds and runs on DGX Spark (NVIDIA Grace Blackwell, 128 GB).
  • Inference tested with llama.cpp-format Q4_K_M models.
  • Attention, MoE, matmul, norm/rope, and FP8 KV-cache paths verified.

CUDA backend refactored into modular .cuh files under /cuda/
mirroring ROCm structure. Includes ds4_cuda_runtime.cuh,
attention, moe, matmul, norm/rope, fp8_kv, compressor,
hc, indexer, router, q8_K, embedding, misc kernels.

Makefile: CUDA target selection
shell.nix: ROCm + CUDA dev environment
@yiakwy-xpu-ml-framework-team

Copy link
Copy Markdown

@gundemirbas could you rebase codes on this #402, since my change is more lite.

I am not sure you have added functions (quantization) and UVA prefetch cache for cuda backend. But hope you have some tests to verify it .

Could you post some snapshot and logs (to make sure critical path work as expected) ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants