CUDA backend (DGX-Spark) — refactored into modular .cuh files mirroring ROCm structure by gundemirbas · Pull Request #398 · antirez/ds4

gundemirbas · 2026-06-11T20:13:58Z

Summary

Add a full CUDA backend for NVIDIA GPU support (targeting DGX Spark / NVIDIA Grace Blackwell), refactored into modular .cuh files under a new cuda/ directory, mirroring the existing ROCm backend structure. The monolithic ds4_cuda.cu is replaced with a thin dispatcher that includes the appropriate .cuh modules.

Changes

New `cuda/` directory — 18 modular `.cuh` files

File	Description
`ds4_cuda_runtime.cuh`	CUDA runtime, stream management, memory allocation, model loading, device discovery
`ds4_cuda_common.cuh`	Shared macros, device properties, common helpers
`ds4_cuda_attention.cuh` / `ds4_cuda_attention_launch.cuh`	Flash attention kernels and launch wrappers
`ds4_cuda_moe.cuh` / `ds4_cuda_moe_launch.cuh`	Mixture-of-Experts compute and launch kernels
`ds4_cuda_matmul.cuh`	Matrix multiplication (cuBLAS, custom kernels)
`ds4_cuda_norm_rope.cuh`	Layer normalization and RoPE embeddings
`ds4_cuda_fp8_kv.cuh`	FP8 KV-cache support
`ds4_cuda_compressor.cuh`	KV-cache compression
`ds4_cuda_hc.cuh` / `ds4_cuda_hc_output_launch.cuh`	Heavy cache (HC) operations
`ds4_cuda_indexer.cuh`	Indexing and scatter/gather operations
`ds4_cuda_router.cuh`	Expert routing
`ds4_cuda_q8_K.cuh`	Q8_K quantization kernels
`ds4_cuda_embedding.cuh` / `ds4_cuda_embedding_launch.cuh`	Embedding lookup kernels
`ds4_cuda_misc.cuh`	Miscellaneous utility kernels

Refactored `ds4_cuda.cu`

Reduced from ~11,800 lines to a clean dispatcher that conditionally includes the modular .cuh files.
Uses the same pattern as ds4_rocm.cu → rocm/*.cuh.

`Makefile`

Added CUDA target: make CUDA=1 builds the CUDA backend.
Conditional compilation with DS4_CUDA preprocessor define.

`shell.nix`

Updated Nix development environment with both ROCm and CUDA toolchains (HIP, cuBLAS, cuDNN, etc.).

Motivation

The DGX Spark (NVIDIA Grace Blackwell) is a compact AI workstation with 128 GB unified memory and a Blackwell GPU. Adding a first-class CUDA backend enables ds4 to run on this platform without the ROCm compatibility layer. The modular structure mirrors the existing ROCm backend (rocm/*.cuh), making future cross-backend maintenance straightforward.

Design Decisions

Modular .cuh files — each kernel family lives in its own file, matching the ROCm layout. This makes it easy to compare CUDA and ROCm implementations side by side.
ds4_cuda_runtime.cuh — contains all CUDA-specific runtime code (streams, events, memory management, device queries), analogous to ds4_rocm_runtime.cuh.
Same API surface — all public functions keep their original signatures. The backend is selected at compile time via DS4_CUDA / DS4_ROCM defines.

Testing

Successfully builds and runs on DGX Spark (NVIDIA Grace Blackwell, 128 GB).
Inference tested with llama.cpp-format Q4_K_M models.
Attention, MoE, matmul, norm/rope, and FP8 KV-cache paths verified.

CUDA backend refactored into modular .cuh files under /cuda/ mirroring ROCm structure. Includes ds4_cuda_runtime.cuh, attention, moe, matmul, norm/rope, fp8_kv, compressor, hc, indexer, router, q8_K, embedding, misc kernels. Makefile: CUDA target selection shell.nix: ROCm + CUDA dev environment

yiakwy-xpu-ml-framework-team · 2026-06-13T08:41:16Z

@gundemirbas could you rebase codes on this #402, since my change is more lite.

I am not sure you have added functions (quantization) and UVA prefetch cache for cuda backend. But hope you have some tests to verify it .

Could you post some snapshot and logs (to make sure critical path work as expected) ?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA backend (DGX-Spark) — refactored into modular .cuh files mirroring ROCm structure#398

CUDA backend (DGX-Spark) — refactored into modular .cuh files mirroring ROCm structure#398
gundemirbas wants to merge 1 commit into
antirez:mainfrom
gundemirbas:pr/cuda-backend

gundemirbas commented Jun 11, 2026 •

edited

Loading

Uh oh!

yiakwy-xpu-ml-framework-team commented Jun 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

gundemirbas commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

New cuda/ directory — 18 modular .cuh files

Refactored ds4_cuda.cu

Makefile

shell.nix

Motivation

Design Decisions

Testing

Uh oh!

yiakwy-xpu-ml-framework-team commented Jun 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

gundemirbas commented Jun 11, 2026 •

edited

Loading

New `cuda/` directory — 18 modular `.cuh` files

Refactored `ds4_cuda.cu`

`Makefile`

`shell.nix`