Skip to content

thearyanahmed/gpuor-debian

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

gpu-operator-ready Debian driver container

A reference NVIDIA driver container for Debian 13 worker nodes that expose the gpu-operator-ready worker-image variant (i.e. nodes that ship doca-roce / RDMA fabric userland but no NVIDIA driver / CUDA / DCGM / fabric-manager / nvidia-container-toolkit, on the assumption that the customer installs them via the NVIDIA GPU Operator).

NVIDIA does not publish driver-container images for Debian 12/13 on nvcr.io/nvidia/driver, only Ubuntu / RHEL / CoreOS / SLES. This repo fills that gap so the GPU Operator's driver DaemonSet can pull a working image on Debian-based GPU node pools.

⚠️ Reference / unsupported. This is a working spike, not a productized driver container. Don't ship arbitrary workloads on this without doing your own validation, especially for GPUDirect RDMA (the nvidia_peermem ↔ MOFED/DOCA story; see Why this works below).

What's in here

Path Purpose
Dockerfile Builds the driver-container image. Pinned to one driver version, one DOCA version, one host kernel.
nvidia-driver Entrypoint (matches the GPU Operator's expectations: init / reload_nvidia_peermem / probe_nvidia_peermem).
Makefile Build / push / verify helpers using docker buildx.
k8s/daemonset.yaml Standalone privileged DaemonSet that runs the driver container WITHOUT requiring the GPU Operator (useful for validating driver bring-up in isolation).
k8s/cuda-test-pod.yaml Privileged Pod that runs the Rust cuda-smoketest binary against the driver loaded by the DaemonSet above.
cuda-test/ Small Rust project (cudarc-based) that exercises the CUDA driver API end-to-end: device enumeration, host↔device memory roundtrip, NVRTC vector-add kernel.

Quick start

1. Build & push the driver image

Set REGISTRY to a registry you can push to, then:

make build-push REGISTRY=<your-registry>/<your-namespace>

This forces --platform=linux/amd64 (so it works from an Apple Silicon dev machine via buildx + qemu — slow on first run, fine after). The build will fail loudly at the post-install assertion step if nvidia_peermem.ko isn't present or isn't linked against ib_register_peer_memory_client — that's intentional (see Why this works).

make print-tag REGISTRY=... prints the tag without building. Tags encode (driver_version, kernel, doca_version) so you can't pull a "looks right but isn't" image onto a node with a different kernel/DOCA.

2. Build & push the cuda-smoketest image

cd cuda-test
make build-push REGISTRY=<your-registry>/<your-namespace>
cd ..

3. Deploy

Create a namespace and (if your registry is private) install a pull secret in it. Edit k8s/daemonset.yaml and k8s/cuda-test-pod.yaml to match your image references and pull-secret name, then:

kubectl create namespace nvidia-driver
kubectl apply -f k8s/daemonset.yaml
kubectl -n nvidia-driver rollout status ds/nvidia-driver --timeout=10m
kubectl -n nvidia-driver logs -l app.kubernetes.io/name=nvidia-driver --tail=200

kubectl apply -f k8s/cuda-test-pod.yaml
kubectl -n nvidia-driver wait --for=condition=Ready pod/cuda-smoketest --timeout=2m || true
kubectl -n nvidia-driver logs -f pod/cuda-smoketest

The DaemonSet's nodeSelector is gpu-operator-ready: "true" — adjust to whatever label your fabric-connected NVIDIA nodes carry.

Why this works (and why an off-the-shelf NVIDIA driver container doesn't)

The hard part of "NVIDIA driver container on Debian" isn't installing the driver — nvidia-kernel-open-dkms from NVIDIA's Debian repo handles that. The hard part is nvidia_peermem, the kernel module that bridges NVIDIA GPU memory and the IB / RoCE stack so GPUDirect RDMA works.

nvidia_peermem is built by DKMS as part of the NVIDIA driver source tree, and it links against the IB peer-memory client API exported by ib_core (ib_register_peer_memory_client and friends). Those symbols exist in:

  • the in-tree kernel build (stock ib_core.ko from linux-image)
  • MOFED / DOCA's replacement ib_core (from the doca-host / mlnx-ofed-kernel-dkms packages)

…with different symbol-version CRCs. If you build nvidia_peermem against stock kernel headers but the running host uses MOFED's ib_core, the kernel returns -EINVAL at modprobe time because the CRCs don't match.

This Dockerfile installs DOCA + mlnx-ofed-kernel-dkms BEFORE the NVIDIA driver, so DKMS picks up MOFED's Module.symvers when it builds nvidia_peermem.ko. The post-install assertion confirms the resulting .ko actually declares the IB peer-memory symbols.

# 1. Install DOCA / MOFED kernel sources first.
RUN wget -q https://www.mellanox.com/downloads/DOCA/DOCA_v3.3.0/host/${DOCA_HOST_DEB} && \
    dpkg -i ${DOCA_HOST_DEB} && \
    apt-get update && \
    apt-get install -y doca-roce mlnx-ofed-kernel-dkms

# 2. THEN install the NVIDIA driver — DKMS now sees MOFED Module.symvers.
RUN apt-get install -y nvidia-kernel-open-dkms=${DRIVER_VERSION}-1 ...

# 3. Fail the build if peermem didn't link against the IB peer-memory API.
RUN modprobe --dump-modversions /lib/modules/${TARGET_KERNEL}/updates/dkms/nvidia-peermem.ko* \
        | grep -E 'ib_register_peer_memory_client' \
        || (echo "FATAL: nvidia-peermem has no ib_register_peer_memory_client symbol" >&2; exit 1)

That's the central trick. Everything else in the Dockerfile is either making the install reproducible (pinning every package to ${DRIVER_VERSION}-1, only installing headers for one target kernel), or shaping the install to fit the GPU Operator's contract (staging GSP firmware to a non-overlaid path; extracting nvidia-smi without dragging in the Debian display stack).

Entrypoint behavior

nvidia-driver init (the GPU Operator's main driver-pod entrypoint) does:

  1. assert_host_kernel_matches_image — refuses to start if uname -r on the host doesn't match TARGET_KERNEL baked into the image. The image only ships DKMS-built modules for one kernel; if the host diverged, fail fast with an actionable message rather than fail confusingly inside a runtime DKMS rebuild that can't succeed.
  2. build_modules — sanity-checks nvidia.ko, nvidia_uvm.ko, nvidia_modeset.ko, nvidia_peermem.ko are present (DKMS-built at image-build time).
  3. publish_firmware — stages GSP firmware into the operator-mounted hostPath at /lib/firmware/nvidia/<driver> (otherwise per-GPU adapter init fails with RmInitAdapter failed! and nvidia-smi reports "No devices were found").
  4. load_modulesmodprobe nvidia / nvidia_uvm / nvidia_modeset.
  5. create_device_nodes — creates /dev/nvidia* on the host's /dev (via /proc/1/root/dev, since the container's /dev is its own tmpfs).
  6. publish_driver_root — bind-mounts / to /run/nvidia/driver so other pods (notably the GPU Operator's toolkit / device-plugin / workload pods) can reach libcuda.so.1 etc.
  7. start_persistenced — NVIDIA Persistence Daemon for stable PCI state.
  8. start_fabricmanager — currently SKIPS with a warning because the nvidia-fabricmanager-start.sh wrapper isn't shipped in the Debian packages (see Known gaps below).
  9. load_peermemmodprobe nvidia_peermem. FATAL on failure (with a three-cause diagnostic dump) — silent peermem absence is exactly the failure mode this image can't ship with.
  10. signal_ready — touches /run/nvidia/validations/.driver-ctr-ready for the GPU Operator's validator.

nvidia-driver reload_nvidia_peermem (sidecar pod) and nvidia-driver probe_nvidia_peermem (startup probe) match the GPU Operator's standard sidecar contract.

What cuda-smoketest verifies

In order, each step proves the next layer of the stack works:

  1. libcuda.so.1 is reachable + dlopen-able. Validates that the driver DaemonSet's bind-mount of / to /run/nvidia/driver worked and that LD_LIBRARY_PATH resolves the host's libcuda.
  2. cuInit + device enumeration succeed. Validates that the nvidia kernel module is loaded and /dev/nvidia* device nodes exist with correct majors/minors.
  3. Each device reports name / compute capability / memory. Validates GSP firmware was published correctly (this is exactly the surface where "RmInitAdapter failed → No devices were found" hits when firmware is missing).
  4. Host → device → host memory roundtrip on every GPU. Validates libcuda is actually talking to the kernel module, not just loading.
  5. (--all) NVRTC vector-add kernel. Validates libnvrtc.so resolves and the JIT path works — i.e. CUDA user-space is not just responding but executing real kernels on the device.

What cuda-smoketest does not test:

  • GPUDirect RDMA / nvidia_peermem end-to-end. The driver DaemonSet's load_peermem() asserts the module loads, but proving RDMA actually moves bytes between GPU memory and the NIC needs an NCCL all-reduce across two nodes — nccl-tests all_reduce_perf with NCCL_DEBUG=INFO, watching for via NET/IB rather than via NET/Socket in the NCCL log.
  • Fabric Manager / NVSwitch. See Known gaps.

Known gaps

These are real and will bite before the spike is production-ready; they're out of scope for this image as it stands.

  1. nvidia-fabricmanager-start.sh is not in the Debian packages. The entrypoint references it for NVL5+ NVLink Subnet Manager bring-up; today it falls through with WARN: ... — skipping fabricmanager. Either source the script from NVIDIA's HGX bundle or hand-roll the equivalent (lspci VPD → port GUID → nvlsmnv-fabricmanager). On NVL5+ multi-GPU NVLink P2P will be degraded until this is solved.
  2. nvidia-imex is installed but never started. Required for cross-node memory exchange on multi-node Blackwell. Needs a small supervised process in the entrypoint.
  3. nvlsm is not version-pinned. All NVIDIA packages are pinned to ${DRIVER_VERSION}-1; nvlsm floats. Pin it to the matching version.
  4. cuda/repos/debian12 on Debian 13 + [trusted=yes]. Calculated risk that works today (Debian 12 packages run on Debian 13's forward-compatible glibc). Revisit when NVIDIA publishes cuda/repos/debian13 and refreshes the keyring's signature algorithm (the SHA1 self-signature issue is on NVIDIA's side).
  5. Driver branch. 590.48.01 is on NVIDIA's New Feature Branch (NFB), not the Production Branch. Fine for a spike; flag for any wider rollout.
  6. Cross-MOFED-version drift between builder and host. The mlnx-ofed-kernel-dkms we install in the builder must be the same sub-revision as the doca-host running on the worker node. Both are DOCA 3.3.0 today, but a host-side rebuild of MOFED for a kernel security update could introduce CRC drift even within DOCA 3.3.0. The entrypoint's load_peermem diagnostic dump distinguishes this case from "wrong major version" via the dmesg messages.
  7. MIG / driver-config / containerd interaction with the GPU Operator. Untested. The image only handles the driver part of the Operator's responsibilities; toolkit, device-plugin, validator, gpu-feature-discovery, etc. all come from the Operator chart and should Just Work but haven't been individually verified here.

Build verification

After make build-push, you can re-run the same DKMS / peermem-symbol assertions out-of-band against any pulled image:

docker pull <your-registry>/nvidia-driver-debian13:<tag>
make verify REGISTRY=<your-registry>

Expected output:

  1. ls /lib/modules/<kernel>/updates/dkms/ shows nvidia.ko*, nvidia-uvm.ko*, nvidia-modeset.ko*, nvidia-peermem.ko*.
  2. modinfo nvidia-peermem.ko* prints a vermagic matching the target kernel and a non-empty depends: line.
  3. modprobe --dump-modversions prints two lines containing ib_register_peer_memory_client and ib_unregister_peer_memory_client.

If all three pass, the image is built correctly. The only remaining risk surfaces are runtime ones: kernel/MOFED version drift on the host, load order of mlx5_ib, and fabric-manager bring-up on NVSwitched boxes.

License

This is a reference / spike. Use at your own risk. NVIDIA driver and CUDA components are subject to NVIDIA's licensing.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors