A reference NVIDIA driver container for Debian 13 worker nodes that
expose the gpu-operator-ready worker-image variant (i.e. nodes that
ship doca-roce / RDMA fabric userland but no NVIDIA driver / CUDA /
DCGM / fabric-manager / nvidia-container-toolkit, on the assumption
that the customer installs them via the NVIDIA GPU Operator).
NVIDIA does not publish driver-container images for Debian 12/13 on
nvcr.io/nvidia/driver, only Ubuntu / RHEL / CoreOS / SLES. This repo
fills that gap so the GPU Operator's driver DaemonSet can pull a
working image on Debian-based GPU node pools.
⚠️ Reference / unsupported. This is a working spike, not a productized driver container. Don't ship arbitrary workloads on this without doing your own validation, especially for GPUDirect RDMA (thenvidia_peermem↔ MOFED/DOCA story; see Why this works below).
| Path | Purpose |
|---|---|
Dockerfile |
Builds the driver-container image. Pinned to one driver version, one DOCA version, one host kernel. |
nvidia-driver |
Entrypoint (matches the GPU Operator's expectations: init / reload_nvidia_peermem / probe_nvidia_peermem). |
Makefile |
Build / push / verify helpers using docker buildx. |
k8s/daemonset.yaml |
Standalone privileged DaemonSet that runs the driver container WITHOUT requiring the GPU Operator (useful for validating driver bring-up in isolation). |
k8s/cuda-test-pod.yaml |
Privileged Pod that runs the Rust cuda-smoketest binary against the driver loaded by the DaemonSet above. |
cuda-test/ |
Small Rust project (cudarc-based) that exercises the CUDA driver API end-to-end: device enumeration, host↔device memory roundtrip, NVRTC vector-add kernel. |
Set REGISTRY to a registry you can push to, then:
make build-push REGISTRY=<your-registry>/<your-namespace>
This forces --platform=linux/amd64 (so it works from an Apple Silicon
dev machine via buildx + qemu — slow on first run, fine after). The
build will fail loudly at the post-install assertion step if
nvidia_peermem.ko isn't present or isn't linked against
ib_register_peer_memory_client — that's intentional (see Why this
works).
make print-tag REGISTRY=... prints the tag without building. Tags
encode (driver_version, kernel, doca_version) so you can't pull a
"looks right but isn't" image onto a node with a different kernel/DOCA.
cd cuda-test
make build-push REGISTRY=<your-registry>/<your-namespace>
cd ..
Create a namespace and (if your registry is private) install a pull
secret in it. Edit k8s/daemonset.yaml and k8s/cuda-test-pod.yaml to
match your image references and pull-secret name, then:
kubectl create namespace nvidia-driver
kubectl apply -f k8s/daemonset.yaml
kubectl -n nvidia-driver rollout status ds/nvidia-driver --timeout=10m
kubectl -n nvidia-driver logs -l app.kubernetes.io/name=nvidia-driver --tail=200
kubectl apply -f k8s/cuda-test-pod.yaml
kubectl -n nvidia-driver wait --for=condition=Ready pod/cuda-smoketest --timeout=2m || true
kubectl -n nvidia-driver logs -f pod/cuda-smoketest
The DaemonSet's nodeSelector is gpu-operator-ready: "true" — adjust
to whatever label your fabric-connected NVIDIA nodes carry.
The hard part of "NVIDIA driver container on Debian" isn't installing
the driver — nvidia-kernel-open-dkms from NVIDIA's Debian repo
handles that. The hard part is nvidia_peermem, the kernel module
that bridges NVIDIA GPU memory and the IB / RoCE stack so GPUDirect
RDMA works.
nvidia_peermem is built by DKMS as part of the NVIDIA driver source
tree, and it links against the IB peer-memory client API exported
by ib_core (ib_register_peer_memory_client and friends). Those
symbols exist in:
- the in-tree kernel build (stock
ib_core.kofromlinux-image) - MOFED / DOCA's replacement
ib_core(from thedoca-host/mlnx-ofed-kernel-dkmspackages)
…with different symbol-version CRCs. If you build nvidia_peermem
against stock kernel headers but the running host uses MOFED's
ib_core, the kernel returns -EINVAL at modprobe time because the
CRCs don't match.
This Dockerfile installs DOCA + mlnx-ofed-kernel-dkms BEFORE the
NVIDIA driver, so DKMS picks up MOFED's Module.symvers when it
builds nvidia_peermem.ko. The post-install assertion confirms the
resulting .ko actually declares the IB peer-memory symbols.
# 1. Install DOCA / MOFED kernel sources first.
RUN wget -q https://www.mellanox.com/downloads/DOCA/DOCA_v3.3.0/host/${DOCA_HOST_DEB} && \
dpkg -i ${DOCA_HOST_DEB} && \
apt-get update && \
apt-get install -y doca-roce mlnx-ofed-kernel-dkms
# 2. THEN install the NVIDIA driver — DKMS now sees MOFED Module.symvers.
RUN apt-get install -y nvidia-kernel-open-dkms=${DRIVER_VERSION}-1 ...
# 3. Fail the build if peermem didn't link against the IB peer-memory API.
RUN modprobe --dump-modversions /lib/modules/${TARGET_KERNEL}/updates/dkms/nvidia-peermem.ko* \
| grep -E 'ib_register_peer_memory_client' \
|| (echo "FATAL: nvidia-peermem has no ib_register_peer_memory_client symbol" >&2; exit 1)That's the central trick. Everything else in the Dockerfile is either
making the install reproducible (pinning every package to
${DRIVER_VERSION}-1, only installing headers for one target kernel),
or shaping the install to fit the GPU Operator's contract (staging GSP
firmware to a non-overlaid path; extracting nvidia-smi without
dragging in the Debian display stack).
nvidia-driver init (the GPU Operator's main driver-pod entrypoint) does:
assert_host_kernel_matches_image— refuses to start ifuname -ron the host doesn't matchTARGET_KERNELbaked into the image. The image only ships DKMS-built modules for one kernel; if the host diverged, fail fast with an actionable message rather than fail confusingly inside a runtime DKMS rebuild that can't succeed.build_modules— sanity-checksnvidia.ko,nvidia_uvm.ko,nvidia_modeset.ko,nvidia_peermem.koare present (DKMS-built at image-build time).publish_firmware— stages GSP firmware into the operator-mounted hostPath at/lib/firmware/nvidia/<driver>(otherwise per-GPU adapter init fails withRmInitAdapter failed!andnvidia-smireports "No devices were found").load_modules—modprobe nvidia / nvidia_uvm / nvidia_modeset.create_device_nodes— creates/dev/nvidia*on the host's/dev(via/proc/1/root/dev, since the container's/devis its own tmpfs).publish_driver_root— bind-mounts/to/run/nvidia/driverso other pods (notably the GPU Operator's toolkit / device-plugin / workload pods) can reachlibcuda.so.1etc.start_persistenced— NVIDIA Persistence Daemon for stable PCI state.start_fabricmanager— currently SKIPS with a warning because thenvidia-fabricmanager-start.shwrapper isn't shipped in the Debian packages (see Known gaps below).load_peermem—modprobe nvidia_peermem. FATAL on failure (with a three-cause diagnostic dump) — silent peermem absence is exactly the failure mode this image can't ship with.signal_ready— touches/run/nvidia/validations/.driver-ctr-readyfor the GPU Operator's validator.
nvidia-driver reload_nvidia_peermem (sidecar pod) and
nvidia-driver probe_nvidia_peermem (startup probe) match the GPU
Operator's standard sidecar contract.
In order, each step proves the next layer of the stack works:
libcuda.so.1is reachable + dlopen-able. Validates that the driver DaemonSet's bind-mount of/to/run/nvidia/driverworked and thatLD_LIBRARY_PATHresolves the host's libcuda.cuInit+ device enumeration succeed. Validates that thenvidiakernel module is loaded and/dev/nvidia*device nodes exist with correct majors/minors.- Each device reports name / compute capability / memory. Validates GSP firmware was published correctly (this is exactly the surface where "RmInitAdapter failed → No devices were found" hits when firmware is missing).
- Host → device → host memory roundtrip on every GPU. Validates libcuda is actually talking to the kernel module, not just loading.
- (
--all) NVRTC vector-add kernel. Validateslibnvrtc.soresolves and the JIT path works — i.e. CUDA user-space is not just responding but executing real kernels on the device.
What cuda-smoketest does not test:
- GPUDirect RDMA /
nvidia_peermemend-to-end. The driver DaemonSet'sload_peermem()asserts the module loads, but proving RDMA actually moves bytes between GPU memory and the NIC needs an NCCL all-reduce across two nodes —nccl-testsall_reduce_perfwithNCCL_DEBUG=INFO, watching forvia NET/IBrather thanvia NET/Socketin the NCCL log. - Fabric Manager / NVSwitch. See Known gaps.
These are real and will bite before the spike is production-ready; they're out of scope for this image as it stands.
nvidia-fabricmanager-start.shis not in the Debian packages. The entrypoint references it for NVL5+ NVLink Subnet Manager bring-up; today it falls through withWARN: ... — skipping fabricmanager. Either source the script from NVIDIA's HGX bundle or hand-roll the equivalent (lspci VPD → port GUID →nvlsm→nv-fabricmanager). On NVL5+ multi-GPU NVLink P2P will be degraded until this is solved.nvidia-imexis installed but never started. Required for cross-node memory exchange on multi-node Blackwell. Needs a small supervised process in the entrypoint.nvlsmis not version-pinned. All NVIDIA packages are pinned to${DRIVER_VERSION}-1;nvlsmfloats. Pin it to the matching version.cuda/repos/debian12on Debian 13 +[trusted=yes]. Calculated risk that works today (Debian 12 packages run on Debian 13's forward-compatible glibc). Revisit when NVIDIA publishescuda/repos/debian13and refreshes the keyring's signature algorithm (the SHA1 self-signature issue is on NVIDIA's side).- Driver branch.
590.48.01is on NVIDIA's New Feature Branch (NFB), not the Production Branch. Fine for a spike; flag for any wider rollout. - Cross-MOFED-version drift between builder and host. The
mlnx-ofed-kernel-dkmswe install in the builder must be the same sub-revision as thedoca-hostrunning on the worker node. Both are DOCA 3.3.0 today, but a host-side rebuild of MOFED for a kernel security update could introduce CRC drift even within DOCA 3.3.0. The entrypoint'sload_peermemdiagnostic dump distinguishes this case from "wrong major version" via the dmesg messages. - MIG / driver-config / containerd interaction with the GPU Operator.
Untested. The image only handles the
driverpart of the Operator's responsibilities;toolkit,device-plugin,validator,gpu-feature-discovery, etc. all come from the Operator chart and should Just Work but haven't been individually verified here.
After make build-push, you can re-run the same DKMS / peermem-symbol
assertions out-of-band against any pulled image:
docker pull <your-registry>/nvidia-driver-debian13:<tag>
make verify REGISTRY=<your-registry>
Expected output:
ls /lib/modules/<kernel>/updates/dkms/showsnvidia.ko*,nvidia-uvm.ko*,nvidia-modeset.ko*,nvidia-peermem.ko*.modinfo nvidia-peermem.ko*prints avermagicmatching the target kernel and a non-emptydepends:line.modprobe --dump-modversionsprints two lines containingib_register_peer_memory_clientandib_unregister_peer_memory_client.
If all three pass, the image is built correctly. The only remaining
risk surfaces are runtime ones: kernel/MOFED version drift on the host,
load order of mlx5_ib, and fabric-manager bring-up on NVSwitched
boxes.
This is a reference / spike. Use at your own risk. NVIDIA driver and CUDA components are subject to NVIDIA's licensing.