Implement GPU collective communication using a custom PyTorch c10d backend built on libibverbs and softRoCE for distributed model training.
-
Updated
May 21, 2026 - C++
Implement GPU collective communication using a custom PyTorch c10d backend built on libibverbs and softRoCE for distributed model training.
AI cluster debugging lab for distributed LLM and HPC workloads: GPU, NCCL, Kubernetes, failure analysis, and tuning recommendations.
NYDUX — AI Infrastructure Intelligence | GPU Cluster Diagnostics | ML Systems Mastery
Safe rust wrapper around CUDA toolkit
CUDA编程练习项目-Hands-on CUDA kernels and performance optimization, covering GEMM, FlashAttention, Tensor Cores, CUTLASS, quantization, KV cache, NCCL, and profiling.
Actor-shaped face for compute acceleration backends (NVIDIA CUDA shipping; ROCm / Metal / oneAPI / Vulkan future). Built on the atomr actor runtime.
Practical AI homelab setup guides for GB10, Mac Studio Ultra, RoCE/RDMA, MikroTik switching, NCCL, and heterogeneous workload experiments.
Mini Distributed Training Framework using NCCL
Simple quick test to benchmark your pytorch + nccl/ncclx setup
From-scratch RDMA-based PyTorch backend in C++; trained a char-level GPT on TinyShakespeare via DDP through code I wrote.
Experimental Explicit Communications API for Kokkos
Fixes NCCL hangs on NVIDIA L40S GPUs by resolving IOMMU-induced PCIe P2P communication issues. Includes reproducible tests, architecture explanation, and production-ready solution.
A Reliable and Resilient Collective Communication Library for NCCL and others
Multi-threaded Layer-2 stress and hardware evaluation tool for GPU cluster fabric validation
Real multi-node MPI benchmarks for AI infrastructure teams. By NYDUX — nydux.ai
A hybrid testbed for evaluating top open-source LLMs (like gpt-oss-20b and Llama 3.3) on local, cloud GPUs, and AWS Inferentia2/Trainium instances, focusing on vLLM optimization, capacity management, kernel bypass, hardware-software co-design, as well as supporting infrastructure such as NCCL, RDMA, NVMeoF.
Add a description, image, and links to the nccl topic page so that developers can more easily learn about it.
To associate your repository with the nccl topic, visit your repo's landing page and select "manage topics."