Tags · NVIDIA/nccl

Device API Improvements:
- Supports Device API struct versioning for backwards compatibility with future versions.
- Adds ncclCommQueryProperties to allow Device API users to check supported features before creating a DevComm.
- Adds host-accessible device pointer functions from symmetric registered ncclWindows.
- Adds improved GIN documentation to clarify the support matrix.

New One-Sided Host APIs:
- Adds new host APIs (ncclPutSignal, ncclWaitSignal, etc) for both network and NVL using zero-SM.
- One-sided communication operation writes data from the local buffer to a remote peer’s registered memory window without explicit participation from the target process.
- Utilizes CopyEngine for NVL transfer and CPU proxy for network.
- Requires CUDA 12.5 or greater.

New Experimental Python language binding (NCCL4Py):
- Pythonic NCCL API for Python applications - native collectives, P2P and other NCCL operations.
- Interoperable with CUDA Python ecosystem: DLPack/CUDA Array Interface, and special support for PyTorch and CuPy.
- Automatic cleanup of NCCL-managed resources (GPU buffers, registered buffers/windows, custom reduction operations).

New LLVM intermediate representation (IR) support:
- Exposes NCCL Device APIs through LLVM IR to enable consumption by diverse code generation systems.
- Example usages include high-level languages, Just-In-Time (JIT) compilers, and domain-specific languages (DSL).
- Build with EMIT_LLVM_IR=1 to generate LLVM IR bitcode.
- Requires CUDA 12 and Clang 21.

Built-in hybrid (LSA+GIN) symmetric kernel for AllGather:
- Adds a new hierarchical kernel using MCRing (NVLS multicast + Ring) to improve performance and scalability of AllGather.
- Requires symmetric memory registration and GIN.

New ncclCommGrow API:
- Adds the ability to dynamically and efficiently add ranks to an existing NCCL communicator.
- Use ncclCommGrow with ncclCommShrink to adjust membership of communicators in response to failing and recovering nodes.
- Also addresses the need for elastic applications to expand a running job by integrating new ranks.

Multi-segment registration:
- Expands buffer registration to support multiple segments of physical memory mapped to one contiguous VA space for the p2p, ib and nvls transports.
- Enables support for expandable segments in PyTorch.

Improves scalability of AllGatherV pattern:
- Adds support for a scalable allgatherv pattern (group of broadcasts).
- Adds new scheduler path and new kernels to improve performance at large scale.

Debuggability & Observability Improvements:
- RAS supports realtime monitoring to continuously track peer status changes.
- Inspector adds support for Prometheus format output (with NCCL_INSPECTOR_PROM_DUMP=1), in addition to the existing JSON format.
- Adds profiler support for CopyEngine(CE) based collectives.

Community Engagement:
- Adds contribution guide: https://github.com/NVIDIA/nccl/blob/master/CONTRIBUTING.md
- Adds NCCL_SOCKET_POLL_TIMEOUT_MSEC which allows waiting instead of spinning during bootstrap in order to reduce CPU usage. (Github PR #1759)
- Fixes segfault in ncclGin initialization that can happen if ncclGinIbGdaki.devices() fails after init() succeeds. (Github PR #1881)
- Fixes crash that can happen when calling p2p and then collectives while using the same user buffer. (Github Issue #1859)
- Fixes bug that was lowering performance on some sm80 or earlier machines with one NIC per GPU. (Github Issue #1876)
- Clears non-fatal CUDA errors so they do not propagate. (Pytorch Issue #164402)

Other Improvements:
- Improves performance of large-size AllGather operations using symmetric memory buffers on Blackwell by transparently switching to CE collectives.
- Improves the default number of channels per net peer for all-to-all, send, and recv to achieve better performance.
- Improves performance tuning of 256M-512M message sizes on Blackwell for AllReduce.
- Enables built-in symmetric kernels only on fully connected nvlink systems, as PCIE systems do not perform as well.
- Prints git branch and commit checksum at the INFO level during NCCL initialization.
- Improves support for symmetric window registrations on CUDA versions prior to 12.1.
- Relaxes symmetric buffer registration requirements for collectives so that users can leverage the symmetric kernels with only one of the buffers being registered, when possible.
- All2all, send, recv now obey NCCL_NETDEVS_POLICY. For these operations, NCCL will now by default use a subset of available network devices as dictated by the Network Device Policy.
- Fixes a hang on GB200/300 + CX8 when the user disables GDR.
- Fixes a bug that could cause AllReduce on ncclFloat8e4m3 to yield “no algorithm/protocol available”.
- ncclCommWindowRegister will now return a NULL window if the system does not support window registration.
- More prominent error when cuMulticastBind fails and NCCL_NVLS_ENABLE=2.
- Upgrades to doca gpunetio v1.1.

Known Limitations:
- Since Device API was experimental in 2.28.x, applications that use the Device API in v2.28 may need modifications to work with v2.29.
- One-sided host APIs (e.g. ncclPutSignal) currently do not support graph capture. Future releases will add cuda graph support.
- The improved AllGatherV support breaks the NCCL profiler support for ncclBroadcast operations, limiting visibility to API events. NCCL_ALLGATHERV_ENABLE=0 can be used as a workaround until it is fixed in a future release.
- NCCL4Py (experimental) has a known issue with cuda.core 0.5.0. We currently recommend using cuda.core 0.4.1 with nccl4py.

Dec 24, 2025
ebd1e92
zip
tar.gz
Notes

PreviousNext

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nccl4py-v0.3.1

nccl-ep-v0.1.0

v2.30.7-1

nccl4py-v0.2.0

v2.30.4-1

v2.30.3-1

v2.29.7-1

v2.29.3-1

nccl4py-v0.1.1

v2.29.2-1

Tags: NVIDIA/nccl