Skip to content

Tags: NVIDIA/nccl

Tags

nccl4py-v0.3.1

Toggle nccl4py-v0.3.1's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Merge nccl4py v0.3.1 release to master

Release notes: https://github.com/NVIDIA/nccl/releases/tag/nccl4py-v0.3.1

Signed-off-by: Bharath Ramesh <bhramesh@nvidia.com>

nccl-ep-v0.1.0

Toggle nccl-ep-v0.1.0's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Merge NCCL EP v0.1.0 release into master

Release notes: https://github.com/NVIDIA/nccl/releases/tag/nccl-ep-v0.1.0

Signed-off-by: Bharath Ramesh <bhramesh@nvidia.com>

v2.30.7-1

Toggle v2.30.7-1's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Merge v2.30.7-1 release into master

Release notes : https://github.com/NVIDIA/nccl/releases/tag/v2.30.7-1

Signed-off-by: Bharath Ramesh <bhramesh@nvidia.com>

nccl4py-v0.2.0

Toggle nccl4py-v0.2.0's commit message

Verified

This tag was signed with the committer’s verified signature.
xiakun-lu Xiakun Lu
This release adds Python bindings for the new NCCL 2.30 one-sided RMA…

…, Device API (GIN), and elastic communicator features, along with substantially more control over communicator configuration.

- **One-sided RMA (point-to-point)** — New `Communicator.put_signal()`, `Communicator.signal()`, and `Communicator.wait_signal()` methods, plus a `WaitSignalDesc` helper for describing signal values and match operations.
- **NCCL Device API host side setup** — New `Communicator.create_dev_comm()` that produces a `DevCommResource` for use with device-side NCCL kernels. Configure the device communicator through the new `NCCLDevCommRequirements` class, and introspect support via `device_api_support`, `gin_type`, `railed_gin_type`, `host_rma_support`, and `n_lsa_teams` properties.
- **Device pointer access for registered windows** — `RegisteredWindowHandle` now exposes `user_ptr`, `get_lsa_device_pointer()`, `get_lsa_multimem_device_pointer()`, and `get_peer_device_pointer()` for direct access to LSA, multimem, and peer mappings.
- **Elastic and fault-tolerant communicators** — New `Communicator.grow()`, `revoke()`, `suspend()`, and `resume()` methods to support elastic topology changes and error-handling flows. `CommSuspendFlag` added alongside existing `CommShrinkFlag`.
- **More flexible construction** — In addition to `init()`, communicators can now be created with class method `init_all()` and instance method `initialize()`. `Communicator.get_mem_stat()` reports per-communicator memory statistics.

New tuning knobs on `NCCLConfig`:

- `graph_usage_mode`, `num_rma_ctx`, `max_p2p_peers`.

`NCCLDevCommRequirements` — passed to `Communicator.create_dev_comm()` to describe the resources and capabilities a device communicator needs:

- LSA: `lsa_multimem`, `barrier_count`, `lsa_barrier_count`, `rail_gin_barrier_count`, `world_gin_barrier_count`, `lsa_ll_a2a_block_count`, `lsa_ll_a2a_slot_count`.
- GIN: `gin_force_enable`, `gin_context_count`, `gin_signal_count`, `gin_counter_count`, `gin_queue_depth`, `gin_connection_type`, `gin_exclusive_contexts`.

New `Communicator` properties: `cuda_dev`, `nvml_dev`, `device_api_support`, `multimem_support`, `gin_type`, `railed_gin_type`, `n_lsa_teams`, `host_rma_support`.

- `CTAPolicy` is now an `IntFlag` (was `IntEnum`) so multiple policies can be combined.
- Interop submodules `nccl.core.cupy` and `nccl.core.torch` are now lazy-loaded via `__getattr__` and only imported on first attribute access, so `import nccl.core` no longer pulls in CuPy or PyTorch.

v2.30.4-1

Toggle v2.30.4-1's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Merge NCCL 2.30.4-1 release to master

Release notes: https://github.com/NVIDIA/nccl/releases/tag/v2.30.4-1

v2.30.3-1

Toggle v2.30.3-1's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Merge NCCL 2.30.3-1 release to master

Release notes: https://github.com/NVIDIA/nccl/releases/tag/v2.30.3-1

v2.29.7-1

Toggle v2.29.7-1's commit message
Add missing files for the NCCL 2.29.7-1 release

v2.29.3-1

Toggle v2.29.3-1's commit message
NCCL v2.29.3-1 Release

Fixes CAS usage in case of weak failure which was causing a hang on ARM.
The issue affects NCCL when compiled with gcc versions prior to 10.

nccl4py-v0.1.1

Toggle nccl4py-v0.1.1's commit message
fix compatibility issue with cuda.core 0.5.0

cuda.core 0.5.0 removed "experimental" in the module path, and added expermental/init.py for compatibility, but cuda.core.experimental._stream.IsStreamT and cuda.core.experimental._memory.DevicePointerT are not included, leading to compatibility issue.

v2.29.2-1

Toggle v2.29.2-1's commit message
NCCL v2.29.2-1 Release

Device API Improvements:
- Supports Device API struct versioning for backwards compatibility with future versions.
- Adds ncclCommQueryProperties to allow Device API users to check supported features before creating a DevComm.
- Adds host-accessible device pointer functions from symmetric registered ncclWindows.
- Adds improved GIN documentation to clarify the support matrix.

New One-Sided Host APIs:
- Adds new host APIs (ncclPutSignal, ncclWaitSignal, etc) for both network and NVL using zero-SM.
- One-sided communication operation writes data from the local buffer to a remote peer’s registered memory window without explicit participation from the target process.
- Utilizes CopyEngine for NVL transfer and CPU proxy for network.
- Requires CUDA 12.5 or greater.

New Experimental Python language binding (NCCL4Py):
- Pythonic NCCL API for Python applications - native collectives, P2P and other NCCL operations.
- Interoperable with CUDA Python ecosystem: DLPack/CUDA Array Interface, and special support for PyTorch and CuPy.
- Automatic cleanup of NCCL-managed resources (GPU buffers, registered buffers/windows, custom reduction operations).

New LLVM intermediate representation (IR) support:
- Exposes NCCL Device APIs through LLVM IR to enable consumption by diverse code generation systems.
- Example usages include high-level languages, Just-In-Time (JIT) compilers, and domain-specific languages (DSL).
- Build with EMIT_LLVM_IR=1 to generate LLVM IR bitcode.
- Requires CUDA 12 and Clang 21.

Built-in hybrid (LSA+GIN) symmetric kernel for AllGather:
- Adds a new hierarchical kernel using MCRing (NVLS multicast + Ring) to improve performance and scalability of AllGather.
- Requires symmetric memory registration and GIN.

New ncclCommGrow API:
- Adds the ability to dynamically and efficiently add ranks to an existing NCCL communicator.
- Use ncclCommGrow with ncclCommShrink to adjust membership of communicators in response to failing and recovering nodes.
- Also addresses the need for elastic applications to expand a running job by integrating new ranks.

Multi-segment registration:
- Expands buffer registration to support multiple segments of physical memory mapped to one contiguous VA space for the p2p, ib and nvls transports.
- Enables support for expandable segments in PyTorch.

Improves scalability of AllGatherV pattern:
- Adds support for a scalable allgatherv pattern (group of broadcasts).
- Adds new scheduler path and new kernels to improve performance at large scale.

Debuggability & Observability Improvements:
- RAS supports realtime monitoring to continuously track peer status changes.
- Inspector adds support for Prometheus format output (with NCCL_INSPECTOR_PROM_DUMP=1), in addition to the existing JSON format.
- Adds profiler support for CopyEngine(CE) based collectives.

Community Engagement:
- Adds contribution guide: https://github.com/NVIDIA/nccl/blob/master/CONTRIBUTING.md
- Adds NCCL_SOCKET_POLL_TIMEOUT_MSEC which allows waiting instead of spinning during bootstrap in order to reduce CPU usage. (Github PR #1759)
- Fixes segfault in ncclGin initialization that can happen if ncclGinIbGdaki.devices() fails after init() succeeds. (Github PR #1881)
- Fixes crash that can happen when calling p2p and then collectives while using the same user buffer. (Github Issue #1859)
- Fixes bug that was lowering performance on some sm80 or earlier machines with one NIC per GPU. (Github Issue #1876)
- Clears non-fatal CUDA errors so they do not propagate. (Pytorch Issue #164402)

Other Improvements:
- Improves performance of large-size AllGather operations using symmetric memory buffers on Blackwell by transparently switching to CE collectives.
- Improves the default number of channels per net peer for all-to-all, send, and recv to achieve better performance.
- Improves performance tuning of 256M-512M message sizes on Blackwell for AllReduce.
- Enables built-in symmetric kernels only on fully connected nvlink systems, as PCIE systems do not perform as well.
- Prints git branch and commit checksum at the INFO level during NCCL initialization.
- Improves support for symmetric window registrations on CUDA versions prior to 12.1.
- Relaxes symmetric buffer registration requirements for collectives so that users can leverage the symmetric kernels with only one of the buffers being registered, when possible.
- All2all, send, recv now obey NCCL_NETDEVS_POLICY. For these operations, NCCL will now by default use a subset of available network devices as dictated by the Network Device Policy.
- Fixes a hang on GB200/300 + CX8 when the user disables GDR.
- Fixes a bug that could cause AllReduce on ncclFloat8e4m3 to yield “no algorithm/protocol available”.
- ncclCommWindowRegister will now return a NULL window if the system does not support window registration.
- More prominent error when cuMulticastBind fails and NCCL_NVLS_ENABLE=2.
- Upgrades to doca gpunetio v1.1.

Known Limitations:
- Since Device API was experimental in 2.28.x, applications that use the Device API in v2.28 may need modifications to work with v2.29.
- One-sided host APIs (e.g. ncclPutSignal) currently do not support graph capture. Future releases will add cuda graph support.
- The improved AllGatherV support breaks the NCCL profiler support for ncclBroadcast operations, limiting visibility to API events. NCCL_ALLGATHERV_ENABLE=0 can be used as a workaround until it is fixed in a future release.
- NCCL4Py (experimental) has a known issue with cuda.core 0.5.0. We currently recommend using cuda.core 0.4.1 with nccl4py.