Tags: NVIDIA/nccl
Tags
Merge nccl4py v0.3.1 release to master Release notes: https://github.com/NVIDIA/nccl/releases/tag/nccl4py-v0.3.1 Signed-off-by: Bharath Ramesh <bhramesh@nvidia.com>
Merge NCCL EP v0.1.0 release into master Release notes: https://github.com/NVIDIA/nccl/releases/tag/nccl-ep-v0.1.0 Signed-off-by: Bharath Ramesh <bhramesh@nvidia.com>
This release adds Python bindings for the new NCCL 2.30 one-sided RMA… …, Device API (GIN), and elastic communicator features, along with substantially more control over communicator configuration. - **One-sided RMA (point-to-point)** — New `Communicator.put_signal()`, `Communicator.signal()`, and `Communicator.wait_signal()` methods, plus a `WaitSignalDesc` helper for describing signal values and match operations. - **NCCL Device API host side setup** — New `Communicator.create_dev_comm()` that produces a `DevCommResource` for use with device-side NCCL kernels. Configure the device communicator through the new `NCCLDevCommRequirements` class, and introspect support via `device_api_support`, `gin_type`, `railed_gin_type`, `host_rma_support`, and `n_lsa_teams` properties. - **Device pointer access for registered windows** — `RegisteredWindowHandle` now exposes `user_ptr`, `get_lsa_device_pointer()`, `get_lsa_multimem_device_pointer()`, and `get_peer_device_pointer()` for direct access to LSA, multimem, and peer mappings. - **Elastic and fault-tolerant communicators** — New `Communicator.grow()`, `revoke()`, `suspend()`, and `resume()` methods to support elastic topology changes and error-handling flows. `CommSuspendFlag` added alongside existing `CommShrinkFlag`. - **More flexible construction** — In addition to `init()`, communicators can now be created with class method `init_all()` and instance method `initialize()`. `Communicator.get_mem_stat()` reports per-communicator memory statistics. New tuning knobs on `NCCLConfig`: - `graph_usage_mode`, `num_rma_ctx`, `max_p2p_peers`. `NCCLDevCommRequirements` — passed to `Communicator.create_dev_comm()` to describe the resources and capabilities a device communicator needs: - LSA: `lsa_multimem`, `barrier_count`, `lsa_barrier_count`, `rail_gin_barrier_count`, `world_gin_barrier_count`, `lsa_ll_a2a_block_count`, `lsa_ll_a2a_slot_count`. - GIN: `gin_force_enable`, `gin_context_count`, `gin_signal_count`, `gin_counter_count`, `gin_queue_depth`, `gin_connection_type`, `gin_exclusive_contexts`. New `Communicator` properties: `cuda_dev`, `nvml_dev`, `device_api_support`, `multimem_support`, `gin_type`, `railed_gin_type`, `n_lsa_teams`, `host_rma_support`. - `CTAPolicy` is now an `IntFlag` (was `IntEnum`) so multiple policies can be combined. - Interop submodules `nccl.core.cupy` and `nccl.core.torch` are now lazy-loaded via `__getattr__` and only imported on first attribute access, so `import nccl.core` no longer pulls in CuPy or PyTorch.
fix compatibility issue with cuda.core 0.5.0 cuda.core 0.5.0 removed "experimental" in the module path, and added expermental/init.py for compatibility, but cuda.core.experimental._stream.IsStreamT and cuda.core.experimental._memory.DevicePointerT are not included, leading to compatibility issue.
NCCL v2.29.2-1 Release Device API Improvements: - Supports Device API struct versioning for backwards compatibility with future versions. - Adds ncclCommQueryProperties to allow Device API users to check supported features before creating a DevComm. - Adds host-accessible device pointer functions from symmetric registered ncclWindows. - Adds improved GIN documentation to clarify the support matrix. New One-Sided Host APIs: - Adds new host APIs (ncclPutSignal, ncclWaitSignal, etc) for both network and NVL using zero-SM. - One-sided communication operation writes data from the local buffer to a remote peer’s registered memory window without explicit participation from the target process. - Utilizes CopyEngine for NVL transfer and CPU proxy for network. - Requires CUDA 12.5 or greater. New Experimental Python language binding (NCCL4Py): - Pythonic NCCL API for Python applications - native collectives, P2P and other NCCL operations. - Interoperable with CUDA Python ecosystem: DLPack/CUDA Array Interface, and special support for PyTorch and CuPy. - Automatic cleanup of NCCL-managed resources (GPU buffers, registered buffers/windows, custom reduction operations). New LLVM intermediate representation (IR) support: - Exposes NCCL Device APIs through LLVM IR to enable consumption by diverse code generation systems. - Example usages include high-level languages, Just-In-Time (JIT) compilers, and domain-specific languages (DSL). - Build with EMIT_LLVM_IR=1 to generate LLVM IR bitcode. - Requires CUDA 12 and Clang 21. Built-in hybrid (LSA+GIN) symmetric kernel for AllGather: - Adds a new hierarchical kernel using MCRing (NVLS multicast + Ring) to improve performance and scalability of AllGather. - Requires symmetric memory registration and GIN. New ncclCommGrow API: - Adds the ability to dynamically and efficiently add ranks to an existing NCCL communicator. - Use ncclCommGrow with ncclCommShrink to adjust membership of communicators in response to failing and recovering nodes. - Also addresses the need for elastic applications to expand a running job by integrating new ranks. Multi-segment registration: - Expands buffer registration to support multiple segments of physical memory mapped to one contiguous VA space for the p2p, ib and nvls transports. - Enables support for expandable segments in PyTorch. Improves scalability of AllGatherV pattern: - Adds support for a scalable allgatherv pattern (group of broadcasts). - Adds new scheduler path and new kernels to improve performance at large scale. Debuggability & Observability Improvements: - RAS supports realtime monitoring to continuously track peer status changes. - Inspector adds support for Prometheus format output (with NCCL_INSPECTOR_PROM_DUMP=1), in addition to the existing JSON format. - Adds profiler support for CopyEngine(CE) based collectives. Community Engagement: - Adds contribution guide: https://github.com/NVIDIA/nccl/blob/master/CONTRIBUTING.md - Adds NCCL_SOCKET_POLL_TIMEOUT_MSEC which allows waiting instead of spinning during bootstrap in order to reduce CPU usage. (Github PR #1759) - Fixes segfault in ncclGin initialization that can happen if ncclGinIbGdaki.devices() fails after init() succeeds. (Github PR #1881) - Fixes crash that can happen when calling p2p and then collectives while using the same user buffer. (Github Issue #1859) - Fixes bug that was lowering performance on some sm80 or earlier machines with one NIC per GPU. (Github Issue #1876) - Clears non-fatal CUDA errors so they do not propagate. (Pytorch Issue #164402) Other Improvements: - Improves performance of large-size AllGather operations using symmetric memory buffers on Blackwell by transparently switching to CE collectives. - Improves the default number of channels per net peer for all-to-all, send, and recv to achieve better performance. - Improves performance tuning of 256M-512M message sizes on Blackwell for AllReduce. - Enables built-in symmetric kernels only on fully connected nvlink systems, as PCIE systems do not perform as well. - Prints git branch and commit checksum at the INFO level during NCCL initialization. - Improves support for symmetric window registrations on CUDA versions prior to 12.1. - Relaxes symmetric buffer registration requirements for collectives so that users can leverage the symmetric kernels with only one of the buffers being registered, when possible. - All2all, send, recv now obey NCCL_NETDEVS_POLICY. For these operations, NCCL will now by default use a subset of available network devices as dictated by the Network Device Policy. - Fixes a hang on GB200/300 + CX8 when the user disables GDR. - Fixes a bug that could cause AllReduce on ncclFloat8e4m3 to yield “no algorithm/protocol available”. - ncclCommWindowRegister will now return a NULL window if the system does not support window registration. - More prominent error when cuMulticastBind fails and NCCL_NVLS_ENABLE=2. - Upgrades to doca gpunetio v1.1. Known Limitations: - Since Device API was experimental in 2.28.x, applications that use the Device API in v2.28 may need modifications to work with v2.29. - One-sided host APIs (e.g. ncclPutSignal) currently do not support graph capture. Future releases will add cuda graph support. - The improved AllGatherV support breaks the NCCL profiler support for ncclBroadcast operations, limiting visibility to API events. NCCL_ALLGATHERV_ENABLE=0 can be used as a workaround until it is fixed in a future release. - NCCL4Py (experimental) has a known issue with cuda.core 0.5.0. We currently recommend using cuda.core 0.4.1 with nccl4py.
PreviousNext