Skip to content

UCCL is an efficient communication library for GPUs, covering collectives, P2P (e.g., KV cache transfer, RL weight transfer), and EP (e.g., GPU-driven)

License

Notifications You must be signed in to change notification settings

uccl-project/uccl

Repository files navigation

About

UCCL is an efficient communication library for GPUs, covering collectives, P2P (e.g., KV cache transfer, RL weight transfer), and EP (e.g., IBGDA), with two key focuses:

  • Flexibility for high performance in fast-evolving ML workloads
  • Portability for connecting heterogeneous GPUs in ML workloads

An UCCL overview can be found in this slide deck with the following components:

  • UCCL-collective serves as a drop-in replacement for NCCL/RCCL (e.g., requiring no changes to application code), and significantly outperforms them in both latency and throughput across various settings.

    UCCL-collective performance comparison
    • On six HGX servers (across two racks) with 8x400G CX-7 RoCE NICs and 8xH100 GPUs, UCCL-collective outperforms NCCL by up to 2.5x for AllReduce:

    • On two AWS g4dn.8xlarge instances with 1x50G ENA NICs and 1xT4 GPUs within the same cluster placement group, UCCL-collective outperforms NCCL by up to 3.7x for AllReduce:

    More specifically, UCCL-collective aims to:

    • rearchitect the CCL layer (while keeping NCCL APIs) to unleash the full potential of network hardware
    • rearchitect the network transport layer to be fast and extensible
    • support heterogeneous GPU and networking vendors such as Nvidia, AMD, and Broadcom
    • become an open and collaborative platform for GPU communication research

    UCCL-collective has built a fast and extensible transport layer in software, which has created many benefits. For example, existing network transports under NCCL (i.e., kernel TCP and RDMA) leverage one or few network paths to stream huge data volumes, thus prone to congestion happening in datacenter networks. Instead, UCCL-collective employs packet spraying in software to leverage abundant network paths to avoid "single-path-of-congestion". More benefits include: 1) packet spraying with 256 paths, 2) advanced congestion control such as latency-based and receiver-driven ones, 3) efficient loss recovery by selective repeat, and 4) widely usable in public clouds with legacy NICs and Ethernet. Feel free to check out our full technical report.

  • UCCL-P2P provides both NIXL-style initiator-target tranfer APIs and NCCL-style collective APIs, with the same or better performance than both. UCCL-P2P is purposely designed for the next-gen 800Gbps NICs with efficient multi-threaded transfer engines.

  • UCCL-EP allows running DeepEP atop of heterogeneous hardware platforms, including AMD and Nvidia GPUs, and any RDMA NICs such as AWS EFA NICs and Broadcom NICs, while achieving IBGDA-level performance. UCCL-EP also makes DeepEP SM-free, devoting all GPU SMs to compute.

UCCL has been adopted as part of the AMD TheRock ecosystem.

Road Map

More UCCL features are under development in this repo, currently including:

  • ✅ More efficient KV cache transfer engine (e.g., better Mooncake)
  • 🚧 Generic and SM-free GPU-initiated P2P (e.g., better DeepEP for MoE)
    • 🚧 Supporting all NIC vendors including Nvidia, AWS EFA, and Broadcom
    • 🚧 Avoiding burning precious GPU SMs
  • 🚧 Re-architecting NCCL to unleash network hardware performance
    • 🚧 Scalable and efficient CPU proxy
    • ☐ Fast async collectives with compute-communication ordering guarantee
    • ☐ Device kernels in vendor-agnostic Triton language
  • ☐ Dynamic membership with GPU servers joining and exiting

Quick Start

The easiest way to use UCCL is to first build based on your platform. The build script will automatically detect the py_version of your current environment. If you need to compile UCCL for a specific python version, please specify the py_version, such as 3.10.

git clone https://github.com/uccl-project/uccl.git --recursive && cd uccl
bash build_and_install.sh [cuda|rocm|therock] [all|ccl_rdma|ccl_efa|p2p|ep] [py_version] [rocm_index_url]
# Eg, bash build_and_install.sh cuda ep

Note:

  • when building for ROCm with python packaging through TheRock, please specify your ROCm index url; the default is https://rocm.prereleases.amd.com/whl/gfx94X-dcgpu and it may not be what you want. When installing UCCL wheels for TheRock, please provide pip with the index url and add the optional extra [rocm] to the wheel, e.g., pip install --extra-index-url https://rocm.prereleases.amd.com/whl/gfx94X-dcgpu wheelhouse-therock/uccl-0.0.1.post4-py3-none-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl[rocm].
  • you can build with different CUDA or ROCm versions by specifying tags such as cuda13 or rocm6. The default versions are CUDA 12.x for the "cuda" tag and ROCm 7.x for the "rocm" tag.

Then, when running your PyTorch applications, set the environment variable accordingly:

# NCCL over IB/RoCE on x86 or GH200 ARM hosts
NCCL_NET_PLUGIN=`python -c "import uccl; print(uccl.nccl_plugin_path())"`

# RCCL over IB/RoCE on x86 hosts
NCCL_NET_PLUGIN=`python -c "import uccl; print(uccl.rccl_plugin_path())"`

# NCCL over AWS EFA NICs (p4d and p4de only)
LD_PRELOAD=`python -c "import uccl; print(uccl.efa_nccl_path())"`
NCCL_NET_PLUGIN=`python -c "import uccl; print(uccl.efa_plugin_path())"`

Now, you can just run your PyTorch applications and enjoy UCCL performance benefits!

Dev Guide

Click

First clone the UCCL repo and init submodules:

git clone https://github.com/uccl-project/uccl.git --recursive
export UCCL_HOME=$(pwd)/uccl

To build UCCL for development, you need to install some common dependencies:

Click me
# Note if you are using docker+wheel build, there is no need to install the following dependencies. 
sudo apt update
sudo apt install linux-tools-$(uname -r) clang llvm cmake m4 build-essential \
                 net-tools libgoogle-glog-dev libgtest-dev libgflags-dev \
                 libelf-dev libpcap-dev libc6-dev-i386 libpci-dev \
                 libopenmpi-dev libibverbs-dev clang-format -y

# Install and activate Miniconda (you can choose any recent versions)
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash ./Miniconda3-latest-Linux-x86_64.sh -b
source ~/miniconda3/bin/activate
source ~/.bashrc # or .zshrc and others
conda init

# Install python ssh lib
pip install paramiko pybind11
# Upgrade conda glic to modern ones
conda install -c conda-forge "libstdcxx-ng>=12" "libgcc-ng>=12"

For quick installation with docker, you can directly dive into:

  • UCCL-Collective RDMA: Collectives for Nvidia/AMD GPUs + IB/RoCE RDMA NICs (currently support Nvidia and Broadcom NICs)

  • UCCL-Collective EFA: Collectives for AWS EFA NIC (currently support p4d.24xlarge)

    On p5/p5e/p5en/p6, the offical aws-ofi-nccl NCCL plugin with proper env variables already makes NCCL perform excellent

  • UCCL-Collective AFXDP: Collectives for Non-RDMA NICs (currently support AWS ENA NICs and IBM VirtIO NICs)

  • UCCL-P2P: P2P for RDMA NICs and GPU IPCs (currently support Nvidia/AMD GPUs and Nvidia/Broadcom NICs)

  • UCCL-EP: EP for MoE training and inference with DeepEP-compatible APIs (currently support Nvidia/AMD GPUs and Nvidia/Broadcom/EFA NICs)

Python Wheel Build

Run the following to build Python wheels:

cd $UCCL_HOME
./build.sh [cuda|rocm|therock] [all|rdma|p2p|efa|ep] [py_version] [rocm_index_url]

Run the following to install the wheels locally:

cd $UCCL_HOME
pip install wheelhouse-[cuda/rocm]/uccl-*.whl

The cross-compilation matrix is as follows:

Platform/Feature rdma-cuda rdma-rocm rdma-arm p2p-cuda p2p-rocm p2p-arm efa
cuda + x86 x x
cuda + arm (gh200) x x x x x
rocm + x86 x
aws p4d/p4de x x x

Note that you need ARM hosts to build ARM wheels, as cross-compilation tool qemu-user-static cannot emulate CUDA or ROCm.

On Cloudlab CPU Machines

If you want to build nccl and nccl-tests on cloudlab ubuntu22, you need to install cuda and openmpi:

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo apt install ./cuda-keyring_1.1-1_all.deb
sudo apt update
sudo apt install cuda-toolkit -y
sudo apt install nvidia-driver-550 nvidia-utils-550 -y
sudo apt-get install openmpi-bin openmpi-doc libopenmpi-dev -y

Citation

The code in this repository is mostly described in the paper below. Please consider citing this work if you find the repository helpful.

@article{uccl_transport,
  title={An Extensible Software Transport Layer for GPU Networking},
  author={Zhou, Yang and Chen, Zhongjie and Mao, Ziming and Lao, ChonLam and Yang, Shuo and Kannan, Pravein Govindan and Gao, Jiaqi and Zhao, Yilong and Wu, Yongji and You, Kaichao and Ren, Fengyuan and Xu, Zhiying and Raiciu, Costin and Stoica, Ion},
  journal={arXiv preprint arXiv:2504.17307},
  year={2025}
}

Acknowledgement

UCCL is being actively developed at UC Berkeley Sky Computing Lab and UC Davis ArtSy lab. We enthusiastically welcome open-source developers joining us!

UCCL is generously supported by (in alphabetical order): AMD, AWS, Broadcom, CloudLab, Google Cloud, IBM, Lambda, Mibura.

Contact

Feel free to raise GitHub issues if you have any questions or suggestions.

About

UCCL is an efficient communication library for GPUs, covering collectives, P2P (e.g., KV cache transfer, RL weight transfer), and EP (e.g., GPU-driven)

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published