“What I cannot create I do not understand” - This is why I started Penny, my own version of NCCL.
If you want to read about it, there is a worklog on my blogpost where I describe a step by step process of creating it:
To install Penny you need to export NVSHMEM_LIB and NVSHMEM_INC environment variables that point to the /lib and /include directories of your NVSHMEM installation
Afterwards just
git clone https://github.com/SzymonOzog/Penny.git
cd Penny
pip install -e . --no-build-isolation
Penny provides a drop in replacement for the vLLM/SGLang custom all reduce class that allows it to run multinode. For SGLang there is a patch that you can apply to get it running:
cd YOUR_SGLANG_DIR
git apply YOUR_PENNY_DIR/extra/sglang.patch
You also need to export the number of nodes that you're running(Currently up to 4 nodes are templated and tested, for more edit extra/custom_all_reduce.cuh at your own risk)
export NNODES=2
Afterwards you can serve your favourite model with Low Latency allreduce