We propose an open-source GPU load-balancing framework for applications that exhibit irregular parallelism. The set of applications and algorithms we consider are fundamental to computing tasks ranging from sparse machine learning, large numerical simulations, and on through to graph analytics. The underlying data and data structures that drive these applications present access patterns that naturally don't map well to the GPU's architecture that is designed with dense and regular patterns in mind.
Prior to the work we present and propose here, the only way to unleash the GPU's full power on these problems has been to workload balance through tightly coupled load-balancing techniques. Our proposed load-balancing abstraction decouples load balancing from work processing and aims to support both static and dynamic schedules with a programmable interface to implement new load-balancing schedules in the future.
With our open-source framework, we hope to not only improve programmers' productivity when developing irregular-parallel algorithms on the GPU but also improve the overall performance characteristics for such applications by allowing a quick path to experimentation with a variety of existing load-balancing techniques. Consequently, we also hope that by separating the concerns of load-balancing from work processing within our abstraction, managing and extending existing code to future architectures becomes easier.
- OS: Linux (Ubuntu 22.04 / 24.04 tested) or Windows.
- Hardware: NVIDIA GPU with compute capability β₯ 7.0 (Volta or newer).
- Software: CUDA Toolkit β₯ 11.7 and CMake β₯ 3.24.
- CUDA architecture: Auto-detected by default (
CMAKE_CUDA_ARCHITECTURES=native). Override at configure time, e.g.-DCMAKE_CUDA_ARCHITECTURES=90for an H100-only build, or"70;80;90"for a fat binary.
loops is a header-only library. Thrust, CUB and libcu++ ship with the CUDA Toolkit and are picked up automatically β no separate fetch step is required. Only cxxopts (CLI parsing) is fetched as an external dependency.
git clone https://github.com/gunrock/loops.git
cd loops
# Auto-detect the GPU(s) on this host (recommended default).
cmake --preset release-native
cmake --build --preset release-native -j
# Sanity check on the bundled chesapeake matrix.
./build/release-native/bin/loops.spmv.merge_path \
-m datasets/chesapeake/chesapeake.mtx --validateOther configure presets (release-h100, release-a100, release-multi, debug-native, release-with-tests, ci-multi-arch), CMake-presets-free fallback, individual example targets, and Docker setup are all covered in the Build documentation.
The four scheduling algorithms (thread_mapped, group_mapped, work_oriented, merge_path_flat) talk to the workload through a small layout view contract rather than directly poking at CSR offset arrays. Any struct that exposes the contract β num_tiles(), num_atoms(), tile_begin(t), tile_end(t), tile_size(t), tile_end_iter() β can drive any of the schedules without modification.
In-tree layouts:
| Layout | Tile is a... | Atom is a... | Backing container | Example |
|---|---|---|---|---|
layout::csr<tile_id_t, atom_id_t> |
row | nonzero | csr_t |
spmv.thread_mapped, spmv.merge_path, ... |
layout::csc<tile_id_t, atom_id_t> |
column | nonzero | csc_t |
spmv.csc_thread_mapped |
layout::coo<tile_id_t, atom_id_t> |
nonzero | nonzero | coo_t |
spmv.coo_thread_mapped |
layout::ell<tile_id_t, atom_id_t> |
row | bucketed nonzero (uniform pitch) | ell_t |
spmv.ell_thread_mapped, spmv.ell_merge_path |
layout::bcsr<tile_id_t, atom_id_t> |
block-row | dense R x C block |
bcsr_t<R, C, ...> |
spmv.bcsr_thread_mapped |
layout::dia<tile_id_t, atom_id_t> |
row | (row, diagonal) cell | dia_t |
spmv.dia_thread_mapped |
To plug in your own format, write a struct satisfying the contract documented in include/loops/container/layout.hxx and pass it as the trailing template argument to schedule::setup<...>. A worked example lives in examples/spmv/custom_layout.cu.
A partitioner is a layout adaptor that re-bins atoms into a different tile grouping while still satisfying the same contract β so the schedules continue to drive it unchanged. In-tree:
loops::layout::flat_uniform_occupancy<K, base_layout_t>β flatten the base layout's atoms and chunk them into tiles ofKatoms each (last tile may be smaller). Tiles can cross the base layout's natural boundaries (e.g., CSR rows), so kernels that need per-atom output addressing should atomic-add viapartitioner.base().tile_of(atom).
A worked SpMV example using flat_uniform_occupancy<8, csr> lives in examples/spmv/flat_partitioned.cu; it shares the standard thread_mapped schedule with no schedule-side changes.
Full documentation is available at gunrock.github.io/loops:
- Building β full CMake-presets table, CUDA-architecture overrides, optional dependencies, and Docker.
- Datasets β fetching the SuiteSparse Matrix Collection.
- Experimentation β running the bundled examples and the sanity check.
- Reproducing Results β re-running the paper's full experiment sweep and regenerating the plots.
- Abstraction β design notes on the tile-atom model.
Thank you for citing our work.
@inproceedings{Osama:2023:APM,
author = {Muhammad Osama and Serban D. Porumbescu and John D. Owens},
title = {A Programming Model for {GPU} Load Balancing},
booktitle = {Proceedings of the 28th ACM SIGPLAN Symposium on
Principles and Practice of Parallel Programming},
series = {PPoPP 2023},
year = 2023,
month = feb # "\slash " # mar,
acceptance = {31 of 131 submissions, 23.7\%},
code = {https://github.com/gunrock/loops},
doi = {10.1145/3572848.3577434},
}@software{Osama:2022:LAP:Code,
author = {Muhammad Osama and Serban D. Porumbescu and John D. Owens},
title = {Loops: A Programming Model for GPU Load Balancing},
month = dec,
year = 2022,
publisher = {Zenodo},
version = {v0.1.0-alpha},
doi = {10.5281/zenodo.7465053},
url = {https://doi.org/10.5281/zenodo.7465053}
}