🐧 `loops`: Expressing Parallel Irregular Computations

We propose an open-source GPU load-balancing framework for applications that exhibit irregular parallelism. The set of applications and algorithms we consider are fundamental to computing tasks ranging from sparse machine learning, large numerical simulations, and on through to graph analytics. The underlying data and data structures that drive these applications present access patterns that naturally don't map well to the GPU's architecture that is designed with dense and regular patterns in mind.

Prior to the work we present and propose here, the only way to unleash the GPU's full power on these problems has been to workload balance through tightly coupled load-balancing techniques. Our proposed load-balancing abstraction decouples load balancing from work processing and aims to support both static and dynamic schedules with a programmable interface to implement new load-balancing schedules in the future.

With our open-source framework, we hope to not only improve programmers' productivity when developing irregular-parallel algorithms on the GPU but also improve the overall performance characteristics for such applications by allowing a quick path to experimentation with a variety of existing load-balancing techniques. Consequently, we also hope that by separating the concerns of load-balancing from work processing within our abstraction, managing and extending existing code to future architectures becomes easier.

Requirements

OS: Linux (Ubuntu 22.04 / 24.04 tested) or Windows.
Hardware: NVIDIA GPU with compute capability ≥ 7.0 (Volta or newer).
Software: CUDA Toolkit ≥ 11.7 and CMake ≥ 3.24.
CUDA architecture: Auto-detected by default (CMAKE_CUDA_ARCHITECTURES=native). Override at configure time, e.g. -DCMAKE_CUDA_ARCHITECTURES=90 for an H100-only build, or "70;80;90" for a fat binary.

loops is a header-only library. Thrust, CUB and libcu++ ship with the CUDA Toolkit and are picked up automatically — no separate fetch step is required. Only cxxopts (CLI parsing) is fetched as an external dependency.

Quick Start

git clone https://github.com/gunrock/loops.git
cd loops

# Auto-detect the GPU(s) on this host (recommended default).
cmake --preset release-native
cmake --build --preset release-native -j

# Sanity check on the bundled chesapeake matrix.
./build/release-native/bin/loops.spmv.merge_path \
    -m datasets/chesapeake/chesapeake.mtx --validate

Other configure presets (release-h100, release-a100, release-multi, debug-native, release-with-tests, ci-multi-arch), CMake-presets-free fallback, individual example targets, and Docker setup are all covered in the Build documentation.

Format-Generic Schedules

The four scheduling algorithms (thread_mapped, group_mapped, work_oriented, merge_path_flat) talk to the workload through a small layout view contract rather than directly poking at CSR offset arrays. Any struct that exposes the contract — num_tiles(), num_atoms(), tile_begin(t), tile_end(t), tile_size(t), tile_end_iter() — can drive any of the schedules without modification.

In-tree layouts:

Layout	Tile is a...	Atom is a...	Backing container	Example
`layout::csr<tile_id_t, atom_id_t>`	row	nonzero	`csr_t`	`spmv.thread_mapped`, `spmv.merge_path`, ...
`layout::csc<tile_id_t, atom_id_t>`	column	nonzero	`csc_t`	`spmv.csc_thread_mapped`
`layout::coo<tile_id_t, atom_id_t>`	nonzero	nonzero	`coo_t`	`spmv.coo_thread_mapped`
`layout::ell<tile_id_t, atom_id_t>`	row	bucketed nonzero (uniform pitch)	`ell_t`	`spmv.ell_thread_mapped`, `spmv.ell_merge_path`
`layout::bcsr<tile_id_t, atom_id_t>`	block-row	dense `R x C` block	`bcsr_t<R, C, ...>`	`spmv.bcsr_thread_mapped`
`layout::dia<tile_id_t, atom_id_t>`	row	(row, diagonal) cell	`dia_t`	`spmv.dia_thread_mapped`

To plug in your own format, write a struct satisfying the contract documented in include/loops/container/layout.hxx and pass it as the trailing template argument to schedule::setup<...>. A worked example lives in examples/spmv/custom_layout.cu.

Tile partitioners

A partitioner is a layout adaptor that re-bins atoms into a different tile grouping while still satisfying the same contract — so the schedules continue to drive it unchanged. In-tree:

loops::layout::flat_uniform_occupancy<K, base_layout_t> — flatten the base layout's atoms and chunk them into tiles of K atoms each (last tile may be smaller). Tiles can cross the base layout's natural boundaries (e.g., CSR rows), so kernels that need per-atom output addressing should atomic-add via partitioner.base().tile_of(atom).

A worked SpMV example using flat_uniform_occupancy<8, csr> lives in examples/spmv/flat_partitioned.cu; it shares the standard thread_mapped schedule with no schedule-side changes.

Documentation

Full documentation is available at gunrock.github.io/loops:

Building — full CMake-presets table, CUDA-architecture overrides, optional dependencies, and Docker.
Datasets — fetching the SuiteSparse Matrix Collection.
Experimentation — running the bundled examples and the sanity check.
Reproducing Results — re-running the paper's full experiment sweep and regenerating the plots.
Abstraction — design notes on the tile-atom model.

How to Cite Loops

Thank you for citing our work.

@inproceedings{Osama:2023:APM,
  author       = {Muhammad Osama and Serban D. Porumbescu and John D. Owens},
  title        = {A Programming Model for {GPU} Load Balancing},
  booktitle    = {Proceedings of the 28th ACM SIGPLAN Symposium on
                  Principles and Practice of Parallel Programming},
  series       = {PPoPP 2023},
  year         = 2023,
  month        = feb # "\slash " # mar,
  acceptance   = {31 of 131 submissions, 23.7\%},
  code         = {https://github.com/gunrock/loops},
  doi          = {10.1145/3572848.3577434},
}

@software{Osama:2022:LAP:Code,
  author       = {Muhammad Osama and Serban D. Porumbescu and John D. Owens},
  title        = {Loops: A Programming Model for GPU Load Balancing},
  month        = dec,
  year         = 2022,
  publisher    = {Zenodo},
  version      = {v0.1.0-alpha},
  doi          = {10.5281/zenodo.7465053},
  url          = {https://doi.org/10.5281/zenodo.7465053}
}

Name		Name	Last commit message	Last commit date
Latest commit History 166 Commits
.github/workflows		.github/workflows
benchmarks		benchmarks
cmake		cmake
datasets		datasets
docker		docker
examples		examples
include/loops		include/loops
plots		plots
scripts		scripts
site		site
unittests		unittests
.clang-format		.clang-format
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
CMakePresets.json		CMakePresets.json
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🐧 `loops`: Expressing Parallel Irregular Computations

Requirements

Quick Start

Format-Generic Schedules

Tile partitioners

Documentation

How to Cite Loops

About

Uh oh!

Releases 2

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🐧 loops: Expressing Parallel Irregular Computations

Requirements

Quick Start

Format-Generic Schedules

Tile partitioners

Documentation

How to Cite Loops

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Uh oh!

Contributors

Uh oh!

Languages

🐧 `loops`: Expressing Parallel Irregular Computations