Skip to content

[CUB] Refactor DevicePartition::If to always take an environment#9393

Open
miscco wants to merge 3 commits into
NVIDIA:mainfrom
miscco:device_partition_flagged_env
Open

[CUB] Refactor DevicePartition::If to always take an environment#9393
miscco wants to merge 3 commits into
NVIDIA:mainfrom
miscco:device_partition_flagged_env

Conversation

@miscco

@miscco miscco commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

We want to be able to pass tunings to the APIs that take user provided memory

Make sure we can pass any environment or stream type to them

miscco added 3 commits June 11, 2026 08:28
We want to use env based API to ensure that we take advantage of user provided tunings
We want to be able to pass tunings to the APIs that take user provided memory

Make sure we can pass any environment or stream type to them
@miscco miscco requested review from a team as code owners June 11, 2026 08:52
@miscco miscco requested a review from pciolkosz June 11, 2026 08:52
@miscco miscco requested a review from elstehle June 11, 2026 08:52
@github-project-automation github-project-automation Bot moved this to Todo in CCCL Jun 11, 2026
@cccl-authenticator-app cccl-authenticator-app Bot moved this from Todo to In Review in CCCL Jun 11, 2026
@coderabbitai

coderabbitai Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Note: CodeRabbit is enabled on this repository as a convenience for maintainers
and contributors. Use your best judgment when considering its review comments and
suggestions — a suggested change may be inadequate, unnecessary, or safe to ignore.
Contributors are not expected to address every comment. Human reviews are what
ultimately matter for merging.

Overview

This PR refactors CUB and libcudacxx's device algorithms to consistently use environment-based execution dispatch instead of explicit CUDA stream parameters. The primary goal is to enable callers to pass tunings and custom execution policies when user-provided temporary storage is supplied.

Key Changes

Core Dispatch Infrastructure (cub/detail/env_dispatch.cuh)

  • Modified dispatch_with_env() and dispatch_with_env_and_tuning() to accept environment as const EnvT& instead of by value, eliminating unnecessary copies.
  • Added new overload variants of dispatch_with_env() and dispatch_with_env_and_tuning() to handle the "user-provided memory" case: these query the stream and tuning from the environment and perform single-phase calls using the provided (d_temp_storage, temp_storage_bytes) parameters.

DevicePartition (cub/device/device_partition.cuh)

  • Breaking change: DevicePartition::If now accepts const EnvT& env = {} instead of cudaStream_t stream in both the two-partition and three-way partition overloads.
  • Both temporary storage sizing and kernel execution paths now route through detail::dispatch_with_env_and_tuning() with a policy selector built from iterator/value/offset types.
  • Environment-based convenience overloads updated to take env as const reference.
  • Enables arbitrary environment and stream types (not just raw cudaStream_t).

DeviceFind (cub/device/device_find.cuh)

  • Updated FindIf, LowerBound, and UpperBound to accept environment-based execution via const EnvT& env parameter.
  • Core overloads now use detail::dispatch_with_env_and_tuning() with appropriate policy selectors.
  • Environment-based convenience overloads updated to take env as const reference.

PSTL Integration (libcudacxx)

Updated all CUDA backend implementations to use the new environment-based APIs:

  • find_if.h: Uses environment-based DeviceFind::FindIf with __policy parameter instead of stream.
  • partition.h: Passes execution __policy to DevicePartition::If instead of raw stream.
  • partition_copy.h and stable_partition.h: Similarly updated to pass __policy instead of __stream.get().

Test Coverage

  • Expanded catch2_test_device_find_env.cu with comprehensive two-stage testing: first query required temporary storage with nullptr, then allocate and rerun with user-provided memory.
  • New environment tests cover multiple environment forms: cudaStream_t, cuda::stream, cuda::stream_ref, cuda::std::execution::env, and cuda::execution::gpu policies.
  • Added new test case to catch2_test_device_partition_if.cu validating user-provided memory and various environment types.

API Compatibility

The changes replace stream-based APIs with environment-based alternatives. Existing code using raw cudaStream_t will need to adapt, but the environment parameter defaults to an empty execution environment, maintaining backward compatibility for simple cases.

Files Modified

  • cub/cub/detail/env_dispatch.cuh (+45/-2)
  • cub/cub/device/device_find.cuh (+83/-45)
  • cub/cub/device/device_partition.cuh (+59/-42)
  • cub/test/catch2_test_device_find_env.cu (+310/-8)
  • cub/test/catch2_test_device_partition_if.cu (+99/-0)
  • libcudacxx/include/cuda/std/__pstl/cuda/find_if.h (+8/-8)
  • libcudacxx/include/cuda/std/__pstl/cuda/partition.h (+2/-2)
  • libcudacxx/include/cuda/std/__pstl/cuda/partition_copy.h (+2/-2)
  • libcudacxx/include/cuda/std/__pstl/cuda/stable_partition.h (+3/-3)

Walkthrough

This PR converts CUB dispatch and device algorithms from explicit CUDA stream parameters to environment-based execution. The infrastructure layer adds const-reference environment parameters and user-memory dispatch overloads, propagated through FindIf/LowerBound/UpperBound and two-way/three-way Partition algorithms, integrated into PSTL backends, and validated by expanded test coverage across multiple environment types.

Changes

Environment-based execution in CUB

Layer / File(s) Summary
Dispatch infrastructure: const-ref environments and user-memory overloads
cub/cub/detail/env_dispatch.cuh
dispatch_with_env and dispatch_with_env_and_tuning changed to accept environment by const reference. New overloads added for user-provided temporary storage: they query stream and tuning from the environment, select policy via DefaultPolicySelector, and invoke the algorithm callable with explicit storage pointers.
DeviceFind: FindIf, LowerBound, UpperBound with environment execution
cub/cub/device/device_find.cuh
DeviceFind::FindIf, LowerBound, and UpperBound main overloads updated to accept const EnvT& env (defaulting to cuda::std::execution::env<>), route through detail::dispatch_with_env_and_tuning with policy selection, and move null-storage handling into env-dispatch lambdas. Convenience overloads pass environment by const reference.
DevicePartition: If overloads with environment execution
cub/cub/device/device_partition.cuh
Two-way and three-way DevicePartition::If device storage overloads updated to accept const EnvT& env instead of cudaStream_t. Default policy selectors built from iterator/value types, invoked via detail::dispatch_with_env_and_tuning, with policy forwarded into partition dispatch. Environment-based overloads pass environment by const reference.
PSTL backends: propagate policy to CUB device algorithms
libcudacxx/include/cuda/std/__pstl/cuda/find_if.h, partition.h, partition_copy.h, stable_partition.h
CUDA PSTL dispatch implementations updated to pass execution __policy instead of __stream.get() to CUB DeviceFind and DevicePartition calls during both temporary storage sizing and kernel invocation. find_if updated to select CUB offset type and use typed nullptr for output pointer.
Tests: DeviceFind and DevicePartition with environment and user-provided memory
cub/test/catch2_test_device_find_env.cu, catch2_test_device_partition_if.cu
Test coverage expanded to validate algorithms with user-provided temporary storage and multiple environment forms (CUDA stream, cuda::stream, cuda::stream_ref, cuda::std::execution::env, cuda::execution::gpu policies). Existing tests refactored into unprovided vs. user-provided memory sections; new C2H_TESTs verify nullptr queries and allocated-storage runs with CUDA error and synchronization checks.

Suggested reviewers

  • shwina
  • gevtushenko

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
cub/cub/detail/env_dispatch.cuh (1)

117-129: suggestion: This new user-memory path now feeds policy selection through the env/tuning dispatcher, so it can change the kernel variant selected for DeviceFind and DevicePartition. Please run the SASS-diff and benchmark flow for the affected Device* benchmarks before merge. As per coding guidelines, **/*.{cpp,cu,cuh}: Do not commit SASS code changes without running benchmarks to check for performance regressions.

Source: Coding guidelines


ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 18345e1e-685e-4c8c-9fd7-da9a89f1a7dd

📥 Commits

Reviewing files that changed from the base of the PR and between d2a0a8a and ba1593e.

📒 Files selected for processing (9)
  • cub/cub/detail/env_dispatch.cuh
  • cub/cub/device/device_find.cuh
  • cub/cub/device/device_partition.cuh
  • cub/test/catch2_test_device_find_env.cu
  • cub/test/catch2_test_device_partition_if.cu
  • libcudacxx/include/cuda/std/__pstl/cuda/find_if.h
  • libcudacxx/include/cuda/std/__pstl/cuda/partition.h
  • libcudacxx/include/cuda/std/__pstl/cuda/partition_copy.h
  • libcudacxx/include/cuda/std/__pstl/cuda/stable_partition.h

Comment on lines +200 to +213
auto test_find_if = [&](const auto& env) {
size_t num_bytes = 0;
error = cub::DeviceFind::FindIf(nullptr, num_bytes, d_in.begin(), d_out.begin(), predicate, num_items, env);
REQUIRE(error == cudaSuccess);
REQUIRE(cudaSuccess == cudaPeekAtLastError());
REQUIRE(cudaSuccess == cudaDeviceSynchronize());
REQUIRE(num_bytes == expected_bytes_allocated);

error = cub::DeviceFind::FindIf(temp_storage, num_bytes, d_in.begin(), d_out.begin(), predicate, num_items, env);
REQUIRE(error == cudaSuccess);
REQUIRE(cudaSuccess == cudaPeekAtLastError());
REQUIRE(cudaSuccess == cudaDeviceSynchronize());
REQUIRE(d_out[0] == 5);
};

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

important: These tests assume temp-storage size is identical across all environment forms (num_bytes == expected_bytes_allocated) and reuse a buffer sized from a baseline query. The env dispatch path can legally select different policy/tuning per env, so required bytes may differ; this makes the test brittle and can fail valid implementations. Query and allocate temp storage per env invocation instead of enforcing cross-env equality.
As per coding guidelines, cub/**/* reviews should prioritize stream behavior and test-coverage risks, and this assertion over-constrains valid env behavior.

Also applies to: 309-340, 436-467

Source: Coding guidelines

@github-actions

Copy link
Copy Markdown
Contributor

🥳 CI Workflow Results

🟩 Finished in 1h 29m: Pass: 100%/343 | Total: 6d 17h | Max: 1h 28m | Hits: 69%/714109

See results here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Review

Development

Successfully merging this pull request may close these issues.

1 participant