Skip to content

[CUB] Refactor DeviceSelect::Unique to always take an environment#9386

Open
miscco wants to merge 4 commits into
NVIDIA:mainfrom
miscco:device_select_unique_env
Open

[CUB] Refactor DeviceSelect::Unique to always take an environment#9386
miscco wants to merge 4 commits into
NVIDIA:mainfrom
miscco:device_select_unique_env

Conversation

@miscco

@miscco miscco commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

This refactors cub::DeviceSelect::Unique to always take an environment instead of just a stream.

With that we can take advantage of tunings even for those overloads.

We also refactor the PSTL algorithms that use this API to always pass the policy so a user can tune them

@miscco miscco requested review from a team as code owners June 11, 2026 07:02
@miscco miscco requested review from NaderAlAwar and griwes June 11, 2026 07:02
@github-project-automation github-project-automation Bot moved this to Todo in CCCL Jun 11, 2026
@cccl-authenticator-app cccl-authenticator-app Bot moved this from Todo to In Review in CCCL Jun 11, 2026
@coderabbitai

coderabbitai Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 402ed7f8-6bfa-4583-8c67-f391996b7772

📥 Commits

Reviewing files that changed from the base of the PR and between 1a3118a and ad308bb.

📒 Files selected for processing (5)
  • cub/test/catch2_test_device_select_unique.cu
  • libcudacxx/include/cuda/std/__pstl/cuda/unique.h
  • libcudacxx/include/cuda/std/__pstl/cuda/unique_copy.h
  • libcudacxx/include/cuda/std/__pstl/dispatch.h
  • libcudacxx/include/cuda/std/__pstl/unique_copy.h
🚧 Files skipped from review as they are similar to previous changes (5)
  • libcudacxx/include/cuda/std/__pstl/unique_copy.h
  • libcudacxx/include/cuda/std/__pstl/dispatch.h
  • libcudacxx/include/cuda/std/__pstl/cuda/unique_copy.h
  • cub/test/catch2_test_device_select_unique.cu
  • libcudacxx/include/cuda/std/__pstl/cuda/unique.h

Note: CodeRabbit is enabled on this repository as a convenience for maintainers
and contributors. Use your best judgment when considering its review comments and
suggestions — a suggested change may be inadequate, unnecessary, or safe to ignore.
Contributors are not expected to address every comment. Human reviews are what
ultimately matter for merging.

Overview

This PR refactors cub::DeviceSelect::Unique so its API consistently accepts an execution environment (EnvT) instead of a raw CUDA stream. The change enables DeviceSelect and related PSTL call sites to propagate user-provided tunings via the environment/policy. It also updates DeviceFind overloads, the PSTL CUDA backends for find_if/unique, and introduces a CUDA backend specialization for unique_copy. The description notes this should be merged after PR #9318.

Core API & Dispatch Changes

  • env_dispatch.cuh

    • dispatch_with_env and dispatch_with_env_and_tuning now take the environment by const reference (const EnvT&).
    • New overloads accept caller-provided temporary storage (void* d_temp_storage, size_t& temp_storage_bytes, const EnvT& env, ...) and perform only stream/tuning queries before invoking the callable (no allocation phase).
  • DeviceFind (device_find.cuh)

    • FindIf, LowerBound, UpperBound environment overloads now take const EnvT& env = {} (defaulting to cuda::std::execution::env{}) and route through detail::dispatch_with_env / detail::dispatch_with_env_and_tuning.
    • Stream-based signatures replaced by env-based ones; Doxygen updated.
  • DeviceSelect::Unique (device_select.cuh)

    • All environment-based Unique overloads changed from pass-by-value to pass-by-const-reference (const EnvT&).
    • Stream-based device-storage overloads converted to env-based overloads (added EnvT template parameter with SFINAE where applicable).
    • Implementations now use detail::dispatch_with_env_and_tuning(...) and invoke detail::select::dispatch with the selected policy.
    • Documentation updated to reference env parameter.

PSTL & libcudacxx Backends

  • find_if (libcudacxx/include/cuda/std/__pstl/cuda/find_if.h)

    • Uses choose_offset_t for CUB offsets.
    • Passes __policy into DeviceFind::FindIf for temp-storage sizing.
    • Uses __storage.template __get_raw_ptr<0>() and passes __policy instead of stream.
  • unique (libcudacxx/include/cuda/std/__pstl/cuda/unique.h)

    • Simplified __par_impl: now calls CUB DeviceSelect::Unique directly and returns __first + __num_selected.
    • Removed prior SelectImpl plumbing and the unique_copy overload from this dispatcher.
  • unique_copy (new: libcudacxx/include/cuda/std/__pstl/cuda/unique_copy.h)

    • New CUDA backend specialization for __pstl_dispatch<__unique_copy, __cuda>.
    • Implements __par_impl using DeviceSelect::Unique with temporary-storage sizing, allocation, async copy of selected-count, and returns advanced output iterator.
    • Error handling converts cudaErrorMemoryAllocation to std::bad_alloc; other cuda errors rethrown as cuda::cuda_error.
    • Public operator() requires forward-traversal of output iterator and handles random-access iterator cases specially; otherwise contains a dependent static_assert/fallback to cuda::std::unique_copy.
  • Dispatch registrations and headers

    • Added __unique_copy enumerator to __pstl_algorithm (dispatch.h).
    • unique_copy.h now included by the public unique_copy header and dispatch selection corrected to __unique_copy.

Tests

  • catch2_test_device_find_env.cu

    • Expanded to validate both "no user memory" and "user-provided temporary storage" invocation paths across multiple environment types (raw cudaStream_t, cuda::stream, cuda::stream_ref, cuda::std::execution::env, cuda::execution::gpu, and stream-bound policies).
    • Adds SECTION-based coverage and asserts cudaPeekAtLastError()/cudaDeviceSynchronize() and result correctness.
  • catch2_test_device_select_unique.cu

    • Adds a new test "DeviceSelect::Unique works with user provided memory and environments" exercising DeviceSelect::Unique with user-supplied temporary storage across multiple environments and comparator variants; verifies counts and outputs against std::unique.

Public API surface changes

  • Several function signatures updated to accept const EnvT& rather than EnvT or cudaStream_t (DeviceFind and DeviceSelect::Unique overloads). Some previously stream-based overloads were converted to env-based overloads (new EnvT template parameters and SFINAE where applicable).
  • New/additional internal dispatch overloads in env_dispatch.cuh for caller-provided temporary storage.

Notes for reviewers

  • The PR is primarily API-shaping: prefer careful review of overload resolution, SFINAE/enables, and call-site adjustments in PSTL backends.
  • Pay attention to temporary-storage sizing and the new single-phase caller-storage path (no allocation) introduced in env_dispatch.cuh.
  • This change relies on merging after PR #9318 (as noted in the branch/commit message).

suggestion:

Walkthrough

This PR changes env dispatch helpers to accept environments by const-ref and add caller-owned temp-storage overloads; converts DeviceFind/DeviceSelect APIs to env-based signatures; adapts PSTL CUDA backends (find_if, unique) and adds a new CUDA unique_copy backend; and extends tests for user-provided temporary storage across environment variants.

Changes

Environment-based API migration

Layer / File(s) Summary
Dispatch infrastructure: const-ref env and user storage
cub/cub/detail/env_dispatch.cuh
Dispatch helpers now accept const EnvT& and add single-phase overloads that accept caller-provided d_temp_storage/temp_storage_bytes, querying stream/tuning only.
DeviceFind: stream-based to environment-based
cub/cub/device/device_find.cuh, cub/test/catch2_test_device_find_env.cu, libcudacxx/include/cuda/std/__pstl/cuda/find_if.h
FindIf/LowerBound/UpperBound converted to EnvT-based templates (defaulting to cuda::std::execution::env<>), implementations call detail::dispatch_with_env / _and_tuning; PSTL find_if updated to pass policy and use raw storage access; tests expanded to exercise null-sizing and caller-provided storage across env variants.
DeviceSelect::Unique: stream-based to environment-based
cub/cub/device/device_select.cuh, cub/test/catch2_test_device_select_unique.cu
Device-storage Unique overloads converted from stream-based to env-based (adds EnvT template constraints); env parameters are const EnvT&; implementations route through dispatch_with_env_and_tuning and call detail::select::dispatch with the tuned stream/policy; tests added for user-provided storage across environments and comparators.
PSTL unique: refactor to use DeviceSelect::Unique
libcudacxx/include/cuda/std/__pstl/cuda/unique.h
CUDA __par_impl simplified to call DeviceSelect::Unique directly, compute __num_selected using device temp-storage, and return __first + __num_selected; removed the prior unique_copy overload from unique dispatch.
PSTL unique_copy: new CUDA backend dispatch
libcudacxx/include/cuda/std/__pstl/dispatch.h, libcudacxx/include/cuda/std/__pstl/cuda/unique_copy.h, libcudacxx/include/cuda/std/__pstl/unique_copy.h
Adds __unique_copy tag and a CUDA PSTL backend that uses DeviceSelect::Unique with temp-storage management, async selected-count copy, synchronization, iterator advancement, and allocation-to-exception mapping.

Possibly related PRs

  • NVIDIA/cccl#8880: touches select/DeviceSelect dispatch and tuning API interactions relevant to env/tuning forwarding.

Suggested reviewers

  • NaderAlAwar
  • gevtushenko
  • davebayer

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (4)
cub/test/catch2_test_device_select_unique.cu (1)

210-212: ⚡ Quick win

suggestion: Remove unused variables boundary and num_selected_std. The test validates output by comparing vectors (lines 241, 265) but never asserts the count matches the std::unique result. Either remove these variables or add REQUIRE(num_selected_out[0] == num_selected_std) assertions like other tests in this file do (see lines 328, 335).

libcudacxx/include/cuda/std/__pstl/cuda/unique.h (2)

93-102: 💤 Low value

suggestion: Error message on line 95 still references cub::DispatchSelectIf::Dispatch but the code now calls cub::DeviceSelect::Unique. Update for clarity:

       _CCCL_TRY_CUDA_API(
         CUB_NS_QUALIFIER::DeviceSelect::Unique,
-        "__pstl_cuda_unique: kernel launch of cub::DispatchSelectIf::Dispatch failed",
+        "__pstl_cuda_unique: kernel launch of cub::DeviceSelect::Unique failed",
         __storage.__get_temp_storage(),

90-91: 💤 Low value

suggestion: Comment says "as well as a copy of the input sequence as Unique is not inplace" but the code uses the in-place variant of DeviceSelect::Unique. The temporary storage is only for __num_selected and CUB internals. Consider updating the comment:

-    { // Create temporary storage for the return value as well as a copy of the input sequence as Unique is not inplace
+    { // Create temporary storage for the return value (num_selected) and CUB internal scratch
libcudacxx/include/cuda/std/__pstl/cuda/unique_copy.h (1)

83-93: 💤 Low value

suggestion: Error messages reference __pstl_cuda_unique but this is unique_copy. Update for debugging clarity:

  • Line 85: "__pstl_cuda_unique_copy: determination of device storage..."
  • Line 100: "__pstl_cuda_unique_copy: kernel launch..."
  • Line 112: "__pstl_cuda_unique_copy: copy of num_selected..."

Also applies to: 98-108, 110-117


ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 2a9c4a23-bd8a-4562-b58e-775fd280c548

📥 Commits

Reviewing files that changed from the base of the PR and between 53d1cc1 and d97187e.

📒 Files selected for processing (10)
  • cub/cub/detail/env_dispatch.cuh
  • cub/cub/device/device_find.cuh
  • cub/cub/device/device_select.cuh
  • cub/test/catch2_test_device_find_env.cu
  • cub/test/catch2_test_device_select_unique.cu
  • libcudacxx/include/cuda/std/__pstl/cuda/find_if.h
  • libcudacxx/include/cuda/std/__pstl/cuda/unique.h
  • libcudacxx/include/cuda/std/__pstl/cuda/unique_copy.h
  • libcudacxx/include/cuda/std/__pstl/dispatch.h
  • libcudacxx/include/cuda/std/__pstl/unique_copy.h

Comment thread cub/test/catch2_test_device_select_unique.cu Outdated
Comment thread cub/test/catch2_test_device_select_unique.cu
Comment thread libcudacxx/include/cuda/std/__pstl/cuda/unique_copy.h Outdated
@miscco miscco force-pushed the device_select_unique_env branch 4 times, most recently from dcfa290 to 1a3118a Compare June 11, 2026 07:18
@miscco miscco changed the title Refactor DeviceSelect::Unique to always take an environment [CUB] Refactor DeviceSelect::Unique to always take an environment Jun 11, 2026
@miscco miscco force-pushed the device_select_unique_env branch from 1a3118a to ad308bb Compare June 11, 2026 09:20
@github-actions

Copy link
Copy Markdown
Contributor

🥳 CI Workflow Results

🟩 Finished in 1h 27m: Pass: 100%/343 | Total: 3d 08h | Max: 1h 22m | Hits: 96%/494169

See results here.

d_temp_storage,
temp_storage_bytes,
env,
[&]([[maybe_unused]] auto tuning_env, void* d_temp_storage, size_t& temp_storage_bytes, cudaStream_t stream) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again, consider variable shadowing otherwise LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Review

Development

Successfully merging this pull request may close these issues.

2 participants