Releases: NVIDIA/cccl
CCCL Python Libraries (1.0.0)
CCCL Python Libraries (v1.0.0)
Previous release: v0.7.0.
This is the first stable release of the cuda-cccl Python package.
The cuda.compute module is now considered stable, and we will follow semantic versioning for changes to its public API going forward.
The cuda.coop module remains experimental and has been moved to cuda.coop._experimental to reflect that. See breaking changes below.
Installation
Please refer to the install instructions here.
API breaking changes
-
cuda.coopcooperative primitives moved tocuda.coop._experimental(#8788)The
block,warp, andStatefulFunctionentry points previously exported fromcuda.coophave been moved to thecuda.coop._experimentalsubmodule, signaling that their API is not yet stable and is expected to change in a future release. Top-levelcuda.coopno longer re-exports these.Before:
from cuda.coop import block, warp, StatefulFunction
After:
from cuda.coop._experimental import block, warp, StatefulFunction
-
cuda.cccl.cooperativelegacy namespace removed (#8788)The deprecated
cuda.cccl.cooperativepackage (previously kept as a transitional alias) has been removed entirely. Migrate any remaining imports tocuda.coop._experimental.
Features
- Python 3.14 support —
cuda-ccclis now built and tested against Python 3.14 in addition to 3.10–3.13 (#8870).
Bug Fixes / Packaging
- Avoid incompatible
numba-cudaversions — The dependency pin onnumba-cudawas tightened to exclude 0.27.x, 0.28.x, 0.29.x, and 0.30.0, which contain regressions that breakcuda-cccl(#8831).
Known issues
-
cuda.coop._experimentalmay fail withRuntimeError: nvdisasm was not found or could not be executedifnvdisasmis not discoverable. Follow the suggestion in the error message to installnvdisasm. If it is already installed, set theCUDA_PATHenvironment variable (notPATH) to the root of the directory containingbin/nvdisasm:export CUDA_PATH=/path/to/cuda # such that $CUDA_PATH/bin/nvdisasm exists
Notes
cuda.computeitself has no API changes in this release relative to v0.7.0. The 0.7.0 release contained the API cleanup (keyword-only arguments, parameter reordering,d_in_values/d_out_valuesrename inmerge_sort); 1.0.0 is the formal stabilization of that API.
CCCL Python Libraries (v0.7.0)
cuda-cccl Python package — version 0.7.0
Release date: May 5th, 2026. Previous release: v0.6.0.
cuda-cccl is in "experimental" status, meaning that its API and feature set can change quite rapidly.
Installation
Please refer to the install instructions here
API breaking changes
-
All
cuda.computefunctions now require keyword-only arguments (#8772)Every top-level function and factory (
make_*) incuda.computenow enforces keyword-only call
syntax (i.e., all parameters must be passed by name). Positional calls will raise aTypeError.Before:
reduce_into(d_in, d_out, op, num_items, h_init)
After:
reduce_into(d_in=d_in, d_out=d_out, num_items=num_items, op=op, h_init=h_init)
Features
-
System CUDA toolkit install extras — New pip extras
sysctk12/sysctk13(and
minimal-sysctk12/minimal-sysctk13) allow installingcuda-ccclwithout pulling in
cuda-toolkitas a pip dependency, for users who already have CUDA installed system-wide
(#8608):pip install cuda-cccl[sysctk13] # full install, system CTK pip install cuda-cccl[minimal-sysctk13] # no Numba, system CTK
Performance
- Faster binary search —
lower_bound/upper_boundare now implemented viatransform
with a small linear search for the final steps, improving throughput on modern GPUs (#8642) - Adaptive warpspeed scan — The scan tuning policy now automatically selects the warpspeed
(lookahead) scan path when beneficial for the data type and architecture (#8158)
Bug Fixes
- Fix incorrect minimum CUDA architecture targeted when building the
cccl.cnative extension
(#8631)
v3.3.3
What's Changed
🔄 Other Changes
- Bump branch/3.3.x to 3.3.3. by @wmaxey in #8409
- [Backport branch/3.3.x] [libcu++] Add missing braces supression to other mempool types by @github-actions[bot] in #8166
- [Backport branch/3.3.x] Fix order of
_CCCL_APIandCCCL_DEPRECATEDby @github-actions[bot] in #8390 - [backport 3.3] Fix family arch specific feature detection in
<nv/target>(#8027) by @davebayer in #8294 - [Backport branch/3.3.x] Fix codegen in 128bit atomic CAS by @github-actions[bot] in #8408
- [Backport branch/3.3.x] [libcu++] Add missing bit_cast in the buffer construction (#8420) by @pciolkosz in #8425
Full Changelog: v3.3.2...v3.3.3
v3.3.2
What's Changed
🔄 Other Changes
- Bump branch/3.3.x to 3.3.2. by @wmaxey in #7992
- [Backport to 3.3]: Support non-copyable stream types in DeviceTransform (#7915) by @bernhardmgruber in #8011
- [Backport branch/3.3.x] Support DLPack inclusion for both
<dlpack/dlpack.h>and<dlpack.h>by @github-actions[bot] in #7910 - [Backport branch/3.3.x] Add fallback for
_CCCL_BUILTIN_EXPECTby @github-actions[bot] in #8049 - [Backport 3.3] reformulate
__as_type_listto avoid MSVC overload resolution bug (#7991) by @miscco in #8062 - [Backport 3.3] Avoid deprecation warning with
is_always_equal(#7674) by @miscco in #8078 - [Backport branch/3.3.x] Fix use of
EXPANDin token concatenation by @github-actions[bot] in #8077
Full Changelog: v3.3.1...v3.3.2
CCCL Python Libraries v0.6.0
These are the release notes for the cuda-cccl Python package version 0.6.0, dated April 9th, 2026. The previous release was v0.5.1.
cuda-cccl is in "experimental" status, meaning that its API and feature set can change quite rapidly.
Installation
Please refer to the install instructions here
API breaking changes
cuda.cooprefactored to use maker factory functions (#7713)
Features
ShuffleIterator— New iterator type added to cuda.compute (#7721)max_segment_size guarantee— Exposed in the public API (#8284)- LTO-IR support — Can now directly pass LTO-IR for custom operators (#7625)
- Numba-optional install — Added a path to install cuda.compute without Numba as a dependency (#7633)
Performance
- Faster TransformIterator construction (#7660)
Bug Fixes
v3.3.1
What's Changed
🔄 Other Changes
- Bump 3.3.0 to 3.3.1. by @wmaxey in #7742
- [Backport 3.3] #7787 and #7738 by @miscco in #7800
- [Backport 3.3]: Avoid use of class static variable in device function (#7776) by @miscco in #7825
- [Backport branch/3.3.x] Forward policy hub from
dispatch_streaming_arg_reduce_ttoreduce::dispatchby @github-actions[bot] in #7814 - [Backport branch/3.3.x] cub: change {Lower,Upper}Bound to accept iterator and number of elements. by @github-actions[bot] in #7816
- [Backport branch/3.3.x] Fix version guard for cudaDevAttrHostNumaMemoryPoolsSupported by @github-actions[bot] in #7842
- [Backport 3.3] Buffer changes by @miscco in #7841
- [Backport branch/3.3.x] [libcu++] Change default pool getters to return memory_pool_ref& by @github-actions[bot] in #7858
- [Backport branch/3.3.x] Avoid compile issue with
__isetby @github-actions[bot] in #7879 - [Backport to 3.3] Require CUDA 12.9 for host numa implementation of pinned memory pool (#7856) by @pciolkosz in #7872
- [Backport 3.3] Avoid GCC bug with dependent type template (#7857) by @miscco in #7860
Full Changelog: v3.3.0...v3.3.1
v3.3.0
Full Changelog: v3.3.0...v3.3.0
What's Changed
📚 Libcudacxx
- [libcudacxx] Fix a typo in the documentation by @caugonnet in #7330
- Add a test for <nv/target> to validate old dialect support. by @wmaxey in #7241
🔄 Other Changes
- Implement
cudax::cufileby @davebayer in #6122 - Update linear_congruential_generator with constexpr, tests and a fast discard by @RAMitchell in #6402
- Replace
_CCCL_HAS_CUDA_COMPILER()with_CCCL_CUDA_COMPILATION()by @davebayer in #6399 - Remove unnecessary casts in complex multiplication/division by @davebayer in #6670
- Add benchmark batch script by @bernhardmgruber in #6661
- Improvements and testing for inspect_changes CI functionality. by @alliepiper in #6535
- Improve clarity of CCCL assert macro documentation by @jrhemstad in #6675
- Fix oversubscription issue with lit precompile, label hack by @alliepiper in #6554
- Make missing sccache nonfatal. by @alliepiper in #6582
- Address pending comments for
make_tma_descriptorby @fbusato in #6662 - Add nvhpc 25.9. by @alliepiper in #6003
- Test building for all arches. by @alliepiper in #6113
- Add nvbench_helper tests to CI. by @alliepiper in #6679
- Add more targets to pytorch build. by @alliepiper in #6685
- Add host std lib version detection by @davebayer in #6678
- Improve CUB benchmark docs by @bernhardmgruber in #6640
- Use
if constevalin libcu++ by @davebayer in #6424 - Update docs for
_CCCL_IF_CONSTEVALby @davebayer in #6692 - Fixes issue with select close to int_max by @elstehle in #6641
- Update libcudacxx C++ dialect handling. by @alliepiper in #6693
- Simplifies env usage in
DeviceTopKtests by @elstehle in #6680 - Switch to S3 preprocessor cache by @alliepiper in #6561
- fix omp scan bug by @charan-003 in #6560
- Refactor out variant from transform tunings by @bernhardmgruber in #6669
- [libcu++] Waive hierarchy constexpr testing on GCC8 by @pciolkosz in #6707
- Use wrapper with
void*argument types for iterator advance/dereference signature by @shwina in #6634 - Restore libcudacxx dialect presets. by @alliepiper in #6705
- Refactor error handling in radix sort dispatch by @bernhardmgruber in #6681
- Remove special dialect handling from cudax build system. by @alliepiper in #6702
- Segmented scan followup by @oleksandr-pavlyk in #6706
- Fix electing leader from any group in
cuda::memcpy_asyncby @bernhardmgruber in #6710 - Avoid scaling twice in
ReduceNondeterministicPolicyby @bernhardmgruber in #6711 - Remove special handling of C++ dialect in CUB's build system by @alliepiper in #6713
- [libcu++] Use resource test fixture members through this by @pciolkosz in #6717
- Improves top-k examples to illustrate stream usage by @elstehle in #6723
- Tweak
sol.pya bit by @bernhardmgruber in #6721 - Implement PCG64 as extension by @RAMitchell in #6292
- Use PDL in cub::DeviceScan by @bernhardmgruber in #6639
- Fix header in libcudacxx test by @alliepiper in #6726
- Remove dead code. by @alliepiper in #6725
- Add deps on thrust/cub to libcudacxx. by @alliepiper in #6694
- Remove special handling for dialect in Thrust's build system. by @alliepiper in #6722
- [libcu++] Automatically bump up the release threshold of default mempools by @pciolkosz in #6718
- Backport
cuda::std::reference_wrapperC++20 features by @davebayer in #6709 - Relax error tolerance for deterministic_device_reduce (RFA) test by @srinivasyadav18 in #6720
- [DOC] Add temp_storage_bytes usage guide by @Aminsed in #6208
- Improve charconv test compile times by @davebayer in #6687
- Move source location builtins directly to
<cuda/std/source_location>by @davebayer in #6738 - Small improvements for
cuda::ipowby @davebayer in #6736 - Add support for clang's alignment builtins by @davebayer in #6741
- Disable test that is failing in multiple configurations by @miscco in #6745
- Implement std::normal_distribution by @RAMitchell in #6585
- Update
cuda::std::spanconcepts by @davebayer in #6744 - Improve bit builtins support by @davebayer in #6737
- Implement
ranges::drop_viewby @miscco in #5049 - Improve fp decompose by @davebayer in #6749
- Enable caching of advance/dereference methods for Zipiterator and PermutationIterator by @shwina in #6753
- implement
indeterminate_domainfrom P3826R2 by @ericniebler in #6628 - Fix
cuda::std::reference_wrappernoexcept test with gcc-8 by @davebayer in #6757 cuda.compute: In TransformIterator, use type annotations (if available) to determine the output type of user-provided op by @shwina in #6760cuda.compute: Fixes and improvements to function caching by @shwina in #6758- Fix
__throw_cuda_erroravailability with nvrtc by @davebayer in #6759 - Implement
ranges::find_ifandranges::find_if_notby @miscco in #6752 - Fix radix_sort tuning namespace by @bernhardmgruber in #6755
- [libcu++] Add sm_62 arch traits by @pciolkosz in #6772
- fix(readme): Update broken Godbolt example link by @miyanyan in #6773
- Implement CUDA backend for parallel
cuda::std::for_eachby @miscco in #5610 - Ensure that we properly warn about device lambdas that need to query the return type by @miscco in #6765
- Add missing test for thrust::reduce_into by @Pansysk75 in #6572
- cuda.compute: Add select algorithm based on three_way_partition by @shwina in #6766
- Add queries for CUB ptx version as
arch_idby @bernhardmgruber in #6776 - Add
operator<<to some CUB enums by @bernhardmgruber in #6774 - cuda.compute: Fix caching of functions that call other functions by @shwina in #6770
- Implement std::exponential_distribution by @RAMitchell in #6750
- Fix issue with libcudacxx header tests. by @alliepiper in #6785
- Add a type and operation enum to CUB by @bernhardmgruber in #6780
- Use conventional order of
_CCCL_API friendconsistently by @miscco in #6781 - Implement std::binomial_distribution by @RAMitchell in #6747
- Fixes
i32overflow for benchmark data generation of more thanINT_MAXnumber of items by @elstehle in #6809 - Temporarily add upper bound to numba-cuda dependency by @shwina in #6815
- Make cuda capabilities part of cccl config by @davebayer in #6806
- Update std::uniform_real_distribution by @RAMitchell in #6798
- [cub] Implement
cub::MaxPotentialDynamicSmemBytesby @davebayer in #6818 - libcudacxx: streamline simple trait aliases by @Aminsed in #6740
- Fix a typo in
compute.rstby @shwina in #6826 - Improve our
WarpReduceimplementation by @miscco in #6814 - Implement
cuda::sincosby @davebayer in #6742 - Replace inline ptx with intrinsics by @davebayer in https://github.com/NVIDIA/c...
v3.2.1
Full Changelog: v3.2.1...v3.2.1
What's Changed
🔄 Other Changes
- Bump branch/3.2.x to 3.2.1. by @wmaxey in #7329
- [Backport branch/3.2.x] Add accessor methods to shared_resource by @github-actions[bot] in #7322
- [Backport branch/3.2.x] Fix clang warning about missing braces again by @github-actions[bot] in #7324
- [Backport branch/3.2.x] part deux: make the abi of
__basic_anycompatible between c++17 and c++20 by @github-actions[bot] in #7421 - [backport 3.2] Fix missing c2h symbol when compiling with clang-cuda (#7454) by @davebayer in #7600
- [Backport branch/3.2.x] Remove recursion from __internal_is_address_from by @github-actions[bot] in #7573
- [Backport branch/3.2.x] Fix
ranges_overlapfornvc++ -cudaby @github-actions[bot] in #7598 - [Backport branch/3.2.x] Fix
cuda::device::current_arch_idby @github-actions[bot] in #7601 - [Backport branch/3.2.x] Check for
_GLIBCXX_USE_CXX11_ABIonly when compiling with libstdc++ by @github-actions[bot] in #7630 - [Backport branch/3.2.x] Fix cuda::barrier missing accounting of results in try_wait by @github-actions[bot] in #7634
Full Changelog: v3.2.0...v3.2.1
CCCL Python Libraries (v0.5.1)
These are the release notes for the cuda-cccl Python package version 0.5.1, dated February 6th, 2026. The previous release was v0.5.0.
cuda-cccl is in "experimental" status, meaning that its API and feature set can change quite rapidly.
Installation
Please refer to the install instructions here
Features
Improvements
- Restrict to numba-cuda less than 0.27 (#7529)
Bug Fixes
- Fix caching of functions referencing numpy ufuncs (#7535)
CCCL Python Libraries (v0.5.0)
These are the release notes for the cuda-cccl Python package version 0.5.0, dated February 5th, 2026. The previous release was v0.4.5.
cuda-cccl is in "experimental" status, meaning that its API and feature set can change quite rapidly.
Installation
Please refer to the install instructions here
⚠️ Breaking change
Object-based API requires passing operator to algorithm __call__ method
This API change affects only users of the object-based API (expert mode).
Previously, constructing an algorithm object required passing the operator as an argument, but invoking it did not:
# step 1: create algorithm object
transformer = cuda.compute.make_unary_transform(d_input, d_output, some_unary_op)
# step 2: invoke algorithm
transformer(d_in1, d_out1, num_items1) # NOTE: not passing some_unary_op hereThe new behaviour requires passing it in both places:
# step 1: create algorithm object
transformer = cuda.compute.make_unary_transform(d_input, d_output, some_unary_op)
# step 2: invoke algorithm
transformer(d_in1, d_out1, some_unary_op, num_items1) # NOTE: need to pass some_unary_op hereThis change is introduced because in many situations (such as in a loop), the operator itself and the globals/closures it references can change between construction and invocation (or between invocations).
Features
Improvements
- Avoid unnecessary recompilation of stateful operators (#7500)
- Improved cache lookup performance (#7501)
Bug Fixes
- Fix handling of boolean types in cuda.compute (#7389)