Skip to content

Releases: NVIDIA/cccl

CCCL 2.8.0

03 Mar 21:03
6d02e11
Compare
Choose a tag to compare

What's Changed

  • Adds benchmarks for DeviceSelect::Unique by @elstehle in #2359
  • CUB - Enable DPX Reduction by @fbusato in #2286
  • [CUDAX] add a small c++17 implementation of std::execution (aka P2300) by @ericniebler in #2301
  • Add thurst::transform_inclusive_scan with init value by @gonidelis in #2326
  • Widen histogram agent constructor to more types by @bernhardmgruber in #2380
  • Use a constant for the amount of static SMEM by @bernhardmgruber in #2374
  • Add cub::DeviceTransform by @bernhardmgruber in #2086
  • Update toolkit to CTK 12.6 by @miscco in #2348
  • implement make_integer_sequence in terms of intrinsics whenever possible by @ericniebler in #2384
  • Implement cuda::mr::cuda_async_memory_resource by @miscco in #1637
  • Drop implementation of thrust::pair and thrust::tuple by @miscco in #2395
  • Pull out _LIBCUDACXX_UNREACHABLE into its own file by @miscco in #2399
  • Share common compiler flags in new CCCL-level targets. by @alliepiper in #2386
  • conditionally include <crt/host_defines.h> from __cccl/execution_space.h header by @ericniebler in #2406
  • add some simple utilities for manipulating lists of types by @ericniebler in #2370
  • Drop thrusts diagnostic suppression warnings by @miscco in #2392
  • [PoC]: Implement cuda::experimental::uninitialized_async_buffer by @miscco in #1854
  • Fix thrust package to work with newer FindOpenMP.cmake. by @alliepiper in #2421
  • Introduce cccl_configure_target cmake function. by @alliepiper in #2388
  • Fix sccache errors in RAPIDS builds by @trxcllnt in #2417
  • Replace CUDA C++ Core Libraries with CUDA Core Compute Libraries (only in README.md). by @rwgk in #2424
  • Minor cleanup with cuda/atomic by @miscco in #2418
  • uninitialized_buffer::get_resource returns a ref to an any_resource that can be copied by @ericniebler in #2431
  • Refactor cuda::ceil_div to take two different types by @miscco in #2376
  • Reduce PR testing matrix. by @alliepiper in #2436
  • Implement cudax::shared_resource by @miscco in #2398
  • Increase the libcu++ timeout by @miscco in #2435
  • Move c/include/cccl/.h files to c/include/cccl/c/.h by @rwgk in #2428
  • Make any_resource emplacable by @miscco in #2425
  • Fix issues with __host__ and __device__ definitions by @miscco in #2413
  • Make bit_cast play nice with extended floating point types by @miscco in #2434
  • Do not include our own string.h file by @miscco in #2444
  • Move nightly time by @bdice in #2437
  • Remove a ton of lines in thrust tests by @gonidelis in #2356
  • [CUDAX] Add placeholder green context type and logical device that can hold both a green ctx and a device by @pciolkosz in #2446
  • Fix typo in CCCLBuildCompilerTargets.cmake by @alliepiper in #2453
  • Drop superflous compile definition from thrust tests by @miscco in #2450
  • Consolidate packages and install rules by @alliepiper in #2456
  • Prune CUB's ChainedPolicy by CUDA_ARCH_LIST by @bernhardmgruber in #2154
  • fixes merge conflict for policy pruning by @elstehle in #2466
  • Add CCCL_ENABLE_WERROR flag. by @alliepiper in #2463
  • Add CUB tests for segmented sort/radix sort with 64-bit num. items and segments by @fbusato in #2254
  • Propagate compiler flags down to libcu++ LIT tests by @Artem-B in #2420
  • Drop remaining uses of _LIBCUDACXX_COMPILER_* by @miscco in #2467
  • Avoid C++17 extension in c++11 tests by @miscco in #2469
  • Add span to example and templated block size by @Kh4ster in #2470
  • Drop Objective C++ support by @miscco in #2468
  • removes superfluous template keyword in call to Dereference by @andrewcorrigan in #2482
  • Improve build times in several heavyweight libcudacxx tests. by @wmaxey in #2478
  • Drop __availability header by @miscco in #2484
  • Replace a few more instances of CUDA C++ Core Libraries with CUDA Core Compute Libraries`. by @rwgk in #2447
  • Fix common_type specialization for extended floating point types by @miscco in #2483
  • Implement some CUDA API calls for async_memory_pool by @miscco in #2455
  • Move cudax example project to CCCL project examples. by @alliepiper in #2462
  • Disable system header for narrowing conversion check by @miscco in #2465
  • Require resources to always provide at least one execution space property by @miscco in #2489
  • Rework builtin handling by @miscco in #2461
  • Disable execution checks for std::equal by @miscco in #2491
  • replace _CCCL_ALWAYS_INLINE with _CCCL_FORCEINLINE by @ericniebler in #2439
  • Drop 2 relative includes that snuck in by @miscco in #2492
  • re-express the cudax::__tupl::__apply member to make nvc++ happy by @ericniebler in #2493
  • Drop badly named _One_of concept by @miscco in #2490
  • Unify assert handling in cccl by @miscco in #2382
  • Reduce scope of Thrust linkage in cudax. by @alliepiper in #2496
  • Centralize CPM logic. by @alliepiper in #2495
  • Fix typo in presets. by @alliepiper in #2497
  • Refactor away per-project TOPLEVEL flags. by @alliepiper in #2498
  • [FEA]: Validate cuda.parallel type matching in build and execution by @rwgk in #2429
  • avoid gcc optimizer bug by not force inlining part of thrust::transform by @ericniebler in #2509
  • Cleanup and modularize <cuda/std/barrier> by @miscco in #2443
  • Consolidate header testing infra. by @alliepiper in #2460
  • Add ForEachN from CUB to cccl/c. by @wmaxey in #2378
  • Adds support for large number of items in DeviceSelect and DevicePartition by @elstehle in #2400
  • Adds support for large number of items to DeviceScan::*ByKey family of algorithms by @elstehle in #2477
  • Integrate c/parallel with CCCL build system and CI. by @alliepiper in #2514
  • Create a command list utility for nvrtc/jitlink steps. by @wmaxey in #2511
  • Fix the example project which the documentation refers too by @caugonnet in #2531
  • Enable tests/headertests for c/parallel in all-dev presets. by @alliepiper in #2566
  • Rename cudax test targets to match CCCL conventions. by @alliepiper in #2568
  • Update project list in issue template by @alliepiper in #2532
  • Disable compiler extensions on CCCL targets. by @alliepiper in #2559
  • Fixes cub::DeviceMemcpy::Batched to be able to copy from const source pointers by @elstehle in #2573
  • Fix documentation error in ci/build_common.sh for -arch by @caugonnet in #2574
  • gcc-14 gained the ability to mangle noexcept expressions by @ericniebler in #2565
  • Miscellaneous simple fixes by @rwgk in #2575
  • Avoid including yvals.h when the compiler is not MSVC. by @wmaxey in #2545
  • Fix popc.h when architecture is not x86 on MSVC. by @wmaxey in #2524
  • test for exceptions support on msvc with the _CPPUNWIND macro by @ericniebler in https://github.co...
Read more

CCCL 2.7.0

06 Jan 22:12
v2.7.0
b5fe509
Compare
Choose a tag to compare

What’s New

C++

Thrust / CUB

  • Inclusive scan now supports initial value #1940
  • Inclusive and exclusive scan now support problem sizes exceeding 2^31 elements #2171
  • New cub::DeviceMerge::MergeKeys and cub::DeviceMerge::MergePairs algorithms #1817
  • New thrust::tabulate_output_iterator fancy iterator #2282

Libcudacxx

  • Enable Assertions on host and device depending on users choice
  • C++26 inplace_vector has been implemented and backported to C++14
  • Improved support for extended floating point types __half and __nv_bfloat16 both for cmath functions and complex
  • cuda::std::tuple is now trivially copyable if the stored types are trivially copyable
  • Reworked our atomics implementation
  • Improved <cuda/std/bit> conformance
  • Implemented <cuda/std/bitset> and backported to C++14
  • Implemented and backported C++20 bit_cast. It is available in all standard modes and constexpr with compiler support
  • Various backports and constexpr improvements (bool_constant, cuda::std::max)
  • Moved the experimental memory resources from <cuda/memory_resource> into <cuda/experimental/memory_resource.cuh>

Python

cuda.cooperative

Best practice of using CCCL to make your CUDA kernels easier to write and faster to execute is now available in Python through the cuda.cooperative module. This module currently supports block- and warp-level algorithms within numba.cuda kernels, offering speed-of-light reductions, prefix sums, radix, and merge sort. You can customize cuda.cooperative algorithms with user-defined data types and operators, implemented directly in Python.

Block and warp-level cooperative algorithms are now available in Python #1973.
Experimental versions of reduce, scan, merge and radix sort are available in numba.cuda kernels.

cuda.parallel

Apart from device-side cooperative algorithms, CCCL 2.7 provides an experimental version of host-side parallel algorithms as part of the cuda.parallel module. This release includes parallel reduction.

What's Changed

Read more

CCCL 2.6.1

10 Sep 18:45
v2.6.1
9019a6a
Compare
Choose a tag to compare

This release includes backports for PRs #2332 and #2341. Please see release 2.6.0 for the full list of changes included in the release.

What's Changed

Full Changelog: v2.6.0...v2.6.1

CCCL 2.6.0

04 Sep 17:42
c67b1c3
Compare
Choose a tag to compare

What's Changed

Full Changelog: v2.5.0...v2.6.0

CCCL 2.5.0

17 Jun 18:00
69be18c
Compare
Choose a tag to compare

What's New

This release includes several notable improvements and new features:

  • CUB device-level algorithms now support NVTX ranges in Nsight Systems. This integration makes it easier to identify and analyze the time spent in CUB algorithms. Please note that profiling with this feature requires at least C++14.
  • We have added new cub::DeviceSelect::FlaggedIf API, which allows you to select items based on applying a predicate to flags. This addition provides more flexibility and control over item selection.

What's Changed

Read more

v2.4.0

23 Apr 21:30
1c009d2
Compare
Choose a tag to compare

What’s New

We are still hard at work in CCCL on paying down lots of technical debt, improving infrastructure, and various other simplifications as part of the unification of Thrust/CUB/libcu++. In addition to various fixes and documentation improvements, the following notable improvements have been made to Thrust, CUB, and libcudacxx.

Thrust

As part of our kernel consolidation effort, kernels of thrust::unique_by_key, thrust::copy_if, and thrust::partition algorithms are now consolidated in CUB. Kernel consolidation achieves two goals. First, it delivers the latest optimizations of CUB algorithms to Thrust users. Apart from the performance improvements, it introduces support of large problem sizes (64-bit offsets) into Thrust algorithms.

CUB

  • cub::DeviceSelect::UniqueByKey now supports equality operator and large problem sizes.
  • New cub::DeviceFor family of algorithms goes beyond conventional cub::DeviceFor::ForEach. cub::DeviceFor::ForEachCopy can provide you with additional performance benefits from vectorized memory accesses.
  • Many CUB algorithms now support CUDA graph capture mode.

libcudacxx

  • Added new cuda::ptx namespace with wrappers for inline-PTX instructions
  • cuda::std::complex specializations for CUDA types bfloat and half.

What's Changed

Read more

v2.3.2

12 Mar 20:22
64d3a5f
Compare
Choose a tag to compare

What's Changed

Full Changelog: v2.3.1...v2.3.2

v2.3.1

23 Apr 21:29
299eb62
Compare
Choose a tag to compare

What's Changed

  • [BACKPORT]: Fix bug in stream_ref::wait by @miscco in #1283
  • Revert "Refactor thrust::complex as a struct derived from cuda::std::complex (#454)" by @miscco in #1286
  • Create patch 2.3.1 by @wmaxey in #1287

Full Changelog: v2.3.0...v2.3.1

CCCL 2.3.0

28 Feb 18:36
c4eda1a
Compare
Choose a tag to compare

What’s New

In addition to various fixes and documentation improvements, the following notable improvements have been made to Thrust, CUB, and libcudacxx.

System Headers and Warnings

Users don't want to see warnings from CCCL headers. The typical way to accomplish this with header libraries is to use -isystem. However, this causes problems when using CCCL from GitHub, it will conflict with the CCCL headers in the CTK. Therefore, you should always include CCCL headers via -I.

To achieve the same effect as -isystem, CCCL headers will now use the system_header pragma. For more information, see #527.

TL;DR: You should never see warnings emitted from a CCCL header ever again!

Linkage Issues

Using CUB and Thrust in shared libraries is a known source of issues. Previously, the solution to these issues consisted of using the THRUST_CUB_WRAPPED_NAMESPACE macro so that different shared libraries have different symbol names. However, this solution has poor discoverability, since issues present themselves in forms of segmentation faults, hangs, wrong results, etc. As of the 2.3 release, linkage issues are addressed by default without the need for THRUST_CUB_WRAPPED_NAMESPACE. Although the fix is API compatible, it might cause ABI compatibility issues. For more details, see issue #443.

Thrust

thrust::tuple, thrust::pair, and thrust::complex have been replaced with cuda::std alternatives. This can be a breaking change, but should be source compatible.

CUB

Up to 60% performance improvements of cub::DeviceSelect::UniqueByKey, cub::DeviceScan::ExclusiveSumByKey, and cub::DeviceReduce::ReduceByKey on A100. cub::DeviceSegmentedReduce now supports 64-bit indexing.

libcudacxx

  • The cuda::ptx namespace and <cuda/ptx> header is now available and provides access to various inline PTX functions that enumerate various async memcpy and barrier intrinsics.
  • #379 - Added experimental bulk TMA memcpy under <cuda/barrier>

What's Changed

  • Port cub::DeviceSegmentedReduce tests to catch2 by @elstehle in #303
  • Branch/2.2.x by @gevtushenko in #305
  • Tune unique by key on A100 by @gevtushenko in #306
  • Merge branch/2.2.x to main by @jrhemstad in #308
  • Add example cmake project by @jrhemstad in #177
  • Adds catch2 tests for reduce-by-key by @elstehle in #311
  • Tune scan by key on A100 by @gevtushenko in #325
  • Replace diag_suppress by nv_diag_suppress in documentation by @ahendriksen in #281
  • Fix MSVC / CUB tests build by @gevtushenko in #336
  • gdb pretty printer: handle non-cuda device vectors by @siboehm in #264
  • Add a nvrtc configuration for libcu++ by @miscco in #202
  • GH Infra: project automation and issue template fixes by @jarmak-nv in #297
  • Tune reduce by key on A100 by @gevtushenko in #346
  • Merge commits from 2.2 branch by @miscco in #350
  • Fix a shadow warning in thrust's execute_with_dependencies.h by @hageboeck in #334
  • Assorted fixes for MSVC 2017 by @miscco in #341
  • [skip-tests] Guard inline variables with _LIBCUDACXX_INLINE_VAR macro by @miscco in #355
  • Port cub::DeviceScan tests to catch2 by @elstehle in #347
  • Remove _NOEXCEPT macro in favor of noexcept in libcu++ by @Blonck in #349
  • Project Automation: add conditional steps due to context errors by @jarmak-nv in #353
  • Work around strange gcc bug by @miscco in #363
  • Implement iter_swap CPO by @miscco in #332
  • Replace default, constexpr, and delete macros by original keywords by @Blonck in #360
  • Add clang16 devcontainer and CI job by @miscco in #362
  • [skip-tests] Skip merge conflict from old iter_swap PR by @miscco in #369
  • [skip-tests] Also skip all CI runs that require a GPU when [skip-tests] is set by @miscco in #370
  • Remove _LIBCUDACXX_CXX03_LANG macro and all encapsulated code by @Blonck in #368
  • Remove checks against _LIBCUDACXX_STD_VER < 11 by @Blonck in #375
  • Use copy-pr-bot by @ajschmidt8 in #381
  • Implement the permutable concept by @miscco in #367
  • [NFC] We missed some _NOEXCEPT_ macro uses by @miscco in #371
  • Implement identity changes for c++20 by @miscco in #383
  • Hide third party cmake options in our cmake developer builds. by @allisonvacanti in #300
  • Port cub::DeviceScanByKey tests to Catch2 by @elstehle in #380
  • Fixes a race in DeviceRunLengthEncode::NonTrivialRuns by @elstehle in #399
  • Add commit information to the test output by @miscco in #401
  • Project Automation: Handle PRs opened as non-draft + multiple bug fixes by @jarmak-nv in #387
  • Project Automation: set Roadmap project value on issue/pr close and Auto-type new issues by @jarmak-nv in #389
  • Add support for tests that should fail at runtime by @ahendriksen in #418
  • Port DeviceAdjacentDifference::SubtractRight tests to catch2 by @miscco in #390
  • Project automation - Fix indentation for continue-on-error by @jarmak-nv in #425
  • [BUG] Ensure that all headers build on their own by @miscco in #200
  • Remove util_device.cuh from iterator headers to enable online compilation by @leofang in #412
  • Fix ci-overview example by @gevtushenko in #428
  • Port cub::DeviceRunLengthEncode tests to catch2 by @miscco in #411
  • Add cuda::device::barrier_arrive tx by @ahendriksen in #358
  • Fix CubDebug by @gevtushenko in #430
  • Do not use static member functions to initialize static member variables. by @miscco in #438
  • Implement the projected helper struct by @miscco in #385
  • Add PTX wrapping functions for TMA features by @ahendriksen in #379
  • Clarify docstring for num_items parameter of DeviceSegmentedRadixSort by @HapeMask in #320
  • Enable lit to determine the compute architectures by @miscco in #447
  • Add NVRTC_SKIP_KERNEL_RUN tag to compile, but skip running NVRTC test by @ahendriksen in #434
  • Improve documentation of cuda::barrier by @ahendriksen in #440
  • Extend thrust::complex unit tests to prepare for upcoming replacement with std::complex by @Blonck in #413
  • Remove having two install rules for -header-search.cmake by @robertmaynard in #298
  • Run .devcontainer/launch.sh with bash + add error checking by @wence- in #407
  • Remove C++03 compatability from unit tests by @Blonck in #378
  • [libcu++] Fix use of __ppc64__ by @miscco in #451
  • Update the README by @jrhemstad in #291
  • [libcu++] Try to avoid gcc misscompilation issues by @miscco in #452
  • Consolidate matrix logic into single script/job by @jrhemstad in #361
  • Implement the indirectly_comparable concept by @miscco in #445
  • Fix compute matrix dropping trailing zeros by @jrhemstad in #466
  • Avoid integer promotion warnings with MSVC by @miscco in #460
  • Implement ranges comparison objects by @miscco in #464
  • Fix CUB/MSVC/RDC tests by @gevtushenko in #469
  • Fix Thrust/CUB Linkage Issues by @gevtushenko in #443
  • Script for Running CUB Benchmarks by @gevtushenko in #472
  • [skip ci] Add list of CCCL users to README by @jrhemstad in #474
  • constexpr all the things by @pb-dseifert in #476
  • Add Gonzalo/Allard to trustees by @jrhemstad in #482
  • Implement the sortable concept by @miscco in #471
  • [libcu++] Add _LIBCUDACXX_CUDACC_BELOW_12_3 macro by @gonzalobg in #479
  • Refactor thrust::complex as a struct derived from cuda::std::complex by @Blonck in #454
  • Add ci scripts for windows by...
Read more

CCCL 2.2.0

07 Sep 19:09
36f379f
Compare
Choose a tag to compare

(Note that these release notes are not yet finalized. They do not reflect any PRs that were merged to Thrust/CUB/libcudacxx before migrating to the nvidia/cccl repo).

What's Changed

New Contributors

Full Changelog: https://github.com/NVIDIA/cccl/commits/v2.2.0