[Thrust]: New single-pass run length decoding example #6470

brycelelbach · 2025-11-04T02:28:27Z

Adds a new Thrust example demonstrating single-pass run-length decoding using inclusive_scan with a transform_output_iterator.

Conor thought it couldn't be done, but once again I have prevailed!

Episode 255: 🇩🇰 C++ Copenhagen Meetup & Replicate
Episode 256: 🇩🇰 Algorithms: Replicate, Scatter, Gather & RLD (Part 2)
Episode 257: 🇳🇴 Live from Norway! Replicate, Scatter, Gather & RLD (Part 3)

If you code review this PR, you might be featured in part 4!

brycelelbach · 2025-11-04T02:33:41Z

Here's a reproducer for the NVCC 13 + GCC 14 bug. I filed it upstream.

Examples should be exemplars, so I think we should leave the lovely ranges code and live with masking off the bad config instead of doing a raw for loop or something.

// Minimal reproducer for CUDA 13.0 nvcc + GCC 14 libstdc++ ranges incompatibility
//
// BUG DESCRIPTION:
// ================
// CUDA 13.0's nvcc fails to compile C++20 ranges code involving join_view with
// nested transform_view when using GCC 14's libstdc++. The same code compiles
// successfully with GCC 13's libstdc++, and GCC 14 alone can compile it without nvcc.
//
// ENVIRONMENT:
// ============
// - CUDA: 13.0.88
// - Host Compiler: GCC 14.2.0
// - Platform: Linux x86_64
// - libstdc++: GCC 14.2.0
//
// COMPILATION COMMANDS:
// =====================
// FAILS with nvcc + GCC 14:
//   nvcc -std=c++20 nvcc_ranges_join_view_bug.cu
//
// WORKS with nvcc + GCC 13:
//   nvcc -std=c++20 nvcc_ranges_join_view_bug.cu
//   (using GCC 13.3.0 as host compiler)
//
// WORKS with GCC 14 alone (not through nvcc):
//   g++ -std=c++20 nvcc_ranges_join_view_bug.cu
//
// ERROR MESSAGE:
// ==============
// /usr/include/c++/14/ranges(3164): error: no instance of constructor
// "std::ranges::join_view<_Vp>::_Iterator<_Const>::_Iterator [with
// _Vp=std::ranges::transform_view<std::ranges::iota_view<unsigned long,
// std::unreachable_sentinel_t>, lambda [](auto)->auto>, _Const=false]"
// matches the argument list
//
// The error indicates that nvcc's concepts evaluation fails to match any of the
// constrained constructors for join_view::_Iterator, even though one should match
// (the one with `requires forward_range<_Base>`).
//
// ROOT CAUSE:
// ===========
// CUDA 13.0's nvcc has incomplete C++20 concepts support. When evaluating the
// constraints on join_view::_Iterator constructors with complex nested types
// (transform_view containing lambdas), nvcc's concepts subsumption logic fails.
// GCC 14's libstdc++ has a more sophisticated ranges implementation than GCC 13,
// which triggers this bug.

#include <ranges>
#include <iostream>

int main() {
    using CountType = unsigned long;
    constexpr CountType repeat = 4;
    
    // Create a nested range: each element of the outer range is itself a range
    // This creates a 2D structure that needs to be flattened with join_view
    auto nested = std::views::iota(CountType{0}, CountType{8})
                  | std::views::transform([=](auto x) {
                      return std::views::iota(CountType{0}, repeat)
                          | std::views::transform([=](auto){ return x; });
                    });
    
    // join_view should flatten the nested ranges
    // This line triggers the compilation error with nvcc + GCC 14
    auto flattened = nested | std::views::join;
    
    // Attempting to get begin() iterator instantiates the problematic constructor
    auto it = flattened.begin();
    
    std::cout << "First element: " << *it << std::endl;
    return 0;
}

// EXPECTED BEHAVIOR:
// ==================
// The code should compile successfully and print "First element: 0"
//
// ACTUAL BEHAVIOR:
// ================
// Compilation fails with nvcc + GCC 14 due to failed concepts constraint evaluation
// in join_view::_Iterator constructor selection.
//
// WORKAROUND:
// ===========
// Use GCC 13 or earlier as the host compiler with CUDA 13.0.

…and build infrastructure, and disable the build for CUDA 13 + GCC 14.

…uilds, requires C++20 ranges.

copy-pr-bot · 2025-11-04T06:21:35Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

miscco · 2025-11-04T07:13:08Z

thrust/examples/run_length_decoding_single_pass.cu

+  template <typename ValueType>
+  __host__ __device__ CountType operator()(run<ValueType, CountType> r) const
+  {
+    cuda::std::size_t end = cuda::minimum()(r.offset + r.count, out_size);


This could just be

Suggested change

cuda::std::size_t end = cuda::minimum()(r.offset + r.count, out_size);

cuda::std::size_t end = cuda::std::min(r.offset + r.count, out_size);

miscco · 2025-11-04T07:14:00Z

thrust/examples/run_length_decoding_single_pass.cu

+  run(run const& other)            = default;
+  run& operator=(run const& other) = default;


Nitpick: What is the purpose here? do you want to delete move operations?

miscco · 2025-11-04T07:15:05Z

thrust/examples/run_length_decoding_single_pass.cu

+  auto runs = thrust::make_transform_iterator(
+    thrust::make_zip_iterator(values, counts), [] __host__ __device__(thrust::tuple<ValueType, CountType> tup) {
+      return run{thrust::get<0>(tup), thrust::get<1>(tup)};
+    });


We do have cuda::zip_transform_iterator, just saying

miscco · 2025-11-04T07:16:52Z

thrust/examples/run_length_decoding_single_pass.cu

+  thrust::device_vector<CountType> counts(size);
+  thrust::fill(thrust::device, counts.begin(), counts.end(), repeat);


This should rather be

Suggested change

thrust::device_vector<CountType> counts(size);

thrust::fill(thrust::device, counts.begin(), counts.end(), repeat);

thrust::device_vector<CountType> counts(size, repeat);

miscco · 2025-11-04T07:18:00Z

thrust/examples/run_length_decoding_single_pass.cu

+  }
+  std::cout << std::endl;
+
+  auto gold = std::views::iota(CountType{0}) | std::views::transform([=](auto x) {


We do have cuda::std::iota_view and cuda::std::transform_view at home

They are even available in C++17

github-actions · 2025-11-04T07:21:08Z

🥳 CI Workflow Results

🟩 Finished in 2h 51m: Pass: 100%/70 | Total: 9h 57m | Max: 52m 58s | Hits: 99%/115298

See results here.

bernhardmgruber · 2025-11-04T07:52:33Z

thrust/examples/run_length_decoding_single_pass.cu

+  __host__ __device__ run(ValueType value = ValueType{}, CountType count = 0, CountType offset = 0, CountType run_id = 1)
+      : value(value)
+      , count(count)
+      , offset(offset)
+      , run_id(run_id)
+  {}
+  run(run const& other)            = default;
+  run& operator=(run const& other) = default;


Suggestion: delete special ops and just use aggregate init

bernhardmgruber · 2025-11-04T07:52:47Z

thrust/examples/run_length_decoding_single_pass.cu

+  __host__ __device__
+  expand(OutputIterator out, CountType out_size, CountType runs_size, ExpandedSizeIterator expanded_size)
+      : out(out)
+      , out_size(out_size)
+      , runs_size(runs_size)
+      , expanded_size(expanded_size)
+  {}


bernhardmgruber · 2025-11-04T07:54:31Z

thrust/examples/run_length_decoding_single_pass.cu

+  auto runs = thrust::make_transform_iterator(
+    thrust::make_zip_iterator(values, counts), [] __host__ __device__(thrust::tuple<ValueType, CountType> tup) {
+      return run{thrust::get<0>(tup), thrust::get<1>(tup)};
+    });


Suggestion: use zip_function to unpack the tuple

Suggested change

auto runs = thrust::make_transform_iterator(

thrust::make_zip_iterator(values, counts), [] __host__ __device__(thrust::tuple<ValueType, CountType> tup) {

return run{thrust::get<0>(tup), thrust::get<1>(tup)};

});

auto runs = thrust::make_transform_iterator(

thrust::make_zip_iterator(values, counts), thrust::make_zip_function([] __host__ __device__(ValueType v CountType c) {

return run{v, c};

}));

bernhardmgruber · 2025-11-04T07:55:39Z

thrust/examples/run_length_decoding_single_pass.cu

+  auto expand_out = thrust::make_transform_output_iterator(
+    thrust::make_discard_iterator(), expand(out, out_size, runs_size, expanded_size.begin()));


Suggestion: I would find a braced-init list easier to read here:

Suggested change

auto expand_out = thrust::make_transform_output_iterator(

thrust::make_discard_iterator(), expand(out, out_size, runs_size, expanded_size.begin()));

auto expand_out = thrust::make_transform_output_iterator(

thrust::make_discard_iterator(), expand{out, out_size, runs_size, expanded_size.begin()});

Plus, you can drop the ctor of expand now.

bernhardmgruber · 2025-11-04T07:57:13Z

thrust/examples/run_length_decoding_single_pass.cu

+
+  if (static_cast<CountType>(out_end - output.begin()) != size * repeat)
+  {
+    throw int{};


Important: let's not show bad practices like throwing exceptions not derived from std::exception. Since this is in main, can we just print an error message and return something nonzero?

Applies one more time below.

brycelelbach requested review from a team as code owners November 4, 2025 02:28

brycelelbach requested a review from gonidelis November 4, 2025 02:28

github-project-automation bot added this to CCCL Nov 4, 2025

brycelelbach requested a review from elstehle November 4, 2025 02:28

github-project-automation bot moved this to Todo in CCCL Nov 4, 2025

cccl-authenticator-app bot moved this from Todo to In Review in CCCL Nov 4, 2025

brycelelbach force-pushed the pr/thrust/run-length-decode-single-pass-example branch from a450ace to 72a84ff Compare November 4, 2025 02:50

This comment has been minimized.

Sign in to view

brycelelbach enabled auto-merge (squash) November 4, 2025 04:52

brycelelbach added 5 commits November 4, 2025 01:21

Initial implementation of a single pass run length decode.

8c60a06

Compute the expanded size.

5b31a3c

[Thrust]: Move run length decoding prototype to examples, add output …

832d700

…and build infrastructure, and disable the build for CUDA 13 + GCC 14.

[Thrust]: Fix typos and formatting for pre-commit checks.

dc68e5a

[Thrust]: Disable run length decoding single pass example for C++17 b…

1930824

…uilds, requires C++20 ranges.

brycelelbach force-pushed the pr/thrust/run-length-decode-single-pass-example branch from 415ac66 to 1930824 Compare November 4, 2025 06:21

miscco requested changes Nov 4, 2025

View reviewed changes

github-project-automation bot moved this from In Review to In Progress in CCCL Nov 4, 2025

bernhardmgruber reviewed Nov 4, 2025

View reviewed changes

	cuda::std::size_t end = cuda::minimum()(r.offset + r.count, out_size);
	cuda::std::size_t end = cuda::std::min(r.offset + r.count, out_size);

		run(run const& other) = default;
		run& operator=(run const& other) = default;

		thrust::device_vector<CountType> counts(size);
		thrust::fill(thrust::device, counts.begin(), counts.end(), repeat);

		auto expand_out = thrust::make_transform_output_iterator(
		thrust::make_discard_iterator(), expand(out, out_size, runs_size, expanded_size.begin()));

[Thrust]: New single-pass run length decoding example #6470

Are you sure you want to change the base?

[Thrust]: New single-pass run length decoding example #6470

Uh oh!

Conversation

brycelelbach commented Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

brycelelbach commented Nov 4, 2025

Uh oh!

This comment has been minimized.

copy-pr-bot bot commented Nov 4, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

miscco Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Nov 4, 2025

🥳 CI Workflow Results

🟩 Finished in 2h 51m: Pass: 100%/70 | Total: 9h 57m | Max: 52m 58s | Hits: 99%/115298

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

brycelelbach commented Nov 4, 2025 •

edited

Loading

miscco Nov 4, 2025 •

edited

Loading