Skip to content

Conversation

@brycelelbach
Copy link
Contributor

@brycelelbach brycelelbach commented Nov 4, 2025

Adds a new Thrust example demonstrating single-pass run-length decoding using inclusive_scan with a transform_output_iterator.

Conor thought it couldn't be done, but once again I have prevailed!

Episode 255: 🇩🇰 C++ Copenhagen Meetup & Replicate
Episode 256: 🇩🇰 Algorithms: Replicate, Scatter, Gather & RLD (Part 2)
Episode 257: 🇳🇴 Live from Norway! Replicate, Scatter, Gather & RLD (Part 3)

If you code review this PR, you might be featured in part 4!

@brycelelbach brycelelbach requested review from a team as code owners November 4, 2025 02:28
@brycelelbach brycelelbach requested a review from elstehle November 4, 2025 02:28
@github-project-automation github-project-automation bot moved this to Todo in CCCL Nov 4, 2025
@cccl-authenticator-app cccl-authenticator-app bot moved this from Todo to In Review in CCCL Nov 4, 2025
@brycelelbach
Copy link
Contributor Author

Here's a reproducer for the NVCC 13 + GCC 14 bug. I filed it upstream.

Examples should be exemplars, so I think we should leave the lovely ranges code and live with masking off the bad config instead of doing a raw for loop or something.

// Minimal reproducer for CUDA 13.0 nvcc + GCC 14 libstdc++ ranges incompatibility
//
// BUG DESCRIPTION:
// ================
// CUDA 13.0's nvcc fails to compile C++20 ranges code involving join_view with
// nested transform_view when using GCC 14's libstdc++. The same code compiles
// successfully with GCC 13's libstdc++, and GCC 14 alone can compile it without nvcc.
//
// ENVIRONMENT:
// ============
// - CUDA: 13.0.88
// - Host Compiler: GCC 14.2.0
// - Platform: Linux x86_64
// - libstdc++: GCC 14.2.0
//
// COMPILATION COMMANDS:
// =====================
// FAILS with nvcc + GCC 14:
//   nvcc -std=c++20 nvcc_ranges_join_view_bug.cu
//
// WORKS with nvcc + GCC 13:
//   nvcc -std=c++20 nvcc_ranges_join_view_bug.cu
//   (using GCC 13.3.0 as host compiler)
//
// WORKS with GCC 14 alone (not through nvcc):
//   g++ -std=c++20 nvcc_ranges_join_view_bug.cu
//
// ERROR MESSAGE:
// ==============
// /usr/include/c++/14/ranges(3164): error: no instance of constructor
// "std::ranges::join_view<_Vp>::_Iterator<_Const>::_Iterator [with
// _Vp=std::ranges::transform_view<std::ranges::iota_view<unsigned long,
// std::unreachable_sentinel_t>, lambda [](auto)->auto>, _Const=false]"
// matches the argument list
//
// The error indicates that nvcc's concepts evaluation fails to match any of the
// constrained constructors for join_view::_Iterator, even though one should match
// (the one with `requires forward_range<_Base>`).
//
// ROOT CAUSE:
// ===========
// CUDA 13.0's nvcc has incomplete C++20 concepts support. When evaluating the
// constraints on join_view::_Iterator constructors with complex nested types
// (transform_view containing lambdas), nvcc's concepts subsumption logic fails.
// GCC 14's libstdc++ has a more sophisticated ranges implementation than GCC 13,
// which triggers this bug.

#include <ranges>
#include <iostream>

int main() {
    using CountType = unsigned long;
    constexpr CountType repeat = 4;
    
    // Create a nested range: each element of the outer range is itself a range
    // This creates a 2D structure that needs to be flattened with join_view
    auto nested = std::views::iota(CountType{0}, CountType{8})
                  | std::views::transform([=](auto x) {
                      return std::views::iota(CountType{0}, repeat)
                          | std::views::transform([=](auto){ return x; });
                    });
    
    // join_view should flatten the nested ranges
    // This line triggers the compilation error with nvcc + GCC 14
    auto flattened = nested | std::views::join;
    
    // Attempting to get begin() iterator instantiates the problematic constructor
    auto it = flattened.begin();
    
    std::cout << "First element: " << *it << std::endl;
    return 0;
}

// EXPECTED BEHAVIOR:
// ==================
// The code should compile successfully and print "First element: 0"
//
// ACTUAL BEHAVIOR:
// ================
// Compilation fails with nvcc + GCC 14 due to failed concepts constraint evaluation
// in join_view::_Iterator constructor selection.
//
// WORKAROUND:
// ===========
// Use GCC 13 or earlier as the host compiler with CUDA 13.0.

@brycelelbach brycelelbach force-pushed the pr/thrust/run-length-decode-single-pass-example branch from a450ace to 72a84ff Compare November 4, 2025 02:50
@github-actions

This comment has been minimized.

@brycelelbach brycelelbach enabled auto-merge (squash) November 4, 2025 04:52
@brycelelbach brycelelbach force-pushed the pr/thrust/run-length-decode-single-pass-example branch from 415ac66 to 1930824 Compare November 4, 2025 06:21
@copy-pr-bot
Copy link
Contributor

copy-pr-bot bot commented Nov 4, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

template <typename ValueType>
__host__ __device__ CountType operator()(run<ValueType, CountType> r) const
{
cuda::std::size_t end = cuda::minimum()(r.offset + r.count, out_size);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could just be

Suggested change
cuda::std::size_t end = cuda::minimum()(r.offset + r.count, out_size);
cuda::std::size_t end = cuda::std::min(r.offset + r.count, out_size);

Comment on lines +38 to +39
run(run const& other) = default;
run& operator=(run const& other) = default;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: What is the purpose here? do you want to delete move operations?

Comment on lines +83 to +86
auto runs = thrust::make_transform_iterator(
thrust::make_zip_iterator(values, counts), [] __host__ __device__(thrust::tuple<ValueType, CountType> tup) {
return run{thrust::get<0>(tup), thrust::get<1>(tup)};
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do have cuda::zip_transform_iterator, just saying

Comment on lines +114 to +115
thrust::device_vector<CountType> counts(size);
thrust::fill(thrust::device, counts.begin(), counts.end(), repeat);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should rather be

Suggested change
thrust::device_vector<CountType> counts(size);
thrust::fill(thrust::device, counts.begin(), counts.end(), repeat);
thrust::device_vector<CountType> counts(size, repeat);

}
std::cout << std::endl;

auto gold = std::views::iota(CountType{0}) | std::views::transform([=](auto x) {
Copy link
Contributor

@miscco miscco Nov 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do have cuda::std::iota_view and cuda::std::transform_view at home

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are even available in C++17

@github-project-automation github-project-automation bot moved this from In Review to In Progress in CCCL Nov 4, 2025
@github-actions
Copy link
Contributor

github-actions bot commented Nov 4, 2025

🥳 CI Workflow Results

🟩 Finished in 2h 51m: Pass: 100%/70 | Total: 9h 57m | Max: 52m 58s | Hits: 99%/115298

See results here.

Comment on lines +32 to +39
__host__ __device__ run(ValueType value = ValueType{}, CountType count = 0, CountType offset = 0, CountType run_id = 1)
: value(value)
, count(count)
, offset(offset)
, run_id(run_id)
{}
run(run const& other) = default;
run& operator=(run const& other) = default;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: delete special ops and just use aggregate init

Comment on lines +55 to +61
__host__ __device__
expand(OutputIterator out, CountType out_size, CountType runs_size, ExpandedSizeIterator expanded_size)
: out(out)
, out_size(out_size)
, runs_size(runs_size)
, expanded_size(expanded_size)
{}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

Comment on lines +83 to +86
auto runs = thrust::make_transform_iterator(
thrust::make_zip_iterator(values, counts), [] __host__ __device__(thrust::tuple<ValueType, CountType> tup) {
return run{thrust::get<0>(tup), thrust::get<1>(tup)};
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: use zip_function to unpack the tuple

Suggested change
auto runs = thrust::make_transform_iterator(
thrust::make_zip_iterator(values, counts), [] __host__ __device__(thrust::tuple<ValueType, CountType> tup) {
return run{thrust::get<0>(tup), thrust::get<1>(tup)};
});
auto runs = thrust::make_transform_iterator(
thrust::make_zip_iterator(values, counts), thrust::make_zip_function([] __host__ __device__(ValueType v CountType c) {
return run{v, c};
}));

Comment on lines +93 to +94
auto expand_out = thrust::make_transform_output_iterator(
thrust::make_discard_iterator(), expand(out, out_size, runs_size, expanded_size.begin()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: I would find a braced-init list easier to read here:

Suggested change
auto expand_out = thrust::make_transform_output_iterator(
thrust::make_discard_iterator(), expand(out, out_size, runs_size, expanded_size.begin()));
auto expand_out = thrust::make_transform_output_iterator(
thrust::make_discard_iterator(), expand{out, out_size, runs_size, expanded_size.begin()});

Plus, you can drop the ctor of expand now.


if (static_cast<CountType>(out_end - output.begin()) != size * repeat)
{
throw int{};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important: let's not show bad practices like throwing exceptions not derived from std::exception. Since this is in main, can we just print an error message and return something nonzero?

Applies one more time below.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

3 participants