[FEA]: Tabulate Output Iterator #1486

gevtushenko · 2024-03-05T09:05:33Z

Is this a duplicate?

I confirmed there appear to be no duplicate issues for this request and that I agree to the Code of Conduct

Area

Thrust

Is your feature request related to a problem? Please describe.

There's an open question if Kokkos scan API can be implemented in Thrust/CUB. Kokkos provides operator with is_final parameter and an index in input sequence. This allows implementing prefix sum algorithms that pack data into multiple arrays based on runtime index. This is somewhat similar to N-way partition, but is not limited to it.

Describe the solution you'd like

I believe this functionality can be implemented with a new output iterator on Thrust end. This iterator should provide a binary operator with an index in the output sequence and an incoming value:

auto it = make_tabulate_output_iterator([](int index, double value) { print(index, value); });

*(it + 1) = 42; // prints 1, 42
*(it + 2) = 43; // prints 2, 43

This iterator might be used to implement N-way partition. After discussing this with @elstehle, it might also be helpful in our tests of large problem sizes.

Describe alternatives you've considered

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

jrhemstad · 2024-03-05T13:35:33Z

auto it = make_tabulate_output_iterator([](int index, double value) { print(index, value); });
*(it + 1) = 42; // prints 1, 42
*(it + 2) = 43; // prints 2, 43

I understand this example, but I don't grok the connection to the Kokkos scan. Can you give a sketch of how one would use this with scan to do something you can't do today?

gevtushenko · 2024-03-05T23:56:11Z

@jrhemstad here's a use case for is_final that I had in mind:

constexpr int N = 3;
using offsets_t = cuda::std::array<int, N>;
using pointers_t = cuda::std::array<int*, N>;

__host__ __device__ int pack_index(int index) {
  // Some data-dependent computation here
  return index % N; 
}

struct output_op_t {
  int n;
  int *input;
  offsets_t *counts;
  pointers_t pointers;

  __host__ __device__ void operator()(int index, offsets_t offsets) {
    pointers[pack_index(index)][offsets[pack_index(index)]] = input[index];

    if (index == n - 1) {
      counts[0] = offsets;
    }
  }
};

int main() {
  const int n = 8;
  pointers_t pointers;
  thrust::device_vector<int> vi = { 7, 6, 5, 4, 3, 2, 1, 0 };
  cuda::std::array<thrust::device_vector<int>, N> packs;
  thrust::device_vector<offsets_t> counter(1);
  int *pi = thrust::raw_pointer_cast(vi.data());
  offsets_t *pc = thrust::raw_pointer_cast(counter.data());
  for (int i = 0; i < N; i++) {
    packs[i].resize(n);
    pointers[i] = thrust::raw_pointer_cast(packs[i].data());
  }

  auto input_iterator = thrust::make_transform_iterator(thrust::make_counting_iterator(0),
                                                        [] __host__ __device__(int index) {
                                                          offsets_t offsets{};
                                                          offsets[pack_index(index)] = 1;
                                                          return offsets;
                                                        });

  thrust::exclusive_scan(thrust::device,
                         input_iterator,
                         input_iterator + n,
                         make_tabulate_output_iterator(output_op_t{n, pi, pc, pointers}),
                         offsets_t{},
                         [] __host__ __device__(offsets_t lhs, offsets_t rhs) -> offsets_t {
                           offsets_t result{};
                           for (int i = 0; i < N; i++) {
                             result[i] = lhs[i] + rhs[i];
                           }
                           return result;
                         });

  offsets_t counts = counter[0];
  std::cout << counts[0] << " / " << counts[1] << " / " << counts[2] << std::endl;

  for (int j = 0; j < N; j++) {
    for (int i = 0; i < counts[j]; i++) {
      std::cout << "p" << j << "[" << i << "]: " << static_cast<int>(packs[j][i]) << std::endl;
    }
  }
}

// prints:
// 3 / 2 / 2
// p0[0]: 7
// p0[1]: 4
// p0[2]: 1
// p1[0]: 6
// p1[1]: 3
// p2[0]: 5
// p2[1]: 2

The code packs input vector into a number of output buffers based on some query:

This is somewhat similar to N-way partition, but is not limited to it.

I saw is_final being used in this context. We can't place this packing inside binary operator, because we don't know when final accumulator is being computed by CUB. Since we write into output iterator once, transform output iterator is equivalent to injecting user code under is_final. What's missing is the output index. We can't solve this using transform output iterator, because we don't know the index in the output sequence.

I can imagine other use cases, for instance, discarding values for invalid elements.

jrhemstad · 2024-03-06T14:35:52Z

@karthikeyann this is very reiminiscent of what you had proposed a long time ago here: NVIDIA/thrust#1575

I'm curious for your thoughts on this. Would the tabulate_output_iterator described above fit the use case you'd originally had in mind for NVIDIA/thrust#1575 ?

karthikeyann · 2024-03-12T16:43:31Z

Yes. That name would be good fit.

gevtushenko added the feature request New feature or request. label Mar 5, 2024

github-project-automation bot added this to CCCL Mar 5, 2024

github-project-automation bot moved this to Todo in CCCL Mar 5, 2024

elstehle self-assigned this Aug 21, 2024

elstehle mentioned this issue Aug 23, 2024

Adds thrust::tabulate_output_iterator #2282

Merged

2 tasks

cccl-authenticator-app bot moved this from Todo to In Review in CCCL Aug 23, 2024

elstehle closed this as completed in #2282 Aug 25, 2024

github-project-automation bot moved this from In Review to Done in CCCL Aug 25, 2024

This was referenced Aug 28, 2024

Add support for large num_items to device_select.cuh #1422

Closed

Implement streaming approach for DeviceSelect and DevicePartition #2238

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA]: Tabulate Output Iterator #1486

[FEA]: Tabulate Output Iterator #1486

gevtushenko commented Mar 5, 2024

jrhemstad commented Mar 5, 2024

gevtushenko commented Mar 5, 2024

jrhemstad commented Mar 6, 2024

karthikeyann commented Mar 12, 2024

[FEA]: Tabulate Output Iterator #1486

[FEA]: Tabulate Output Iterator #1486

Comments

gevtushenko commented Mar 5, 2024

Is this a duplicate?

Area

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

jrhemstad commented Mar 5, 2024

gevtushenko commented Mar 5, 2024

jrhemstad commented Mar 6, 2024

karthikeyann commented Mar 12, 2024