Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA]: Tabulate Output Iterator #1486

Closed
1 task done
Tracked by #1422
gevtushenko opened this issue Mar 5, 2024 · 4 comments · Fixed by #2282
Closed
1 task done
Tracked by #1422

[FEA]: Tabulate Output Iterator #1486

gevtushenko opened this issue Mar 5, 2024 · 4 comments · Fixed by #2282
Assignees
Labels
feature request New feature or request.

Comments

@gevtushenko
Copy link
Collaborator

Is this a duplicate?

Area

Thrust

Is your feature request related to a problem? Please describe.

There's an open question if Kokkos scan API can be implemented in Thrust/CUB. Kokkos provides operator with is_final parameter and an index in input sequence. This allows implementing prefix sum algorithms that pack data into multiple arrays based on runtime index. This is somewhat similar to N-way partition, but is not limited to it.

Describe the solution you'd like

I believe this functionality can be implemented with a new output iterator on Thrust end. This iterator should provide a binary operator with an index in the output sequence and an incoming value:

auto it = make_tabulate_output_iterator([](int index, double value) { print(index, value); });

*(it + 1) = 42; // prints 1, 42
*(it + 2) = 43; // prints 2, 43

This iterator might be used to implement N-way partition. After discussing this with @elstehle, it might also be helpful in our tests of large problem sizes.

Describe alternatives you've considered

No response

Additional context

No response

@gevtushenko gevtushenko added the feature request New feature or request. label Mar 5, 2024
@github-project-automation github-project-automation bot moved this to Todo in CCCL Mar 5, 2024
@jrhemstad
Copy link
Collaborator

auto it = make_tabulate_output_iterator([](int index, double value) { print(index, value); });
*(it + 1) = 42; // prints 1, 42
*(it + 2) = 43; // prints 2, 43

I understand this example, but I don't grok the connection to the Kokkos scan. Can you give a sketch of how one would use this with scan to do something you can't do today?

@gevtushenko
Copy link
Collaborator Author

@jrhemstad here's a use case for is_final that I had in mind:

constexpr int N = 3;
using offsets_t = cuda::std::array<int, N>;
using pointers_t = cuda::std::array<int*, N>;

__host__ __device__ int pack_index(int index) {
  // Some data-dependent computation here
  return index % N; 
}

struct output_op_t {
  int n;
  int *input;
  offsets_t *counts;
  pointers_t pointers;

  __host__ __device__ void operator()(int index, offsets_t offsets) {
    pointers[pack_index(index)][offsets[pack_index(index)]] = input[index];

    if (index == n - 1) {
      counts[0] = offsets;
    }
  }
};

int main() {
  const int n = 8;
  pointers_t pointers;
  thrust::device_vector<int> vi = { 7, 6, 5, 4, 3, 2, 1, 0 };
  cuda::std::array<thrust::device_vector<int>, N> packs;
  thrust::device_vector<offsets_t> counter(1);
  int *pi = thrust::raw_pointer_cast(vi.data());
  offsets_t *pc = thrust::raw_pointer_cast(counter.data());
  for (int i = 0; i < N; i++) {
    packs[i].resize(n);
    pointers[i] = thrust::raw_pointer_cast(packs[i].data());
  }

  auto input_iterator = thrust::make_transform_iterator(thrust::make_counting_iterator(0),
                                                        [] __host__ __device__(int index) {
                                                          offsets_t offsets{};
                                                          offsets[pack_index(index)] = 1;
                                                          return offsets;
                                                        });

  thrust::exclusive_scan(thrust::device,
                         input_iterator,
                         input_iterator + n,
                         make_tabulate_output_iterator(output_op_t{n, pi, pc, pointers}),
                         offsets_t{},
                         [] __host__ __device__(offsets_t lhs, offsets_t rhs) -> offsets_t {
                           offsets_t result{};
                           for (int i = 0; i < N; i++) {
                             result[i] = lhs[i] + rhs[i];
                           }
                           return result;
                         });

  offsets_t counts = counter[0];
  std::cout << counts[0] << " / " << counts[1] << " / " << counts[2] << std::endl;

  for (int j = 0; j < N; j++) {
    for (int i = 0; i < counts[j]; i++) {
      std::cout << "p" << j << "[" << i << "]: " << static_cast<int>(packs[j][i]) << std::endl;
    }
  }
}

// prints:
// 3 / 2 / 2
// p0[0]: 7
// p0[1]: 4
// p0[2]: 1
// p1[0]: 6
// p1[1]: 3
// p2[0]: 5
// p2[1]: 2

The code packs input vector into a number of output buffers based on some query:

This is somewhat similar to N-way partition, but is not limited to it.

I saw is_final being used in this context. We can't place this packing inside binary operator, because we don't know when final accumulator is being computed by CUB. Since we write into output iterator once, transform output iterator is equivalent to injecting user code under is_final. What's missing is the output index. We can't solve this using transform output iterator, because we don't know the index in the output sequence.

I can imagine other use cases, for instance, discarding values for invalid elements.

@jrhemstad
Copy link
Collaborator

@karthikeyann this is very reiminiscent of what you had proposed a long time ago here: NVIDIA/thrust#1575

I'm curious for your thoughts on this. Would the tabulate_output_iterator described above fit the use case you'd originally had in mind for NVIDIA/thrust#1575 ?

@karthikeyann
Copy link
Contributor

Yes. That name would be good fit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request.
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants