-
Notifications
You must be signed in to change notification settings - Fork 170
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA]: Tabulate Output Iterator #1486
Comments
I understand this example, but I don't grok the connection to the Kokkos scan. Can you give a sketch of how one would use this with scan to do something you can't do today? |
@jrhemstad here's a use case for constexpr int N = 3;
using offsets_t = cuda::std::array<int, N>;
using pointers_t = cuda::std::array<int*, N>;
__host__ __device__ int pack_index(int index) {
// Some data-dependent computation here
return index % N;
}
struct output_op_t {
int n;
int *input;
offsets_t *counts;
pointers_t pointers;
__host__ __device__ void operator()(int index, offsets_t offsets) {
pointers[pack_index(index)][offsets[pack_index(index)]] = input[index];
if (index == n - 1) {
counts[0] = offsets;
}
}
};
int main() {
const int n = 8;
pointers_t pointers;
thrust::device_vector<int> vi = { 7, 6, 5, 4, 3, 2, 1, 0 };
cuda::std::array<thrust::device_vector<int>, N> packs;
thrust::device_vector<offsets_t> counter(1);
int *pi = thrust::raw_pointer_cast(vi.data());
offsets_t *pc = thrust::raw_pointer_cast(counter.data());
for (int i = 0; i < N; i++) {
packs[i].resize(n);
pointers[i] = thrust::raw_pointer_cast(packs[i].data());
}
auto input_iterator = thrust::make_transform_iterator(thrust::make_counting_iterator(0),
[] __host__ __device__(int index) {
offsets_t offsets{};
offsets[pack_index(index)] = 1;
return offsets;
});
thrust::exclusive_scan(thrust::device,
input_iterator,
input_iterator + n,
make_tabulate_output_iterator(output_op_t{n, pi, pc, pointers}),
offsets_t{},
[] __host__ __device__(offsets_t lhs, offsets_t rhs) -> offsets_t {
offsets_t result{};
for (int i = 0; i < N; i++) {
result[i] = lhs[i] + rhs[i];
}
return result;
});
offsets_t counts = counter[0];
std::cout << counts[0] << " / " << counts[1] << " / " << counts[2] << std::endl;
for (int j = 0; j < N; j++) {
for (int i = 0; i < counts[j]; i++) {
std::cout << "p" << j << "[" << i << "]: " << static_cast<int>(packs[j][i]) << std::endl;
}
}
}
// prints:
// 3 / 2 / 2
// p0[0]: 7
// p0[1]: 4
// p0[2]: 1
// p1[0]: 6
// p1[1]: 3
// p2[0]: 5
// p2[1]: 2 The code packs input vector into a number of output buffers based on some query:
I saw I can imagine other use cases, for instance, discarding values for invalid elements. |
@karthikeyann this is very reiminiscent of what you had proposed a long time ago here: NVIDIA/thrust#1575 I'm curious for your thoughts on this. Would the |
Yes. That name would be good fit. |
Is this a duplicate?
Area
Thrust
Is your feature request related to a problem? Please describe.
There's an open question if Kokkos scan API can be implemented in Thrust/CUB. Kokkos provides operator with
is_final
parameter and an index in input sequence. This allows implementing prefix sum algorithms that pack data into multiple arrays based on runtime index. This is somewhat similar to N-way partition, but is not limited to it.Describe the solution you'd like
I believe this functionality can be implemented with a new output iterator on Thrust end. This iterator should provide a binary operator with an index in the output sequence and an incoming value:
This iterator might be used to implement N-way partition. After discussing this with @elstehle, it might also be helpful in our tests of large problem sizes.
Describe alternatives you've considered
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: