Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Priority Queue #105

Open
wants to merge 56 commits into
base: dev
Choose a base branch
from
Open

Priority Queue #105

wants to merge 56 commits into from

Conversation

andrewbriand
Copy link

Adds a GPU-accelerated priority queue

Allows for multiple concurrent insertions as well as multiple concurrent
deletions.

The implementation of the priority queue is based on https://arxiv.org/pdf/1906.06504.pdf.

The queue supports two operations:
push: Add elements into the queue
pop: Remove the element(s) with the lowest (when Max == false) or highest
(when Max == true) keys

The priority queue supports bulk host-side operations and more fine-grained
device-side operations.

The host-side bulk operations push and pop allow an arbitrary number of
elements to be pushed to or popped from the queue.

The device-side operations allow a cooperative group to push or pop
some number of elements less than or equal to node_size. These device side
operations are invoked with a trivially-copyable device view,
device_mutable_view which can be obtained with the host function
get_mutable_device_view and passed to the device.

Current limitations:

  • Only supports trivially comparable key types
  • Does not support insertion and deletion at the same time
  • Capacity is fixed and the queue does not automatically resize
  • Deletion from the queue is much slower than insertion into the queue due to congestion at the underlying heap's root node

TODO: Port tests to Catch2 and benchmarks to google benchmark

@GPUtester
Copy link

Can one of the admins verify this patch?

namespace cuco {

/*
* @brief A GPU-accelerated priority queue of key-value pairs
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason for this to be hardcoded for key-value pairs? Can't it be for any trivially copyable type T? e.g., with std::priority_queue I could have a std::priority_queue<int> or a std::priority_queue<std::pair<int,int>>.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update docs now that this has been updated.

@jrhemstad
Copy link
Collaborator

I reviewed the top level header at this point and gave some thoughts/questions on how to make this a little more generic.

@PointKernel PointKernel added topic: build CMake build issue type: feature request New feature request topic: performance Performance related issue labels Dec 3, 2021
@PointKernel
Copy link
Member

ok to test

@andrewbriand
Copy link
Author

@PointKernel Thanks for your comments! I believe that I have addressed or responded to them all. Please let me know what you think and what other comments you might have.

Copy link
Member

@PointKernel PointKernel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another round of review.

Thanks @andrewbriand for your effort and persistence made to this PR! We are almost there.

include/cuco/priority_queue.cuh Show resolved Hide resolved
include/cuco/priority_queue.cuh Outdated Show resolved Hide resolved
include/cuco/priority_queue.cuh Outdated Show resolved Hide resolved
~priority_queue();

class device_mutable_view {
public:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
public:
public:
using value_type = T;

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I also replace references to T with value_type in device_mutable_view?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That will be great!

Comment on lines +79 to +88
detail::push_kernel<<<num_blocks, block_size, get_shmem_size(block_size), stream>>>(
first,
last - first,
d_heap_,
d_size_,
node_size_,
d_locks_,
d_p_buffer_size_,
lowest_level_start_,
compare_);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
detail::push_kernel<<<num_blocks, block_size, get_shmem_size(block_size), stream>>>(
first,
last - first,
d_heap_,
d_size_,
node_size_,
d_locks_,
d_p_buffer_size_,
lowest_level_start_,
compare_);
auto view = get_device_mutable_view();
detail::push_kernel<<<num_blocks, block_size, get_shmem_size(block_size), stream>>>(
first, num_elements, view);

This is a great example showing the power of "view". Accordingly, the push_kernel would look like:

template <typename OutputIt, typename viewT>
__global__ void push_kernel(OutputIt elements,
                            std::size_t const num_elements,
                            viewT view)
{
  using T = typename viewT::value_type;
  ...
}

If you want, push_n_kernel instead of push_kernel would be a more descriptive name in this case.

Comment on lines +130 to +138
detail::push_single_node(g,
first + i * node_size_,
d_heap_,
d_size_,
node_size_,
d_locks_,
lowest_level_start_,
shmem,
compare_);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

push_single_node, push_partial_node, and related utilities should be member functions of device_mutable_view.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same as pop_single_node and pop_partial_node

Comment on lines +162 to +174
/*
* @brief Return the amount of temporary storage required for operations
* on the queue with a cooperative group size of block_size
*
* @param block_size Size of the cooperative groups to calculate storage for
* @return The amount of temporary storage required in bytes
*/
__device__ int get_shmem_size(int block_size) const
{
int intersection_bytes = 2 * (block_size + 1) * sizeof(int);
int node_bytes = node_size_ * sizeof(T);
return intersection_bytes + 2 * node_bytes;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems never used

* @param shmem The shared memory layout for this cooperative group
* @param compare Comparison operator ordering the elements in the heap
*/
template <typename InputIt, typename T, typename Compare, typename CG>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OutputIt instead of InputIt

* @param lowest_level_start The first index of the heaps lowest layer
* @param compare Comparison operator ordering the elements in the heap
*/
template <typename OutputIt, typename T, typename Compare>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
template <typename OutputIt, typename T, typename Compare>
template <typename InputIt, typename viewT>

Comment on lines +1120 to +1126
T* heap,
int* size,
std::size_t node_size,
int* locks,
std::size_t* p_buffer_size,
int lowest_level_start,
Compare compare)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
T* heap,
int* size,
std::size_t node_size,
int* locks,
std::size_t* p_buffer_size,
int lowest_level_start,
Compare compare)
viewT view)

The kernel implementation can also be simplified with view.

@PointKernel
Copy link
Member

@andrewbriand Can you please also merge with the latest dev branch and fix build warnings (if there is any)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs Review Awaiting reviews before merging topic: build CMake build issue topic: performance Performance related issue type: feature request New feature request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants