Add configuration option for choosing number of GPU streams when they're not per-thread #1222
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
pika-org/pika#1294 will be part of pika 0.31.0, and changes the meaning of the "number of streams" parameters passed to the
cuda_pool
. Instead of signaling the number of streams per worker thread, they now mean the number of streams in total. This is unfortunately a silent breaking change in pika, but this PR attempts to make it somewhat loud in DLA-Future.This PR introduces two new configuration options:
num_np_gpu_streams
andnum_hp_gpu_streams
. These will be used when pika is version 0.31.0 or newer. The old options will be used when using a version before 0.31.0. When attempting to use an option with the wrong version of pika, DLA-Future will print a warning that the option will be ignored. For compatibility one can set both the old and the new options at the same time to cover any version of pika, at the cost of a warning.I'm not too worried about this causing problems, since so far we've never had the need to change the defaults on different systems.
32 streams (each for normal and high priority) is the same as the default in pika. DLA-Future's miniapps show no meaningful performance difference with the new option compared to the old per-thread streams. 32 was chosen as a reasonable middle ground. Going to something low like 4 or 8 showed a small slowdown, and going to something really high like 128 has no use since the GPUs don't support that much concurrency anyway. Note that with the previous setup we would actually create 192 normal and high priority streams on e.g. Grace (with 64 worker threads), which was clearly overkill. Despite creating so many streams, we could still be limited by the three streams per worker thread.
Note that the change in pika is really meant as a conceptual simplification (rather than a performance improvement), since it's easier to reason about how much concurrency the pool provides when the number of streams given is the total, rather than varying with the number of worker threads. It also matches now how we deal with the cuBLAS and cuSOLVER handles. However, it may allow corner cases to exploit more concurrency as well. In the case that @albestro encountered, where different continuations (launching CUDA work) end up running on the same worker thread, this new option allows the same worker thread to use all streams instead of being limited to the previous default of three streams per worker thread.
Note that I've updated
test_init
to test with a different configuration option. This is simply to avoid having to do tests conditional on the pika version there. The actual configuration option used for testing was never the important part, just that some configuration option is used.Until pika 0.31.0 is released, pika main identifies itself as 0.30.1, so DLA-Future will not use the correct option, despite pika main already having the change from pika-org/pika#1294. I recommend staying with pika 0.30.1 until then.