Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add configuration option for choosing number of GPU streams when they're not per-thread #1222

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

msimberg
Copy link
Collaborator

pika-org/pika#1294 will be part of pika 0.31.0, and changes the meaning of the "number of streams" parameters passed to the cuda_pool. Instead of signaling the number of streams per worker thread, they now mean the number of streams in total. This is unfortunately a silent breaking change in pika, but this PR attempts to make it somewhat loud in DLA-Future.

This PR introduces two new configuration options: num_np_gpu_streams and num_hp_gpu_streams. These will be used when pika is version 0.31.0 or newer. The old options will be used when using a version before 0.31.0. When attempting to use an option with the wrong version of pika, DLA-Future will print a warning that the option will be ignored. For compatibility one can set both the old and the new options at the same time to cover any version of pika, at the cost of a warning.

I'm not too worried about this causing problems, since so far we've never had the need to change the defaults on different systems.

32 streams (each for normal and high priority) is the same as the default in pika. DLA-Future's miniapps show no meaningful performance difference with the new option compared to the old per-thread streams. 32 was chosen as a reasonable middle ground. Going to something low like 4 or 8 showed a small slowdown, and going to something really high like 128 has no use since the GPUs don't support that much concurrency anyway. Note that with the previous setup we would actually create 192 normal and high priority streams on e.g. Grace (with 64 worker threads), which was clearly overkill. Despite creating so many streams, we could still be limited by the three streams per worker thread.

Note that the change in pika is really meant as a conceptual simplification (rather than a performance improvement), since it's easier to reason about how much concurrency the pool provides when the number of streams given is the total, rather than varying with the number of worker threads. It also matches now how we deal with the cuBLAS and cuSOLVER handles. However, it may allow corner cases to exploit more concurrency as well. In the case that @albestro encountered, where different continuations (launching CUDA work) end up running on the same worker thread, this new option allows the same worker thread to use all streams instead of being limited to the previous default of three streams per worker thread.

Note that I've updated test_init to test with a different configuration option. This is simply to avoid having to do tests conditional on the pika version there. The actual configuration option used for testing was never the important part, just that some configuration option is used.

Until pika 0.31.0 is released, pika main identifies itself as 0.30.1, so DLA-Future will not use the correct option, despite pika main already having the change from pika-org/pika#1294. I recommend staying with pika 0.30.1 until then.

@msimberg
Copy link
Collaborator Author

cscs-ci run

@msimberg msimberg added this to the v0.7.0 milestone Nov 27, 2024
@msimberg
Copy link
Collaborator Author

cscs-ci run

@msimberg msimberg requested review from rasolca, albestro and RMeli and removed request for rasolca November 27, 2024 13:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Review
Development

Successfully merging this pull request may close these issues.

1 participant