-
Notifications
You must be signed in to change notification settings - Fork 644
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ability to dedicate accelerator directive (GPU) over range #5570
Comments
I think this needs to be addressed by the executor, i.e. AWS Batch or SLURM should allow multiple tasks to request 1 GPU each, pack them onto a multi-GPU node as it would for CPUs, and set the Otherwise Nextflow would basically have to become the executor by tracking the VM assignments for each task in order to figure out which GPUs are available at any given time. |
@bentsherman SLURM is already doing this I believe? Did some testing yesterday and setting the following in the nextflow.config will tell SLURM to ask for 1 GPU for each task.
If submitted to a multi-GPU node, different GPUs will be assigned but they will all be mapped to device 0 in each task. So |
For SLURM it may depend on the individual cluster setup. The sysadmin can (probably) use cgroups to isolate GPUs just like you would for CPUs and memory, so that a job only sees the requested resources even if the underlying node has more. Setting the |
NVIDIA article on cgroups: https://developer.nvidia.com/blog/improving-cuda-initialization-times-using-cgroups-in-certain-scenarios/ |
New feature
When submitting to nodes with >1 GPU, Nextflow has very limited capabilities to split the work over separate GPUs.
Usage scenario
Let's imagine running on AWS Batch. We submit multiple GPU enabled tasks to the Batch service. AWS allocates them to a single, large instance with multiple GPUs, as per it's allocation strategy which prioritises the cheapest per CPU price.
In this instance, all tasks would be able to use all GPUs at the same time, leading to collisions and GPU memory issues.
We have some strategies to deal with this:
However, we lack a way of saying "for a queue of x GPUs, assign each task to 1 GPU available"
What I want is each task to know which GPU we can use, then only use that GPU.
Suggest implementation
I don't actually have a good fix here. Perhaps using a process array with an index might help? Perhaps it's specific to each executor? But I feel like Nextflow could expose a variable to help us here.
The text was updated successfully, but these errors were encountered: