Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Out-of-date info re. GPUs and exclusivity #36

Open
msegado opened this issue Dec 6, 2023 · 0 comments
Open

Out-of-date info re. GPUs and exclusivity #36

msegado opened this issue Dec 6, 2023 · 0 comments

Comments

@msegado
Copy link

msegado commented Dec 6, 2023

(Following up on an email thread)

It looks like the page Running your AI training jobs on Satori using Slurm
contains some incorrect info on GPUs and exclusivity. I'm guessing this might be left over from a time when GPUs were exposed to jobs differently? E.g.:

exclusive. That means that unless you ask otherwise, the GPUs on the node(s)
you are assigned may already be in use by another user. That means if you
request a node with 2GPU's the 2 other GPUs on that node may be engaged by
another job. This allows us to more efficently allocate all of the GPU
resources. This may require some additional checking to make sure you can
uniquely use all of the GPU's on a machine. If you're in doubt, you can request
the node to be 'exclusive' . See below on how to request exclusive access in
an interactive and batch situation.

I don't think any additional checking is required, nor is it necessary to request exclusive use of the node... my understanding of the current behavior (per @adamdrucker) is that a job gets exclusive use of any GPUs requested via the --gres flag, and my experience is that any additional unallocated GPUs are simply not exposed to a job at all.

srun --gres=gpu:4 -N 1 --mem=1T --time 1:00:00 -I --pty /bin/bash
This will request an AC922 node with 4x GPUs from the Satori (normal
queue) for 1 hour.
If you need to make sure no one else can allocate the unused GPU's on the machine you can use
.. code:: bash
srun --gres=gpu:4 -N 1 --exclusive --mem=1T --time 1:00:00 -I --pty /bin/bash
this will request exclusive use of an interactive node with 4GPU's

I believe the first command above is sufficient to ensure that nobody else can allocate the four GPUs on the node, right?

- line 13: ``--exclusive`` means that you want full use of the GPUS on the nodes you are reserving. Leaving this out allows
the GPU resources you're not using on the node to be shared.

Again, my understanding is that this isn't necessary and may be detrimental; requesting all GPUs on a node is sufficient to ensure exclusive access to the job, and omitting the --exclusive flag unless it's really needed (e.g. you need all resources available on a node, not just all GPUs) would give the scheduler more flexibility to combine GPU-heavy, CPU-light jobs with those that need only CPU cores.

Don't have the bandwidth to open a PR at the moment, but hope the above helps! (And please let me know if I misunderstood any of this...)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant