You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It looks like the page Running your AI training jobs on Satori using Slurm contains some incorrect info on GPUs and exclusivity. I'm guessing this might be left over from a time when GPUs were exposed to jobs differently? E.g.:
exclusive. That means that unless you ask otherwise, the GPUs on the node(s)
you are assigned may already be in use by another user. That means if you
request a node with 2GPU's the 2 other GPUs on that node may be engaged by
another job. This allows us to more efficently allocate all of the GPU
resources. This may require some additional checking to make sure you can
uniquely use all of the GPU's on a machine. If you're in doubt, you can request
the node to be 'exclusive' . See below on how to request exclusive access in
an interactive and batch situation.
I don't think any additional checking is required, nor is it necessary to request exclusive use of the node... my understanding of the current behavior (per @adamdrucker) is that a job gets exclusive use of any GPUs requested via the --gres flag, and my experience is that any additional unallocated GPUs are simply not exposed to a job at all.
- line 13: ``--exclusive`` means that you want full use of the GPUS on the nodes you are reserving. Leaving this out allows
the GPU resources you're not using on the node to be shared.
Again, my understanding is that this isn't necessary and may be detrimental; requesting all GPUs on a node is sufficient to ensure exclusive access to the job, and omitting the --exclusive flag unless it's really needed (e.g. you need all resources available on a node, not just all GPUs) would give the scheduler more flexibility to combine GPU-heavy, CPU-light jobs with those that need only CPU cores.
Don't have the bandwidth to open a PR at the moment, but hope the above helps! (And please let me know if I misunderstood any of this...)
The text was updated successfully, but these errors were encountered:
(Following up on an email thread)
It looks like the page Running your AI training jobs on Satori using Slurm
contains some incorrect info on GPUs and exclusivity. I'm guessing this might be left over from a time when GPUs were exposed to jobs differently? E.g.:
getting-started/satori-workload-manager-using-slurm.rst
Lines 33 to 40 in 940cdd6
I don't think any additional checking is required, nor is it necessary to request exclusive use of the node... my understanding of the current behavior (per @adamdrucker) is that a job gets exclusive use of any GPUs requested via the
--gres
flag, and my experience is that any additional unallocated GPUs are simply not exposed to a job at all.getting-started/satori-workload-manager-using-slurm.rst
Lines 65 to 78 in 940cdd6
I believe the first command above is sufficient to ensure that nobody else can allocate the four GPUs on the node, right?
getting-started/satori-workload-manager-using-slurm.rst
Lines 178 to 179 in 940cdd6
Again, my understanding is that this isn't necessary and may be detrimental; requesting all GPUs on a node is sufficient to ensure exclusive access to the job, and omitting the
--exclusive
flag unless it's really needed (e.g. you need all resources available on a node, not just all GPUs) would give the scheduler more flexibility to combine GPU-heavy, CPU-light jobs with those that need only CPU cores.Don't have the bandwidth to open a PR at the moment, but hope the above helps! (And please let me know if I misunderstood any of this...)
The text was updated successfully, but these errors were encountered: