Out-of-date info re. GPUs and exclusivity #36

msegado · 2023-12-06T19:44:21Z

(Following up on an email thread)

It looks like the page Running your AI training jobs on Satori using Slurm
contains some incorrect info on GPUs and exclusivity. I'm guessing this might be left over from a time when GPUs were exposed to jobs differently? E.g.:

getting-started/satori-workload-manager-using-slurm.rst

Lines 33 to 40 in 940cdd6

    
           exclusive. That means that unless you ask otherwise, the GPUs on the node(s) 
        
           you are assigned may already be in use by another user. That means if you 
        
           request a node  with 2GPU's  the 2 other GPUs on that node may be engaged by 
        
           another job. This allows us to more efficently allocate all of the GPU 
        
           resources. This may require some additional checking to make sure you can 
        
           uniquely use  all of the GPU's on a machine. If you're in doubt, you can request 
        
           the node to be 'exclusive' . See below on how to request exclusive access  in 
        
           an interactive and batch situation.

I don't think any additional checking is required, nor is it necessary to request exclusive use of the node... my understanding of the current behavior (per @adamdrucker) is that a job gets exclusive use of any GPUs requested via the --gres flag, and my experience is that any additional unallocated GPUs are simply not exposed to a job at all.

getting-started/satori-workload-manager-using-slurm.rst

Lines 65 to 78 in 940cdd6

    
              srun --gres=gpu:4 -N 1 --mem=1T --time 1:00:00 -I --pty /bin/bash 
        
           This will request an AC922 node with 4x GPUs from the Satori (normal 
        
           queue) for 1 hour. 
        
           If you need to make sure no one else can allocate the unused GPU's on the machine you can use 
        
           .. code:: bash 
        
              srun --gres=gpu:4 -N 1 --exclusive --mem=1T --time 1:00:00 -I --pty /bin/bash 
        
           this will request exclusive use of an interactive node with 4GPU's

I believe the first command above is sufficient to ensure that nobody else can allocate the four GPUs on the node, right?

getting-started/satori-workload-manager-using-slurm.rst

Lines 178 to 179 in 940cdd6

    
           -  line 13: ``--exclusive`` means that you want full use of the GPUS on the nodes you are reserving. Leaving this out allows  
        
              the GPU resources you're not using on the node to be shared.

Again, my understanding is that this isn't necessary and may be detrimental; requesting all GPUs on a node is sufficient to ensure exclusive access to the job, and omitting the --exclusive flag unless it's really needed (e.g. you need all resources available on a node, not just all GPUs) would give the scheduler more flexibility to combine GPU-heavy, CPU-light jobs with those that need only CPU cores.

Don't have the bandwidth to open a PR at the moment, but hope the above helps! (And please let me know if I misunderstood any of this...)

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Out-of-date info re. GPUs and exclusivity #36

Out-of-date info re. GPUs and exclusivity #36

msegado commented Dec 6, 2023

Out-of-date info re. GPUs and exclusivity #36

Out-of-date info re. GPUs and exclusivity #36

Comments

msegado commented Dec 6, 2023