PySlurm ignoring some batch job options #169

cahartsell · 2020-02-29T01:23:22Z

Details

Slurm Version: 19.5.0
Python Version: 2.7
Cython Version: 0.29.15
PySlurm Branch: 19-05-0
Linux Distribution: Ubuntu 16.04

Issue

PySlurm seems to ignore some valid "sbatch" parameters when submitting a batch job. Example python code:

import pyslurm

job_opts = {"wrap": "sleep 60",
            "ntasks": 1,
            "cpus_per_task": 8,
            "gres": "gpu:turing:1"}
pyslurm.job().submit_batch_job(job_opts)

Equivalent "sbatch" command line call:

sbatch --wrap="sleep 60" --ntasks=1 --cpus-per-task=8 --gres="gpu:turing:1"

The sbatch command line call behaves as expected (allocates 8 cores and 1 Turing GPU), but the PySlurm code seems to ignore some of the parameters, "cpus_per_task" and "gres" in this example, and only allocates the standard 2 cores with no GPU. I've tested a few other parameters (eg. "job_name", "partition", etc.) and they appear to work correctly, so this seems limited to certain arguments.

I do not see any errors or warnings in any of the slurm logs when run with either method, and no exceptions are thrown when run through PySlurm.

After looking through the pyslurm.pyx file (and the "fill_job_desc_from_opts" function in particular), it seems like the "gres" parameter may not be a supported? However, "cpus_per_task" does appear to be supported, but still is not working.

Any help with this would be greatly appreciated. Also, if I'm right that "gres" is not supported, are there any workarounds or alternative methods for allocating GPUs to batch jobs?

Thanks,
Charlie

The text was updated successfully, but these errors were encountered:

giovtorres · 2020-03-04T02:53:17Z

I think the issue or bug is on L2679:

pyslurm/pyslurm/pyslurm.pyx

Lines 2675 to 2683 in c50467c

    
           if job_opts.get("overcommit"): 
        
               desc.min_cpus = max(job_opts.get("min_nodes", 1), 1) 
        
               desc.overcommit = job_opts.get("overcommit") 
        
           elif job_opts.get("cpus_per_task"): 
        
               desc.min_cpus = job_opts.get("ntasks", 1) * job_opts.get("cpus_per_task") 
        
           elif job_opts.get("nodelist") and job_opts.get("min_nodes") == 0: 
        
               desc.min_cpus = 0 
        
           else: 
        
               desc.min_cpus = job_opts.get("ntasks", 1)

I think this might be carry over from a previous version and no longer works in this 19.05. If you could help me track down what it should be, we should be able to fix it.

cahartsell · 2020-03-11T20:45:19Z

I don't see any obvious problems with the code snippet you posted, and the job_desc_msg_t datatype defined in both the PySlurm and Slurm repositories still contain the desc.min_cpus field. I'm not very experienced with Slurm, and I did not dive deep enough to find out if this min_cpus field is actually used by the Slurm API code or not. One thing I did notice is that cpus_per_task field is also defined in the job_desc_msg_t structure, but doesn't seem to be used by PySlurm. Instead of translating this argument to min_cpus, can this field be set directly with:

desc.cpus_per_task = job_opts.get("cpus_per_task", 1)

In the meantime, I'm getting around this by translating the dictionary of job options into a command line call and invoking sbatch with the subprocess library. This is not ideal, but is sufficient for my uses. Example code below for anyone else having a similar issue.

from future.utils import iteritems
import subprocess

def submit_job(job_info):
    # Construct sbatch command
    slurm_cmd = ["sbatch"]
    for key, value in iteritems(job_info):
        # Check for special case keys
        if key == "cpus_per_task":
            key = "cpus-per-task"
        if key == "job_name":
            key = "job-name"
        elif key == "script":
            continue
        slurm_cmd.append("--%s=%s" % (key, value))
    slurm_cmd.append(job_info["script"])
    print("Generated slurm batch command: '%s'" % slurm_cmd)

    # Run sbatch command as subprocess.
    try:
        sbatch_output = subprocess.check_output(slurm_cmd)
    except subprocess.CalledProcessError as e:
        # Print error message from sbatch for easier debugging, then pass on exception
        if sbatch_output is not None:
            print("ERROR: Subprocess call output: %s" % sbatch_output)
        raise e

    # Parse job id from sbatch output.
    sbatch_output = sbatch_output.strip("\n ").split()
    for s in sbatch_output:
        if s.isdigit():
            job_id = int(s)
            break

alanhoyle · 2022-07-11T19:06:48Z

I am also having trouble submitting a job with anything other than 1 CPU core

    mem = 32000
    cpus = 4
    partition = 'mypartition'
    job_name = "sweet_job_name" 

    awesome_job_opts = {
        'script': sweet_script_name,
        'realmem': mem,
        'cpus-per-task': cpus,
        'partition': partition,
        'job_name': job_name,
    }

    pyslurm.job().submit_batch_job(awesome_job_opts)

This results in a job that has 32,000 MB and but only 1 cpu core.

$ squeue -o "jobid: %A name: %j cpus: %C ram:%m %P %t %M %o" | grep sweet_job_name
jobid: 12345 name: sweet_job_name cpus: 1 ram:32000M mypartition R 1-02:51:10 (null)

alanhoyle · 2022-07-11T19:58:44Z

AHHH! I figured it out from @cahartsell's comment above.

it should be:

'cpus_per_task': cpus,

underscores works, dashes doesn't!

nvutri mentioned this issue Dec 7, 2020

Add tres options for Job Descriptor #179

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PySlurm ignoring some batch job options #169

PySlurm ignoring some batch job options #169

cahartsell commented Feb 29, 2020

giovtorres commented Mar 4, 2020

cahartsell commented Mar 11, 2020

alanhoyle commented Jul 11, 2022 •

edited

Loading

alanhoyle commented Jul 11, 2022

PySlurm ignoring some batch job options #169

PySlurm ignoring some batch job options #169

Comments

cahartsell commented Feb 29, 2020

Details

Issue

giovtorres commented Mar 4, 2020

cahartsell commented Mar 11, 2020

alanhoyle commented Jul 11, 2022 • edited Loading

alanhoyle commented Jul 11, 2022

alanhoyle commented Jul 11, 2022 •

edited

Loading