Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PySlurm ignoring some batch job options #169

Open
cahartsell opened this issue Feb 29, 2020 · 4 comments
Open

PySlurm ignoring some batch job options #169

cahartsell opened this issue Feb 29, 2020 · 4 comments

Comments

@cahartsell
Copy link

Details

  • Slurm Version: 19.5.0
  • Python Version: 2.7
  • Cython Version: 0.29.15
  • PySlurm Branch: 19-05-0
  • Linux Distribution: Ubuntu 16.04

Issue

PySlurm seems to ignore some valid "sbatch" parameters when submitting a batch job. Example python code:

import pyslurm

job_opts = {"wrap": "sleep 60",
            "ntasks": 1,
            "cpus_per_task": 8,
            "gres": "gpu:turing:1"}
pyslurm.job().submit_batch_job(job_opts)

Equivalent "sbatch" command line call:

sbatch --wrap="sleep 60" --ntasks=1 --cpus-per-task=8 --gres="gpu:turing:1"

The sbatch command line call behaves as expected (allocates 8 cores and 1 Turing GPU), but the PySlurm code seems to ignore some of the parameters, "cpus_per_task" and "gres" in this example, and only allocates the standard 2 cores with no GPU. I've tested a few other parameters (eg. "job_name", "partition", etc.) and they appear to work correctly, so this seems limited to certain arguments.

I do not see any errors or warnings in any of the slurm logs when run with either method, and no exceptions are thrown when run through PySlurm.

After looking through the pyslurm.pyx file (and the "fill_job_desc_from_opts" function in particular), it seems like the "gres" parameter may not be a supported? However, "cpus_per_task" does appear to be supported, but still is not working.

Any help with this would be greatly appreciated. Also, if I'm right that "gres" is not supported, are there any workarounds or alternative methods for allocating GPUs to batch jobs?

Thanks,
Charlie

@giovtorres
Copy link
Member

I think the issue or bug is on L2679:

pyslurm/pyslurm/pyslurm.pyx

Lines 2675 to 2683 in c50467c

if job_opts.get("overcommit"):
desc.min_cpus = max(job_opts.get("min_nodes", 1), 1)
desc.overcommit = job_opts.get("overcommit")
elif job_opts.get("cpus_per_task"):
desc.min_cpus = job_opts.get("ntasks", 1) * job_opts.get("cpus_per_task")
elif job_opts.get("nodelist") and job_opts.get("min_nodes") == 0:
desc.min_cpus = 0
else:
desc.min_cpus = job_opts.get("ntasks", 1)

I think this might be carry over from a previous version and no longer works in this 19.05. If you could help me track down what it should be, we should be able to fix it.

@cahartsell
Copy link
Author

I don't see any obvious problems with the code snippet you posted, and the job_desc_msg_t datatype defined in both the PySlurm and Slurm repositories still contain the desc.min_cpus field. I'm not very experienced with Slurm, and I did not dive deep enough to find out if this min_cpus field is actually used by the Slurm API code or not. One thing I did notice is that cpus_per_task field is also defined in the job_desc_msg_t structure, but doesn't seem to be used by PySlurm. Instead of translating this argument to min_cpus, can this field be set directly with:

desc.cpus_per_task = job_opts.get("cpus_per_task", 1)

In the meantime, I'm getting around this by translating the dictionary of job options into a command line call and invoking sbatch with the subprocess library. This is not ideal, but is sufficient for my uses. Example code below for anyone else having a similar issue.

from future.utils import iteritems
import subprocess

def submit_job(job_info):
    # Construct sbatch command
    slurm_cmd = ["sbatch"]
    for key, value in iteritems(job_info):
        # Check for special case keys
        if key == "cpus_per_task":
            key = "cpus-per-task"
        if key == "job_name":
            key = "job-name"
        elif key == "script":
            continue
        slurm_cmd.append("--%s=%s" % (key, value))
    slurm_cmd.append(job_info["script"])
    print("Generated slurm batch command: '%s'" % slurm_cmd)

    # Run sbatch command as subprocess.
    try:
        sbatch_output = subprocess.check_output(slurm_cmd)
    except subprocess.CalledProcessError as e:
        # Print error message from sbatch for easier debugging, then pass on exception
        if sbatch_output is not None:
            print("ERROR: Subprocess call output: %s" % sbatch_output)
        raise e

    # Parse job id from sbatch output.
    sbatch_output = sbatch_output.strip("\n ").split()
    for s in sbatch_output:
        if s.isdigit():
            job_id = int(s)
            break

@alanhoyle
Copy link

alanhoyle commented Jul 11, 2022

I am also having trouble submitting a job with anything other than 1 CPU core

    mem = 32000
    cpus = 4
    partition = 'mypartition'
    job_name = "sweet_job_name" 

    awesome_job_opts = {
        'script': sweet_script_name,
        'realmem': mem,
        'cpus-per-task': cpus,
        'partition': partition,
        'job_name': job_name,
    }

    pyslurm.job().submit_batch_job(awesome_job_opts)

This results in a job that has 32,000 MB and but only 1 cpu core.

$ squeue -o "jobid: %A name: %j cpus: %C ram:%m %P %t %M %o" | grep sweet_job_name
jobid: 12345 name: sweet_job_name cpus: 1 ram:32000M mypartition R 1-02:51:10 (null)

@alanhoyle
Copy link

AHHH! I figured it out from @cahartsell's comment above.

it should be:

'cpus_per_task': cpus,

underscores works, dashes doesn't!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants