Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

toil issuing too many jobs at a time #1465

Open
macmanes opened this issue Aug 17, 2024 · 4 comments
Open

toil issuing too many jobs at a time #1465

macmanes opened this issue Aug 17, 2024 · 4 comments

Comments

@macmanes
Copy link

Thanks in advance for the support. I am trying to align approx. 90 mammal genomes on a slurm-enabled cluster. Running like this:

TOIL_SLURM_ARGS="--partition=macmanes --exclude=node116,node144,node145,node146,node147,node115" \
cactus $HOME/jobs $HOME/final_cactus_input.txt mammals2.hal \
--batchSystem slurm \
--batchLogsDir batch-logs --coordinationDir $HOME/cactus_jobs \
--consCores 40 --maxMemory 500G --doubleMem true

I am getting an error in the run_lastz phase of the workflow. I believe this is because too many jobs are issued. For this dataset, it was about 46k jobs issued at the time of failure.

[2024-08-17T10:17:56-0400] [Thread-2] [E] [toil.lib.retry] Got a <class 'OSError'>: [Errno 7] Argument list too long: 'sacct' which is not retriable according to <function AbstractGridEngineBatchSystem.with_retries.<locals>.<lambda> at 0x7f23d6838dc0>
[2024-08-17T10:17:56-0400] [Thread-2] [E] [toil.batchSystems.abstractGridEngineBatchSystem] GridEngine like batch system failure: [Errno 7] Argument list too long: 'sacct'

Any way I can throttle this for instance, to permit 5k or 10k jobs to be issued at a time?

@macmanes
Copy link
Author

macmanes commented Aug 17, 2024

I do see that there is a toil flag --maxJobs that might help, but not sure how to pass arguments to toil (except for slurm)

TOIL_ARGS="--maxJobs=5000"??

@glennhickey
Copy link
Collaborator

--maxJobs sounds about right. It's an option for any cactus command. Ex cactus --help

--maxJobs MAX_JOBS    Specifies the maximum number of jobs to submit to the
                        backing scheduler at once. Not supported on Mesos or
                        AWS Batch. Use 0 for unlimited. Defaults to unlimited.

@adamnovak
Copy link
Collaborator

It looks like Toil needs to add OSError (with errno 7) to the exception list here that makes us fall back from sacct to scontrol, where we don't list all jobs in the command. We also probably need some machinery to limit the maximum jobs asked about at a time (or maybe just the maximum command line length directly).

But as a workaround, limiting the max jobs in flight ought to work.

@macmanes
Copy link
Author

Thanks @glennhickey and @adamnovak. I can confirm that --maxJobs does work.

For the TOIL developers, it would be great to be able to change maxJobs after submission like you can using slurm for array jobs scontrol update ArrayTaskThrottle=20 JobId=12345. This would allow a user to expand and contract given available cluster resources.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants