Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvements to Workflow Queue Tools #2351

Merged
merged 6 commits into from
Sep 30, 2024
Merged

Improvements to Workflow Queue Tools #2351

merged 6 commits into from
Sep 30, 2024

Conversation

akremin
Copy link
Member

@akremin akremin commented Aug 30, 2024

This PR addresses Issue #2350 and Issue #2329 . It also improves the queue querying by removing COMPLETED jobs from the list of Slurm job id's to request information about, since we already know what happened to those. This could be expanded to all "final" states, but for now I have left it as COMPLETED which is the most common final state in the processing tables and I'd like to think a little harder about any ramifications of not querying for other "final" states.

This solves #2350 by identifying failed dependencies and not submitting them. Instead printing a message notifying the user of the failed dependency and printing what it would have attempted had it not been for the failed dependency. It assigns a STATUS of "UNSUBMITTED", which is the outcome of the code in main, except that it first spends 3 minutes attempting to submit to Slurm and being refused.

This solves #2329 by including "FAILED" by default in desispec.workflow.processing.update_and_recursively_submit() and any code that calls desispec.workflow.queue.get_resubmission_states(). I added a new variable no_resub_failed which is False by default that can be provided at the command line as --no-resub-failed to both desi_proc_night and desi_resubmit_queue_failures to turn this off if we want the old behavior where FAILED jobs are not resubmitted in desi_proc_night.

Lastly, this cleans some things up, for instance if no QID's are provided sacct returns the three most recent jobs, which is benign, but is better to intercept and return an empty table. desispec.workflow.queue.update_from_queue() was modifying the processing table in-place. I've updated the code to first make a copy that is returned.

I believe I have tested all of the new functions in an ipython session where I ran the various codes and showed that they do what I expected. This includes making a fake processing table row with a CANCELLED dependency and finding that it chooses not to submit the job:

INFO:processing.py:647:submit_batch_script: Found a dependency in a bad final state: CANCELLED for depjobid=29948166, not submitting this job.
INFO:processing.py:750:submit_batch_script: Would have submitted: ['sbatch', '--parsable', '--dependency=afterok:29948166', '/global/cfs/cdirs/desi/spectro/redux/test_kibo/run/scripts/night/20221220/nightlyflat-20221220-00159151-a0123456789.slurm']

@akremin akremin requested a review from sbailey August 30, 2024 04:59
@sbailey sbailey merged commit 67201d7 into main Sep 30, 2024
26 checks passed
@sbailey sbailey deleted the queue_deps_handling branch September 30, 2024 22:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants