Automated job launcher #2292

sbailey · 2024-07-15T23:12:49Z

One of the hassles of running a production is that it requires more jobs than the NERSC 5000 limit in the regular queue. Currently this is handled by humans monitoring the queue and launching additional nights/months when it drops below some threshold of pending jobs. This could be automated with something like an hourly scronjob script to

Check the queue; if there are already more than N jobs then exit, otherwise proceed
Check the exposure tables to determine the possible set of nights that need to be submitted
Check the processing tables to determine the set of nights that have already been submitted
Submit additional nights (desi_proc_night) in order until the queue is above N jobs again or we've run out of nights

Ideally this is a wrapper to existing job launching infrastructure and doesn't require any additional changes to the processing tables or other bookkeeping. Items to work out:

How to handle a failure of desi_proc_night -n NIGHT ..., e.g. due to a transient sbatch hiccup. If it got far enough to have a partial processing table written, then the above algorithm would think the night was already submitted and semi-incorrectly proceed with the next night when it ran again. But trying to derive if the night is only partially submitted is potentially tricky, and potentially problematic if the launcher keeps trying and failing on a night.
Should the launcher also automatically resubmit nights that previously ran with failures (desi_resubmit_queue_failures -n NIGHT ...)? Could be nice, but would need additional bookkeeping to limit the number of attempts. Feels like a beyond-the-baseline feature that could risk getting the original feature done in time for Kibo.

The text was updated successfully, but these errors were encountered:

weaverba137 · 2024-07-16T03:18:47Z

I have a very basic shell script that performs some of this functionality. I can provide details tomorrow.

weaverba137 · 2024-07-16T16:56:37Z

This script handles the basic logic of:

Given a list of batch jobs
See how many existing jobs are in the queue
If there is space available submit jobs up to some defined limit

The script can be wrapped in a batch job submitted to the workflow queue.

sbailey · 2024-08-16T21:39:56Z

Implemented in PR #2322 ; closing ticket.

sbailey added the pipeline label Jul 15, 2024

akremin self-assigned this Aug 6, 2024

sbailey closed this as completed Aug 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automated job launcher #2292

Automated job launcher #2292

sbailey commented Jul 15, 2024

weaverba137 commented Jul 16, 2024

weaverba137 commented Jul 16, 2024

sbailey commented Aug 16, 2024

Automated job launcher #2292

Automated job launcher #2292

Comments

sbailey commented Jul 15, 2024

weaverba137 commented Jul 16, 2024

weaverba137 commented Jul 16, 2024

sbailey commented Aug 16, 2024