Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automated job launcher #2292

Closed
sbailey opened this issue Jul 15, 2024 · 3 comments
Closed

Automated job launcher #2292

sbailey opened this issue Jul 15, 2024 · 3 comments
Assignees
Labels

Comments

@sbailey
Copy link
Contributor

sbailey commented Jul 15, 2024

One of the hassles of running a production is that it requires more jobs than the NERSC 5000 limit in the regular queue. Currently this is handled by humans monitoring the queue and launching additional nights/months when it drops below some threshold of pending jobs. This could be automated with something like an hourly scronjob script to

  • Check the queue; if there are already more than N jobs then exit, otherwise proceed
  • Check the exposure tables to determine the possible set of nights that need to be submitted
  • Check the processing tables to determine the set of nights that have already been submitted
  • Submit additional nights (desi_proc_night) in order until the queue is above N jobs again or we've run out of nights

Ideally this is a wrapper to existing job launching infrastructure and doesn't require any additional changes to the processing tables or other bookkeeping. Items to work out:

  • How to handle a failure of desi_proc_night -n NIGHT ..., e.g. due to a transient sbatch hiccup. If it got far enough to have a partial processing table written, then the above algorithm would think the night was already submitted and semi-incorrectly proceed with the next night when it ran again. But trying to derive if the night is only partially submitted is potentially tricky, and potentially problematic if the launcher keeps trying and failing on a night.
  • Should the launcher also automatically resubmit nights that previously ran with failures (desi_resubmit_queue_failures -n NIGHT ...)? Could be nice, but would need additional bookkeeping to limit the number of attempts. Feels like a beyond-the-baseline feature that could risk getting the original feature done in time for Kibo.
@weaverba137
Copy link
Member

I have a very basic shell script that performs some of this functionality. I can provide details tomorrow.

@weaverba137
Copy link
Member

This script handles the basic logic of:

  • Given a list of batch jobs
  • See how many existing jobs are in the queue
  • If there is space available submit jobs up to some defined limit

The script can be wrapped in a batch job submitted to the workflow queue.

@akremin akremin self-assigned this Aug 6, 2024
@sbailey
Copy link
Contributor Author

sbailey commented Aug 16, 2024

Implemented in PR #2322 ; closing ticket.

@sbailey sbailey closed this as completed Aug 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants