You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
One of the hassles of running a production is that it requires more jobs than the NERSC 5000 limit in the regular queue. Currently this is handled by humans monitoring the queue and launching additional nights/months when it drops below some threshold of pending jobs. This could be automated with something like an hourly scronjob script to
Check the queue; if there are already more than N jobs then exit, otherwise proceed
Check the exposure tables to determine the possible set of nights that need to be submitted
Check the processing tables to determine the set of nights that have already been submitted
Submit additional nights (desi_proc_night) in order until the queue is above N jobs again or we've run out of nights
Ideally this is a wrapper to existing job launching infrastructure and doesn't require any additional changes to the processing tables or other bookkeeping. Items to work out:
How to handle a failure of desi_proc_night -n NIGHT ..., e.g. due to a transient sbatch hiccup. If it got far enough to have a partial processing table written, then the above algorithm would think the night was already submitted and semi-incorrectly proceed with the next night when it ran again. But trying to derive if the night is only partially submitted is potentially tricky, and potentially problematic if the launcher keeps trying and failing on a night.
Should the launcher also automatically resubmit nights that previously ran with failures (desi_resubmit_queue_failures -n NIGHT ...)? Could be nice, but would need additional bookkeeping to limit the number of attempts. Feels like a beyond-the-baseline feature that could risk getting the original feature done in time for Kibo.
The text was updated successfully, but these errors were encountered:
One of the hassles of running a production is that it requires more jobs than the NERSC 5000 limit in the regular queue. Currently this is handled by humans monitoring the queue and launching additional nights/months when it drops below some threshold of pending jobs. This could be automated with something like an hourly scronjob script to
Ideally this is a wrapper to existing job launching infrastructure and doesn't require any additional changes to the processing tables or other bookkeeping. Items to work out:
desi_proc_night -n NIGHT ...
, e.g. due to a transient sbatch hiccup. If it got far enough to have a partial processing table written, then the above algorithm would think the night was already submitted and semi-incorrectly proceed with the next night when it ran again. But trying to derive if the night is only partially submitted is potentially tricky, and potentially problematic if the launcher keeps trying and failing on a night.desi_resubmit_queue_failures -n NIGHT ...
)? Could be nice, but would need additional bookkeeping to limit the number of attempts. Feels like a beyond-the-baseline feature that could risk getting the original feature done in time for Kibo.The text was updated successfully, but these errors were encountered: