-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
auto-resubmit FAILED jobs too? #2329
Comments
As hypothesized the distinction in the two codes was intentional, as the resubmission script was meant as a tool to be run by a human after failed jobs had been "cleaned up," whereas the resubmission code built into the pipeline was run without human intervention. At the time, all known FAILED states were caused but issues that could not be resolved by simply retrying the same code on the same data (i.e. issues in data quality, input data, code bugs, etc.). To directly address one of the questions, the code submits a job initially and then will re-submit it two additional times (for a total of 3 attempts) automatically. This new situation does make it seem as though there is a use-case for resubmitting even failed jobs. Most cases will be a waste of computing in which the job will try and fail again, but there aren't many jobs that genuinely fail anymore so it will be a negligible part of our computing budget. Here is my take on the situation: Cons: The code will give up if any job fails to successfully process after 3 attempts. So the downside is that if a FAILED job is a genuine pipeline failure it will fail two more times and then that will prevent any later jobs on that night from being able to resubmit due to e.g. timeouts. Whereas now we don't resubmit failures and therefore rarely reach 3 attempts on any given job since timeouts are often resolved by resubmission (except in very bad NERSC conditions). I have no strong opinion on this either way. We can always experiment with the resubmission of FAILED jobs and then turn it off if it becomes more of an issue than a solution. |
I don't understand this point. I thought the "up to 3 times" tracking was per job, not per night. It seems that even if one job fails 3 times in a row, that should just count against that job and give up only for that job, but any other jobs that don't depend on that job should still be (re)submitted regardless. Please clarify. I think we want a situation where
|
The code logic is that if any job has failed 3 times, the script does not attempt to resubmit any job. tl;dr We can add FAILED to the list without much problem. Task (2) that you're asking for could be done, but it is not a trivial change and would require careful code changes and thorough testing. I don't think it is worth it for the small corner case that spurred this ticket. The reason for this is that these jobs don't exist in a vacuum. Often a job is dropped because of a dependency failure. If that is the case, then we first resubmit the dependency and only then do we submit the later job that depended on it using the new Slurm jobid of the resubmitted dependency to have Slurm track the dependencies. This is done recursively as a dependency might have failed because its dependency failed. If we were to check per-job, then if e.g. an arc failed, we would resubmit that arc 3 times, then submit it three more times when its downstream jobs are resubmitted and it is submitted as a dependency, then the next, etc. So for calibrations, we could end up submitting a job ~3xN times where N is the number of jobs. We could change how the code recursively submits jobs, but that would be a much larger update to the code and would require extensive testing. We could also consider moving the "failed 3 times" check to be at a much lower level and have the code give up at that point, but that would have implications for |
implemented in #2351 |
desispec.workflow.queue.get_resubmission_states does not include FAILED in the list of states to resubmit.
desi_resubmit_queue_failures
adds FAILED unless--dont-resub-failed
is specified, but the auto-resubmission indesi_proc_night
will not resubmit FAILED jobs.The original reasoning might have been that we want to auto-resubmit jobs that fail for NERSC reasons which are usually in the CANCELLED or TIMEOUT state, but not auto-resubmit jobs that failed for algorithmic reasons which are usually in the FAILED state because they would likely just fail again. But 20240818 got into a bad state where an arc job 29535157 failed for a "Fatal error in PMPI_Init_thread: Other MPI error" reason and ended up in a FAILED state which the desi_proc_night scronjob wouldn't resubmit.
Consider adding FAILED to the list of auto-retry states, since desi_proc_night only retries N times (N=2?) before giving up, so it won't get into an infinite resubmit-and-fail loop even if something is genuinely wrong.
The text was updated successfully, but these errors were encountered: