-
Notifications
You must be signed in to change notification settings - Fork 999
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixes random job failures in kubernetes #19001
Conversation
This is a great find, and ultimately also a simple fix! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, thanks for figuring this out!
Nice work! It would have taken some doing to figure this out! |
Co-authored-by: Marius van den Beek <[email protected]>
Co-authored-by: Nuwan Goonasekera <[email protected]>
This PR was merged without a "kind/" label, please correct. |
Thank you everyone for your support. I tested now |
This fix addresses the random crashes of k8s jobs in Galaxy: galaxyproject/galaxy-helm#490 and mentioned therein. The issue is that k8s job status may not be ready, while providing already some information:
https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.30/#jobstatus-v1-batch
In the previous code
if len(job.obj["status"]) == 0:
checked whether some information is there; if yes, treat it as "final" state of the job and continue processing. In the case that the job status has the fielduncountedTerminatedPods
, k8s is not done with analyzing whether the job failed or succeeded. The code then used this information (for me 0 succeeded, 0 active and 0 failed) and went through the decision tree to decide what to do.My suggestion is to instead wait for k8s to determine the status of the terminated pods and only then decide what to do. This reduced the failure rate from 2-5% to 0% :)
How to test the changes?
License