Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple/many parallel jobs lead to "random" failures #490

Open
mapk-amazon opened this issue Jul 23, 2024 · 18 comments
Open

Multiple/many parallel jobs lead to "random" failures #490

mapk-amazon opened this issue Jul 23, 2024 · 18 comments

Comments

@mapk-amazon
Copy link
Contributor

Setup

The setup is deployed on AWS on EKS:

  • Version k8s: 1.28
  • Version Helm Chart: v5.9.0

Issue

Galaxy "usually" deploys jobs just fine. We started importing with Batch files into Galaxy and experience random failures of pods.

Logs

galaxy.jobs.runners.kubernetes ERROR 2024-07-23 15:35:40,109 [pN:job_handler_0,p:8,tN:KubernetesRunner.monitor_thread] No Jobs are available under expected selector app=gxy-galaxy-g674v
galaxy.jobs.runners.kubernetes ERROR 2024-07-23 15:35:40,120 [pN:job_handler_0,p:8,tN:KubernetesRunner.monitor_thread] No Jobs are available under expected selector app=gxy-galaxy-zpgqx
galaxy.jobs.runners.kubernetes ERROR 2024-07-23 15:35:40,130 [pN:job_handler_0,p:8,tN:KubernetesRunner.monitor_thread] No Jobs are available under expected selector app=gxy-galaxy-7kl4g
galaxy.jobs.runners.kubernetes ERROR 2024-07-23 15:35:40,159 [pN:job_handler_0,p:8,tN:KubernetesRunner.monitor_thread] No Jobs are available under expected selector app=gxy-galaxy-g9ts6

and

requests.exceptions.HTTPError: 409 Client Error: Conflict for url: [https://172.20.0.1:443/apis/batch/v1/namespaces/galaxy/jobs/gxy-galaxy-f4b62](https://172.20.0.1/apis/batch/v1/namespaces/galaxy/jobs/gxy-galaxy-f4b62)
pykube.exceptions.HTTPError: Operation cannot be fulfilled on jobs.batch "gxy-galaxy-f4b62": the object has been modified; please apply your changes to the latest version and try again

In the k8s log we also see that the pods was launched around the time:

gxy-galaxy-f4b62-95mlz               0/1     ContainerCreating   0          1s
gxy-galaxy-f4b62-95mlz               1/1     Running             0          4s
gxy-galaxy-f4b62-95mlz               0/1     Completed           0          7s

Ideas/Hypothesis

Current ideas are that the hash (e.g. f4b62) has a collision and leads to resource conflicts for the pods and to failures of some jobs.

Does the team has any experience with it? Any fixes? Thank you :)

@ksuderman
Copy link
Contributor

We recently received a similar report and I originally thought it may be related to Kubernetes 1.30 and the pykube-ng version we use. However, you are using 1.28 and I have been unable to recreate the problem. The one common thread is EKS. I will investigate that next.

See galaxyproject/galaxy#18567

@mapk-amazon
Copy link
Contributor Author

I can test with various EKS versions, however, I am not sure how to build a minimal example withpykube-ng. If you have a snippet to produce a similar effect to Galaxy job scheduling, I can test it and report back :)

@nuwang
Copy link
Member

nuwang commented Jul 29, 2024

Is there a stack trace? Or can the verbosity level be increased to produce one? If not, I think we have a problem with the error being inadequately logged, and we need to figure out which line of code is generating the exception.

Most likely, this is caused by a race condition between k8s modifying the job status, and the runner attempting to read and modify the manifest itself. As mentioned earlier, the resultant hash collision of resourceVersion would cause this conflict. So if we re-queue the current task whenever this error is encountered, the runner thread should eventually fetch the latest version, and succeed I would expect.

@mapk-amazon
Copy link
Contributor Author

This is "the most" detailed log I get:

galaxy-job-0 galaxy.jobs.runners.kubernetes ERROR 2024-07-29 13:46:58,387 [pN:job_handler_0,p:8,tN:KubernetesRunner.monitor_thread] Could not clean up k8s batch job. Ignoring...                                 │
│ galaxy-job-0 Traceback (most recent call last):                                                                                                                                                                   │
│ galaxy-job-0   File "/galaxy/server/.venv/lib/python3.12/site-packages/pykube/http.py", line 403, in raise_for_status                                                                                             │
│ galaxy-job-0     resp.raise_for_status()                                                                                                                                                                          │
│ galaxy-job-0   File "/galaxy/server/.venv/lib/python3.12/site-packages/requests/models.py", line 1021, in raise_for_status                                                                                        │
│ galaxy-job-0     raise HTTPError(http_error_msg, response=self)                                                                                                                                                   │
│ galaxy-job-0 requests.exceptions.HTTPError: 409 Client Error: Conflict for url: https://172.20.0.1:443/apis/batch/v1/namespaces/galaxy/jobs/gxy-galaxy-4db2n                                                      │
│ galaxy-job-0                                                                                                                                                                                                      │
│ galaxy-job-0 During handling of the above exception, another exception occurred:                                                                                                                                  │
│ galaxy-job-0                                                                                                                                                                                                      │
│ galaxy-job-0 Traceback (most recent call last):                                                                                                                                                                   │
│ galaxy-job-0   File "/galaxy/server/lib/galaxy/jobs/runners/kubernetes.py", line 872, in _handle_job_failure                                                                                                      │
│ galaxy-job-0     self.__cleanup_k8s_job(job)                                                                                                                                                                      │
│ galaxy-job-0   File "/galaxy/server/lib/galaxy/jobs/runners/kubernetes.py", line 879, in __cleanup_k8s_job                                                                                                        │
│ galaxy-job-0     delete_job(job, k8s_cleanup_job)                                                                                                                                                                 │
│ galaxy-job-0   File "/galaxy/server/lib/galaxy/jobs/runners/util/pykube_util.py", line 108, in delete_job                                                                                                         │
│ galaxy-job-0     job.scale(replicas=0)                                                                                                                                                                            │
│ galaxy-job-0   File "/galaxy/server/.venv/lib/python3.12/site-packages/pykube/mixins.py", line 31, in scale                                                                                                       │
│ galaxy-job-0     self.update()                                                                                                                                                                                    │
│ galaxy-job-0   File "/galaxy/server/.venv/lib/python3.12/site-packages/pykube/objects.py", line 165, in update                                                                                                    │
│ galaxy-job-0     self.patch(self.obj, subresource=subresource)                                                                                                                                                    │
│ galaxy-job-0   File "/galaxy/server/.venv/lib/python3.12/site-packages/pykube/objects.py", line 157, in patch                                                                                                     │
│ galaxy-job-0     self.api.raise_for_status(r)                                                                                                                                                                     │
│ galaxy-job-0   File "/galaxy/server/.venv/lib/python3.12/site-packages/pykube/http.py", line 410, in raise_for_status                                                                                             │
│ galaxy-job-0     raise HTTPError(resp.status_code, payload["message"])                                                                                                                                            │
│ galaxy-job-0 pykube.exceptions.HTTPError: Operation cannot be fulfilled on jobs.batch "gxy-galaxy-4db2n": the object has been modified; please apply your changes to the latest version and try again    

@nuwang
Copy link
Member

nuwang commented Jul 30, 2024

Thanks. That helps with narrowing things down.

@ksuderman
Copy link
Contributor

To change/update the pykube-ng version requires building a new galaxy-min docker image. I have limited internet connectivity at the moment so it is not easy for me to build and push a new image right now, but I'll try to get that done in the next few days.

@mapk-amazon
Copy link
Contributor Author

How do you build the galaxy-min docker image?

Is it building this as-is, or is there a "min" configuration somewhere?

@nuwang
Copy link
Member

nuwang commented Aug 6, 2024

@mapk-amazon That's the right image. Building it as is will do the job. If you'd like to test the changes, please try this branch: galaxyproject/galaxy#18514
This has some fixes, including the pykube upgrade that may solve this issue.

@almahmoud
Copy link
Member

Fwiw @mapk-amazon , you can also use ghcr.io/bioconductor/galaxy:dev which is the built image from that PR.

@mapk-amazon
Copy link
Contributor Author

Thank you all. I used ghcr.io/bioconductor/galaxy:dev, otherwise the same setup as in the start. I uploaded 100x 1MB files with random content. It failed for 2 with the same error:

│ galaxy.jobs.runners.kubernetes ERROR 2024-08-06 18:06:32,493 [pN:job_handler_0,p:8,tN:KubernetesRunner.monitor_thread] Could not clean up k8s batch job. Ignoring...                                                                                            │
│ Traceback (most recent call last):                                                                                                                                                                                                                              │
│   File "/galaxy/server/.venv/lib/python3.12/site-packages/pykube/http.py", line 437, in raise_for_status                                                                                                                                                        │
│     resp.raise_for_status()                                                                                                                                                                                                                                     │
│   File "/galaxy/server/.venv/lib/python3.12/site-packages/requests/models.py", line 1024, in raise_for_status                                                                                                                                                   │
│     raise HTTPError(http_error_msg, response=self)                                                                                                                                                                                                              │
│ requests.exceptions.HTTPError: 409 Client Error: Conflict for url: https://172.20.0.1:443/apis/batch/v1/namespaces/galaxy/jobs/gxy-galaxy-vnjqk                                                                                                                 │
│                                                                                                                                                                                                                                                                 │
│ During handling of the above exception, another exception occurred:                                                                                                                                                                                             │
│                                                                                                                                                                                                                                                                 │
│ Traceback (most recent call last):                                                                                                                                                                                                                              │
│   File "/galaxy/server/lib/galaxy/jobs/runners/kubernetes.py", line 912, in _handle_job_failure                                                                                                                                                                 │
│     self.__cleanup_k8s_job(job)                                                                                                                                                                                                                                 │
│   File "/galaxy/server/lib/galaxy/jobs/runners/kubernetes.py", line 919, in __cleanup_k8s_job                                                                                                                                                                   │
│     delete_job(job, k8s_cleanup_job)                                                                                                                                                                                                                            │
│   File "/galaxy/server/lib/galaxy/jobs/runners/util/pykube_util.py", line 115, in delete_job                                                                                                                                                                    │
│     job.scale(replicas=0)                                                                                                                                                                                                                                       │
│   File "/galaxy/server/.venv/lib/python3.12/site-packages/pykube/mixins.py", line 30, in scale                                                                                                                                                                  │
│     self.update()                                                                                                                                                                                                                                               │
│   File "/galaxy/server/.venv/lib/python3.12/site-packages/pykube/objects.py", line 165, in update                                                                                                                                                               │
│     self.patch(self.obj, subresource=subresource)                                                                                                                                                                                                               │
│   File "/galaxy/server/.venv/lib/python3.12/site-packages/pykube/objects.py", line 157, in patch                                                                                                                                                                │
│     self.api.raise_for_status(r)                                                                                                                                                                                                                                │
│   File "/galaxy/server/.venv/lib/python3.12/site-packages/pykube/http.py", line 444, in raise_for_status                                                                                                                                                        │
│     raise HTTPError(resp.status_code, payload["message"])

@ksuderman
Copy link
Contributor

Thanks @mapk-amazon, it sure looks like a race condition. How did you upload the 100 files? Through the UI, API, or other means (bioblend etc)?

@pcm32
Copy link
Member

pcm32 commented Aug 16, 2024 via email

@pcm32
Copy link
Member

pcm32 commented Aug 16, 2024 via email

@mapk-amazon
Copy link
Contributor Author

Thank you for your input!

@ksuderman I use the webinterface. I can try the API if you think it makes a difference.
@pcm32 Yes, the job fails. It looks like this on the UI then.
image

image

@pcm32
Copy link
Member

pcm32 commented Aug 16, 2024

But yes, I do see this error every now and then in our logs, maybe I don't see it in the UI as an error due to the resubmissions.

@ksuderman
Copy link
Contributor

When running hundreds of jobs, you're are always bound to get some arbitrary errors, we mitigate that in our use of the setup with aggressive resubmission policies.

True, but we are getting reports of the 409 Client Error errors from other users even with only a handful of jobs, but I've never been able to recreate the error myself. I do get occasional failures when running lots of jobs, but I don't recall them being a 409. I am hoping to find a common underlying cause

@mapk-amazon no need to try the API, I just want to make sure I am using the same procedure when I try to recreate the problem..

@mapk-amazon
Copy link
Contributor Author

Update : I believe I know now what is happening. In my understanding the aggressive "retries" are the root cause of the issues.

The job pod (the one scheduling the pods) shows for failing pods, that "Galaxy" receives twice the information about the pod.

DEBUG 2024-10-14 20:12:36,480 [pN:job_handler_0,p:8,tN:KubernetesRunner.monitor_thread] Job id: gxy-galaxy-dkpc5 with k8s id: gxy-galaxy-dkpc5 succeeded
DEBUG 2024-10-14 20:12:38,484 [pN:job_handler_0,p:8,tN:KubernetesRunner.monitor_thread] Job id: gxy-galaxy-dkpc5 with k8s id: gxy-galaxy-dkpc5 succeeded

Then it starts cleaning (twice) and one fails, as the other one already deleted/starting deletion. Finally, it shows tool_stdout and tool_stderr twice:

DEBUG 2024-10-14 20:12:54,185 [pN:job_handler_0,p:8,tN:KubernetesRunner.work_thread-1] (3464/gxy-galaxy-dkpc5) tool_stdout: 
DEBUG 2024-10-14 20:12:54,186 [pN:job_handler_0,p:8,tN:KubernetesRunner.work_thread-1] (3464/gxy-galaxy-dkpc5) job_stdout: 
DEBUG 2024-10-14 20:12:54,186 [pN:job_handler_0,p:8,tN:KubernetesRunner.work_thread-1] (3464/gxy-galaxy-dkpc5) tool_stderr: 
DEBUG 2024-10-14 20:12:54,186 [pN:job_handler_0,p:8,tN:KubernetesRunner.work_thread-1] (3464/gxy-galaxy-dkpc5) job_stderr: Job output not returned from cluster

It seems the first job moved the data already and the second did no longer found the file.

The result is a technically successful job (as the container finished), the results were processed successfully once, and the second iteration (the later one) responds with an error and Galaxy believes the job fails.

@mapk-amazon
Copy link
Contributor Author

Update 2: I believe I was wrong (yet again). Please take a look at the PR galaxyproject/galaxy#19001 :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants