-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Workflows with ContainerSet template stuck forever in case of pod deletion #13951
Comments
argo-workflows/workflow/controller/operator.go Lines 1236 to 1247 in 1f304ba
The issue lies here: when the pod is deleted, only its children are marked as |
@jswxstw Thank you for the clarification. The title or description can be corrected if necessary. |
…argoproj#13951 Signed-off-by: oninowang <[email protected]>
This issue is similar to #12210, and #12756 has not fully resolved it. |
Thank you, @jswxstw! I hope it gets merged sooner rather than later. It has been a recurring issue for our workflows with containersets. |
…argoproj#13951 Signed-off-by: oninowang <[email protected]>
…argoproj#13951 Signed-off-by: oninowang <[email protected]>
Pre-requisites
:latest
image tag (i.e.quay.io/argoproj/workflow-controller:latest
) and can confirm the issue still exists on:latest
. If not, I have explained why, in detail, in my description below.What happened? What did you expect to happen?
We are using cheap GKE Autopilot spot instances to run Argo Workflows jobs. It means that the node can gone at any moment. With such consequences:
We see a bug while using the ContainerSet template with retries. The workflow stucks forever. This is crucial in the case of using mutexes because new workflows are stuck with pending status until manually stopping the stuck workflow.
Steps for reproduce:
Thanks for the great tool and fixes!
Version(s)
v3.5.?, v3.5.10, v3.6.0
Paste a minimal workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.
Logs from the workflow controller
Logs from in your workflow's wait container
The text was updated successfully, but these errors were encountered: