-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fail workflow and release resources when dag pod was failed #13979
Comments
It seems that the actual issue you are concerned about is that |
Yes, and also shutdown running nodes, and update their status and messages, and save their logs if pod is running, delete pod if it is pending. |
Name: delightful-whale
Namespace: argo
ServiceAccount: unset (will run with the default ServiceAccount)
Status: Failed
Conditions:
PodRunning False
Completed True
Created: Mon Dec 09 15:43:23 +0800 (2 hours ago)
Started: Mon Dec 09 15:43:23 +0800 (2 hours ago)
Finished: Mon Dec 09 15:46:30 +0800 (2 hours ago)
Duration: 3 minutes 7 seconds
Progress: 2/3
ResourcesDuration: 1m14s*(1 cpu),10m16s*(100Mi memory)
STEP TEMPLATE PODNAME DURATION MESSAGE
✖ delightful-whale entrypoint
├─✖ sleep1 sleep-deadline delightful-whale-sleep-deadline-3517234507 16s Pod was active on the node longer than the specified deadline
├─✔ sleep2 sleep delightful-whale-sleep-3534012126 3m
└─✔ sleep3 sleep delightful-whale-sleep-3550789745 3m Have you encountered any other issues? |
If there are pods keep running, it is wasting resources, especially gpu, so we expect pods can be shutdown gracefully. Does that fix will resolve this? |
To clarify, in the case you provided, |
So what I need may be |
I see, you want all nodes to fail fast immediately. This is not a bug, you need to propose a feature for it. |
#5612 (comment) describes this with |
Pre-requisites
:latest
image tag (i.e.quay.io/argoproj/workflow-controller:latest
) and can confirm the issue still exists on:latest
. If not, I have explained why, in detail, in my description below.What happened? What did you expect to happen?
When we configure a
activeDeadlineSeconds
on template level, if pod was running timeout, it will failed with a message "Pod was active on the node longer than the specified deadline", but the workflow is still running and waiting for other pod to finish. But we expect the workflow will also fail, and clean up running pods to release related resources.Version(s)
v3.4.8
Paste a minimal workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.
Logs from the workflow controller
Logs from in your workflow's wait container
The text was updated successfully, but these errors were encountered: