Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fail workflow and release resources when dag pod was failed #13979

Closed
3 of 4 tasks
imliuda opened this issue Dec 9, 2024 · 8 comments
Closed
3 of 4 tasks

Fail workflow and release resources when dag pod was failed #13979

imliuda opened this issue Dec 9, 2024 · 8 comments
Labels

Comments

@imliuda
Copy link
Contributor

imliuda commented Dec 9, 2024

Pre-requisites

  • I have double-checked my configuration
  • I have tested with the :latest image tag (i.e. quay.io/argoproj/workflow-controller:latest) and can confirm the issue still exists on :latest. If not, I have explained why, in detail, in my description below.
  • I have searched existing issues and could not find a match for this bug
  • I'd like to contribute the fix myself (see contributing guide)

What happened? What did you expect to happen?

When we configure a activeDeadlineSeconds on template level, if pod was running timeout, it will failed with a message "Pod was active on the node longer than the specified deadline", but the workflow is still running and waiting for other pod to finish. But we expect the workflow will also fail, and clean up running pods to release related resources.

Version(s)

v3.4.8

Paste a minimal workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

metadata:
  name: delightful-whale
  namespace: default-argo-managed
  labels:
    example: 'true'
spec:
  entrypoint: entrypoint
  templates:
    - name: entrypoint
      dag:
        failFast: true
        tasks:
          - name: sleep1
            template: sleep-deadline
          - name: sleep2
            template: sleep
          - name: sleep3
            template: sleep
    - name: sleep-deadline
      activeDeadlineSeconds: 15
      container:
        name: main
        image: busybox
        command:
          - sh
        args:
          - -c
          - sleep 180
    - name: sleep
      container:
        name: main
        image: busybox
        command:
          - sh
        args:
          - -c
          - sleep 180
  ttlStrategy:
    secondsAfterCompletion: 300
  podGC:
    strategy: OnPodCompletion

Logs from the workflow controller

time="2024-12-09T06:51:14.917Z" level=info msg="Processing workflow" namespace=default-argo-managed workflow=delightful-whale11
time="2024-12-09T06:51:14.921Z" level=info msg="Updated phase  -> Running" namespace=default-argo-managed workflow=delightful-whale11
time="2024-12-09T06:51:14.923Z" level=info msg="DAG node delightful-whale11 initialized Running" namespace=default-argo-managed workflow=delightful-whale11
time="2024-12-09T06:51:14.923Z" level=info msg="All of node delightful-whale11.sleep1 dependencies [] completed" namespace=default-argo-managed workflow=delightful-whale11
time="2024-12-09T06:51:14.923Z" level=info msg="Pod node delightful-whale11-754256265 initialized Pending" namespace=default-argo-managed workflow=delightful-whale11
time="2024-12-09T06:51:14.941Z" level=info msg="Created pod: delightful-whale11.sleep1 (delightful-whale11-sleep-deadline-754256265)" namespace=default-argo-managed workflow=delightful-whale11
time="2024-12-09T06:51:14.941Z" level=info msg="All of node delightful-whale11.sleep2 dependencies [] completed" namespace=default-argo-managed workflow=delightful-whale11
time="2024-12-09T06:51:14.941Z" level=info msg="Pod node delightful-whale11-703923408 initialized Pending" namespace=default-argo-managed workflow=delightful-whale11
time="2024-12-09T06:51:15.031Z" level=info msg="Created pod: delightful-whale11.sleep2 (delightful-whale11-sleep-703923408)" namespace=default-argo-managed workflow=delightful-whale11
time="2024-12-09T06:51:15.031Z" level=info msg="All of node delightful-whale11.sleep3 dependencies [] completed" namespace=default-argo-managed workflow=delightful-whale11
time="2024-12-09T06:51:15.031Z" level=info msg="Pod node delightful-whale11-720701027 initialized Pending" namespace=default-argo-managed workflow=delightful-whale11
time="2024-12-09T06:51:15.142Z" level=info msg="Created pod: delightful-whale11.sleep3 (delightful-whale11-sleep-720701027)" namespace=default-argo-managed workflow=delightful-whale11
time="2024-12-09T06:51:15.143Z" level=info msg="TaskSet Reconciliation" namespace=default-argo-managed workflow=delightful-whale11
time="2024-12-09T06:51:15.143Z" level=info msg=reconcileAgentPod namespace=default-argo-managed workflow=delightful-whale11
time="2024-12-09T06:51:15.244Z" level=info msg="Workflow update successful" namespace=default-argo-managed phase=Running resourceVersion=400592186 workflow=delightful-whale11
time="2024-12-09T06:51:24.942Z" level=info msg="Processing workflow" namespace=default-argo-managed workflow=delightful-whale11
time="2024-12-09T06:51:24.943Z" level=info msg="Task-result reconciliation" namespace=default-argo-managed numObjs=0 workflow=delightful-whale11
time="2024-12-09T06:51:24.943Z" level=info msg="node changed" namespace=default-argo-managed new.message=PodInitializing new.phase=Pending new.progress=0/1 nodeID=delightful-whale11-720701027 old.message= old.phase=Pending old.progress=0/1 workflow=delightful-whale11
time="2024-12-09T06:51:24.943Z" level=info msg="node changed" namespace=default-argo-managed new.message=PodInitializing new.phase=Pending new.progress=0/1 nodeID=delightful-whale11-754256265 old.message= old.phase=Pending old.progress=0/1 workflow=delightful-whale11
time="2024-12-09T06:51:24.943Z" level=info msg="node changed" namespace=default-argo-managed new.message= new.phase=Running new.progress=0/1 nodeID=delightful-whale11-703923408 old.message= old.phase=Pending old.progress=0/1 workflow=delightful-whale11
time="2024-12-09T06:51:24.943Z" level=info msg="TaskSet Reconciliation" namespace=default-argo-managed workflow=delightful-whale11
time="2024-12-09T06:51:24.943Z" level=info msg=reconcileAgentPod namespace=default-argo-managed workflow=delightful-whale11
time="2024-12-09T06:51:24.952Z" level=info msg="Workflow update successful" namespace=default-argo-managed phase=Running resourceVersion=400592324 workflow=delightful-whale11
time="2024-12-09T06:51:34.954Z" level=info msg="Processing workflow" namespace=default-argo-managed workflow=delightful-whale11
time="2024-12-09T06:51:34.954Z" level=info msg="Task-result reconciliation" namespace=default-argo-managed numObjs=0 workflow=delightful-whale11
time="2024-12-09T06:51:34.955Z" level=info msg="node unchanged" namespace=default-argo-managed nodeID=delightful-whale11-720701027 workflow=delightful-whale11
time="2024-12-09T06:51:34.955Z" level=info msg="node unchanged" namespace=default-argo-managed nodeID=delightful-whale11-754256265 workflow=delightful-whale11
time="2024-12-09T06:51:34.955Z" level=info msg="node unchanged" namespace=default-argo-managed nodeID=delightful-whale11-703923408 workflow=delightful-whale11
time="2024-12-09T06:51:34.955Z" level=info msg="TaskSet Reconciliation" namespace=default-argo-managed workflow=delightful-whale11
time="2024-12-09T06:51:34.955Z" level=info msg=reconcileAgentPod namespace=default-argo-managed workflow=delightful-whale11
time="2024-12-09T06:51:47.514Z" level=info msg="Processing workflow" namespace=default-argo-managed workflow=delightful-whale11
time="2024-12-09T06:51:47.515Z" level=info msg="Task-result reconciliation" namespace=default-argo-managed numObjs=1 workflow=delightful-whale11
time="2024-12-09T06:51:47.515Z" level=info msg="task-result changed" namespace=default-argo-managed nodeID=delightful-whale11-754256265 workflow=delightful-whale11
time="2024-12-09T06:51:47.515Z" level=info msg="node changed" namespace=default-argo-managed new.message= new.phase=Running new.progress=0/1 nodeID=delightful-whale11-720701027 old.message=PodInitializing old.phase=Pending old.progress=0/1 workflow=delightful-whale11
time="2024-12-09T06:51:47.515Z" level=info msg="Pod failed: Pod was active on the node longer than the specified deadline" displayName=sleep1 namespace=default-argo-managed pod=delightful-whale11-sleep-deadline-754256265 templateName=sleep-deadline workflow=delightful-whale11
time="2024-12-09T06:51:47.515Z" level=info msg="node changed" namespace=default-argo-managed new.message="Pod was active on the node longer than the specified deadline" new.phase=Failed new.progress=0/1 nodeID=delightful-whale11-754256265 old.message=PodInitializing old.phase=Pending old.progress=0/1 workflow=delightful-whale11
time="2024-12-09T06:51:47.515Z" level=info msg="node unchanged" namespace=default-argo-managed nodeID=delightful-whale11-703923408 workflow=delightful-whale11
time="2024-12-09T06:51:47.515Z" level=info msg="TaskSet Reconciliation" namespace=default-argo-managed workflow=delightful-whale11
time="2024-12-09T06:51:47.515Z" level=info msg=reconcileAgentPod namespace=default-argo-managed workflow=delightful-whale11
time="2024-12-09T06:51:47.760Z" level=info msg="Workflow update successful" namespace=default-argo-managed phase=Running resourceVersion=400592605 workflow=delightful-whale11
time="2024-12-09T06:51:52.761Z" level=info msg="cleaning up pod" action=deletePod key=default-argo-managed/delightful-whale11-sleep-deadline-754256265/deletePod
time="2024-12-09T06:51:57.761Z" level=info msg="Processing workflow" namespace=default-argo-managed workflow=delightful-whale11
time="2024-12-09T06:51:57.762Z" level=info msg="Task-result reconciliation" namespace=default-argo-managed numObjs=0 workflow=delightful-whale11
time="2024-12-09T06:51:57.762Z" level=info msg="node unchanged" namespace=default-argo-managed nodeID=delightful-whale11-720701027 workflow=delightful-whale11
time="2024-12-09T06:51:57.762Z" level=info msg="node unchanged" namespace=default-argo-managed nodeID=delightful-whale11-703923408 workflow=delightful-whale11
time="2024-12-09T06:51:57.763Z" level=info msg="TaskSet Reconciliation" namespace=default-argo-managed workflow=delightful-whale11
time="2024-12-09T06:51:57.763Z" level=info msg=reconcileAgentPod namespace=default-argo-managed workflow=delightful-whale11
time="2024-12-09T06:54:32.651Z" level=info msg="Processing workflow" namespace=default-argo-managed workflow=delightful-whale11
time="2024-12-09T06:54:32.651Z" level=info msg="Task-result reconciliation" namespace=default-argo-managed numObjs=1 workflow=delightful-whale11
time="2024-12-09T06:54:32.651Z" level=info msg="task-result changed" namespace=default-argo-managed nodeID=delightful-whale11-703923408 workflow=delightful-whale11
time="2024-12-09T06:54:32.651Z" level=info msg="node unchanged" namespace=default-argo-managed nodeID=delightful-whale11-720701027 workflow=delightful-whale11
time="2024-12-09T06:54:32.651Z" level=info msg="node changed" namespace=default-argo-managed new.message= new.phase=Succeeded new.progress=0/1 nodeID=delightful-whale11-703923408 old.message= old.phase=Running old.progress=0/1 workflow=delightful-whale11
time="2024-12-09T06:54:32.652Z" level=info msg="TaskSet Reconciliation" namespace=default-argo-managed workflow=delightful-whale11
time="2024-12-09T06:54:32.652Z" level=info msg=reconcileAgentPod namespace=default-argo-managed workflow=delightful-whale11
time="2024-12-09T06:54:32.669Z" level=info msg="Workflow update successful" namespace=default-argo-managed phase=Running resourceVersion=400594621 workflow=delightful-whale11
time="2024-12-09T06:54:37.670Z" level=info msg="cleaning up pod" action=deletePod key=default-argo-managed/delightful-whale11-sleep-703923408/deletePod
time="2024-12-09T06:54:47.570Z" level=info msg="Processing workflow" namespace=default-argo-managed workflow=delightful-whale11
time="2024-12-09T06:54:47.570Z" level=info msg="Task-result reconciliation" namespace=default-argo-managed numObjs=1 workflow=delightful-whale11
time="2024-12-09T06:54:47.570Z" level=info msg="task-result changed" namespace=default-argo-managed nodeID=delightful-whale11-720701027 workflow=delightful-whale11
time="2024-12-09T06:54:47.571Z" level=info msg="node changed" namespace=default-argo-managed new.message= new.phase=Succeeded new.progress=0/1 nodeID=delightful-whale11-720701027 old.message= old.phase=Running old.progress=0/1 workflow=delightful-whale11
time="2024-12-09T06:54:47.571Z" level=info msg="Outbound nodes of delightful-whale11 set to [delightful-whale11-754256265 delightful-whale11-703923408 delightful-whale11-720701027]" namespace=default-argo-managed workflow=delightful-whale11
time="2024-12-09T06:54:47.571Z" level=info msg="node delightful-whale11 phase Running -> Failed" namespace=default-argo-managed workflow=delightful-whale11
time="2024-12-09T06:54:47.571Z" level=info msg="node delightful-whale11 finished: 2024-12-09 06:54:47.571470078 +0000 UTC" namespace=default-argo-managed workflow=delightful-whale11
time="2024-12-09T06:54:47.571Z" level=info msg="Checking daemoned children of delightful-whale11" namespace=default-argo-managed workflow=delightful-whale11
time="2024-12-09T06:54:47.571Z" level=info msg="TaskSet Reconciliation" namespace=default-argo-managed workflow=delightful-whale11
time="2024-12-09T06:54:47.571Z" level=info msg=reconcileAgentPod namespace=default-argo-managed workflow=delightful-whale11
time="2024-12-09T06:54:47.571Z" level=info msg="Updated phase Running -> Failed" namespace=default-argo-managed workflow=delightful-whale11
time="2024-12-09T06:54:47.571Z" level=info msg="Marking workflow completed" namespace=default-argo-managed workflow=delightful-whale11
time="2024-12-09T06:54:47.571Z" level=info msg="Marking workflow as pending archiving" namespace=default-argo-managed workflow=delightful-whale11
time="2024-12-09T06:54:47.571Z" level=info msg="Checking daemoned children of " namespace=default-argo-managed workflow=delightful-whale11
time="2024-12-09T06:54:47.577Z" level=info msg="cleaning up pod" action=deletePod key=default-argo-managed/delightful-whale11-1340600742-agent/deletePod
time="2024-12-09T06:54:47.579Z" level=info msg="Workflow update successful" namespace=default-argo-managed phase=Failed resourceVersion=400594811 workflow=delightful-whale11
time="2024-12-09T06:54:47.585Z" level=info msg="archiving workflow" namespace=default-argo-managed uid=30ae13c9-518f-457f-8f27-c591099ecb60 workflow=delightful-whale11
time="2024-12-09T06:54:47.604Z" level=info msg="Queueing Failed workflow default-argo-managed/delightful-whale11 for delete in 48h0m0s due to TTL"
time="2024-12-09T06:54:52.585Z" level=info msg="cleaning up pod" action=deletePod key=default-argo-managed/delightful-whale11-sleep-720701027/deletePod

Logs from in your workflow's wait container

time="2024-12-09T06:51:21.572Z" level=info msg="Starting Workflow Executor" version=+ee51d78.dirty
time="2024-12-09T06:51:21.575Z" level=info msg="Using executor retry strategy" Duration=1s Factor=1.6 Jitter=0.5 Steps=5
time="2024-12-09T06:51:21.575Z" level=info msg="Executor initialized" deadline="0001-01-01 00:00:00 +0000 UTC" includeScriptOutput=false namespace=default-argo-managed podName=delightful-whale11-sleep-703923408 template="{\"name\":\"sleep\",\"inputs\":{},\"outputs\":{},\"metadata\":{},\"container\":{\"name\":\"main\",\"image\":\"busybox\",\"command\":[\"sh\"],\"args\":[\"-c\",\"sleep 180\"],\"resources\":{}},\"archiveLocation\":{\"archiveLogs\":true,\"s3\":{\"endpoint\":\"10.0.64.143:9000\",\"bucket\":\"default-argo\",\"insecure\":true,\"accessKeySecret\":{\"name\":\"s3-credential\",\"key\":\"accessKey\"},\"secretKeySecret\":{\"name\":\"s3-credential\",\"key\":\"secretKey\"},\"key\":\"default-argo/delightful-whale11/delightful-whale11-sleep-703923408\"}}}" version="&Version{Version:+ee51d78.dirty,BuildDate:2024-10-14T09:36:47Z,GitCommit:ee51d78e75c02126ed656b98a464927ff32c4475,GitTag:2.1.2.1-3.4.8-fyp,GitTreeState:dirty,GoVersion:go1.20.4,Compiler:gc,Platform:linux/amd64,}"
time="2024-12-09T06:51:21.575Z" level=info msg="Starting deadline monitor"
Error from server (BadRequest): container "wait" in pod "delightful-whale11-sleep-720701027" is waiting to start: PodInitializing
@imliuda imliuda changed the title Fail workflow and release resources when pod was failed due to activeDeadlineSeconds Fail workflow and release resources when dag pod was failed Dec 9, 2024
@jswxstw
Copy link
Member

jswxstw commented Dec 9, 2024

It seems that the actual issue you are concerned about is that failFast is not working, right?
There is already a duplicate issue: #10312, if so.

@imliuda
Copy link
Contributor Author

imliuda commented Dec 9, 2024

It seems that the actual issue you are concerned about is that failFast is not working, right? There is already a duplicate issue: #10312, if so.

Yes, and also shutdown running nodes, and update their status and messages, and save their logs if pod is running, delete pod if it is pending.

@jswxstw
Copy link
Member

jswxstw commented Dec 9, 2024

Apart from the issue of failFast not working, everything seems to be as expected.

Name:                delightful-whale
Namespace:           argo
ServiceAccount:      unset (will run with the default ServiceAccount)
Status:              Failed
Conditions:          
 PodRunning          False
 Completed           True
Created:             Mon Dec 09 15:43:23 +0800 (2 hours ago)
Started:             Mon Dec 09 15:43:23 +0800 (2 hours ago)
Finished:            Mon Dec 09 15:46:30 +0800 (2 hours ago)
Duration:            3 minutes 7 seconds
Progress:            2/3
ResourcesDuration:   1m14s*(1 cpu),10m16s*(100Mi memory)

STEP                 TEMPLATE        PODNAME                                     DURATION  MESSAGE
 ✖ delightful-whale  entrypoint                                                                                                                           
 ├─✖ sleep1          sleep-deadline  delightful-whale-sleep-deadline-3517234507  16s       Pod was active on the node longer than the specified deadline  
 ├─✔ sleep2          sleep           delightful-whale-sleep-3534012126           3m                                                                       
 └─✔ sleep3          sleep           delightful-whale-sleep-3550789745           3m

Have you encountered any other issues?

@imliuda
Copy link
Contributor Author

imliuda commented Dec 9, 2024

If there are pods keep running, it is wasting resources, especially gpu, so we expect pods can be shutdown gracefully. Does that fix will resolve this?

@jswxstw
Copy link
Member

jswxstw commented Dec 9, 2024

Yes, and also shutdown running nodes, and update their status and messages, and save their logs if pod is running, delete pod if it is pending.

To clarify, in the case you provided, sleep1, sleep2, and sleep3 are running simultaneously. Even with failFast enabled, it will not interrupt the normal execution of sleep2 and sleep3.

@imliuda
Copy link
Contributor Author

imliuda commented Dec 9, 2024

So what I need may be failUltraFast, when one of these nodes failed, the whole dag or steps will fail, rather than waiting for them to finish running. That really make sense.

@jswxstw
Copy link
Member

jswxstw commented Dec 9, 2024

So what I need may be failUltraFast, when one of these nodes failed, the whole dag or steps will fail, rather than waiting for them to finish running. That really make sense.

I see, you want all nodes to fail fast immediately. This is not a bug, you need to propose a feature for it.

@imliuda imliuda closed this as completed Dec 10, 2024
@agilgur5
Copy link

agilgur5 commented Dec 10, 2024

you need to propose a feature for it

#5612 (comment) describes this with failFastStrategy: Terminate (vs Stop vs the default of "run to completion"). #5398 mentions termination on failure as well. IIRC, it's been mentioned in a few other places too

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants