Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workflows with ContainerSet template stuck forever in case of pod deletion #13951

Closed
3 of 4 tasks
artem-zherdiev-ingio opened this issue Nov 28, 2024 · 4 comments · Fixed by #13978
Closed
3 of 4 tasks

Comments

@artem-zherdiev-ingio
Copy link

Pre-requisites

  • I have double-checked my configuration
  • I have tested with the :latest image tag (i.e. quay.io/argoproj/workflow-controller:latest) and can confirm the issue still exists on :latest. If not, I have explained why, in detail, in my description below.
  • I have searched existing issues and could not find a match for this bug
  • I'd like to contribute the fix myself (see contributing guide)

What happened? What did you expect to happen?

We are using cheap GKE Autopilot spot instances to run Argo Workflows jobs. It means that the node can gone at any moment. With such consequences:

The maximum grace period given to Spot Pods during preemption is 15 seconds. Requesting more than 15 seconds in terminationGracePeriodSeconds doesn't grant more than 15 seconds during preemption. On eviction, your Pod is sent the SIGTERM signal, and should take steps to shutdown during the grace period.

We see a bug while using the ContainerSet template with retries. The workflow stucks forever. This is crucial in the case of using mutexes because new workflows are stuck with pending status until manually stopping the stuck workflow.

Steps for reproduce:

  1. Run the provided workflow.
  2. During execution manually delete the running pod of set1. To trigger "pod deleted" event.
  3. Wait until workflow steps pass but the workflow will stuck forever.

image
image

Thanks for the great tool and fixes!

Version(s)

v3.5.?, v3.5.10, v3.6.0

Paste a minimal workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
  name: containerset-retry-bug
  namespace: workflows
spec:
  serviceAccountName: workflows-executor

  ttlStrategy:
    secondsAfterCompletion: 360

  entrypoint: main
  templates:
    - name: main
      dag:
        tasks:
          - name: set1
            template: set
          - name: set2
            depends: set1.Succeeded
            template: set

    - name: set
      retryStrategy:
        limit: 2
        expression: 'asInt(lastRetry.exitCode) < 0 || lastRetry.message matches "node shutdown|pod deleted"'
      containerSet:
        containers:
          - name: some-container
            image: alpine
            command: [sh, -c]
            args: ["exit 0"]
          - name: main
            dependencies:
              - some-container
            image: alpine
            command: [sh, -c]
            args: ["sleep 30s"]

Logs from the workflow controller

{"Phase":"","ResourceVersion":"770926970","level":"info","msg":"Processing workflow","namespace":"workflows","time":"2024-11-28T13:34:58.518Z","workflow":"containerset-retry-bug-xx6jj"}
{"level":"info","msg":"Task-result reconciliation","namespace":"workflows","numObjs":0,"time":"2024-11-28T13:34:58.545Z","workflow":"containerset-retry-bug-xx6jj"}
{"level":"info","msg":"Updated phase  -\u003e Running","namespace":"workflows","time":"2024-11-28T13:34:58.545Z","workflow":"containerset-retry-bug-xx6jj"}
{"level":"warning","msg":"Node was nil, will be initialized as type Skipped","namespace":"workflows","time":"2024-11-28T13:34:58.547Z","workflow":"containerset-retry-bug-xx6jj"}
{"level":"info","msg":"was unable to obtain node for , letting display name to be nodeName","namespace":"workflows","time":"2024-11-28T13:34:58.547Z","workflow":"containerset-retry-bug-xx6jj"}
{"level":"info","msg":"DAG node containerset-retry-bug-xx6jj initialized Running","namespace":"workflows","time":"2024-11-28T13:34:58.547Z","workflow":"containerset-retry-bug-xx6jj"}
{"level":"warning","msg":"was unable to obtain the node for containerset-retry-bug-xx6jj-538766217, taskName set2","time":"2024-11-28T13:34:58.547Z"}
{"level":"warning","msg":"was unable to obtain the node for containerset-retry-bug-xx6jj-488433360, taskName set1","time":"2024-11-28T13:34:58.547Z"}
{"level":"warning","msg":"was unable to obtain the node for containerset-retry-bug-xx6jj-488433360, taskName set1","time":"2024-11-28T13:34:58.547Z"}
{"level":"info","msg":"All of node containerset-retry-bug-xx6jj.set1 dependencies [] completed","namespace":"workflows","time":"2024-11-28T13:34:58.547Z","workflow":"containerset-retry-bug-xx6jj"}
{"level":"warning","msg":"Node was nil, will be initialized as type Skipped","namespace":"workflows","time":"2024-11-28T13:34:58.547Z","workflow":"containerset-retry-bug-xx6jj"}
{"level":"info","msg":"Retry node containerset-retry-bug-xx6jj-488433360 initialized Running","namespace":"workflows","time":"2024-11-28T13:34:58.547Z","workflow":"containerset-retry-bug-xx6jj"}
{"level":"info","msg":"Pod node containerset-retry-bug-xx6jj-144675699 initialized Pending","namespace":"workflows","time":"2024-11-28T13:34:58.548Z","workflow":"containerset-retry-bug-xx6jj"}
W1128 13:34:58.602222       1 warnings.go:70] autopilot-default-resources-mutator:Autopilot updated Pod workflows/containerset-retry-bug-xx6jj-set-144675699: defaulted unspecified 'cpu' resource for containers [init, wait, some-container, main] (see http://g.co/gke/autopilot-defaults).
{"level":"info","msg":"Created pod: containerset-retry-bug-xx6jj.set1(0) (containerset-retry-bug-xx6jj-set-144675699)","namespace":"workflows","time":"2024-11-28T13:34:58.602Z","workflow":"containerset-retry-bug-xx6jj"}
{"level":"info","msg":"Container node containerset-retry-bug-xx6jj-2565252177 initialized Pending","namespace":"workflows","time":"2024-11-28T13:34:58.603Z","workflow":"containerset-retry-bug-xx6jj"}
{"level":"info","msg":"Container node containerset-retry-bug-xx6jj-3751293222 initialized Pending","namespace":"workflows","time":"2024-11-28T13:34:58.603Z","workflow":"containerset-retry-bug-xx6jj"}
{"level":"warning","msg":"was unable to obtain the node for containerset-retry-bug-xx6jj-538766217, taskName set2","time":"2024-11-28T13:34:58.603Z"}
{"level":"warning","msg":"was unable to obtain the node for containerset-retry-bug-xx6jj-538766217, taskName set2","time":"2024-11-28T13:34:58.603Z"}
{"level":"info","msg":"TaskSet Reconciliation","namespace":"workflows","time":"2024-11-28T13:34:58.603Z","workflow":"containerset-retry-bug-xx6jj"}
{"level":"info","msg":"reconcileAgentPod","namespace":"workflows","time":"2024-11-28T13:34:58.603Z","workflow":"containerset-retry-bug-xx6jj"}
{"level":"info","msg":"Workflow update successful","namespace":"workflows","phase":"Running","resourceVersion":"770926974","time":"2024-11-28T13:34:58.616Z","workflow":"containerset-retry-bug-xx6jj"}
{"Phase":"Running","ResourceVersion":"770926974","level":"info","msg":"Processing workflow","namespace":"workflows","time":"2024-11-28T13:35:08.604Z","workflow":"containerset-retry-bug-xx6jj"}
{"level":"info","msg":"Task-result reconciliation","namespace":"workflows","numObjs":1,"time":"2024-11-28T13:35:08.604Z","workflow":"containerset-retry-bug-xx6jj"}
{"level":"info","msg":"node containerset-retry-bug-xx6jj-3751293222 phase Pending -\u003e Running","namespace":"workflows","time":"2024-11-28T13:35:08.604Z","workflow":"containerset-retry-bug-xx6jj"}
{"level":"info","msg":"node containerset-retry-bug-xx6jj-2565252177 phase Pending -\u003e Succeeded","namespace":"workflows","time":"2024-11-28T13:35:08.604Z","workflow":"containerset-retry-bug-xx6jj"}
{"level":"info","msg":"node containerset-retry-bug-xx6jj-2565252177 finished: 2024-11-28 13:35:08.604844582 +0000 UTC","namespace":"workflows","time":"2024-11-28T13:35:08.604Z","workflow":"containerset-retry-bug-xx6jj"}
{"level":"info","msg":"node changed","namespace":"workflows","new.message":"","new.phase":"Running","new.progress":"0/1","nodeID":"containerset-retry-bug-xx6jj-144675699","old.message":"","old.phase":"Pending","old.progress":"0/1","time":"2024-11-28T13:35:08.604Z","workflow":"containerset-retry-bug-xx6jj"}
{"level":"warning","msg":"was unable to obtain the node for containerset-retry-bug-xx6jj-538766217, taskName set2","time":"2024-11-28T13:35:08.605Z"}
{"level":"warning","msg":"was unable to obtain the node for containerset-retry-bug-xx6jj-538766217, taskName set2","time":"2024-11-28T13:35:08.605Z"}
{"level":"warning","msg":"was unable to obtain the node for containerset-retry-bug-xx6jj-538766217, taskName set2","time":"2024-11-28T13:35:08.605Z"}
{"level":"info","msg":"TaskSet Reconciliation","namespace":"workflows","time":"2024-11-28T13:35:08.605Z","workflow":"containerset-retry-bug-xx6jj"}
{"level":"info","msg":"reconcileAgentPod","namespace":"workflows","time":"2024-11-28T13:35:08.605Z","workflow":"containerset-retry-bug-xx6jj"}
{"level":"info","msg":"Workflow update successful","namespace":"workflows","phase":"Running","resourceVersion":"770927227","time":"2024-11-28T13:35:08.617Z","workflow":"containerset-retry-bug-xx6jj"}
{"Phase":"Running","ResourceVersion":"770927227","level":"info","msg":"Processing workflow","namespace":"workflows","time":"2024-11-28T13:35:18.618Z","workflow":"containerset-retry-bug-xx6jj"}
{"level":"info","msg":"Task-result reconciliation","namespace":"workflows","numObjs":1,"time":"2024-11-28T13:35:18.618Z","workflow":"containerset-retry-bug-xx6jj"}
{"level":"info","msg":"Workflow pod is missing","namespace":"workflows","nodeName":"containerset-retry-bug-xx6jj.set1(0)","nodePhase":"Running","recentlyStarted":false,"time":"2024-11-28T13:35:18.618Z","workflow":"containerset-retry-bug-xx6jj"}
{"error":"pod deleted","level":"error","msg":"Mark error node","namespace":"workflows","nodeName":"containerset-retry-bug-xx6jj.set1(0)","time":"2024-11-28T13:35:18.618Z","workflow":"containerset-retry-bug-xx6jj"}
{"level":"info","msg":"node containerset-retry-bug-xx6jj-144675699 phase Running -\u003e Error","namespace":"workflows","time":"2024-11-28T13:35:18.618Z","workflow":"containerset-retry-bug-xx6jj"}
{"level":"info","msg":"node containerset-retry-bug-xx6jj-144675699 message: pod deleted","namespace":"workflows","time":"2024-11-28T13:35:18.618Z","workflow":"containerset-retry-bug-xx6jj"}
{"level":"info","msg":"node containerset-retry-bug-xx6jj-144675699 finished: 2024-11-28 13:35:18.618568584 +0000 UTC","namespace":"workflows","time":"2024-11-28T13:35:18.618Z","workflow":"containerset-retry-bug-xx6jj"}
{"error":"container deleted","level":"error","msg":"Mark error node","namespace":"workflows","nodeName":"containerset-retry-bug-xx6jj.set1(0).some-container","time":"2024-11-28T13:35:18.618Z","workflow":"containerset-retry-bug-xx6jj"}
{"fromPhase":"Succeeded","level":"error","msg":"node is already fulfilled","namespace":"workflows","nodeName":"containerset-retry-bug-xx6jj.set1(0).some-container","time":"2024-11-28T13:35:18.618Z","toPhase":"Error","workflow":"containerset-retry-bug-xx6jj"}
{"level":"info","msg":"node containerset-retry-bug-xx6jj-2565252177 phase Succeeded -\u003e Error","namespace":"workflows","time":"2024-11-28T13:35:18.618Z","workflow":"containerset-retry-bug-xx6jj"}
{"level":"info","msg":"node containerset-retry-bug-xx6jj-2565252177 message: container deleted","namespace":"workflows","time":"2024-11-28T13:35:18.618Z","workflow":"containerset-retry-bug-xx6jj"}
{"level":"warning","msg":"was unable to obtain the node for containerset-retry-bug-xx6jj-538766217, taskName set2","time":"2024-11-28T13:35:18.618Z"}
{"level":"info","msg":"Retry Policy: Always (onFailed: true, onError true)","namespace":"workflows","time":"2024-11-28T13:35:18.619Z","workflow":"containerset-retry-bug-xx6jj"}
{"level":"info","msg":"1 child nodes of containerset-retry-bug-xx6jj.set1 failed. Trying again...","namespace":"workflows","time":"2024-11-28T13:35:18.619Z","workflow":"containerset-retry-bug-xx6jj"}
{"level":"info","msg":"Pod node containerset-retry-bug-xx6jj-748817078 initialized Pending","namespace":"workflows","time":"2024-11-28T13:35:18.621Z","workflow":"containerset-retry-bug-xx6jj"}
W1128 13:35:18.680033       1 warnings.go:70] autopilot-default-resources-mutator:Autopilot updated Pod workflows/containerset-retry-bug-xx6jj-set-748817078: defaulted unspecified 'cpu' resource for containers [init, wait, some-container, main] (see http://g.co/gke/autopilot-defaults).
{"level":"info","msg":"Created pod: containerset-retry-bug-xx6jj.set1(1) (containerset-retry-bug-xx6jj-set-748817078)","namespace":"workflows","time":"2024-11-28T13:35:18.680Z","workflow":"containerset-retry-bug-xx6jj"}
{"level":"info","msg":"Container node containerset-retry-bug-xx6jj-2055230882 initialized Pending","namespace":"workflows","time":"2024-11-28T13:35:18.680Z","workflow":"containerset-retry-bug-xx6jj"}
{"level":"info","msg":"Container node containerset-retry-bug-xx6jj-1684633093 initialized Pending","namespace":"workflows","time":"2024-11-28T13:35:18.680Z","workflow":"containerset-retry-bug-xx6jj"}
{"level":"warning","msg":"was unable to obtain the node for containerset-retry-bug-xx6jj-538766217, taskName set2","time":"2024-11-28T13:35:18.680Z"}
{"level":"warning","msg":"was unable to obtain the node for containerset-retry-bug-xx6jj-538766217, taskName set2","time":"2024-11-28T13:35:18.680Z"}
{"level":"info","msg":"TaskSet Reconciliation","namespace":"workflows","time":"2024-11-28T13:35:18.680Z","workflow":"containerset-retry-bug-xx6jj"}
{"level":"info","msg":"reconcileAgentPod","namespace":"workflows","time":"2024-11-28T13:35:18.680Z","workflow":"containerset-retry-bug-xx6jj"}
{"level":"info","msg":"Workflow update successful","namespace":"workflows","phase":"Running","resourceVersion":"770927778","time":"2024-11-28T13:35:18.692Z","workflow":"containerset-retry-bug-xx6jj"}
{"Phase":"Running","ResourceVersion":"770927778","level":"info","msg":"Processing workflow","namespace":"workflows","time":"2024-11-28T13:35:28.680Z","workflow":"containerset-retry-bug-xx6jj"}
{"level":"info","msg":"Task-result reconciliation","namespace":"workflows","numObjs":2,"time":"2024-11-28T13:35:28.680Z","workflow":"containerset-retry-bug-xx6jj"}
{"level":"info","msg":"node containerset-retry-bug-xx6jj-1684633093 phase Pending -\u003e Running","namespace":"workflows","time":"2024-11-28T13:35:28.681Z","workflow":"containerset-retry-bug-xx6jj"}
{"level":"info","msg":"node containerset-retry-bug-xx6jj-2055230882 phase Pending -\u003e Succeeded","namespace":"workflows","time":"2024-11-28T13:35:28.681Z","workflow":"containerset-retry-bug-xx6jj"}
{"level":"info","msg":"node containerset-retry-bug-xx6jj-2055230882 finished: 2024-11-28 13:35:28.681112696 +0000 UTC","namespace":"workflows","time":"2024-11-28T13:35:28.681Z","workflow":"containerset-retry-bug-xx6jj"}
{"level":"info","msg":"node changed","namespace":"workflows","new.message":"","new.phase":"Running","new.progress":"0/1","nodeID":"containerset-retry-bug-xx6jj-748817078","old.message":"","old.phase":"Pending","old.progress":"0/1","time":"2024-11-28T13:35:28.681Z","workflow":"containerset-retry-bug-xx6jj"}
{"level":"warning","msg":"was unable to obtain the node for containerset-retry-bug-xx6jj-538766217, taskName set2","time":"2024-11-28T13:35:28.681Z"}
{"level":"warning","msg":"was unable to obtain the node for containerset-retry-bug-xx6jj-538766217, taskName set2","time":"2024-11-28T13:35:28.681Z"}
{"level":"warning","msg":"was unable to obtain the node for containerset-retry-bug-xx6jj-538766217, taskName set2","time":"2024-11-28T13:35:28.681Z"}
{"level":"info","msg":"TaskSet Reconciliation","namespace":"workflows","time":"2024-11-28T13:35:28.681Z","workflow":"containerset-retry-bug-xx6jj"}
{"level":"info","msg":"reconcileAgentPod","namespace":"workflows","time":"2024-11-28T13:35:28.681Z","workflow":"containerset-retry-bug-xx6jj"}
{"level":"info","msg":"Workflow update successful","namespace":"workflows","phase":"Running","resourceVersion":"770928093","time":"2024-11-28T13:35:28.693Z","workflow":"containerset-retry-bug-xx6jj"}
{"Phase":"Running","ResourceVersion":"770928093","level":"info","msg":"Processing workflow","namespace":"workflows","time":"2024-11-28T13:35:38.696Z","workflow":"containerset-retry-bug-xx6jj"}
{"level":"info","msg":"Task-result reconciliation","namespace":"workflows","numObjs":2,"time":"2024-11-28T13:35:38.696Z","workflow":"containerset-retry-bug-xx6jj"}
{"level":"info","msg":"node unchanged","namespace":"workflows","nodeID":"containerset-retry-bug-xx6jj-748817078","time":"2024-11-28T13:35:38.696Z","workflow":"containerset-retry-bug-xx6jj"}
{"level":"warning","msg":"was unable to obtain the node for containerset-retry-bug-xx6jj-538766217, taskName set2","time":"2024-11-28T13:35:38.696Z"}
{"level":"warning","msg":"was unable to obtain the node for containerset-retry-bug-xx6jj-538766217, taskName set2","time":"2024-11-28T13:35:38.697Z"}
{"level":"warning","msg":"was unable to obtain the node for containerset-retry-bug-xx6jj-538766217, taskName set2","time":"2024-11-28T13:35:38.697Z"}
{"level":"info","msg":"TaskSet Reconciliation","namespace":"workflows","time":"2024-11-28T13:35:38.697Z","workflow":"containerset-retry-bug-xx6jj"}
{"level":"info","msg":"reconcileAgentPod","namespace":"workflows","time":"2024-11-28T13:35:38.697Z","workflow":"containerset-retry-bug-xx6jj"}
{"Phase":"Running","ResourceVersion":"770928093","level":"info","msg":"Processing workflow","namespace":"workflows","time":"2024-11-28T13:36:03.178Z","workflow":"containerset-retry-bug-xx6jj"}
{"level":"info","msg":"Task-result reconciliation","namespace":"workflows","numObjs":2,"time":"2024-11-28T13:36:03.178Z","workflow":"containerset-retry-bug-xx6jj"}
{"level":"info","msg":"node containerset-retry-bug-xx6jj-1684633093 phase Running -\u003e Succeeded","namespace":"workflows","time":"2024-11-28T13:36:03.178Z","workflow":"containerset-retry-bug-xx6jj"}
{"level":"info","msg":"node containerset-retry-bug-xx6jj-1684633093 finished: 2024-11-28 13:36:03.178698798 +0000 UTC","namespace":"workflows","time":"2024-11-28T13:36:03.178Z","workflow":"containerset-retry-bug-xx6jj"}
{"level":"info","msg":"node changed","namespace":"workflows","new.message":"","new.phase":"Succeeded","new.progress":"0/1","nodeID":"containerset-retry-bug-xx6jj-748817078","old.message":"","old.phase":"Running","old.progress":"0/1","time":"2024-11-28T13:36:03.178Z","workflow":"containerset-retry-bug-xx6jj"}
{"level":"warning","msg":"was unable to obtain the node for containerset-retry-bug-xx6jj-538766217, taskName set2","time":"2024-11-28T13:36:03.178Z"}
{"level":"info","msg":"node containerset-retry-bug-xx6jj-488433360 phase Running -\u003e Succeeded","namespace":"workflows","time":"2024-11-28T13:36:03.179Z","workflow":"containerset-retry-bug-xx6jj"}
{"level":"info","msg":"node containerset-retry-bug-xx6jj-488433360 finished: 2024-11-28 13:36:03.179218767 +0000 UTC","namespace":"workflows","time":"2024-11-28T13:36:03.179Z","workflow":"containerset-retry-bug-xx6jj"}
{"level":"warning","msg":"was unable to obtain the node for containerset-retry-bug-xx6jj-538766217, taskName set2","time":"2024-11-28T13:36:03.179Z"}
{"level":"warning","msg":"was unable to obtain the node for containerset-retry-bug-xx6jj-538766217, taskName set2","time":"2024-11-28T13:36:03.179Z"}
{"level":"info","msg":"All of node containerset-retry-bug-xx6jj.set2 dependencies [set1] completed","namespace":"workflows","time":"2024-11-28T13:36:03.179Z","workflow":"containerset-retry-bug-xx6jj"}
{"level":"warning","msg":"Node was nil, will be initialized as type Skipped","namespace":"workflows","time":"2024-11-28T13:36:03.179Z","workflow":"containerset-retry-bug-xx6jj"}
{"level":"info","msg":"Retry node containerset-retry-bug-xx6jj-538766217 initialized Running","namespace":"workflows","time":"2024-11-28T13:36:03.181Z","workflow":"containerset-retry-bug-xx6jj"}
{"level":"info","msg":"Pod node containerset-retry-bug-xx6jj-141125360 initialized Pending","namespace":"workflows","time":"2024-11-28T13:36:03.181Z","workflow":"containerset-retry-bug-xx6jj"}
W1128 13:36:03.239577       1 warnings.go:70] autopilot-default-resources-mutator:Autopilot updated Pod workflows/containerset-retry-bug-xx6jj-set-141125360: defaulted unspecified 'cpu' resource for containers [init, wait, some-container, main] (see http://g.co/gke/autopilot-defaults).
{"level":"info","msg":"Created pod: containerset-retry-bug-xx6jj.set2(0) (containerset-retry-bug-xx6jj-set-141125360)","namespace":"workflows","time":"2024-11-28T13:36:03.240Z","workflow":"containerset-retry-bug-xx6jj"}
{"level":"info","msg":"Container node containerset-retry-bug-xx6jj-869016352 initialized Pending","namespace":"workflows","time":"2024-11-28T13:36:03.240Z","workflow":"containerset-retry-bug-xx6jj"}
{"level":"info","msg":"Container node containerset-retry-bug-xx6jj-2545150835 initialized Pending","namespace":"workflows","time":"2024-11-28T13:36:03.240Z","workflow":"containerset-retry-bug-xx6jj"}
{"level":"info","msg":"TaskSet Reconciliation","namespace":"workflows","time":"2024-11-28T13:36:03.240Z","workflow":"containerset-retry-bug-xx6jj"}
{"level":"info","msg":"reconcileAgentPod","namespace":"workflows","time":"2024-11-28T13:36:03.240Z","workflow":"containerset-retry-bug-xx6jj"}
{"level":"info","msg":"Workflow update successful","namespace":"workflows","phase":"Running","resourceVersion":"770929250","time":"2024-11-28T13:36:03.253Z","workflow":"containerset-retry-bug-xx6jj"}
{"action":"labelPodCompleted","key":"workflows/containerset-retry-bug-xx6jj-set-748817078/labelPodCompleted","level":"info","msg":"cleaning up pod","time":"2024-11-28T13:36:03.260Z"}
{"Phase":"Running","ResourceVersion":"770929250","level":"info","msg":"Processing workflow","namespace":"workflows","time":"2024-11-28T13:36:13.241Z","workflow":"containerset-retry-bug-xx6jj"}
{"level":"info","msg":"Task-result reconciliation","namespace":"workflows","numObjs":3,"time":"2024-11-28T13:36:13.241Z","workflow":"containerset-retry-bug-xx6jj"}
{"level":"info","msg":"node containerset-retry-bug-xx6jj-2545150835 phase Pending -\u003e Running","namespace":"workflows","time":"2024-11-28T13:36:13.241Z","workflow":"containerset-retry-bug-xx6jj"}
{"level":"info","msg":"node containerset-retry-bug-xx6jj-869016352 phase Pending -\u003e Succeeded","namespace":"workflows","time":"2024-11-28T13:36:13.241Z","workflow":"containerset-retry-bug-xx6jj"}
{"level":"info","msg":"node containerset-retry-bug-xx6jj-869016352 finished: 2024-11-28 13:36:13.241917948 +0000 UTC","namespace":"workflows","time":"2024-11-28T13:36:13.241Z","workflow":"containerset-retry-bug-xx6jj"}
{"level":"info","msg":"node changed","namespace":"workflows","new.message":"","new.phase":"Running","new.progress":"0/1","nodeID":"containerset-retry-bug-xx6jj-141125360","old.message":"","old.phase":"Pending","old.progress":"0/1","time":"2024-11-28T13:36:13.241Z","workflow":"containerset-retry-bug-xx6jj"}
{"level":"info","msg":"TaskSet Reconciliation","namespace":"workflows","time":"2024-11-28T13:36:13.242Z","workflow":"containerset-retry-bug-xx6jj"}
{"level":"info","msg":"reconcileAgentPod","namespace":"workflows","time":"2024-11-28T13:36:13.242Z","workflow":"containerset-retry-bug-xx6jj"}
{"level":"info","msg":"Workflow update successful","namespace":"workflows","phase":"Running","resourceVersion":"770929528","time":"2024-11-28T13:36:13.254Z","workflow":"containerset-retry-bug-xx6jj"}
{"Phase":"Running","ResourceVersion":"770929528","level":"info","msg":"Processing workflow","namespace":"workflows","time":"2024-11-28T13:36:47.503Z","workflow":"containerset-retry-bug-xx6jj"}
{"level":"info","msg":"Task-result reconciliation","namespace":"workflows","numObjs":3,"time":"2024-11-28T13:36:47.503Z","workflow":"containerset-retry-bug-xx6jj"}
{"level":"info","msg":"node containerset-retry-bug-xx6jj-2545150835 phase Running -\u003e Succeeded","namespace":"workflows","time":"2024-11-28T13:36:47.503Z","workflow":"containerset-retry-bug-xx6jj"}
{"level":"info","msg":"node containerset-retry-bug-xx6jj-2545150835 finished: 2024-11-28 13:36:47.503852095 +0000 UTC","namespace":"workflows","time":"2024-11-28T13:36:47.503Z","workflow":"containerset-retry-bug-xx6jj"}
{"level":"info","msg":"node changed","namespace":"workflows","new.message":"","new.phase":"Succeeded","new.progress":"0/1","nodeID":"containerset-retry-bug-xx6jj-141125360","old.message":"","old.phase":"Running","old.progress":"0/1","time":"2024-11-28T13:36:47.503Z","workflow":"containerset-retry-bug-xx6jj"}
{"level":"info","msg":"node containerset-retry-bug-xx6jj-538766217 phase Running -\u003e Succeeded","namespace":"workflows","time":"2024-11-28T13:36:47.504Z","workflow":"containerset-retry-bug-xx6jj"}
{"level":"info","msg":"node containerset-retry-bug-xx6jj-538766217 finished: 2024-11-28 13:36:47.504492725 +0000 UTC","namespace":"workflows","time":"2024-11-28T13:36:47.504Z","workflow":"containerset-retry-bug-xx6jj"}
{"level":"info","msg":"TaskSet Reconciliation","namespace":"workflows","time":"2024-11-28T13:36:47.504Z","workflow":"containerset-retry-bug-xx6jj"}
{"level":"info","msg":"reconcileAgentPod","namespace":"workflows","time":"2024-11-28T13:36:47.504Z","workflow":"containerset-retry-bug-xx6jj"}
{"level":"info","msg":"Workflow update successful","namespace":"workflows","phase":"Running","resourceVersion":"770930349","time":"2024-11-28T13:36:47.517Z","workflow":"containerset-retry-bug-xx6jj"}
{"action":"labelPodCompleted","key":"workflows/containerset-retry-bug-xx6jj-set-141125360/labelPodCompleted","level":"info","msg":"cleaning up pod","time":"2024-11-28T13:36:47.524Z"}
{"Phase":"Running","ResourceVersion":"770930349","level":"info","msg":"Processing workflow","namespace":"workflows","time":"2024-11-28T13:36:57.579Z","workflow":"containerset-retry-bug-xx6jj"}
{"level":"info","msg":"Task-result reconciliation","namespace":"workflows","numObjs":3,"time":"2024-11-28T13:36:57.579Z","workflow":"containerset-retry-bug-xx6jj"}
{"level":"info","msg":"TaskSet Reconciliation","namespace":"workflows","time":"2024-11-28T13:36:57.580Z","workflow":"containerset-retry-bug-xx6jj"}
{"level":"info","msg":"reconcileAgentPod","namespace":"workflows","time":"2024-11-28T13:36:57.580Z","workflow":"containerset-retry-bug-xx6jj"}
{"Phase":"Running","ResourceVersion":"770930349","level":"info","msg":"Processing workflow","namespace":"workflows","time":"2024-11-28T13:38:17.305Z","workflow":"containerset-retry-bug-xx6jj"}
{"level":"info","msg":"Task-result reconciliation","namespace":"workflows","numObjs":3,"time":"2024-11-28T13:38:17.305Z","workflow":"containerset-retry-bug-xx6jj"}
{"level":"info","msg":"TaskSet Reconciliation","namespace":"workflows","time":"2024-11-28T13:38:17.305Z","workflow":"containerset-retry-bug-xx6jj"}
{"level":"info","msg":"reconcileAgentPod","namespace":"workflows","time":"2024-11-28T13:38:17.305Z","workflow":"containerset-retry-bug-xx6jj"}
{"Phase":"Running","ResourceVersion":"770930349","level":"info","msg":"Processing workflow","namespace":"workflows","time":"2024-11-28T13:43:46.308Z","workflow":"containerset-retry-bug-xx6jj"}
{"level":"info","msg":"Task-result reconciliation","namespace":"workflows","numObjs":3,"time":"2024-11-28T13:43:46.309Z","workflow":"containerset-retry-bug-xx6jj"}
{"level":"info","msg":"TaskSet Reconciliation","namespace":"workflows","time":"2024-11-28T13:43:46.309Z","workflow":"containerset-retry-bug-xx6jj"}
{"level":"info","msg":"reconcileAgentPod","namespace":"workflows","time":"2024-11-28T13:43:46.309Z","workflow":"containerset-retry-bug-xx6jj"}
{"Phase":"Running","ResourceVersion":"770930349","level":"info","msg":"Processing workflow","namespace":"workflows","time":"2024-11-28T13:47:25.804Z","workflow":"containerset-retry-bug-xx6jj"}
{"level":"info","msg":"Task-result reconciliation","namespace":"workflows","numObjs":3,"time":"2024-11-28T13:47:25.805Z","workflow":"containerset-retry-bug-xx6jj"}
{"level":"info","msg":"TaskSet Reconciliation","namespace":"workflows","time":"2024-11-28T13:47:25.805Z","workflow":"containerset-retry-bug-xx6jj"}
{"level":"info","msg":"reconcileAgentPod","namespace":"workflows","time":"2024-11-28T13:47:25.805Z","workflow":"containerset-retry-bug-xx6jj"}
{"Phase":"Running","ResourceVersion":"770930349","level":"info","msg":"Processing workflow","namespace":"workflows","time":"2024-11-28T13:52:40.314Z","workflow":"containerset-retry-bug-xx6jj"}
{"level":"info","msg":"Task-result reconciliation","namespace":"workflows","numObjs":3,"time":"2024-11-28T13:52:40.314Z","workflow":"containerset-retry-bug-xx6jj"}
{"level":"info","msg":"TaskSet Reconciliation","namespace":"workflows","time":"2024-11-28T13:52:40.314Z","workflow":"containerset-retry-bug-xx6jj"}
{"level":"info","msg":"reconcileAgentPod","namespace":"workflows","time":"2024-11-28T13:52:40.314Z","workflow":"containerset-retry-bug-xx6jj"}

Logs from in your workflow's wait container

-
@jswxstw
Copy link
Member

jswxstw commented Nov 29, 2024

woc.markNodeError(node.Name, errors.New("", "pod deleted"))
// Set pod's child(container) error if pod deleted
for _, childNodeID := range node.Children {
childNode, err := woc.wf.Status.Nodes.Get(childNodeID)
if err != nil {
woc.log.Errorf("was unable to obtain node for %s", childNodeID)
continue
}
if childNode.Type == wfv1.NodeTypeContainer {
woc.markNodeError(childNode.Name, errors.New("", "container deleted"))
}
}

The issue lies here: when the pod is deleted, only its children are marked as container deleted. However, node main depends on node some-container and is therefore a child of it, not the child of node set1.

@artem-zherdiev-ingio
Copy link
Author

@jswxstw Thank you for the clarification. The title or description can be corrected if necessary.

@jswxstw
Copy link
Member

jswxstw commented Dec 9, 2024

This issue is similar to #12210, and #12756 has not fully resolved it.
I submit a new PR for it, please take a look if you have time. @shuangkun

@mostaphaRoudsari
Copy link
Contributor

Thank you, @jswxstw! I hope it gets merged sooner rather than later. It has been a recurring issue for our workflows with containersets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants