Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

node stuck Running when Parallelism and FailFast is enabled during parallel execution #13806

Open
4 tasks done
jswxstw opened this issue Oct 24, 2024 · 1 comment · May be fixed by #13827
Open
4 tasks done

node stuck Running when Parallelism and FailFast is enabled during parallel execution #13806

jswxstw opened this issue Oct 24, 2024 · 1 comment · May be fixed by #13827
Labels
area/controller Controller issues, panics area/parallelism `parallelism` for the Controller, Workflows, or templates type/bug

Comments

@jswxstw
Copy link
Member

jswxstw commented Oct 24, 2024

Pre-requisites

  • I have double-checked my configuration
  • I have tested with the :latest image tag (i.e. quay.io/argoproj/workflow-controller:latest) and can confirm the issue still exists on :latest. If not, I have explained why, in detail, in my description below.
  • I have searched existing issues and could not find a match for this bug
  • I'd like to contribute the fix myself (see contributing guide)

What happened? What did you expect to happen?

Case 1: parallel steps with failFast
Steps

Case 2: parallel tasks with failFast
DAG

The official example can reproduce this issue as well.

These issues are all introduced by checkParallelism, some scenarios have not been taken into account.
The FailFast feature has two serious flaws:

  • If parallelism is greater than 1 and one node fails while there are still other incomplete nodes, it cannot fail fast.
  • It only mark the Steps node as Failed, the last StepGroup node is still Running.

// Check failFast
if tmpl.IsFailFast() && woc.getUnsuccessfulChildren(node.ID) > 0 {
woc.markNodePhase(node.Name, wfv1.NodeFailed, "template has failed or errored children and failFast enabled")
return ErrParallelismReached
}

Version(s)

2cc6b32

Paste a minimal workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

Case 1: parallel steps with failFast

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: steps-
spec:
  entrypoint: main
  templates:
  - name: main
    parallelism: 2
    failFast: true
    steps:
    - - name: a
        template: fail
      - name: b
        template: sleep
  - name: fail
    container:
      image: alpine:latest
      command: [sh, -c]
      args: ["exit 1"]
  - name: sleep
    container:
      image: alpine:latest
      command: [ sh, -c ]
      args: [ "sleep 30; echo hello" ]

Case 2: parallel tasks with failFast

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: dag-
spec:
  entrypoint: main
  templates:
  - name: main
    parallelism: 2
    failFast: true
    dag:
      tasks:
      - name: A
        template: fail
      - name: B
        template: sleep
  - name: fail
    container:
      image: alpine:latest
      command: [ sh, -c ]
      args: ["echo intentional failure; exit 1"]
  - name: sleep
    container:
      image: alpine:latest
      command: [sh, -c]
      args: ["sleep 30; echo hello"]

Logs from the workflow controller

Case 1: parallel steps with failFast

INFO[2024-10-25T15:22:47.945Z] Processing workflow                           Phase= ResourceVersion=11770372 namespace=argo workflow=steps-pd6hl
INFO[2024-10-25T15:22:47.950Z] resolved artifact repository                  artifactRepositoryRef="argo/#"
INFO[2024-10-25T15:22:47.950Z] Task-result reconciliation                    namespace=argo numObjs=0 workflow=steps-pd6hl
INFO[2024-10-25T15:22:47.950Z] Updated phase  -> Running                     namespace=argo workflow=steps-pd6hl
WARN[2024-10-25T15:22:47.950Z] Node was nil, will be initialized as type Skipped  namespace=argo workflow=steps-pd6hl
INFO[2024-10-25T15:22:47.950Z] was unable to obtain node for , letting display name to be nodeName  namespace=argo workflow=steps-pd6hl
INFO[2024-10-25T15:22:47.950Z] Steps node steps-pd6hl initialized Running    namespace=argo workflow=steps-pd6hl
INFO[2024-10-25T15:22:47.950Z] StepGroup node steps-pd6hl-2011400267 initialized Running  namespace=argo workflow=steps-pd6hl
WARN[2024-10-25T15:22:47.951Z] Node was nil, will be initialized as type Skipped  namespace=argo workflow=steps-pd6hl
INFO[2024-10-25T15:22:47.951Z] Pod node steps-pd6hl-1226603194 initialized Pending  namespace=argo workflow=steps-pd6hl
INFO[2024-10-25T15:22:47.976Z] Created pod: steps-pd6hl[0].a (steps-pd6hl-fail-1226603194)  namespace=argo workflow=steps-pd6hl
WARN[2024-10-25T15:22:47.977Z] Node was nil, will be initialized as type Skipped  namespace=argo workflow=steps-pd6hl
INFO[2024-10-25T15:22:47.977Z] Pod node steps-pd6hl-1209825575 initialized Pending  namespace=argo workflow=steps-pd6hl
INFO[2024-10-25T15:22:47.988Z] Created pod: steps-pd6hl[0].b (steps-pd6hl-sleep-1209825575)  namespace=argo workflow=steps-pd6hl
INFO[2024-10-25T15:22:47.988Z] Workflow step group node steps-pd6hl-2011400267 not yet completed  namespace=argo workflow=steps-pd6hl
INFO[2024-10-25T15:22:47.988Z] TaskSet Reconciliation                        namespace=argo workflow=steps-pd6hl
INFO[2024-10-25T15:22:47.988Z] reconcileAgentPod                             namespace=argo workflow=steps-pd6hl
INFO[2024-10-25T15:22:47.988Z] Workflow to be dehydrated                     Workflow Size=1547
INFO[2024-10-25T15:22:47.996Z] Workflow update successful                    namespace=argo phase=Running resourceVersion=11770378 workflow=steps-pd6hl
INFO[2024-10-25T15:22:48.946Z] Processing workflow                           Phase=Running ResourceVersion=11770378 namespace=argo workflow=steps-pd6hl
INFO[2024-10-25T15:22:48.947Z] Task-result reconciliation                    namespace=argo numObjs=0 workflow=steps-pd6hl
INFO[2024-10-25T15:22:48.947Z] node changed                                  namespace=argo new.message=PodInitializing new.phase=Pending new.progress=0/1 nodeID=steps-pd6hl-1209825575 old.message= old.phase=Pending old.progress=0/1 workflow=steps-pd6hl
INFO[2024-10-25T15:22:48.948Z] node changed                                  namespace=argo new.message=PodInitializing new.phase=Pending new.progress=0/1 nodeID=steps-pd6hl-1226603194 old.message= old.phase=Pending old.progress=0/1 workflow=steps-pd6hl
INFO[2024-10-25T15:22:48.948Z] template (node steps-pd6hl) active children parallelism exceeded 2  namespace=argo workflow=steps-pd6hl
ERRO[2024-10-25T15:22:48.948Z] error in entry template execution             error="Max parallelism reached" namespace=argo workflow=steps-pd6hl
INFO[2024-10-25T15:22:48.948Z] Workflow to be dehydrated                     Workflow Size=1908
INFO[2024-10-25T15:22:48.954Z] Workflow update successful                    namespace=argo phase=Running resourceVersion=11770391 workflow=steps-pd6hl
INFO[2024-10-25T15:22:50.168Z] Processing workflow                           Phase=Running ResourceVersion=11770391 namespace=argo workflow=steps-pd6hl
INFO[2024-10-25T15:22:50.170Z] Task-result reconciliation                    namespace=argo numObjs=2 workflow=steps-pd6hl
INFO[2024-10-25T15:22:50.170Z] node unchanged                                namespace=argo nodeID=steps-pd6hl-1209825575 workflow=steps-pd6hl
INFO[2024-10-25T15:22:50.170Z] node unchanged                                namespace=argo nodeID=steps-pd6hl-1226603194 workflow=steps-pd6hl
INFO[2024-10-25T15:22:50.170Z] template (node steps-pd6hl) active children parallelism exceeded 2  namespace=argo workflow=steps-pd6hl
ERRO[2024-10-25T15:22:50.170Z] error in entry template execution             error="Max parallelism reached" namespace=argo workflow=steps-pd6hl
INFO[2024-10-25T15:22:53.182Z] Processing workflow                           Phase=Running ResourceVersion=11770391 namespace=argo workflow=steps-pd6hl
INFO[2024-10-25T15:22:53.183Z] Task-result reconciliation                    namespace=argo numObjs=2 workflow=steps-pd6hl
INFO[2024-10-25T15:22:53.184Z] node changed                                  namespace=argo new.message= new.phase=Running new.progress=0/1 nodeID=steps-pd6hl-1209825575 old.message=PodInitializing old.phase=Pending old.progress=0/1 workflow=steps-pd6hl
INFO[2024-10-25T15:22:53.184Z] node changed                                  namespace=argo new.message= new.phase=Running new.progress=0/1 nodeID=steps-pd6hl-1226603194 old.message=PodInitializing old.phase=Pending old.progress=0/1 workflow=steps-pd6hl
INFO[2024-10-25T15:22:53.184Z] template (node steps-pd6hl) active children parallelism exceeded 2  namespace=argo workflow=steps-pd6hl
ERRO[2024-10-25T15:22:53.184Z] error in entry template execution             error="Max parallelism reached" namespace=argo workflow=steps-pd6hl
INFO[2024-10-25T15:22:53.184Z] Workflow to be dehydrated                     Workflow Size=1935
INFO[2024-10-25T15:22:53.191Z] Workflow update successful                    namespace=argo phase=Running resourceVersion=11770433 workflow=steps-pd6hl
INFO[2024-10-25T15:22:54.188Z] Processing workflow                           Phase=Running ResourceVersion=11770433 namespace=argo workflow=steps-pd6hl
INFO[2024-10-25T15:22:54.189Z] Task-result reconciliation                    namespace=argo numObjs=2 workflow=steps-pd6hl
INFO[2024-10-25T15:22:54.190Z] task-result changed                           namespace=argo nodeID=steps-pd6hl-1226603194 workflow=steps-pd6hl
INFO[2024-10-25T15:22:54.190Z] node unchanged                                namespace=argo nodeID=steps-pd6hl-1209825575 workflow=steps-pd6hl
INFO[2024-10-25T15:22:54.190Z] node changed                                  namespace=argo new.message= new.phase=Running new.progress=0/1 nodeID=steps-pd6hl-1226603194 old.message= old.phase=Running old.progress=0/1 workflow=steps-pd6hl
INFO[2024-10-25T15:22:54.190Z] template (node steps-pd6hl) active children parallelism exceeded 2  namespace=argo workflow=steps-pd6hl
ERRO[2024-10-25T15:22:54.190Z] error in entry template execution             error="Max parallelism reached" namespace=argo workflow=steps-pd6hl
INFO[2024-10-25T15:22:54.191Z] Workflow to be dehydrated                     Workflow Size=2035
INFO[2024-10-25T15:22:54.195Z] cleaning up pod                               action=terminateContainers key=argo/steps-pd6hl-fail-1226603194/terminateContainers
INFO[2024-10-25T15:22:54.197Z] Workflow update successful                    namespace=argo phase=Running resourceVersion=11770442 workflow=steps-pd6hl
INFO[2024-10-25T15:22:55.192Z] Processing workflow                           Phase=Running ResourceVersion=11770442 namespace=argo workflow=steps-pd6hl
INFO[2024-10-25T15:22:55.194Z] Task-result reconciliation                    namespace=argo numObjs=2 workflow=steps-pd6hl
INFO[2024-10-25T15:22:55.194Z] node unchanged                                namespace=argo nodeID=steps-pd6hl-1209825575 workflow=steps-pd6hl
INFO[2024-10-25T15:22:55.195Z] node unchanged                                namespace=argo nodeID=steps-pd6hl-1226603194 workflow=steps-pd6hl
INFO[2024-10-25T15:22:55.195Z] template (node steps-pd6hl) active children parallelism exceeded 2  namespace=argo workflow=steps-pd6hl
ERRO[2024-10-25T15:22:55.195Z] error in entry template execution             error="Max parallelism reached" namespace=argo workflow=steps-pd6hl
INFO[2024-10-25T15:22:55.200Z] cleaning up pod                               action=terminateContainers key=argo/steps-pd6hl-fail-1226603194/terminateContainers
INFO[2024-10-25T15:22:56.289Z] Processing workflow                           Phase=Running ResourceVersion=11770442 namespace=argo workflow=steps-pd6hl
INFO[2024-10-25T15:22:56.290Z] Task-result reconciliation                    namespace=argo numObjs=2 workflow=steps-pd6hl
INFO[2024-10-25T15:22:56.290Z] node unchanged                                namespace=argo nodeID=steps-pd6hl-1209825575 workflow=steps-pd6hl
INFO[2024-10-25T15:22:56.290Z] Pod failed: Error (exit code 1)               displayName=a namespace=argo pod=steps-pd6hl-fail-1226603194 templateName=fail workflow=steps-pd6hl
INFO[2024-10-25T15:22:56.290Z] node changed                                  namespace=argo new.message="Error (exit code 1)" new.phase=Failed new.progress=0/1 nodeID=steps-pd6hl-1226603194 old.message= old.phase=Running old.progress=0/1 workflow=steps-pd6hl
INFO[2024-10-25T15:22:56.291Z] node steps-pd6hl phase Running -> Failed      namespace=argo workflow=steps-pd6hl
INFO[2024-10-25T15:22:56.291Z] node steps-pd6hl message: template has failed or errored children and failFast enabled  namespace=argo workflow=steps-pd6hl
INFO[2024-10-25T15:22:56.291Z] node steps-pd6hl finished: 2024-10-25 07:22:56.291074256 +0000 UTC  namespace=argo workflow=steps-pd6hl
ERRO[2024-10-25T15:22:56.291Z] error in entry template execution             error="Max parallelism reached" namespace=argo workflow=steps-pd6hl
INFO[2024-10-25T15:22:56.291Z] Workflow to be dehydrated                     Workflow Size=2196
INFO[2024-10-25T15:22:56.302Z] Workflow update successful                    namespace=argo phase=Running resourceVersion=11770450 workflow=steps-pd6hl
INFO[2024-10-25T15:22:56.309Z] cleaning up pod                               action=labelPodCompleted key=argo/steps-pd6hl-fail-1226603194/labelPodCompleted
INFO[2024-10-25T15:22:57.196Z] cleaning up pod                               action=killContainers key=argo/steps-pd6hl-fail-1226603194/killContainers
INFO[2024-10-25T15:22:57.304Z] Processing workflow                           Phase=Running ResourceVersion=11770450 namespace=argo workflow=steps-pd6hl
INFO[2024-10-25T15:22:57.305Z] Task-result reconciliation                    namespace=argo numObjs=2 workflow=steps-pd6hl
INFO[2024-10-25T15:22:57.305Z] node unchanged                                namespace=argo nodeID=steps-pd6hl-1209825575 workflow=steps-pd6hl
INFO[2024-10-25T15:22:57.306Z] TaskSet Reconciliation                        namespace=argo workflow=steps-pd6hl
INFO[2024-10-25T15:22:57.306Z] reconcileAgentPod                             namespace=argo workflow=steps-pd6hl
INFO[2024-10-25T15:23:23.273Z] Processing workflow                           Phase=Running ResourceVersion=11770450 namespace=argo workflow=steps-pd6hl
INFO[2024-10-25T15:23:23.275Z] Task-result reconciliation                    namespace=argo numObjs=2 workflow=steps-pd6hl
INFO[2024-10-25T15:23:23.275Z] task-result changed                           namespace=argo nodeID=steps-pd6hl-1209825575 workflow=steps-pd6hl
INFO[2024-10-25T15:23:23.275Z] node changed                                  namespace=argo new.message= new.phase=Running new.progress=0/1 nodeID=steps-pd6hl-1209825575 old.message= old.phase=Running old.progress=0/1 workflow=steps-pd6hl
INFO[2024-10-25T15:23:23.275Z] TaskSet Reconciliation                        namespace=argo workflow=steps-pd6hl
INFO[2024-10-25T15:23:23.275Z] reconcileAgentPod                             namespace=argo workflow=steps-pd6hl
INFO[2024-10-25T15:23:23.275Z] Updated phase Running -> Failed               namespace=argo workflow=steps-pd6hl
INFO[2024-10-25T15:23:23.276Z] Updated message  -> template has failed or errored children and failFast enabled  namespace=argo workflow=steps-pd6hl
INFO[2024-10-25T15:23:23.276Z] Marking workflow completed                    namespace=argo workflow=steps-pd6hl
INFO[2024-10-25T15:23:23.276Z] Workflow to be dehydrated                     Workflow Size=2382
INFO[2024-10-25T15:23:23.280Z] cleaning up pod                               action=terminateContainers key=argo/steps-pd6hl-sleep-1209825575/terminateContainers
INFO[2024-10-25T15:23:23.284Z] Workflow update successful                    namespace=argo phase=Failed resourceVersion=11770528 workflow=steps-pd6hl
INFO[2024-10-25T15:23:26.281Z] cleaning up pod                               action=killContainers key=argo/steps-pd6hl-sleep-1209825575/killContainers

Case 2: parallel tasks with failFast

INFO[2024-10-25T14:57:41.442Z] Processing workflow                           Phase=Running ResourceVersion=11767004 namespace=argo workflow=dag-fcfmn
INFO[2024-10-25T14:57:41.443Z] Task-result reconciliation                    namespace=argo numObjs=2 workflow=dag-fcfmn
INFO[2024-10-25T14:57:41.443Z] node unchanged                                namespace=argo nodeID=dag-fcfmn-908173470 workflow=dag-fcfmn
INFO[2024-10-25T14:57:41.443Z] Pod failed: Error (exit code 1)               displayName=A namespace=argo pod=dag-fcfmn-fail-891395851 templateName=fail workflow=dag-fcfmn
INFO[2024-10-25T14:57:41.443Z] node changed                                  namespace=argo new.message="Error (exit code 1)" new.phase=Failed new.progress=0/1 nodeID=dag-fcfmn-891395851 old.message= old.phase=Running old.progress=0/1 workflow=dag-fcfmn
INFO[2024-10-25T14:57:41.443Z] node dag-fcfmn phase Running -> Failed        namespace=argo workflow=dag-fcfmn
INFO[2024-10-25T14:57:41.443Z] node dag-fcfmn message: template has failed or errored children and failFast enabled  namespace=argo workflow=dag-fcfmn
INFO[2024-10-25T14:57:41.443Z] node dag-fcfmn finished: 2024-10-25 06:57:41.443862121 +0000 UTC  namespace=argo workflow=dag-fcfmn
ERRO[2024-10-25T14:57:41.443Z] error in entry template execution             error="Max parallelism reached" namespace=argo workflow=dag-fcfmn
INFO[2024-10-25T14:57:41.443Z] Workflow to be dehydrated                     Workflow Size=1975
INFO[2024-10-25T14:57:41.451Z] Workflow update successful                    namespace=argo phase=Running resourceVersion=11767009 workflow=dag-fcfmn
INFO[2024-10-25T14:57:41.458Z] cleaning up pod                               action=labelPodCompleted key=argo/dag-fcfmn-fail-891395851/labelPodCompleted
INFO[2024-10-25T14:57:42.319Z] cleaning up pod                               action=killContainers key=argo/dag-fcfmn-fail-891395851/killContainers
INFO[2024-10-25T14:57:42.471Z] Processing workflow                           Phase=Running ResourceVersion=11767009 namespace=argo workflow=dag-fcfmn
INFO[2024-10-25T14:57:42.472Z] Task-result reconciliation                    namespace=argo numObjs=2 workflow=dag-fcfmn
INFO[2024-10-25T14:57:42.472Z] node unchanged                                namespace=argo nodeID=dag-fcfmn-908173470 workflow=dag-fcfmn
INFO[2024-10-25T14:57:42.472Z] TaskSet Reconciliation                        namespace=argo workflow=dag-fcfmn
INFO[2024-10-25T14:57:42.472Z] reconcileAgentPod                             namespace=argo workflow=dag-fcfmn
INFO[2024-10-25T14:58:08.391Z] Processing workflow                           Phase=Running ResourceVersion=11767009 namespace=argo workflow=dag-fcfmn
INFO[2024-10-25T14:58:08.392Z] Task-result reconciliation                    namespace=argo numObjs=2 workflow=dag-fcfmn
INFO[2024-10-25T14:58:08.392Z] task-result changed                           namespace=argo nodeID=dag-fcfmn-908173470 workflow=dag-fcfmn
INFO[2024-10-25T14:58:08.392Z] node changed                                  namespace=argo new.message= new.phase=Running new.progress=0/1 nodeID=dag-fcfmn-908173470 old.message= old.phase=Running old.progress=0/1 workflow=dag-fcfmn
INFO[2024-10-25T14:58:08.393Z] TaskSet Reconciliation                        namespace=argo workflow=dag-fcfmn
INFO[2024-10-25T14:58:08.393Z] reconcileAgentPod                             namespace=argo workflow=dag-fcfmn
INFO[2024-10-25T14:58:08.393Z] Updated phase Running -> Failed               namespace=argo workflow=dag-fcfmn
INFO[2024-10-25T14:58:08.393Z] Updated message  -> template has failed or errored children and failFast enabled  namespace=argo workflow=dag-fcfmn
INFO[2024-10-25T14:58:08.393Z] Marking workflow completed                    namespace=argo workflow=dag-fcfmn
INFO[2024-10-25T14:58:08.393Z] Workflow to be dehydrated                     Workflow Size=2156
INFO[2024-10-25T14:58:08.397Z] cleaning up pod                               action=terminateContainers key=argo/dag-fcfmn-sleep-908173470/terminateContainers
INFO[2024-10-25T14:58:08.401Z] Workflow update successful                    namespace=argo phase=Failed resourceVersion=11767069 workflow=dag-fcfmn
INFO[2024-10-25T14:58:11.398Z] cleaning up pod                               action=killContainers key=argo/dag-fcfmn-sleep-908173470/killContainers

Logs from in your workflow's wait container

kubectl logs -n argo -c wait -l workflows.argoproj.io/workflow=${workflow},workflow.argoproj.io/phase!=Succeeded
@jswxstw jswxstw added type/bug area/controller Controller issues, panics area/looping `withParams`, `withItems`, and `withSequence` area/parallelism `parallelism` for the Controller, Workflows, or templates and removed area/looping `withParams`, `withItems`, and `withSequence` labels Oct 24, 2024
@jswxstw jswxstw changed the title StepGroup node stuck Running when looping with FailFast StepGroup node stuck Running when using template-level parallelism and FailFast Oct 25, 2024
@jswxstw jswxstw changed the title StepGroup node stuck Running when using template-level parallelism and FailFast node stuck Running when FailFast is enabled during parallel execution Oct 25, 2024
@jswxstw jswxstw changed the title node stuck Running when FailFast is enabled during parallel execution node stuck Running when Parallelism and FailFast is enabled during parallel execution Oct 25, 2024
jswxstw added a commit to jswxstw/argo-workflows that referenced this issue Oct 28, 2024
jswxstw added a commit to jswxstw/argo-workflows that referenced this issue Oct 31, 2024
@jswxstw
Copy link
Member Author

jswxstw commented Nov 5, 2024

Also, the pod of node b can not be deleted as it still has finalizer workflows.argoproj.io/status.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/controller Controller issues, panics area/parallelism `parallelism` for the Controller, Workflows, or templates type/bug
Projects
None yet
1 participant