Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: ensure that nodes complete when workflow fails with parallelism and failFast. Fixes #13806 #13827

Merged
merged 1 commit into from
Dec 22, 2024

Conversation

jswxstw
Copy link
Member

@jswxstw jswxstw commented Oct 28, 2024

Fixes #13806

Motivation

The FailFast feature has two serious flaws:

  • If parallelism is greater than 1 and one node fails while there are still other incomplete nodes, it cannot fail fast.
  • It only mark the Steps node as Failed, the last StepGroup node is still Running.

Modifications

  • Do not fail fast when there are incomplete nodes.
  • Mark the last StepGroup node as Failed.

Verification

local test and e2e tests

Test workflow:

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  name: demo
spec:
  entrypoint: main
  templates:
    - name: main
      parallelism: 2
      failFast: true
      steps:
        - - name: step0
            template: sleep
        - - name: step1
            template: fail
          - name: step2
            template: sleep
    - name: fail
      container:
        image: alpine:latest
        command: [sh, -c]
        args: ["exit 1"]
    - name: sleep
      container:
        image: alpine:latest
        command: [ sh, -c ]
        args: [ "sleep 5" ]

demo

@jswxstw jswxstw added area/controller Controller issues, panics area/parallelism `parallelism` for the Controller, Workflows, or templates labels Oct 28, 2024
@agilgur5 agilgur5 changed the title fix: ensure that nodes complete when workflow fails with parallelism and failFast enabled. Fixes #13806 fix: ensure that nodes complete when workflow fails with parallelism and failFast. Fixes #13806 Oct 28, 2024
@agilgur5
Copy link

Have you seen #10312 and #11992?

@jswxstw
Copy link
Member Author

jswxstw commented Oct 29, 2024

Have you seen #10312 and #11992?

@agilgur5 Not yet, but #10312 is not related to this issue, failFast in these two issues is not the same, and the problems caused are also different:

@jswxstw jswxstw added the prioritized-review For members of the Sustainability Effort label Dec 11, 2024
@jswxstw
Copy link
Member Author

jswxstw commented Dec 11, 2024

We usually set global workflow-level parallelism to limit resource usage, so once the workflow enables failFast during looping, issue #13806 becomes inevitable, making feature #5315 almost unusable.
Therefore, I believe this issue is quite serious and requires urgent resolution.

Copy link
Member

@MasonM MasonM left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I ran both examples from #13806 and verified both correctly failed with this fix:

$ ./dist/argo list -o wide
NAME                STATUS      AGE   DURATION   PRIORITY   MESSAGE                                                        P/R/C   PARAMETERS
dag-4rrxm           Failed      55s   37s        0          template has failed or errored children and failFast enabled   0/0/2   
steps-gnh5p         Failed      1m    41s        0          template has failed or errored children and failFast enabled   0/0/2   

@jswxstw
Copy link
Member Author

jswxstw commented Dec 20, 2024

@terrytangyuan @Joibel @isubasinghe Could you please take some time to review this PR? Thank you very much!

Copy link
Member

@terrytangyuan terrytangyuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

@terrytangyuan terrytangyuan merged commit 1d85e68 into argoproj:main Dec 22, 2024
28 checks passed
isubasinghe pushed a commit to pipekit/argo-workflows that referenced this pull request Jan 30, 2025
isubasinghe pushed a commit to pipekit/argo-workflows that referenced this pull request Jan 31, 2025
Joibel pushed a commit that referenced this pull request Feb 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/controller Controller issues, panics area/parallelism `parallelism` for the Controller, Workflows, or templates prioritized-review For members of the Sustainability Effort
Projects
None yet
Development

Successfully merging this pull request may close these issues.

node stuck Running when Parallelism and FailFast is enabled during parallel execution
5 participants