Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: ensure that nodes complete when workflow fails with parallelism and failFast. Fixes #13806 #13827

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

jswxstw
Copy link
Member

@jswxstw jswxstw commented Oct 28, 2024

Fixes #13806

Motivation

The FailFast feature has two serious flaws:

  • If parallelism is greater than 1 and one node fails while there are still other incomplete nodes, it cannot fail fast.
  • It only mark the Steps node as Failed, the last StepGroup node is still Running.

Modifications

  • Do not fail fast when there are incomplete nodes.
  • Mark the last StepGroup node as Failed.

Verification

local test and e2e tests

Test workflow:

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  name: demo
spec:
  entrypoint: main
  templates:
    - name: main
      parallelism: 2
      failFast: true
      steps:
        - - name: step0
            template: sleep
        - - name: step1
            template: fail
          - name: step2
            template: sleep
    - name: fail
      container:
        image: alpine:latest
        command: [sh, -c]
        args: ["exit 1"]
    - name: sleep
      container:
        image: alpine:latest
        command: [ sh, -c ]
        args: [ "sleep 5" ]

demo

@jswxstw jswxstw added area/controller Controller issues, panics area/parallelism `parallelism` for the Controller, Workflows, or templates labels Oct 28, 2024
@agilgur5 agilgur5 changed the title fix: ensure that nodes complete when workflow fails with parallelism and failFast enabled. Fixes #13806 fix: ensure that nodes complete when workflow fails with parallelism and failFast. Fixes #13806 Oct 28, 2024
@agilgur5
Copy link
Contributor

Have you seen #10312 and #11992?

@jswxstw
Copy link
Member Author

jswxstw commented Oct 29, 2024

Have you seen #10312 and #11992?

@agilgur5 Not yet, but #10312 is not related to this issue, failFast in these two issues is not the same, and the problems caused are also different:

workflow/controller/operator.go Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/controller Controller issues, panics area/parallelism `parallelism` for the Controller, Workflows, or templates
Projects
None yet
Development

Successfully merging this pull request may close these issues.

node stuck Running when Parallelism and FailFast is enabled during parallel execution
3 participants