Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Two Workflow Instances Open on Failure #284

Open
lironleizer opened this issue Oct 14, 2024 · 1 comment
Open

Two Workflow Instances Open on Failure #284

lironleizer opened this issue Oct 14, 2024 · 1 comment

Comments

@lironleizer
Copy link

Description:
Since upgrading to Conductor 3.16.0, we have encountered unusual behavior in one of our workflows. The workflow is defined as follows:
conductor issue

When the WAIT EVENT receives a message, the workflow proceeds to the TERMINATE TASK. However, occasionally we observe that two failure workflows are opened. These failure workflows are nearly identical, but one shows an ownerApp as "conductor," while the other has an empty ownerApp.

Main workflow:

{

  "ownerApp": "",

  "createTime": 1728474259443,

  "updateTime": 1728477038331,

  "status": "FAILED",

  "endTime": 1728477038331,

  "workflowId": "179dd481-8e4b-4e9d-905d-8f37f9b7c577",

  "tasks": […]

}

Output:

{

  "output": "",

  "conductor.failure_workflow": "5974405e-e4b6-4924-b0bf-fbcec3827e2b"

}

failure workflow 1:

{

  "ownerApp": "conductor",

  "createTime": 1728477038313,

  "updateTime": 1728477039058,

  "status": "COMPLETED",

  "endTime": 1728477039058,

  "workflowId": "5974405e-e4b6-4924-b0bf-fbcec3827e2b",

  "tasks": […]

}

failure workflow 2:

{

  "ownerApp": "",

  "createTime": 1728477038229,

  "updateTime": 1728477039145,

  "status": "COMPLETED",

  "endTime": 1728477039145,

  "workflowId": "c02b2cb8-d6c4-4aaa-bc1a-3c04a1585d80",

  "tasks": […]

}

From the main workflow output, the failure workflow ID corresponds to the one with the ownerApp set to "conductor." The timestamps show that the two workflows are opened just a few milliseconds apart.

Here are the relevant logs for further insight:

image (1)

Based on these logs, we suspect that this behavior may be caused by race conditions on the workflow's status. It seems related to the sweeper thread triggering an action while the event is already being processed by the main flow.

Expected Behavior: Only one failure workflow instance should be opened when the workflow fails.

Potential Cause: The issue appears to be caused by a race condition in the decider queue, specifically around status updates when the workflow progresses from the WAIT EVENT to the TERMINATE TASK. The sweeper thread may be triggering actions prematurely, while the event processing is still ongoing in the main workflow flow.

@lironleizer
Copy link
Author

@dilip-lukose @v1r3n
please advise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant