Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Schedules: fail fast when a reference workflow can't be read #4688

Open
2 tasks done
katrogan opened this issue Jan 8, 2024 · 1 comment
Open
2 tasks done
Labels
backlogged For internal use. Reserved for contributor team workflow. bug Something isn't working

Comments

@katrogan
Copy link
Contributor

katrogan commented Jan 8, 2024

Describe the bug

If an incident causes the offloaded workflow closure to be deleted from blobstore but an active launch plan schedule is still associated with it, the scheduler will endlessly loop attempting to create the workflow but ultimately each attempt to read it at CreateExecution time will fail with

Failed to get workflow with id resource_type:WORKFLOW project:\"flytesnacks\" domain:\"development\" name:\"productionizing.lp_schedules.positive_wf\" version:\"v0.3.231\"  with err Unable to read WorkflowClosure from location 

Expected behavior

Fail fast for unrecoverable schedule scenarios

Additional context to reproduce

No response

Screenshots

No response

Are you sure this issue hasn't been raised already?

  • Yes

Have you read the Code of Conduct?

  • Yes
@katrogan katrogan added bug Something isn't working untriaged This issues has not yet been looked at by the Maintainers labels Jan 8, 2024
Copy link

dosubot bot commented Jan 8, 2024

🤖

Hi there @katrogan! I'm Dosu, a bot here to help you with Flyte related issues, answer your questions, and guide you if you'd like to contribute. While we're waiting for a human maintainer, feel free to ask me anything you need. I'm here to make your experience smoother!

Thank you for bringing this issue to our attention. It seems like the current mechanism in Flyte for handling scenarios where a reference workflow can't be read is to transition the workflow to a failing or failed state. This is done through the handleRunningWorkflow and handleFailureNode functions in the workflowExecutor struct. If an error occurs during the execution of the start node or the failure node of the workflow, these functions return a failing status with an ExecutionError indicating the error that occurred. The error is then propagated up to the HandleFlyteWorkflow function which transitions the workflow to the failing or failed phase and records the event. You can find more details in the flytepropeller/pkg/controller/workflow/executor.go file.

However, to address the specific issue you're facing where the scheduler endlessly loops attempts to create the workflow when a reference workflow can't be read, we could potentially modify Flyte to fail fast. This could be achieved by using the NewFailFastLaunchPlanExecutor() function from the launchplan package to create a new instance of a launch plan executor that fails fast when it encounters an error. This function is used in the TestWorkflowExecutor_HandleFlyteWorkflow_Failing and BenchmarkWorkflowExecutor functions to create an instance of a launch plan executor. You can find more details in the flytepropeller/pkg/controller/workflow/executor_test.go file.

Regarding the Flyte scheduler, it handles errors in several ways depending on the type of error. For instance, if a schedule is activated but the launchplan is not, the scheduler will continue to hit the admin with new future scheduled times where the problem can get fixed for the launchplan. If a scheduled time execution fails, the goroutine executing for that time will go through 30 repetitions before aborting the run. You can find more details in the flyteadmin/scheduler/doc.go file.

I hope this information helps. If you have any further questions or need more clarification, please let me know.

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

@eapolinario eapolinario added backlogged For internal use. Reserved for contributor team workflow. and removed untriaged This issues has not yet been looked at by the Maintainers labels Jan 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backlogged For internal use. Reserved for contributor team workflow. bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants