-
Notifications
You must be signed in to change notification settings - Fork 670
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Etcdserver: request is too large error causing never-ending tasks and log spamming #4349
Comments
Thank you for opening your first issue here! 🛠 |
@deadlycoconuts , thanks for the detailed report. A few things to kick off this investigation:
@hamersaw , can you help in answering the questions in #4349 (comment)? |
@eapolinario that is right, it only works for new workflows and how to enable it is documented well here - https://docs.flyte.org/en/latest/deployment/configuration/performance.html#offloading-static-workflow-information-from-crd |
Thanks @eapolinario and @kumare3 for getting back to us. We're using What do you mean by 'clearing' workflows though? Are there docs that we can refer to to perform this? Or does 'clearing' simply mean creating a new workflow version? |
We have set the below and registered a new version but the error is still happening. Must it be registered under a new workflow name? Are there any steps we can follow to validate that this is configured correctly?
|
@leonlnj , is this still happening? Can you confirm which version of Flyte you're using? |
Describe the bug
We’ve observed this issue whereby the FlytePropeller isn’t failing workflows that cannot be written to the etcd store. These workflows get stuck in an enqueue/resync/fail loop which repeats itself every minute or so indefinitely, even though they are shown as ‘failed’ on the FlyteConsole. This also floods our logging services with redundant logs.
We tried toggling the
useOffloadedWorkflowClosure
flag but unfortunately it didn’t help resolve the etcd errors that we’ve been seeing from the FlytePropeller. We’ve restarted both the FlyteAdmin and the FlytePropeller deployments after updating the configs but that didn’t seem to stop the existing executions from throwing the etcd errors continuously. In the end we had to delete the affected executions one by one manually to stop them from being scheduled endlessly.There are some related issues and fixes that we found from 2-3 years ago which do not seem to work anymore, so we’re wondering what might’ve changed since:
Related Issues
Related Fixes
In particular, it seems like this if condition is no longer working as expected:
flyte/flytepropeller/pkg/controller/workflowstore/passthrough.go
Line 99 in c6476cc
Expected behavior
Tasks should never end up in a state where they are retried endlessly and should be terminated with an error if there is an issue updating their state.
Additional context to reproduce
No response
Screenshots
Here are some images to illustrate the problem that I’ve described:
Are you sure this issue hasn't been raised already?
Have you read the Code of Conduct?
The text was updated successfully, but these errors were encountered: