[BUG] Etcdserver: request is too large error causing never-ending tasks and log spamming #4349

deadlycoconuts · 2023-11-02T06:58:12Z

Describe the bug

We’ve observed this issue whereby the FlytePropeller isn’t failing workflows that cannot be written to the etcd store. These workflows get stuck in an enqueue/resync/fail loop which repeats itself every minute or so indefinitely, even though they are shown as ‘failed’ on the FlyteConsole. This also floods our logging services with redundant logs.

We tried toggling the useOffloadedWorkflowClosure flag but unfortunately it didn’t help resolve the etcd errors that we’ve been seeing from the FlytePropeller. We’ve restarted both the FlyteAdmin and the FlytePropeller deployments after updating the configs but that didn’t seem to stop the existing executions from throwing the etcd errors continuously. In the end we had to delete the affected executions one by one manually to stop them from being scheduled endlessly.

There are some related issues and fixes that we found from 2-3 years ago which do not seem to work anymore, so we’re wondering what might’ve changed since:

Related Issues

Related Fixes

Large workflows will not cause workflows to stay in running state #minor flytepropeller#240

In particular, it seems like this if condition is no longer working as expected:

flyte/flytepropeller/pkg/controller/workflowstore/passthrough.go

Line 99 in c6476cc

if kubeerrors.IsRequestEntityTooLargeError(err) {

Expected behavior

Tasks should never end up in a state where they are retried endlessly and should be terminated with an error if there is an issue updating their state.

Additional context to reproduce

No response

Screenshots

Here are some images to illustrate the problem that I’ve described:

Affected workflow - this was first executed at 2.17 AM (SGT/HKT) on the 7th of October and failed at around 2.34 AM; it was however being restarted endlessly ever since
FlytePropeller logs when the workflow first failed
FlytePropeller logs 11 days after the initial failure

Are you sure this issue hasn't been raised already?

Yes

Have you read the Code of Conduct?

Yes

The text was updated successfully, but these errors were encountered:

welcome · 2023-11-02T06:58:15Z

Thank you for opening your first issue here! 🛠

deadlycoconuts · 2023-11-02T07:12:25Z

Here are a couple of questions we have related to this bug:

What happens (assuming that the aforementioned flag has been turned on) but the etcd request for a certain workflow ends up being greater than 10MB?
We noticed that this issue occurs not only for executions which are stuck in the ‘Failing’ state, but also those stuck in the ‘Running’ state (see screenshot below); is this expected?
Is there any in-built mechanism to stop such executions from being retried indefinitely? It’s incredibly easy for such jobs to go unnoticed and this has a huge impact on our metrics/logs, especially if these executions are all triggered as part of a regular cron job (the amount of errors raised just grows every time a new execution gets triggered only to get stuck). It’s also painstaking work having to dig through our logs to identify these executions to terminate them manually. One of these executions that we have found came from slightly over 2 months ago (see screenshot below).
Are there any metrics in the FlytePropeller (or any other components) that would allow us to identify these stuck executions? These would help us set up alerts to bring our attention to any affected executions.

eapolinario · 2023-11-03T18:38:32Z

@deadlycoconuts , thanks for the detailed report. A few things to kick off this investigation:

what version of Flyte are you running? Specifically which version of flytepropeller?
the useOffloadedWorkflowClosure flag only affects new workflows.

@hamersaw , can you help in answering the questions in #4349 (comment)?

kumare3 · 2023-11-06T05:59:52Z

@eapolinario that is right, it only works for new workflows and how to enable it is documented well here - https://docs.flyte.org/en/latest/deployment/configuration/performance.html#offloading-static-workflow-information-from-crd
For already running workflows you will have to manually go and clear them.

deadlycoconuts · 2023-11-06T07:29:21Z

Thanks @eapolinario and @kumare3 for getting back to us. We're using v1.9.1 of the FlytePropeller. Thanks too for highlighting that the useOffloadedWorkflowClosure flag only affects new workflows (it doesn't seem like the docs specifically mention that the flag doesn't change the behaviour of existing workflows).

What do you mean by 'clearing' workflows though? Are there docs that we can refer to to perform this? Or does 'clearing' simply mean creating a new workflow version?

leonlnj · 2024-01-12T10:43:57Z

We have set the below and registered a new version but the error is still happening. Must it be registered under a new workflow name? Are there any steps we can follow to validate that this is configured correctly?

configmap:
 adminServer:
      flyteadmin:
        useOffloadedWorkflowClosure: true

eapolinario · 2024-11-07T21:39:31Z

@leonlnj , is this still happening? Can you confirm which version of Flyte you're using?

deadlycoconuts added bug Something isn't working untriaged This issues has not yet been looked at by the Maintainers labels Nov 2, 2023

eapolinario removed the untriaged This issues has not yet been looked at by the Maintainers label Nov 3, 2023

eapolinario self-assigned this Nov 3, 2023

hamersaw added exo backlogged For internal use. Reserved for contributor team workflow. labels Nov 8, 2023

eapolinario assigned hamersaw and unassigned eapolinario Nov 9, 2023

pvditt mentioned this issue Jan 22, 2024

[BUG] Handle Potential Indefinite Propeller Update Loops #4755

Merged

3 tasks

eapolinario added the stale label Dec 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Etcdserver: request is too large error causing never-ending tasks and log spamming #4349

[BUG] Etcdserver: request is too large error causing never-ending tasks and log spamming #4349

deadlycoconuts commented Nov 2, 2023 •

edited

Loading

welcome bot commented Nov 2, 2023

deadlycoconuts commented Nov 2, 2023

eapolinario commented Nov 3, 2023

kumare3 commented Nov 6, 2023

deadlycoconuts commented Nov 6, 2023

leonlnj commented Jan 12, 2024

eapolinario commented Nov 7, 2024

[BUG] Etcdserver: request is too large error causing never-ending tasks and log spamming #4349

[BUG] Etcdserver: request is too large error causing never-ending tasks and log spamming #4349

Comments

deadlycoconuts commented Nov 2, 2023 • edited Loading

Describe the bug

Expected behavior

Additional context to reproduce

Screenshots

Are you sure this issue hasn't been raised already?

Have you read the Code of Conduct?

welcome bot commented Nov 2, 2023

deadlycoconuts commented Nov 2, 2023

eapolinario commented Nov 3, 2023

kumare3 commented Nov 6, 2023

deadlycoconuts commented Nov 6, 2023

leonlnj commented Jan 12, 2024

eapolinario commented Nov 7, 2024

deadlycoconuts commented Nov 2, 2023 •

edited

Loading