Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Etcdserver: request is too large error causing never-ending tasks and log spamming #4349

Open
2 tasks done
deadlycoconuts opened this issue Nov 2, 2023 · 7 comments
Assignees
Labels
backlogged For internal use. Reserved for contributor team workflow. bug Something isn't working exo stale

Comments

@deadlycoconuts
Copy link

deadlycoconuts commented Nov 2, 2023

Describe the bug

We’ve observed this issue whereby the FlytePropeller isn’t failing workflows that cannot be written to the etcd store. These workflows get stuck in an enqueue/resync/fail loop which repeats itself every minute or so indefinitely, even though they are shown as ‘failed’ on the FlyteConsole. This also floods our logging services with redundant logs.

We tried toggling the useOffloadedWorkflowClosure flag but unfortunately it didn’t help resolve the etcd errors that we’ve been seeing from the FlytePropeller. We’ve restarted both the FlyteAdmin and the FlytePropeller deployments after updating the configs but that didn’t seem to stop the existing executions from throwing the etcd errors continuously. In the end we had to delete the affected executions one by one manually to stop them from being scheduled endlessly.

There are some related issues and fixes that we found from 2-3 years ago which do not seem to work anymore, so we’re wondering what might’ve changed since:

Related Issues

Related Fixes

In particular, it seems like this if condition is no longer working as expected:

if kubeerrors.IsRequestEntityTooLargeError(err) {

Expected behavior

Tasks should never end up in a state where they are retried endlessly and should be terminated with an error if there is an issue updating their state.

Additional context to reproduce

No response

Screenshots

Here are some images to illustrate the problem that I’ve described:

  1. Affected workflow - this was first executed at 2.17 AM (SGT/HKT) on the 7th of October and failed at around 2.34 AM; it was however being restarted endlessly ever since
    image
  2. FlytePropeller logs when the workflow first failed
    image
  3. FlytePropeller logs 11 days after the initial failure
    image

Are you sure this issue hasn't been raised already?

  • Yes

Have you read the Code of Conduct?

  • Yes
@deadlycoconuts deadlycoconuts added bug Something isn't working untriaged This issues has not yet been looked at by the Maintainers labels Nov 2, 2023
Copy link

welcome bot commented Nov 2, 2023

Thank you for opening your first issue here! 🛠

@deadlycoconuts
Copy link
Author

Here are a couple of questions we have related to this bug:

  1. What happens (assuming that the aforementioned flag has been turned on) but the etcd request for a certain workflow ends up being greater than 10MB?
  2. We noticed that this issue occurs not only for executions which are stuck in the ‘Failing’ state, but also those stuck in the ‘Running’ state (see screenshot below); is this expected?
    279889816-01faafd2-e15f-4cb8-ae11-44cbf6412744
  3. Is there any in-built mechanism to stop such executions from being retried indefinitely? It’s incredibly easy for such jobs to go unnoticed and this has a huge impact on our metrics/logs, especially if these executions are all triggered as part of a regular cron job (the amount of errors raised just grows every time a new execution gets triggered only to get stuck). It’s also painstaking work having to dig through our logs to identify these executions to terminate them manually. One of these executions that we have found came from slightly over 2 months ago (see screenshot below).
    image
  4. Are there any metrics in the FlytePropeller (or any other components) that would allow us to identify these stuck executions? These would help us set up alerts to bring our attention to any affected executions.

@eapolinario
Copy link
Contributor

@deadlycoconuts , thanks for the detailed report. A few things to kick off this investigation:

  1. what version of Flyte are you running? Specifically which version of flytepropeller?
  2. the useOffloadedWorkflowClosure flag only affects new workflows.

@hamersaw , can you help in answering the questions in #4349 (comment)?

@eapolinario eapolinario removed the untriaged This issues has not yet been looked at by the Maintainers label Nov 3, 2023
@eapolinario eapolinario self-assigned this Nov 3, 2023
@kumare3
Copy link
Contributor

kumare3 commented Nov 6, 2023

@eapolinario that is right, it only works for new workflows and how to enable it is documented well here - https://docs.flyte.org/en/latest/deployment/configuration/performance.html#offloading-static-workflow-information-from-crd
For already running workflows you will have to manually go and clear them.

@deadlycoconuts
Copy link
Author

Thanks @eapolinario and @kumare3 for getting back to us. We're using v1.9.1 of the FlytePropeller. Thanks too for highlighting that the useOffloadedWorkflowClosure flag only affects new workflows (it doesn't seem like the docs specifically mention that the flag doesn't change the behaviour of existing workflows).

What do you mean by 'clearing' workflows though? Are there docs that we can refer to to perform this? Or does 'clearing' simply mean creating a new workflow version?

@hamersaw hamersaw added exo backlogged For internal use. Reserved for contributor team workflow. labels Nov 8, 2023
@eapolinario eapolinario assigned hamersaw and unassigned eapolinario Nov 9, 2023
@leonlnj
Copy link

leonlnj commented Jan 12, 2024

We have set the below and registered a new version but the error is still happening. Must it be registered under a new workflow name? Are there any steps we can follow to validate that this is configured correctly?

configmap:
 adminServer:
      flyteadmin:
        useOffloadedWorkflowClosure: true

@eapolinario
Copy link
Contributor

@leonlnj , is this still happening? Can you confirm which version of Flyte you're using?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backlogged For internal use. Reserved for contributor team workflow. bug Something isn't working exo stale
Projects
None yet
Development

No branches or pull requests

5 participants