Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Inconsistent number of retries on failing tasks. Always 4 attempts. #5217

Closed
2 tasks done
dansola opened this issue Apr 10, 2024 · 3 comments
Closed
2 tasks done
Assignees
Labels
bug Something isn't working exo

Comments

@dansola
Copy link
Contributor

dansola commented Apr 10, 2024

Describe the bug

There arrear to be some inconsistencies with how retries are handled when set in the task decorator. According to the docs, recoverable exceptions should be respected by retries, but non-recoverable exceptions should not. I am able to see this behavior respected when I directly raise FlyteRecoverableException in a task as compared to directly raising something like RuntimeError or even FlyteAssertion. The inconsistency appears to arise when certain flyte exceptions arise (e.g. FlyteAssertion) and >1 attempt occur. Specifically, four attempts appear to occur regardless of what retries is set to.

Here is an example:

from flytekit import task, workflow
from flytekit.types.file import FlyteFile


@task(retries=2)
def my_task() -> FlyteFile:
    return FlyteFile("/not/a/real/file.txt")


@workflow
def wf():
    ff = my_task()
    print(ff)

The following error occurs here after four attempts:

TypeError: Failed to convert outputs of task 'workflows.retry_test.my_task' at position 0:
  USER:AssertionError: error=File /not/a/real/file.txt does not exist

which I would expect to not retry as it is not a recoverable exception. Regardless, if it is meant to be retired, I would expect it to respect retries=2 rather than making four attempts.

It is also worth noting that when looking at the task details in the UI, the specified number of retries is populated.

Expected behavior

I would expect the task to either not retry given that the exception is non-recoverable, or if the exception is meant to be recoverable, I would expect it to respect the number of retries specified in the decorator.

Additional context to reproduce

These examples were run on union cloud.

Screenshots

No response

Are you sure this issue hasn't been raised already?

  • Yes

Have you read the Code of Conduct?

  • Yes
@dansola dansola added bug Something isn't working untriaged This issues has not yet been looked at by the Maintainers labels Apr 10, 2024
@pvditt pvditt added the exo label Apr 11, 2024
@pvditt pvditt self-assigned this Apr 11, 2024
@wild-endeavor wild-endeavor removed the untriaged This issues has not yet been looked at by the Maintainers label Apr 11, 2024
@pvditt
Copy link
Contributor

pvditt commented Apr 11, 2024

@dansola That error maps to a system error. This is expected behavior for system errors.

There's a check for retries for system errors here . MaxNodeRetriesOnSystemFailures defaults to 3, so the 4 total tries is expected.

Is the concern that the particular errors should not be considered system errors?

There's also an option to configure IgnoreRetryCause so that all retryable errors count the same against a node's retries/max attempts.

@kumare3
Copy link
Contributor

kumare3 commented Apr 17, 2024

Shall we close this issue? @dansola

@dansola
Copy link
Contributor Author

dansola commented Apr 17, 2024

Yes, thank you. I will close the issue.

@dansola dansola closed this as completed Apr 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working exo
Projects
None yet
Development

No branches or pull requests

4 participants