Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Outputs of a subworkflow in a remote launch plan is handled incorrectly #4369

Closed
2 tasks done
honnix opened this issue Nov 6, 2023 · 7 comments
Closed
2 tasks done
Assignees
Labels
backlogged For internal use. Reserved for contributor team workflow. bug Something isn't working exo

Comments

@honnix
Copy link
Member

honnix commented Nov 6, 2023

Describe the bug

may not receive full outputs, depending on how storage.limits.maxDownloadMBs is configured in flyteadmin, and it does not fail loudly when the outputs size is larger than storage.limits.maxDownloadMBs. What's worse is that flytepropeller happily stores the 0-sized protobuf file to be used as the inputs of the following task.

Currently the error message is impossible to understand by users and also very hard for flyte admins to investigate:

failed at Node[wxetj-n1]. BindingResolutionError: Error binding Var [wf].[asi], caused by: failed at Node[wxetj-n0]. OutputsNotFoundError: Outputs not found at [gs://.../wxetj-n0/data/0/outputs.pb]

Expected behavior

flytepropeller should fail loudly to indicate that it could not receive expected outputs.

Additional context to reproduce

No response

Screenshots

No response

Are you sure this issue hasn't been raised already?

  • Yes

Have you read the Code of Conduct?

  • Yes
@honnix honnix added bug Something isn't working untriaged This issues has not yet been looked at by the Maintainers labels Nov 6, 2023
@honnix
Copy link
Member Author

honnix commented Nov 6, 2023

BTW it is also very confusing having two config items with overlapping purposes:

  • remoteData.maxSizeInBytes
  • storage.limits.maxDownloadMBs

@honnix
Copy link
Member Author

honnix commented Nov 6, 2023

I think in general having smaller storage.limits.maxDownloadMBs values (in flyteadmin, flytepropeller, and datacatalog) configured than max-output-size-bytes does not make sense it would fail at different places.

@hamersaw hamersaw added exo backlogged For internal use. Reserved for contributor team workflow. and removed untriaged This issues has not yet been looked at by the Maintainers labels Nov 8, 2023
@pvditt pvditt self-assigned this Dec 13, 2023
@hamersaw
Copy link
Contributor

@honnix this fix is rather confusing given the plethora of configuration options. Do you know if you are using signed urls? It seems the proto support for this flow is deprecated, but none of the configuration options are labeled as such.

The overall API design suggests that fetching output data within admin is "best effort" and if it is not returned then the client should try. This leads me to believe that rather than attempted to return the error value here (in only the correct scenarios) that we should attempt to retrieve the data if it's missing from flytepropeller when we retrieve the execution. Thoughts?

@pvditt please comment as well!

@honnix
Copy link
Member Author

honnix commented Dec 21, 2023

@hamersaw The signed urls won't help in cases when flytepropeller needs to fetch outputs from flyteadmin for the following task executions and currently it doesn't use signed url for that if I'm not mistaken. Yes the outputs is deprecated, but not full_outputs which is what flytepropeller reads from.

Regarding retry, yes, if anything is in the critical path, a proper retry sounds like a great idea. In this case, however, a retry would not make any difference because there was no RPC failure so flytepropeller was not able to tell whether a workflow indeed didn't output anything or it was just flyteadmin refusing to fetch due to max size config.

@pvditt
Copy link
Contributor

pvditt commented Dec 21, 2023

@hamersaw I believe that initial API design intention for retries was for the console. This PR removed the usage of UrlBlob on the console. I don't believe there's been an expectation of retries on these calls for a while. I made an issue to clean up UrlBlob.

Good call out on retries from propellor. I'll add that to this PR.

@pvditt
Copy link
Contributor

pvditt commented Dec 22, 2023

@hamersaw actually would it make sense to add retries on the propellor side when we already have system error retries?

@pvditt
Copy link
Contributor

pvditt commented Jan 19, 2024

@honnix this issue should be closed by: #4602

There's a follow-up house-cleaning issue to have that GetExecutionData admin endpoint to bubble errors that I'll be resolving soon that doesn't block closing this issue.

Please feel free to re-open this issue and tag me if this change does not fix the issue.

@pvditt pvditt closed this as completed Jan 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backlogged For internal use. Reserved for contributor team workflow. bug Something isn't working exo
Projects
None yet
Development

No branches or pull requests

3 participants