[BUG] Outputs of a subworkflow in a remote launch plan is handled incorrectly #4369

honnix · 2023-11-06T14:14:50Z

Describe the bug

flyte/flytepropeller/pkg/controller/nodes/subworkflow/launchplan/admin.go

Line 283 in 9fe34db

outputs = execData.GetFullOutputs()

may not receive full outputs, depending on how storage.limits.maxDownloadMBs is configured in flyteadmin, and it does not fail loudly when the outputs size is larger than storage.limits.maxDownloadMBs. What's worse is that flytepropeller happily stores the 0-sized protobuf file to be used as the inputs of the following task.

Currently the error message is impossible to understand by users and also very hard for flyte admins to investigate:

failed at Node[wxetj-n1]. BindingResolutionError: Error binding Var [wf].[asi], caused by: failed at Node[wxetj-n0]. OutputsNotFoundError: Outputs not found at [gs://.../wxetj-n0/data/0/outputs.pb]

Expected behavior

flytepropeller should fail loudly to indicate that it could not receive expected outputs.

Additional context to reproduce

No response

Screenshots

No response

Are you sure this issue hasn't been raised already?

Yes

Have you read the Code of Conduct?

Yes

The text was updated successfully, but these errors were encountered:

honnix · 2023-11-06T16:22:46Z

BTW it is also very confusing having two config items with overlapping purposes:

remoteData.maxSizeInBytes
storage.limits.maxDownloadMBs

honnix · 2023-11-06T16:26:25Z

I think in general having smaller storage.limits.maxDownloadMBs values (in flyteadmin, flytepropeller, and datacatalog) configured than max-output-size-bytes does not make sense it would fail at different places.

hamersaw · 2023-12-20T18:52:08Z

@honnix this fix is rather confusing given the plethora of configuration options. Do you know if you are using signed urls? It seems the proto support for this flow is deprecated, but none of the configuration options are labeled as such.

The overall API design suggests that fetching output data within admin is "best effort" and if it is not returned then the client should try. This leads me to believe that rather than attempted to return the error value here (in only the correct scenarios) that we should attempt to retrieve the data if it's missing from flytepropeller when we retrieve the execution. Thoughts?

@pvditt please comment as well!

honnix · 2023-12-21T09:08:44Z

@hamersaw The signed urls won't help in cases when flytepropeller needs to fetch outputs from flyteadmin for the following task executions and currently it doesn't use signed url for that if I'm not mistaken. Yes the outputs is deprecated, but not full_outputs which is what flytepropeller reads from.

Regarding retry, yes, if anything is in the critical path, a proper retry sounds like a great idea. In this case, however, a retry would not make any difference because there was no RPC failure so flytepropeller was not able to tell whether a workflow indeed didn't output anything or it was just flyteadmin refusing to fetch due to max size config.

pvditt · 2023-12-21T19:50:08Z

@hamersaw I believe that initial API design intention for retries was for the console. This PR removed the usage of UrlBlob on the console. I don't believe there's been an expectation of retries on these calls for a while. I made an issue to clean up UrlBlob.

Good call out on retries from propellor. I'll add that to this PR.

pvditt · 2023-12-22T05:35:40Z

@hamersaw actually would it make sense to add retries on the propellor side when we already have system error retries?

pvditt · 2024-01-19T22:52:33Z

@honnix this issue should be closed by: #4602

There's a follow-up house-cleaning issue to have that GetExecutionData admin endpoint to bubble errors that I'll be resolving soon that doesn't block closing this issue.

Please feel free to re-open this issue and tag me if this change does not fix the issue.

honnix added bug Something isn't working untriaged This issues has not yet been looked at by the Maintainers labels Nov 6, 2023

hamersaw added exo backlogged For internal use. Reserved for contributor team workflow. and removed untriaged This issues has not yet been looked at by the Maintainers labels Nov 8, 2023

pvditt self-assigned this Dec 13, 2023

pvditt mentioned this issue Dec 15, 2023

[BUG] Retry fetching subworkflow output data on failure #4602

Merged

3 tasks

pvditt closed this as completed Jan 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Outputs of a subworkflow in a remote launch plan is handled incorrectly #4369

[BUG] Outputs of a subworkflow in a remote launch plan is handled incorrectly #4369

honnix commented Nov 6, 2023 •

edited

Loading

honnix commented Nov 6, 2023

honnix commented Nov 6, 2023

hamersaw commented Dec 20, 2023

honnix commented Dec 21, 2023

pvditt commented Dec 21, 2023

pvditt commented Dec 22, 2023

pvditt commented Jan 19, 2024

[BUG] Outputs of a subworkflow in a remote launch plan is handled incorrectly #4369

[BUG] Outputs of a subworkflow in a remote launch plan is handled incorrectly #4369

Comments

honnix commented Nov 6, 2023 • edited Loading

Describe the bug

Expected behavior

Additional context to reproduce

Screenshots

Are you sure this issue hasn't been raised already?

Have you read the Code of Conduct?

honnix commented Nov 6, 2023

honnix commented Nov 6, 2023

hamersaw commented Dec 20, 2023

honnix commented Dec 21, 2023

pvditt commented Dec 21, 2023

pvditt commented Dec 22, 2023

pvditt commented Jan 19, 2024

honnix commented Nov 6, 2023 •

edited

Loading