-
Notifications
You must be signed in to change notification settings - Fork 670
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added literal offloading for array node map tasks #5697
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #5697 +/- ##
==========================================
+ Coverage 36.21% 36.29% +0.07%
==========================================
Files 1303 1305 +2
Lines 109568 109991 +423
==========================================
+ Hits 39683 39924 +241
- Misses 65764 65912 +148
- Partials 4121 4155 +34
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks great to me!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
High-level question - it looks like we're using semver comparison for the flytekit version for the specific task that is executing. Is this right? For example we have a workflow with and ArrayNode (n0
) that writes sends data to a TaskNode (n1
). We're checking if the flytekit version of the ArrayNode task for n0
supports offloading. IIUC that task will execute normally, we will transparently offload, and we are really concerned whether the task for n1
supports offloading.
@hamersaw I agree with this concern and i think we had raised a similar concern during the spec review @katrogan and we had decided to allow things to fail for these heterogeneous scenarios . Still trying to get to understand the type system and propeller code but If we need to support this what do you think would be needed here. |
This is really difficult, because it's unclear if we want to fail just because propeller supports offloading and the task (or downstream tasks) do not. If the outputs aren't large enough to offload, we would want it to run successfully rather than fail even though it could succeed right? If we absolutely need to validate the flytekit version of tasks to support offloading the only place that makes sense to me is when propeller begins evaluation (in the handleReadyWorkflow function). We can't do it during registration or compilation because flyteadmin doesn't know if propeller has offloading enabled. In this function we could iterate through all tasks in the
This would be much easier. In the node executor we copy inputs from previous nodes when transitioning the node from |
Thanks @hamersaw . Would want to avoid complicated logic if possible and rely on a rollout process to solve it. If we cant rely on this process and we want automated checks then we should think about how we can make this fully work (even dynamic case) for old workflows without failing them
yes this is not supported for plugins . I dont think we have even scoped this for plugins in general as we are tackling the map task as the first usecase of this broad problem with large datasets cc : @katrogan |
This sounds like a reasonable, we don't attempt to type check at the onset of execution but can lazily decide to fail if a downstream task appears incapable of consuming an offloaded output but only in the scenario where we have offloaded an output. Re: plugins: I believe spark, agents and flytekit plugins should continue to work (they use flytekit type transformers to unmarshal literals which is the SDK check we're gating against). For example, spark calls pyflyte execute: https://github.com/flyteorg/flytekit/blob/master/plugins/flytekit-spark/flytekitplugins/spark/task.py#L221 cc @eapolinario would love for you to double check this There is follow up work to make copilot data reading work that is slated for after this (thank you @pmahindrakar-oss ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks great!
648b920
to
3e7265b
Compare
dd69954
to
54a1e77
Compare
f92f493
to
dd69954
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM but please wait for @hamersaw's review too
Signed-off-by: pmahindrakar-oss <[email protected]>
Signed-off-by: pmahindrakar-oss <[email protected]>
03a6b73
to
b8a1c4e
Compare
Signed-off-by: pmahindrakar-oss <[email protected]>
Signed-off-by: pmahindrakar-oss <[email protected]>
Signed-off-by: pmahindrakar-oss <[email protected]>
phaseInfo = handler.PhaseInfoFailure(core.ExecutionError_SYSTEM, "GetTaskIDFailure", err.Error(), nil) | ||
return &phaseInfo | ||
} | ||
runtimeData := taskNode.CoreTask().GetMetadata().GetRuntime() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have good context here, but TaskNodes encompass all task executions in Flyte (actors, agents, spark, dask, ray, etc) how is this runtimeData
set for each? For example, if we offload into a agent task is there a specific check we need to do there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if my understanding is correct, anything we serialize from flytekit should have the runtime set from the SDK, right?
e.g. here for spark: https://github.com/flyteorg/flytekit/blob/ae9c6f8de21eb716dbda8d1af555c2057655db1d/plugins/flytekit-spark/flytekitplugins/spark/task.py#L153-L159 we call the super method for PythonFunctionTask
this https://github.com/flyteorg/flytekit/blob/ae9c6f8de21eb716dbda8d1af555c2057655db1d/flytekit/core/python_function_task.py#L81 inherits from PythonAutoContainerTask
which itself inherits from PythonTask
(https://github.com/flyteorg/flytekit/blob/ae9c6f8de21eb716dbda8d1af555c2057655db1d/flytekit/core/python_auto_container.py#L29)
which sets the metadata https://github.com/flyteorg/flytekit/blob/ae9c6f8de21eb716dbda8d1af555c2057655db1d/flytekit/core/base_task.py#L167-L169
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for agents in particular, we use the pyflyte entrypoint for all task types so I believe we should get typetransformer changes for reading offloaded inputs for free
e.g here https://github.com/flyteorg/flytekit/blob/master/flytekit/extend/backend/agent_service.py#L176 and elsewhere we call LiteralMap.from_flyte_idl
in the agent service
we call agent.do https://github.com/flyteorg/flytekit/blob/master/flytekit/extend/backend/agent_service.py#L178
or agent.create https://github.com/flyteorg/flytekit/blob/master/flytekit/extend/backend/agent_service.py#L124
and then call input_python_value = TypeEngine.literal_map_to_kwargs
https://github.com/flyteorg/flytekit/blob/ae9c6f8de21eb716dbda8d1af555c2057655db1d/plugins/flytekit-openai/flytekitplugins/openai/chatgpt/agent.py#L33
which itself calls to_python_value
: https://github.com/flyteorg/flytekit/blob/ae9c6f8de21eb716dbda8d1af555c2057655db1d/flytekit/core/type_engine.py#L1181
which is what Eduardo's PR modifies https://github.com/flyteorg/flytekit/pull/2685/files#diff-68624e201f644896b015de1a5716eabef9b8b5a106caba1d9f21c51c0708684bR1101
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @eapolinario let me know if this sounds correct to you, would love someone from OSS to verify 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc : @wild-endeavor @pingsutw
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
users may use flytekit 1.13.5 to serialize the workflow, but use flytekit 1.12 in the container.
Which should be fine , the version checks make sure of the compatibility to use this feature
yes, type transformer is used to ser/de literal, so we should only change the transformer.
Given the above path @katrogan pointed out, i dont see we need to change anything here as the base changes already take care of reading the offloaded values
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the more concerning bit is that the runtime metadata doesn't correspond to the container image. is the only way to verify the SDK version using the code to introspect the image then?
I think so. We could probably release flytekit first and ensure most users have upgraded it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We handle the flytekit version check, so IMO it shouldn't block this PR and if folks are ok with the current version of the code then lets check this in unless there are any other concerns. @hamersaw @katrogan @pingsutw . i can add higher version for flytekit in this pr or wait on the release of flytekit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good to me, I think this is the best check we have for compatibility
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm OK with this.
Signed-off-by: pmahindrakar-oss <[email protected]>
Signed-off-by: pmahindrakar-oss <[email protected]>
Signed-off-by: pmahindrakar-oss <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small nit.
Signed-off-by: pmahindrakar-oss <[email protected]>
Tracking issue
RFC #5103
This PR tackles the mapped o/p results exceeeding the offloaded min size
Why are the changes needed?
Without these changes we hit grpc limit when propeller tries to create an execution by passing inputs inline which it got from output of large map task
Here you can see the failure where we are hitting the default 4MB grpc message size limit
This is the line where it fails
flyte/flytepropeller/pkg/controller/nodes/subworkflow/launchplan/admin.go
Line 151 in 7136919
See the testing details for the workflow used. In particular this section
Where we have large task output being returned in processed results and is being passed to launchplan and propeller sends these inputs inline and hence if the value is too large then we exceed the grpc limit
What changes were proposed in this pull request?
How was this patch tested?
Using the following workflow which generates ~ 20 MB map output when launched with input =20 for the following workflow
ref_wf
Workflow used for testing
Before the change receive the following error
After the Change
The workflow succeeds
Following is the subworkflow crd which shows the offloaded uri being used as input when propeller launches and execution
Logs from flytekit reading the offloaded literal and passing this in subworkflow. This uses the flytekit changes from here
flyteorg/flytekit#2685
which reads the offloaded literal if present
Pending : Unit tests
Setup process
Screenshots
Check all the applicable boxes
Related PRs
Docs link