-
Notifications
You must be signed in to change notification settings - Fork 300
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] Map task caching failures #2113
Conversation
…rface Signed-off-by: Paul Dittamo <[email protected]>
Thank you for opening this pull request! 🙌 These tips will help get your PR across the finish line:
|
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #2113 +/- ##
==========================================
+ Coverage 82.58% 85.78% +3.19%
==========================================
Files 233 313 +80
Lines 19541 23514 +3973
Branches 3512 3515 +3
==========================================
+ Hits 16138 20171 +4033
+ Misses 2824 2734 -90
- Partials 579 609 +30 ☔ View full report in Codecov by Sentry. |
Signed-off-by: Paul Dittamo <[email protected]>
Signed-off-by: Paul Dittamo <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we also update the args also actually?
Another issue is that when flytekit creates the command:
flytekit/flytekit/core/array_node_map_task.py
Line 390 in 6279b81
f'{",".join(t.bound_inputs)}', |
the bound vars can show up in different orders. Can we order these first to make sure the args are always the same?
Same for the old map task:
flytekit/flytekit/core/map_task.py
Line 407 in 6279b81
f'{",".join(t.bound_inputs)}', |
if you sort then this should resolve flyteorg/flyte#4702 as well @pvditt |
@wild-endeavor thanks for pointing that out. Will update this PR |
Signed-off-by: Paul Dittamo <[email protected]>
@@ -77,8 +77,9 @@ def __init__( | |||
f = actual_task.lhs | |||
else: | |||
_, mod, f, _ = tracker.extract_task_module(cast(PythonFunctionTask, actual_task).task_function) | |||
sorted_bounded_inputs = ",".join(sorted(self._bound_inputs)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note: could also just turn self._bound_inputs into a sorted list as opposed to set - don't feel too strongly either way
Congrats on merging your first pull request! 🎉 |
Tracking issue
flyteorg/flyte#4731
flyteorg/flyte#4702
Why are the changes needed?
Legacy map tasks and new map tasks include a generated task interface as part of task names to support partially binding different parameters of the same mappable task. This introduces issues such as guaranteed cache misses when collection interface has types that include non-deterministic elements.
Example where the output contains a memory address causing for guaranteed cache misses:
The same can be done for inputs.
There's an also issue with map task container args being non-deterministic if there are multiple bounded inputs (flyteorg/flyte#4702) of which can result in an error.
What changes were proposed in this pull request?
Utilize an ordered list of the bounded input names in favor of the entire interface. This is similar to this but disregards any Typed values, opting to only keep the keys/names.
Note, this will cause for all cache enabled map tasks to have cache misses on first runs as task names will all change.
How was this patch tested?
Setup process
Confirmed that intputs/outputs w/ memory address don't impact caching
Confirmed if partially binding different parameters for the same mapped task still worked:
Screenshots
Check all the applicable boxes
Related PRs
Docs link