Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug/abort map task subtasks #4506

Merged
merged 17 commits into from
Dec 15, 2023
Merged

Bug/abort map task subtasks #4506

merged 17 commits into from
Dec 15, 2023

Conversation

pvditt
Copy link
Contributor

@pvditt pvditt commented Nov 30, 2023

Tracking issue

Currently, when a single MapTask subtask fails the other subtasks are not aborted as the task has already been set to a terminal state.

Why are the changes needed?

It is expected behavior for subtasks to abort on a single subtask failure for maptasks.

Docs link

Describe your changes

  • On legacy MapTask/Array plugin failure, abort subtasks
  • Add new phase for legacy MapTask/Array plugin to abort subtasks
  • Persist aborted subtask states to Flyte admin

Check all the applicable boxes

  • I updated the documentation accordingly.
  • All new and existing tests passed.
  • All commits are signed-off.

Setup Process

Built locally w/ flyte-single-binary-local.yaml

Screenshots

Screenshot 2023-12-09 at 6 27 33 PM Screenshot 2023-12-09 at 6 25 49 PM

Note to reviewers

Related PRs

Copy link

welcome bot commented Nov 30, 2023

Thank you for opening this pull request! 🙌

These tips will help get your PR across the finish line:

  • Most of the repos have a PR template; if not, fill it out to the best of your knowledge.
  • Sign off your commits (Reference: DCO Guide).

@dosubot dosubot bot added the size:M This PR changes 30-99 lines, ignoring generated files. label Nov 30, 2023
Copy link

codecov bot commented Nov 30, 2023

Codecov Report

Attention: 49 lines in your changes are missing coverage. Please review.

Comparison is base (1bed29a) 59.00% compared to head (b34f6d3) 58.97%.

Files Patch % Lines
...lytepropeller/pkg/controller/nodes/task/handler.go 9.67% 27 Missing and 1 partial ⚠️
...lyteplugins/go/tasks/pluginmachinery/core/phase.go 0.00% 7 Missing ⚠️
...lyteplugins/go/tasks/plugins/array/k8s/executor.go 0.00% 5 Missing ⚠️
...propeller/pkg/controller/nodes/task/transformer.go 0.00% 4 Missing ⚠️
...teplugins/go/tasks/plugins/array/k8s/management.go 90.00% 3 Missing ⚠️
...lugins/go/tasks/plugins/array/awsbatch/executor.go 0.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #4506      +/-   ##
==========================================
- Coverage   59.00%   58.97%   -0.04%     
==========================================
  Files         621      621              
  Lines       52498    52568      +70     
==========================================
+ Hits        30979    31000      +21     
- Misses      19054    19093      +39     
- Partials     2465     2475      +10     
Flag Coverage Δ
unittests 58.97% <40.96%> (-0.04%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. and removed size:M This PR changes 30-99 lines, ignoring generated files. labels Dec 4, 2023
@pvditt pvditt requested a review from hamersaw December 5, 2023 01:24
@pvditt pvditt force-pushed the bug/abort-map-task-subtasks branch from cb9c766 to fb4dce2 Compare December 5, 2023 20:32
@pvditt pvditt closed this Dec 6, 2023
@pvditt pvditt reopened this Dec 6, 2023
@hamersaw
Copy link
Contributor

hamersaw commented Dec 7, 2023

A very high-level thought here. And apologies for steering in the wrong direction, these changes (ie. eventing) were much more involved that I anticipated. Does it make more sense to add another state to maptask for something like FailureCleanup. Where, while the task is still technically in the Running phase we abort subtasks that are still running? Then we don't need to add eventing to the abort function, add IsSubtaskAbort toExternalResourceInfo, and update admins event handling (ie. same phase with increased TaskPhaseVersion) for something that will soon be phased out.

@pvditt
Copy link
Contributor Author

pvditt commented Dec 7, 2023

A very high-level thought here. And apologies for steering in the wrong direction, these changes (ie. eventing) were much more involved that I anticipated. Does it make more sense to add another state to maptask for something like FailureCleanup. Where, while the task is still technically in the Running phase we abort subtasks that are still running? Then we don't need to add eventing to the abort function, add IsSubtaskAbort toExternalResourceInfo, and update admins event handling (ie. same phase with increased TaskPhaseVersion) for something that will soon be phased out.

@hamersaw

Good call out - definitely agree that we should minimize changes outside of the array plugin. I also think leaving terminal states as final is probably a safer design decision. Aborting the subtasks while the task is still running would remove the added eventing in the abort function and remove the change in FlyteAdmin enabling the writes on already terminal phases.

I think we will still need to update ExternalResourceInfo to persist subtask state correctly from running to aborted. I'll try to figure another way to persist this state, but I think we should show the correct state for subtasks given they are aborted.

@pvditt
Copy link
Contributor Author

pvditt commented Dec 8, 2023

@hamersaw to persist subtask abort states when a map task is manually terminated, I think we would need to keep the added buffered task eventing in the abort function since propellor won't go through the plugin Handle method.

We can avoid the changes to flyteadmin by having the recorded phaseinfo set to running & bumping phase version. This would work when a maptask is manually terminated and on terminating subtasks when the task is still running. Have this working locally but need to clean up some of the eventing/phase versioning to minimize debugger output on manual terminations. Will push those changes up tomorrow.

Paul Dittamo and others added 13 commits December 13, 2023 09:10
Signed-off-by: Paul Dittamo <[email protected]>
Signed-off-by: Paul Dittamo <[email protected]>
Signed-off-by: Paul Dittamo <[email protected]>
Signed-off-by: Paul Dittamo <[email protected]>
Signed-off-by: Paul Dittamo <[email protected]>
Signed-off-by: Paul Dittamo <[email protected]>
@pvditt pvditt force-pushed the bug/abort-map-task-subtasks branch from 4656256 to ccd31ce Compare December 13, 2023 17:10
hamersaw and others added 4 commits December 14, 2023 09:29
* using Abort phase and cleaning up best-effort

Signed-off-by: Daniel Rammer <[email protected]>

* removed phase updates on Finalize

Signed-off-by: Daniel Rammer <[email protected]>

---------

Signed-off-by: Daniel Rammer <[email protected]>
Signed-off-by: Paul Dittamo <[email protected]>
@hamersaw hamersaw merged commit 1699094 into master Dec 15, 2023
43 of 45 checks passed
@hamersaw hamersaw deleted the bug/abort-map-task-subtasks branch December 15, 2023 15:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size:L This PR changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants