Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Select cluster execution when triggering an execution #4956

Closed
wants to merge 4 commits into from

Conversation

RRap0so
Copy link
Contributor

@RRap0so RRap0so commented Feb 26, 2024

RFC to propose a new Cluster Execution implementation.

@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. enhancement New feature or request labels Feb 26, 2024
Signed-off-by: Rafael Raposo <[email protected]>
Signed-off-by: Rafael Raposo <[email protected]>
@davidmirror-ops
Copy link
Contributor

This is amazing, I can think of several users who'd like to see a better mechanism to handle executions in multi-cluster scenarios.
Thanks!

Copy link

codecov bot commented Feb 26, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 58.97%. Comparing base (94e433b) to head (7f9cb8b).
Report is 31 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #4956   +/-   ##
=======================================
  Coverage   58.96%   58.97%           
=======================================
  Files         645      645           
  Lines       55506    55561   +55     
=======================================
+ Hits        32730    32766   +36     
- Misses      20177    20200   +23     
+ Partials     2599     2595    -4     
Flag Coverage Δ
unittests 58.97% <ø> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Co-authored-by: sonjaer <[email protected]>
Signed-off-by: Rafael Raposo <[email protected]>
@davidmirror-ops davidmirror-ops added the rfc A label for RFC issues label Feb 26, 2024
bstadlbauer
bstadlbauer previously approved these changes Feb 29, 2024
Copy link
Member

@bstadlbauer bstadlbauer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a great usecase and the required changes shouldn't introduce too much extra complexity 👍


## 8 Unresolved questions

What takes priority Labels or Pools?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would vote for labels to take priority as those seem to be more execution specific. I might be missing something though

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes the label usage seems more related to "I need this X type of cluster for this <PROJECT/Domain>".

So to be sure, we first check if there's a label set, then under the clusters in that label if there's a pool. If not then default using the weight. (Similar to what is done today.)

@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Feb 29, 2024
Co-authored-by: Fabio M. Graetz, Ph.D. <[email protected]>
Signed-off-by: Rafael Raposo <[email protected]>

Proposed configuration:
```
clusters:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As said in the contributors' sync:

Love the proposal, my only wish would be to figure out whether there is a way to simplify this config here by not having to configure a cluster twice, once for the existing project/domain mechanism, once for the new mechanism. Would be nice if the cluster configuration could be shared.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fg91 I think it should be possible by expanding the ExecutionParameters struct. By adding the ExecutionClusterLabel to this struct and making sure we pass it when executing it should make things simpler and avoid having double configs.

This will mean a bit of re-writing, since it's late I'll also cross-check my thinking and get back with some certainties next week.

Copy link
Contributor

@katrogan katrogan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a couple of edge cases come to mind which would be great to clarify

  • are cluster pools ever reserved or restricted? can any execution target any cluster pool?
  • what happens when a cluster pool AND an execution cluster are specified in an execution spec?
  • what about when a CreateExecution request targets a specific cluster but the hierarchical overrides include a (conflicting) cluster pool (or vice versa) do we always read from the execution spec

also a general question on validation, what happens if flyteadmin is brought up with a config where cluster pools reference non-existent execution clusters? what if a matchable attribute is added with a cluster pool referencing a non-existent cluster pool?

@RRap0so
Copy link
Contributor Author

RRap0so commented Mar 4, 2024

@fg91 I've gave it a shot in this PR. Instead of adding a new field in the ExecutionTargetSpec struct, we're reusing the ClusterAssignment and adding it there.

This is an alternative to what we propose with the RFC above but keeping the same idea, this was much of the code will be reused and minimal changes are required.

@katrogan please take a look at the alternative here. This should solve any question with pools vs labels. We will first honor whatever label was first set in the execution and then project.

@wild-endeavor
Copy link
Contributor

hey @RRap0so,

late to the party on this one, but i'm not sure I understand. Could you help clear up some confusion for me please?

Today admin already has the ability to do:

      labelClusterMap:
        production:
        - id: flyte-1-k8s
          weight: 0.3
        - id: flyte-2-k8s
          weight: 0.7

(And looking at how we do the weighting, you can do 3 and 7 also.)

And then you can set a project or domain to use the production label, and have it randomly pick between them. Does the behavior you're proposing differ from this? If so could you explain more please, I'm missing something.

@RRap0so
Copy link
Contributor Author

RRap0so commented Mar 11, 2024

hey @RRap0so,

late to the party on this one, but i'm not sure I understand. Could you help clear up some confusion for me please?

Today admin already has the ability to do:

      labelClusterMap:
        production:
        - id: flyte-1-k8s
          weight: 0.3
        - id: flyte-2-k8s
          weight: 0.7

(And looking at how we do the weighting, you can do 3 and 7 also.)

And then you can set a project or domain to use the production label, and have it randomly pick between them. Does the behavior you're proposing differ from this? If so could you explain more please, I'm missing something.

Sure thing @wild-endeavor . This RFC is not changing what is done in terms of weighs and picking the cluster, just introducing another way to select a workflow to run in another cluster at execution time. The problem today is that we can only set a project or domain and we also want to define per execution.

We started this RFC by using the cluster_assignment since it was a field that already existed but wasn't fully implemented.

There's an alternative PR that probably explains in a simpler way what exactly we intend to do and if this is the way forward, happy to update the RFC accordingly.

#4998

@wild-endeavor
Copy link
Contributor

oh @RRap0so are you just saying there's no way to pass an execution label at execution kick-off time?

cuz matchable overrides at the workflow level is also possible already, but i agree i'm not seeing the label at execution kickoff time.

@RRap0so
Copy link
Contributor Author

RRap0so commented Mar 12, 2024

@wild-endeavor exactly! We spotted the cluster assignment and noticed some artifacts about clusterPools so we did the RFC to fully implement the concept but probably the alternative PR is the quickest way without duplicating configs and enough for the use-case.

@wild-endeavor
Copy link
Contributor

Sorry for the delay. In that case, can we just add the ExecutionClusterLabel object to the ExecutionSpec message? And then have admin respect that as an override?

Will you be at the contributor meeting today? Maybe we can discuss there.

@RRap0so
Copy link
Contributor Author

RRap0so commented Mar 15, 2024

Sorry for the delay. In that case, can we just add the ExecutionClusterLabel object to the ExecutionSpec message? And then have admin respect that as an override?

Will you be at the contributor meeting today? Maybe we can discuss there.

Sorry it was a bit late for me to attend, so we've done that already if you take a look at this alternative. #4956 if this is the way forward, happy to close this RFC and continue there.

@wild-endeavor
Copy link
Contributor

This is the way forward yes. Can you point me to where the alternative is? The link to #4956 just points back to this pull request. Did you already make a PR that adds ExecutionClusterLabel to ExecutionSpec?

@RRap0so
Copy link
Contributor Author

RRap0so commented Mar 19, 2024

This is the way forward yes. Can you point me to where the alternative is? The link to #4956 just points back to this pull request. Did you already make a PR that adds ExecutionClusterLabel to ExecutionSpec?

@wild-endeavor 🤦 here's the correct link. We've added it in the cluster_assignment to make this a bit more clean but happy to change.

@wild-endeavor
Copy link
Contributor

hey @RRap0so this is the ExecutionClusterLabel message that we talked about

@RRap0so
Copy link
Contributor Author

RRap0so commented Mar 20, 2024

We've decided to take the alternative approach of adding the ExecutionClusterLabel into the execution spec.

For more details see Issue #5081 and PR #4998

@RRap0so RRap0so closed this Mar 21, 2024
@RRap0so RRap0so deleted the rfc-cluster-pool branch March 21, 2024 06:26
@fg91
Copy link
Member

fg91 commented May 16, 2024

@RRap0so @iaroslav-ciupin I would appreciate your input on this 🙏

Based on your PR which exposes ExecutionClusterLabel when creating an execution, this PR is exposing this option in pyflyte run. Because we really need this functionality as well, I just tested this and can confirm it works - all good here. (Thank you for your work in the backend!)

My question is this:

This PR flyteorg/flytekit#1208 a few months ago exposed --cluster-pool in pyflyte which is what you @RRap0so proposed in this RFC, correct?.

If I understood you correctly in the discussions in the contrib sync, the cluster pool logic was never fully implemented in the backend, is this correct or did I misunderstand you?

I'm wondering whether all this "cluster pool" logic is dead code that never worked and should be cleaned up.
@davidmirror-ops FYI would be good to revisit this in the contrib sync.

@RRap0so
Copy link
Contributor Author

RRap0so commented May 16, 2024

This is correct as it stands right now there's a "placeholder" but not implemented in the flyteadmin backend.

Even tough we don't use it in the backend, it is possible other plugins (outside the flyteadmin backend) are actually using it? Maybe we can start by opening an issue, remove that code and bring it up the next contrib sync.

Great that it's growing into pyflyte, I'm also looking at potentially adding it to flytectl so might grab some inspiration :)

@davidmirror-ops
Copy link
Contributor

Thanks for bringing this up!
We can discuss this further. That code is being used by Union, so should not be deleted. We're looking for a way to better label/signal this type of work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request lgtm This PR has been approved by a maintainer rfc A label for RFC issues size:L This PR changes 100-499 lines, ignoring generated files.
Projects
Status: Implementation in progress
Development

Successfully merging this pull request may close these issues.

7 participants