[RFC] Select cluster execution when triggering an execution #4956

RRap0so · 2024-02-26T13:57:57Z

RFC to propose a new Cluster Execution implementation.

Signed-off-by: Rafael Raposo <[email protected]>

davidmirror-ops · 2024-02-26T14:14:09Z

This is amazing, I can think of several users who'd like to see a better mechanism to handle executions in multi-cluster scenarios.
Thanks!

rfc/core language/4956-cluster-pools.md

codecov · 2024-02-26T14:18:20Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 58.97%. Comparing base (94e433b) to head (7f9cb8b).
Report is 31 commits behind head on master.

Additional details and impacted files

@@           Coverage Diff           @@
##           master    #4956   +/-   ##
=======================================
  Coverage   58.96%   58.97%           
=======================================
  Files         645      645           
  Lines       55506    55561   +55     
=======================================
+ Hits        32730    32766   +36     
- Misses      20177    20200   +23     
+ Partials     2599     2595    -4

Flag	Coverage Δ
unittests	`58.97% <ø> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Co-authored-by: sonjaer <[email protected]> Signed-off-by: Rafael Raposo <[email protected]>

rfc/core language/4956-cluster-pools.md

bstadlbauer

This seems like a great usecase and the required changes shouldn't introduce too much extra complexity 👍

bstadlbauer · 2024-02-29T13:23:20Z

rfc/core language/4956-cluster-pools.md

+
+## 8 Unresolved questions
+
+What takes priority Labels or Pools? 


I would vote for labels to take priority as those seem to be more execution specific. I might be missing something though

Yes the label usage seems more related to "I need this X type of cluster for this <PROJECT/Domain>".

So to be sure, we first check if there's a label set, then under the clusters in that label if there's a pool. If not then default using the weight. (Similar to what is done today.)

Co-authored-by: Fabio M. Graetz, Ph.D. <[email protected]> Signed-off-by: Rafael Raposo <[email protected]>

fg91 · 2024-02-29T15:25:23Z

rfc/core language/4956-cluster-pools.md

+
+Proposed configuration:
+```
+    clusters:


As said in the contributors' sync:

Love the proposal, my only wish would be to figure out whether there is a way to simplify this config here by not having to configure a cluster twice, once for the existing project/domain mechanism, once for the new mechanism. Would be nice if the cluster configuration could be shared.

@fg91 I think it should be possible by expanding the ExecutionParameters struct. By adding the ExecutionClusterLabel to this struct and making sure we pass it when executing it should make things simpler and avoid having double configs.

This will mean a bit of re-writing, since it's late I'll also cross-check my thinking and get back with some certainties next week.

katrogan

a couple of edge cases come to mind which would be great to clarify

are cluster pools ever reserved or restricted? can any execution target any cluster pool?
what happens when a cluster pool AND an execution cluster are specified in an execution spec?
what about when a CreateExecution request targets a specific cluster but the hierarchical overrides include a (conflicting) cluster pool (or vice versa) do we always read from the execution spec

also a general question on validation, what happens if flyteadmin is brought up with a config where cluster pools reference non-existent execution clusters? what if a matchable attribute is added with a cluster pool referencing a non-existent cluster pool?

RRap0so · 2024-03-04T14:04:12Z

@fg91 I've gave it a shot in this PR. Instead of adding a new field in the ExecutionTargetSpec struct, we're reusing the ClusterAssignment and adding it there.

This is an alternative to what we propose with the RFC above but keeping the same idea, this was much of the code will be reused and minimal changes are required.

@katrogan please take a look at the alternative here. This should solve any question with pools vs labels. We will first honor whatever label was first set in the execution and then project.

wild-endeavor · 2024-03-09T00:33:05Z

hey @RRap0so,

late to the party on this one, but i'm not sure I understand. Could you help clear up some confusion for me please?

Today admin already has the ability to do:

      labelClusterMap:
        production:
        - id: flyte-1-k8s
          weight: 0.3
        - id: flyte-2-k8s
          weight: 0.7

(And looking at how we do the weighting, you can do 3 and 7 also.)

And then you can set a project or domain to use the production label, and have it randomly pick between them. Does the behavior you're proposing differ from this? If so could you explain more please, I'm missing something.

RRap0so · 2024-03-11T12:43:41Z

hey @RRap0so,

late to the party on this one, but i'm not sure I understand. Could you help clear up some confusion for me please?

Today admin already has the ability to do:
      labelClusterMap:
        production:
        - id: flyte-1-k8s
          weight: 0.3
        - id: flyte-2-k8s
          weight: 0.7
(And looking at how we do the weighting, you can do 3 and 7 also.)

And then you can set a project or domain to use the production label, and have it randomly pick between them. Does the behavior you're proposing differ from this? If so could you explain more please, I'm missing something.

Sure thing @wild-endeavor . This RFC is not changing what is done in terms of weighs and picking the cluster, just introducing another way to select a workflow to run in another cluster at execution time. The problem today is that we can only set a project or domain and we also want to define per execution.

We started this RFC by using the cluster_assignment since it was a field that already existed but wasn't fully implemented.

There's an alternative PR that probably explains in a simpler way what exactly we intend to do and if this is the way forward, happy to update the RFC accordingly.

#4998

wild-endeavor · 2024-03-12T20:54:06Z

oh @RRap0so are you just saying there's no way to pass an execution label at execution kick-off time?

cuz matchable overrides at the workflow level is also possible already, but i agree i'm not seeing the label at execution kickoff time.

RRap0so · 2024-03-12T21:16:17Z

@wild-endeavor exactly! We spotted the cluster assignment and noticed some artifacts about clusterPools so we did the RFC to fully implement the concept but probably the alternative PR is the quickest way without duplicating configs and enough for the use-case.

wild-endeavor · 2024-03-14T17:20:20Z

Sorry for the delay. In that case, can we just add the ExecutionClusterLabel object to the ExecutionSpec message? And then have admin respect that as an override?

Will you be at the contributor meeting today? Maybe we can discuss there.

RRap0so · 2024-03-15T11:32:06Z

Sorry for the delay. In that case, can we just add the ExecutionClusterLabel object to the ExecutionSpec message? And then have admin respect that as an override?

Will you be at the contributor meeting today? Maybe we can discuss there.

Sorry it was a bit late for me to attend, so we've done that already if you take a look at this alternative. #4956 if this is the way forward, happy to close this RFC and continue there.

wild-endeavor · 2024-03-19T18:31:10Z

This is the way forward yes. Can you point me to where the alternative is? The link to #4956 just points back to this pull request. Did you already make a PR that adds ExecutionClusterLabel to ExecutionSpec?

RRap0so · 2024-03-19T18:35:19Z

This is the way forward yes. Can you point me to where the alternative is? The link to #4956 just points back to this pull request. Did you already make a PR that adds ExecutionClusterLabel to ExecutionSpec?

@wild-endeavor 🤦 here's the correct link. We've added it in the cluster_assignment to make this a bit more clean but happy to change.

wild-endeavor · 2024-03-20T17:08:53Z

hey @RRap0so this is the ExecutionClusterLabel message that we talked about

RRap0so · 2024-03-20T19:03:36Z

We've decided to take the alternative approach of adding the ExecutionClusterLabel into the execution spec.

For more details see Issue #5081 and PR #4998

fg91 · 2024-05-16T10:32:56Z

@RRap0so @iaroslav-ciupin I would appreciate your input on this 🙏

Based on your PR which exposes ExecutionClusterLabel when creating an execution, this PR is exposing this option in pyflyte run. Because we really need this functionality as well, I just tested this and can confirm it works - all good here. (Thank you for your work in the backend!)

My question is this:

This PR flyteorg/flytekit#1208 a few months ago exposed --cluster-pool in pyflyte which is what you @RRap0so proposed in this RFC, correct?.

If I understood you correctly in the discussions in the contrib sync, the cluster pool logic was never fully implemented in the backend, is this correct or did I misunderstand you?

I'm wondering whether all this "cluster pool" logic is dead code that never worked and should be cleaned up.
@davidmirror-ops FYI would be good to revisit this in the contrib sync.

RRap0so · 2024-05-16T10:56:42Z

This is correct as it stands right now there's a "placeholder" but not implemented in the flyteadmin backend.

Even tough we don't use it in the backend, it is possible other plugins (outside the flyteadmin backend) are actually using it? Maybe we can start by opening an issue, remove that code and bring it up the next contrib sync.

Great that it's growing into pyflyte, I'm also looking at potentially adding it to flytectl so might grab some inspiration :)

davidmirror-ops · 2024-05-16T18:21:11Z

Thanks for bringing this up!
We can discuss this further. That code is being used by Union, so should not be deleted. We're looking for a way to better label/signal this type of work.

dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. enhancement New feature or request labels Feb 26, 2024

rfc-cluster-pools

77a14d1

Signed-off-by: Rafael Raposo <[email protected]>

RRap0so force-pushed the rfc-cluster-pool branch from 7054c72 to 77a14d1 Compare February 26, 2024 13:58

Pull Request number

bc1de09

Signed-off-by: Rafael Raposo <[email protected]>

sonjaer reviewed Feb 26, 2024

View reviewed changes

rfc/core language/4956-cluster-pools.md Outdated Show resolved Hide resolved

Update rfc/core language/4956-cluster-pools.md

203f0e0

Co-authored-by: sonjaer <[email protected]> Signed-off-by: Rafael Raposo <[email protected]>

davidmirror-ops added the rfc A label for RFC issues label Feb 26, 2024

fg91 reviewed Feb 29, 2024

View reviewed changes

rfc/core language/4956-cluster-pools.md Outdated Show resolved Hide resolved

bstadlbauer previously approved these changes Feb 29, 2024

View reviewed changes

dosubot bot added the lgtm This PR has been approved by a maintainer label Feb 29, 2024

RRap0so dismissed bstadlbauer’s stale review via 7f9cb8b February 29, 2024 13:51

Update rfc/core language/4956-cluster-pools.md

7f9cb8b

Co-authored-by: Fabio M. Graetz, Ph.D. <[email protected]> Signed-off-by: Rafael Raposo <[email protected]>

fg91 reviewed Feb 29, 2024

View reviewed changes

katrogan reviewed Feb 29, 2024

View reviewed changes

RRap0so mentioned this pull request Mar 4, 2024

Allow setting a ExecutionClusterLabel when triggering a Launchplan/Workflow/Task #4998

Merged

3 tasks

RRap0so mentioned this pull request Mar 20, 2024

[Core feature] Allow setting a ExecutionClusterLabel when triggering a Launchplan/Workflow/Task #5081

Closed

2 tasks

RRap0so closed this Mar 21, 2024

RRap0so deleted the rfc-cluster-pool branch March 21, 2024 06:26

runllm bot mentioned this pull request May 20, 2024

[Core feature] Allow flytectl to set a targetExecutionCluster #5395

Closed

2 tasks

fg91 mentioned this pull request May 22, 2024

Add support for specifying execution cluster labels in pyflyte flyteorg/flytekit#2422

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Select cluster execution when triggering an execution #4956

[RFC] Select cluster execution when triggering an execution #4956

RRap0so commented Feb 26, 2024 •

edited

Loading

davidmirror-ops commented Feb 26, 2024

codecov bot commented Feb 26, 2024 •

edited

Loading

bstadlbauer left a comment

bstadlbauer Feb 29, 2024

RRap0so Feb 29, 2024

fg91 Feb 29, 2024

RRap0so Feb 29, 2024

katrogan left a comment

RRap0so commented Mar 4, 2024 •

edited

Loading

wild-endeavor commented Mar 9, 2024

RRap0so commented Mar 11, 2024

wild-endeavor commented Mar 12, 2024

RRap0so commented Mar 12, 2024

wild-endeavor commented Mar 14, 2024

RRap0so commented Mar 15, 2024

wild-endeavor commented Mar 19, 2024

RRap0so commented Mar 19, 2024 •

edited

Loading

wild-endeavor commented Mar 20, 2024

RRap0so commented Mar 20, 2024

fg91 commented May 16, 2024 •

edited

Loading

RRap0so commented May 16, 2024

davidmirror-ops commented May 16, 2024


		## 8 Unresolved questions

		What takes priority Labels or Pools?

[RFC] Select cluster execution when triggering an execution #4956

[RFC] Select cluster execution when triggering an execution #4956

Conversation

RRap0so commented Feb 26, 2024 • edited Loading

davidmirror-ops commented Feb 26, 2024

codecov bot commented Feb 26, 2024 • edited Loading

Codecov Report

bstadlbauer left a comment

Choose a reason for hiding this comment

bstadlbauer Feb 29, 2024

Choose a reason for hiding this comment

RRap0so Feb 29, 2024

Choose a reason for hiding this comment

fg91 Feb 29, 2024

Choose a reason for hiding this comment

RRap0so Feb 29, 2024

Choose a reason for hiding this comment

katrogan left a comment

Choose a reason for hiding this comment

RRap0so commented Mar 4, 2024 • edited Loading

wild-endeavor commented Mar 9, 2024

RRap0so commented Mar 11, 2024

wild-endeavor commented Mar 12, 2024

RRap0so commented Mar 12, 2024

wild-endeavor commented Mar 14, 2024

RRap0so commented Mar 15, 2024

wild-endeavor commented Mar 19, 2024

RRap0so commented Mar 19, 2024 • edited Loading

wild-endeavor commented Mar 20, 2024

RRap0so commented Mar 20, 2024

fg91 commented May 16, 2024 • edited Loading

RRap0so commented May 16, 2024

davidmirror-ops commented May 16, 2024

RRap0so commented Feb 26, 2024 •

edited

Loading

codecov bot commented Feb 26, 2024 •

edited

Loading

RRap0so commented Mar 4, 2024 •

edited

Loading

RRap0so commented Mar 19, 2024 •

edited

Loading

fg91 commented May 16, 2024 •

edited

Loading