-
Notifications
You must be signed in to change notification settings - Fork 671
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Select cluster execution when triggering an execution #4956
Conversation
Signed-off-by: Rafael Raposo <[email protected]>
7054c72
to
77a14d1
Compare
Signed-off-by: Rafael Raposo <[email protected]>
This is amazing, I can think of several users who'd like to see a better mechanism to handle executions in multi-cluster scenarios. |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #4956 +/- ##
=======================================
Coverage 58.96% 58.97%
=======================================
Files 645 645
Lines 55506 55561 +55
=======================================
+ Hits 32730 32766 +36
- Misses 20177 20200 +23
+ Partials 2599 2595 -4
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
Co-authored-by: sonjaer <[email protected]> Signed-off-by: Rafael Raposo <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a great usecase and the required changes shouldn't introduce too much extra complexity 👍
|
||
## 8 Unresolved questions | ||
|
||
What takes priority Labels or Pools? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would vote for labels to take priority as those seem to be more execution specific. I might be missing something though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes the label usage seems more related to "I need this X type of cluster for this <PROJECT/Domain>".
So to be sure, we first check if there's a label set, then under the clusters in that label if there's a pool. If not then default using the weight. (Similar to what is done today.)
Co-authored-by: Fabio M. Graetz, Ph.D. <[email protected]> Signed-off-by: Rafael Raposo <[email protected]>
|
||
Proposed configuration: | ||
``` | ||
clusters: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As said in the contributors' sync:
Love the proposal, my only wish would be to figure out whether there is a way to simplify this config here by not having to configure a cluster twice, once for the existing project/domain mechanism, once for the new mechanism. Would be nice if the cluster configuration could be shared.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fg91 I think it should be possible by expanding the ExecutionParameters
struct. By adding the ExecutionClusterLabel
to this struct and making sure we pass it when executing it should make things simpler and avoid having double configs.
This will mean a bit of re-writing, since it's late I'll also cross-check my thinking and get back with some certainties next week.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a couple of edge cases come to mind which would be great to clarify
- are cluster pools ever reserved or restricted? can any execution target any cluster pool?
- what happens when a cluster pool AND an execution cluster are specified in an execution spec?
- what about when a CreateExecution request targets a specific cluster but the hierarchical overrides include a (conflicting) cluster pool (or vice versa) do we always read from the execution spec
also a general question on validation, what happens if flyteadmin is brought up with a config where cluster pools reference non-existent execution clusters? what if a matchable attribute is added with a cluster pool referencing a non-existent cluster pool?
@fg91 I've gave it a shot in this PR. Instead of adding a new field in the This is an alternative to what we propose with the RFC above but keeping the same idea, this was much of the code will be reused and minimal changes are required. @katrogan please take a look at the alternative here. This should solve any question with pools vs labels. We will first honor whatever label was first set in the execution and then project. |
hey @RRap0so, late to the party on this one, but i'm not sure I understand. Could you help clear up some confusion for me please? Today admin already has the ability to do:
(And looking at how we do the weighting, you can do 3 and 7 also.) And then you can set a project or domain to use the |
Sure thing @wild-endeavor . This RFC is not changing what is done in terms of weighs and picking the cluster, just introducing another way to select a workflow to run in another cluster at execution time. The problem today is that we can only set a project or domain and we also want to define per execution. We started this RFC by using the cluster_assignment since it was a field that already existed but wasn't fully implemented. There's an alternative PR that probably explains in a simpler way what exactly we intend to do and if this is the way forward, happy to update the RFC accordingly. |
oh @RRap0so are you just saying there's no way to pass an execution label at execution kick-off time? cuz matchable overrides at the workflow level is also possible already, but i agree i'm not seeing the label at execution kickoff time. |
@wild-endeavor exactly! We spotted the cluster assignment and noticed some artifacts about clusterPools so we did the RFC to fully implement the concept but probably the alternative PR is the quickest way without duplicating configs and enough for the use-case. |
Sorry for the delay. In that case, can we just add the ExecutionClusterLabel object to the ExecutionSpec message? And then have admin respect that as an override? Will you be at the contributor meeting today? Maybe we can discuss there. |
Sorry it was a bit late for me to attend, so we've done that already if you take a look at this alternative. #4956 if this is the way forward, happy to close this RFC and continue there. |
This is the way forward yes. Can you point me to where the alternative is? The link to #4956 just points back to this pull request. Did you already make a PR that adds |
@wild-endeavor 🤦 here's the correct link. We've added it in the cluster_assignment to make this a bit more clean but happy to change. |
hey @RRap0so this is the ExecutionClusterLabel message that we talked about |
We've decided to take the alternative approach of adding the ExecutionClusterLabel into the execution spec. |
@RRap0so @iaroslav-ciupin I would appreciate your input on this 🙏 Based on your PR which exposes My question is this: This PR flyteorg/flytekit#1208 a few months ago exposed If I understood you correctly in the discussions in the contrib sync, the cluster pool logic was never fully implemented in the backend, is this correct or did I misunderstand you? I'm wondering whether all this "cluster pool" logic is dead code that never worked and should be cleaned up. |
This is correct as it stands right now there's a "placeholder" but not implemented in the flyteadmin backend. Even tough we don't use it in the backend, it is possible other plugins (outside the flyteadmin backend) are actually using it? Maybe we can start by opening an issue, remove that code and bring it up the next contrib sync. Great that it's growing into pyflyte, I'm also looking at potentially adding it to flytectl so might grab some inspiration :) |
Thanks for bringing this up! |
RFC to propose a new Cluster Execution implementation.