-
Notifications
You must be signed in to change notification settings - Fork 1.5k
KEP-5313: Placement Decision API for multicluster scheduling #5314
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: mikeshng The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Hi @mikeshng. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
@mikeshng: GitHub didn't allow me to assign the following users: zhiying-lin. Note that only kubernetes members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/cc @awels |
@iholder101: GitHub didn't allow me to request PR reviews from the following users: awels. Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
9ca10ab
to
3406d3d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trying to simplify it a bit.
At the same time, will try to suggest sharing our MCO one as the placement.
* Support continuous rescheduling: decision list may be updated. | ||
* Guarantee that every `clusterName` entry matches a `ClusterProfile.metadata.name` in the same inventory. | ||
* Guarantee that every `clusterName` entry is in the same namespace as `PlacementDecision.metadata.namespace`. | ||
* Provide label conventions so consumers can retrieve all slices of one placement. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we actually need this added complexity yet? How many clusters would a placement REALLY need to target? If we get more than a 100, maybe we're not using the right grouping/abstraction?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it sounds like a little bit overdesign here.
Mike, can you give an example of what kind of workload needs to be placed on more than 100 clusters?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are fleets running across thousands of clusters, especially in edge and telco cases. While not everyday case, we have customers do push configs to all of those clusters at once.
The core issue with dumping large lists into a single object is that every change results in a large write. K8s API authors already designed APIs like EndpointSlice to handle watch/write churn and workaround etcd limits. So it makes sense to follow established conventions. Expecting users to manually shard their Placement objects to avoid etcd limits or expensive writes feels like a step backward in API design. CC @deads2k
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a "daemonset" type of workload could be placed on a large number of clusters. i.e. something like networkPolicy/flowschemas.
forcing downstream tools such as GitOps engine, workload orchestrator, progressive rollout controller, | ||
or AI/ML pipeline having to understand a scheduler specific API. | ||
|
||
This KEP introduces a vendor neutral `PlacementDecision` API that standardizes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's actually call it Placement
? That gives us a chance to align the Spec later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are using the name PlacementDecision on purpose to show it is only responsible for the scheduler's data only answer to "which clusters should be used?" It's different from any future standard Placement API (or current vendor specific Placement/Scheduling APIs) that defines the request/spec driving that decision.
|
||
### Non-Goals | ||
|
||
* Describing how a scheduler made its choice (Placement API spec). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one of our plan for MCO is to publish events on the Why; so while the placement itself shouldn't care, the end user may care (for debugging purposes)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added the Reason
field for end user.
|
||
* Describing how a scheduler made its choice (Placement API spec). | ||
* Describing how consumers access selected clusters. | ||
* Embedding orchestration logic or consumer feedback in `PlacementDecision`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For orchestration, we realized in MCO that one state that was really interesting was "drain". E.g. we want to get out of a cluster but slowly.
what did you have in mind for consumer feedback?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember Liqian asked the question at the last community meeting, 'How do we know the status, if the decision has been consumed?'.
Even though we haven't seriously talked about the Placement API, the Placement API should reference which workload goes where, and the status. If no status on PlacementDecision
, where to get the status?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK, the overall design has another layer of "syncer" which takes the placement decision combined with the workloads and execute this. I guess the status would be on that syncer.
With that said, while this design is flexible, I feel that the e2e UX may not be ideal since users need to jump from one place to another again and again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just to elaborate more, the flow seems to be like this
- user create a placement API that somehow contains the placement policy that reflects what the workload needs
- the placemant controller emits a placementDecision object
- user then feed the decision and the workload definition to a syncer API
- user then monitor the output the syncer object and adjust the placement policy accordingly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PlacementDecision intentionally omits any scheduling/orchestration spec. Added a "Consumer Feedback" section to clarify that feedback is out of scope for this resource and should be handled by a separate mechanism.
metav1.TypeMeta `json:",inline"` | ||
metav1.ObjectMeta `json:"metadata,omitempty"` | ||
|
||
// Up to 100 ClusterDecisions per object (slice) to stay well below the etcd limit. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we remove this limitation of 100 clusters? that way we could avoid the whole idea of having to compose multiple placementdecision CRs together.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if we can get out of multiple placementdecision CRs given the ETCD limit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with Ryan. See #5314 (comment)
3406d3d
to
881956f
Compare
11ecf7b
to
280c3b3
Compare
Signed-off-by: Mike Ng <[email protected]>
280c3b3
to
971facb
Compare
/sig multicluster