Skip to content

KEP-5313: Placement Decision API for multicluster scheduling #5314

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

mikeshng
Copy link
Contributor

  • One-line PR description: Add a new KEP to introduce the Placement Decision API for multicluster scheduling
  • Other comments:

/sig multicluster

@k8s-ci-robot k8s-ci-robot added the sig/multicluster Categorizes an issue or PR as relevant to SIG Multicluster. label May 17, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: mikeshng
Once this PR has been reviewed and has the lgtm label, please assign skitt for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label May 17, 2025
@k8s-ci-robot k8s-ci-robot requested review from JeremyOT and skitt May 17, 2025 23:48
@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 17, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @mikeshng. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label May 17, 2025
@mikeshng
Copy link
Contributor Author

@k8s-ci-robot
Copy link
Contributor

@mikeshng: GitHub didn't allow me to assign the following users: zhiying-lin.

Note that only kubernetes members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

/assign @deads2k @RainbowMango @zhiying-lin

CC @corentone @elgnay @haoqing0110 @jnpacker @qiujian16 @ryanzhang-oss

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@iholder101
Copy link
Contributor

/cc @awels
FYI

@k8s-ci-robot
Copy link
Contributor

@iholder101: GitHub didn't allow me to request PR reviews from the following users: awels.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @awels
FYI

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@mikeshng mikeshng force-pushed the placement-decision-api branch from 9ca10ab to 3406d3d Compare May 19, 2025 16:02
Copy link

@corentone corentone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trying to simplify it a bit.

At the same time, will try to suggest sharing our MCO one as the placement.

* Support continuous rescheduling: decision list may be updated.
* Guarantee that every `clusterName` entry matches a `ClusterProfile.metadata.name` in the same inventory.
* Guarantee that every `clusterName` entry is in the same namespace as `PlacementDecision.metadata.namespace`.
* Provide label conventions so consumers can retrieve all slices of one placement.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we actually need this added complexity yet? How many clusters would a placement REALLY need to target? If we get more than a 100, maybe we're not using the right grouping/abstraction?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it sounds like a little bit overdesign here.
Mike, can you give an example of what kind of workload needs to be placed on more than 100 clusters?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are fleets running across thousands of clusters, especially in edge and telco cases. While not everyday case, we have customers do push configs to all of those clusters at once.
The core issue with dumping large lists into a single object is that every change results in a large write. K8s API authors already designed APIs like EndpointSlice to handle watch/write churn and workaround etcd limits. So it makes sense to follow established conventions. Expecting users to manually shard their Placement objects to avoid etcd limits or expensive writes feels like a step backward in API design. CC @deads2k

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a "daemonset" type of workload could be placed on a large number of clusters. i.e. something like networkPolicy/flowschemas.

forcing downstream tools such as GitOps engine, workload orchestrator, progressive rollout controller,
or AI/ML pipeline having to understand a scheduler specific API.

This KEP introduces a vendor neutral `PlacementDecision` API that standardizes

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's actually call it Placement? That gives us a chance to align the Spec later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are using the name PlacementDecision on purpose to show it is only responsible for the scheduler's data only answer to "which clusters should be used?" It's different from any future standard Placement API (or current vendor specific Placement/Scheduling APIs) that defines the request/spec driving that decision.


### Non-Goals

* Describing how a scheduler made its choice (Placement API spec).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one of our plan for MCO is to publish events on the Why; so while the placement itself shouldn't care, the end user may care (for debugging purposes)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the Reason field for end user.


* Describing how a scheduler made its choice (Placement API spec).
* Describing how consumers access selected clusters.
* Embedding orchestration logic or consumer feedback in `PlacementDecision`.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For orchestration, we realized in MCO that one state that was really interesting was "drain". E.g. we want to get out of a cluster but slowly.

what did you have in mind for consumer feedback?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember Liqian asked the question at the last community meeting, 'How do we know the status, if the decision has been consumed?'.

Even though we haven't seriously talked about the Placement API, the Placement API should reference which workload goes where, and the status. If no status on PlacementDecision, where to get the status?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK, the overall design has another layer of "syncer" which takes the placement decision combined with the workloads and execute this. I guess the status would be on that syncer.
With that said, while this design is flexible, I feel that the e2e UX may not be ideal since users need to jump from one place to another again and again.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just to elaborate more, the flow seems to be like this

  1. user create a placement API that somehow contains the placement policy that reflects what the workload needs
  2. the placemant controller emits a placementDecision object
  3. user then feed the decision and the workload definition to a syncer API
  4. user then monitor the output the syncer object and adjust the placement policy accordingly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PlacementDecision intentionally omits any scheduling/orchestration spec. Added a "Consumer Feedback" section to clarify that feedback is out of scope for this resource and should be handled by a separate mechanism.

metav1.TypeMeta `json:",inline"`
metav1.ObjectMeta `json:"metadata,omitempty"`

// Up to 100 ClusterDecisions per object (slice) to stay well below the etcd limit.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we remove this limitation of 100 clusters? that way we could avoid the whole idea of having to compose multiple placementdecision CRs together.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if we can get out of multiple placementdecision CRs given the ETCD limit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Ryan. See #5314 (comment)

@mikeshng mikeshng force-pushed the placement-decision-api branch from 3406d3d to 881956f Compare May 25, 2025 15:34
@mikeshng mikeshng force-pushed the placement-decision-api branch 3 times, most recently from 11ecf7b to 280c3b3 Compare May 27, 2025 20:49
@mikeshng mikeshng force-pushed the placement-decision-api branch from 280c3b3 to 971facb Compare May 27, 2025 21:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. sig/multicluster Categorizes an issue or PR as relevant to SIG Multicluster. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants