Generalized scheduling strategies for transparent multicluster #97

imjasonh · 2021-05-27T18:13:46Z

imjasonh
May 27, 2021
Collaborator

Discussed in #91

A generalized, type-unaware splitter/scheduler will need some general set of strategies it can use to schedule a resource to underlying clusters.

Example strategies

For the Deployment splitter as it is today, that strategy is "split", based on .spec.replicas -- i.e., when given a Deployment, create N other deployments (where N is # of clusters), where each gets .spec.replicas / N. This ignores scheduling constraints for now.

For a DaemonSet scheduler, where we want to run one replica on each node of each cluster, the strategy might be "copy" -- when given a DaemonSet, create N copies (N = # of clusters), where each is an exact copy of the original DaemonSet.

For a Pod scheduler, the strategy might be "any" -- when given a Pod, select a cluster (at random, maybe) and label the Pod to be synced to that cluster.

For a Namespace or ServiceAccount, the strategy could be "as-needed" -- when they're created in a kcp, don't do anything, but before syncing something that needs it down to a cluster (e.g., Pod with namespace: foo and serviceAccountName: bot), ensure those resources are also created. This also insinuates cleaning them up when the dependent object is deleted.

This is an incomplete list (suggest more!), and CRD authors will inevitably want to define their own, but we can start with some common ones to get to 85% of real use cases.

CRD authors (and K8s built-in types, which are CRDs now) should be able to choose the kcp scheduling strategy for their type, and we should try to find some sane default that won't surprise people too much, if possible.

Syncing / aggregating status

Splitting objects/specs is only half of the story, once the syncer updates the status of the split/copied/whatever resource, the scheduler will also need to know how it should aggregate/summarize that status back to the original object in kcp. For Pods ("any" strategy) and anything without status, that's pretty trivial.

For DaemonSets ("copy" strategy) and Deployments, the status needs to aggregate, for example, .status.numberReady, .status.readyReplicas, by adding each cluster's observed ready replicas. The name of the field(s) that need to be aggregated, and how, needs to be described by the type author.

@smarterclayton

imjasonh · 2021-06-01T14:16:09Z

imjasonh
Jun 1, 2021
Collaborator Author

Then there's also the issue of co-scheduling resources that depend on each other -- resources don't exist in a vacuum.

For example, a Deployment might depend on a ServiceAccount and a Namespace with the "as-needed" strategy, but also a PVC with an "any" strategy, which has already been scheduled to some cluster. This means the Deployment isn't actually "split", it's "split among where my dependencies are available" -- which might end up being just one cluster location.

To signal this we need to be able to detect those dependencies, ideally without encoding any knowledge about what a "Deployment" is or depends on, and obviously without modifying the type. We can have some hacky heuristics like searching for a field named .serviceAccountName, or .volumes.persistentVolumeClaim.claimName. OwnerReferences could also help identify dependencies that should be co-scheduled, and maybe even searching for any field of type LocalObjectReference, ObjectReference, etc., or generalizing even further, any object with fields apiVersion, kind and name.

When we detect a dependency, we should persist and report it, probably in an annotation:

apiVersion: v1
kind: Pod
metadata:
  name: my-pod
  annotations:
    kcp.dev/dependencies: |
      [{
        "apiVersion": "v1",
        "kind": "PersistentVolumeClaim",
        "name": "my-pvc",
        "detected": true,
      }]

Resource authors should also be able to explicitly describe an object's dependencies:

apiVersion: v1
kind: Pod
metadata:
  name: my-pod
  annotations:
    # put this Pod in the same cluster as another one
    kcp.dev/dependencies: |
      [{
        "apiVersion": "v1",
        "kind": "Pod",
        "name": "pod-friend",
        "detected": false,
      }]

...which could tie in to non-cluster resource dependencies:

apiVersion: v1
kind: Pod
metadata:
  name: my-pod
  annotations:
    kcp.dev/dependencies: |
      [{
        "apiVersion": "crossplane.io/v1alpha1",
        "kind": "Database",
        "name": "db-prod",
        "detected": false,
      }]

2 replies

jhjaggars Jun 1, 2021

Probably getting ahead of myself here, but this feels like a good opportunity to build a new object that tracks dependencies explicitly, rather than decorating existing objects. In the examples, would the annotation be duplicated across both items in each pair?

imjasonh Jun 1, 2021
Collaborator Author

Yeah that seems reasonable. To satisfy the "transparent" part of transparent multicluster, users shouldn't have to describe this themselves to get kcp to schedule their workload across clusters, but they should be able to see what decisions kcp made with their inputs, and should be able to give more inputs if they want to refine that decision.

We can do this in annotations, or in separate resources that reference and configure the user's resources (in #96 we also discussed modifying the type to store information on the resource in a more structured way). Annotations are probably easiest for now since they're flexible and already exist on everything, but eventually moving this to a separate typed structured resource could make sense.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generalized scheduling strategies for transparent multicluster #97

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

Generalized scheduling strategies for transparent multicluster #97

imjasonh May 27, 2021 Collaborator

Example strategies

Syncing / aggregating status

Replies: 1 comment · 2 replies

imjasonh Jun 1, 2021 Collaborator Author

jhjaggars Jun 1, 2021

imjasonh Jun 1, 2021 Collaborator Author

imjasonh
May 27, 2021
Collaborator

Replies: 1 comment 2 replies

imjasonh
Jun 1, 2021
Collaborator Author

imjasonh Jun 1, 2021
Collaborator Author