Skip to content

Latest commit

 

History

History
244 lines (177 loc) · 16.3 KB

07-shoot-control-plane-migration.md

File metadata and controls

244 lines (177 loc) · 16.3 KB

Shoot Control Plane Migration

Motivation

Currently moving the control plane of a shoot cluster can only be done manually and requires deep knowledge of how exactly to transfer the resources and state from one seed to another. This can make it slow and prone to errors.

Automatic migration can be very useful in a couple of scenarios:

  • Seed goes down and can't be repaired (fast enough or at all) and it's control planes need to be brought to another seed
  • Seed needs to be changed, but this operation requires the recreation of the seed (e.g. turn a single-AZ seed into a multi-AZ seed)
  • Seeds need to be rebalanced
  • New seeds become available in a region closer to/in the region of the workers and the control plane should be moved there to improve latency
  • Gardener ring, which is a self-supporting setup/underlay for a highly available (usually cross-region) Gardener deployment

Goals

  • Provide a mechanism to migrate the control plane of a shoot cluster from one seed to another
  • The mechanism should support migration from a seed which is no longer reachable (Disaster Recovery)
  • The shoot cluster nodes are preserved and continue to run the workload, but will talk to the new control plane after the migration completes
  • Extension controllers implement a mechanism which allows them to store their state or to be restored from an already existing state on a different seed cluster.
  • The already existing shoot reconciliation flow is reused for migration with minimum changes

Terminology

Source Seed is the seed which currently hosts the control plane of a Shoot Cluster

Destination Seed is the seed to which the control plane is being migrated

Resources and controller state which have to be migrated between two seeds:

Note: The following lists are just FYI and are meant to show the current resources which need to be moved to the Destination Seed

Secrets

Gardener has preconfigured lists of needed secrets which are generated when a shoot is created and deployed in the seed. Following is a minimum set of secrets which must be migrated to the Destination Seed. Other secrets can be regenerated from them.

  • ca
  • ca-front-proxy
  • static-token
  • ca-kubelet
  • ca-metrics-server
  • etcd-encryption-secret
  • kube-aggregator
  • kube-apiserver-basic-auth
  • kube-apiserver
  • service-account-key
  • ssh-keypair

Custom Resources and state of extension controllers

Gardenlet deploys custom resources in the Source Seed cluster during shoot reconciliation which are reconciled by extension controllers. The state of these controllers and any additional resources they create is independent of the gardenlet and must also be migrated to the Destination Seed. Following is a list of custom resources, and the state which is generated by them that has to be migrated.

  • BackupBucket: nothing relevant for migration
  • BackupEntry: nothing relevant for migration
  • ControlPlane: nothing relevant for migration
  • DNSProvider/DNSEntry: nothing relevant for migration
  • Extensions: migration of state needs to be handled individually
  • Infrastructure: terraform state
  • Network: nothing relevant for migration
  • OperatingSystemConfig: nothing relevant for migration
  • Worker: Machine-Controller-Manager related objects: machineclasses, machinedeployments, machinesets, machines

This list depends on the currently installed extensions and can change in the future

Proposal

Custom Resource on the garden cluster

The Garden cluster has a new Custom Resource which is stored in the project namespace of the Shoot called ShootState. It contains all the required data described above so that the control plane can be recreated on the Destination Seed.

This data is separated into two sections. The first is generated by the gardenlet and then either used to generate new resources (e.g secrets) or is directly deployed to the Shoot's control plane on the Destination Seed.

The second is generated by the extension controllers in the seed.

apiVersion: core.gardener.cloud/v1alpha1
kind: ShootState
metadata:
  name: my-shoot
  namespace: garden-core
  ownerReference:
    apiVersion: core.gardener.cloud/v1beta1
    blockOwnerDeletion: true
    controller: true
    kind: Shoot
    name: my-shoot
    uid: ...
  finalizers:
  - gardener
gardenlet:
  secrets:
  - name: ca
    data:
      ca.crt: ...
      ca.key: ...
  - name: ssh-keypair
    data:
      id_rsa: ...
  - name:
...
extensions:
- kind: Infrastructure
  state: ... (Terraform state)
- kind: ControlPlane
  purpose: normal
  state: ... (Certificates generated by the extension)
- kind: Worker
  state: ... (Machine objects)

The state data is saved as a runtime.RawExtension type, which can be encoded/decoded by the corresponding extension controller.

There can be sensitive data in the ShootState which has to be hidden from the end-users. Hence, it will be recommended to provide an etcd encryption configuration to the Gardener API server in order to encrypt the ShootState resource.

Size limitations

There are limits on the size of the request bodies sent to the kubernetes API server when creating or updating resources: by default ETCD can only accept request bodies which do not exceed 1.5 MiB (this can be configured with the --max-request-bytes flag); the kubernetes API Server has a request body limit of 3 MiB which cannot be set from the outside (with a command line flag); the gRPC configuration used by the API server to talk to ETCD has a limit of 2 MiB per request body which cannot be configured from the outside; and watch requests have a 16 MiB limit on the buffer used to stream resources.

This means that if ShootState is bigger than 1.5 MiB, the ETCD max request bytes will have to be increased. However, there is still an upper limit of 2 MiB imposed by the gRPC configuration.

If ShootState exceeds this size limitation it must make use of configmap/secret references to store the state of extension controllers. This is an implementation detail of Gardener and can be done at a later time if necessary as extensions will not be affected.

Splitting the ShootState into multiple resources could have a positive benefit on performance as the Gardener API Server and Gardener Controller Manager would handle multiple small resources instead of one big resource.

Gardener extensions changes

All extension controllers which require state migration must save their state in a new status.state field and act on an annotation gardener.cloud/operation=restore in the respective Custom Resources which should trigger a restoration operation instead of reconciliation. A restoration operation means that the extension has to restore its state in the Shoot's namespace on the Destination Seed from the status.state field.

As an example: the Infrastructure resource must save the terraform state.

apiVersion: extensions.gardener.cloud/v1alpha1
kind: Infrastructure
metadata:
  name: infrastructure
  namespace: shoot--foo--bar
spec:
  type: azure
  region: eu-west-1
  secretRef:
    name: cloudprovider
    namespace: shoot--foo--bar
  providerConfig:
    apiVersion: azure.provider.extensions.gardener.cloud/v1alpha1
    kind: InfrastructureConfig
    resourceGroup:
      name: mygroup
    networks:
      vnet: # specify either 'name' or 'cidr'
      # name: my-vnet
        cidr: 10.250.0.0/16
      workers: 10.250.0.0/19
status:
  state: |
      {
          "version": 3,
          "terraform_version": "0.11.14",
          "serial": 2,
          "lineage": "3a1e2faa-e7b6-f5f0-5043-368dd8ea6c10",
          "modules": [
              {
              }
          ]
          ...
      }

Extensions which do not require state migration should set status.state=nil in their Custom Resources and trigger a normal reconciliation operation if the CR contains the core.gardener.cloud/operation=restore annotation.

Similar to the contract for the reconcile operation, the extension controller has to remove the restore annotation after the restoration operation has finished.

An additional annotation gardener.cloud/operation=migrate is added to the Custom Resources. It is used to tell the extension controllers in the Source Seed that they must stop reconciling resources (in case they are requeued due to errors) and should perform cleanup activities in the Shoot's control plane. These cleanup activities involve removing the finalizers on Custom Resources and deleting them without actually deleting any infrastructure resources.

Note: The same size limitations from the previous section are relevant here as well.

Shoot reconciliation flow changes

The only data which must be stored in the ShootState by the gardenlet is secrets (e.g ca for the API server). Therefore the botanist.DeploySecrets step is changed. It is split into two functions which take a list of secrets that have to be generated.

  • botanist.GenerateSecretState Generates certificate authorities and other secrets which have to be persisted in the ShootState and must not be regenerated on the Destination Seed.
  • botanist.DeploySecrets Takes secret data from the ShootState, generates new ones (e.g. client tls certificates from the saved certificate authorities) and deploys everything in the Shoot's control plane on the Destination Seed

ShootState synchronization controller

The ShootState synchronization controller will become part of the gardenlet. It syncs the state of extension custom resources from the shoot namespace to the garden cluster and updates the corresponding spec.extension.state field in the ShootState resource. The controller can watch Custom Resources used by the extensions and update the ShootState only when changes occur.

Migration workflow

  1. Starting migration
    • Migration can only be started after a Shoot cluster has been successfully created so that the status.seed field in the Shoot resource has been set
    • The Shoot resource's field spec.seedName="new-seed" is edited to hold the name of the Destination Seed and reconciliation is automatically triggered
    • The Garden Controller Manager checks if the equality between spec.seedName and status.seed, detects that they are different and triggers migration.
  2. The Garden Controller Manager waits for the Destination Seed to be ready
  3. Shoot's API server is stopped
  4. Backup the Shoot's ETCD.
  5. Extension resources in the Source Seed are annotated with gardener.cloud/operation=migrate
  6. Scale Down the Shoot's control plane in the Source Seed.
  7. The gardenlet in the Destination Seed fetches the state of extension resources from the ShootState resource in the garden cluster.
  8. Normal reconciliation flow is resumed in the Destination Seed. Extension resources are annotated with gardener.cloud/operation=restore to instruct the extension controllers to reconstruct their state.
  9. The Shoot's namespace in Source Seed is deleted.

Leader Election and Control Plane Termination

During migration "split brain" scenario must be avoided. This means that a Shoot's control plane in the Source Seed must be scaled down before it is scaled up in the Destination Seed.

Note: This section is still under discussion. The plan is to first implement the ShootState and modify the reconciliation flow and extensions accordingly. Additionally multiple scenarios need to be considered depending on the reachability to the Garden cluster from the Source Seed components, the Source Seed's API server from the Garden cluster and the Source Seed's API server from the controllers running on the seed. The initial implementation will only cover the case where everything is running.

Extension controllers do not need leader election functionality because they only reconcile the extension resources if the reconcile operation annotation is specified in the resource. Since this is set only during a reconciliation triggered by the Garden Controller Manager it cannot happen during migration.

For other controllers in the controlplane (e.g MCM, etcd backup restore, kube apiserver) leader election has to be implemented. We plan to introduce gardenlet soon therefore the garden cluster is expected to be reachable from all seeds.

Another flag might be needed to tell the gardenlet in the Destination Seed that the control plane in the Source Seed has been scaled down and the Shoot reconciliation can begin.

Horizontal Pod Autoscalers also need to be considered and removed.

Garden cluster and Source Seed are healthy and there are no network problems

If both the Garden cluster and Source Seed cluster are healthy, the Garden Controller Manager (or gardenlet after checking the spec.seedName) can directly scale down the Shoot's control plane as part of the migration flow.

If components in the Source Seed cannot reliably read who the leader is from the Garden cluster

Currently we have come up with two ideas to handle this case:

DNS leader election: A DNS TXT entry with TTL=60s and value seed='Source Seed' is used. The record is created and maintained by the Gardener Controller Manager (by using the DNS Controller Manager and its DNSEntry resource). When a control plane migration is detected, the Gardener Controller Manager changes the value of the DNS Entry to seed='Destination Seed' and waits for 2*TTL + 1 = 121 seconds to ensure that the change is propagated to all controllers in the old seed. We rely on the fact that DNS is highly available (100% for AWS Route53) and that control plane components in the Source Seed can see the changes.

Control plane components have to be shut down even when there is no access to the Source Seed's API server. To be able to do that a daemonset is deployed in each Seed cluster. When the daemonset in the Source Seed sees that it is nolonger the leader (by checking the DNS record) and there is no connection to the Source Seed's API server, the daemonset will kill the Shoot's control plane pods by directly talking to the Kubelet's API server. If the Source Seed's API Server comes back up, then gardenlet should take care of scaling down the deployments and statefulsets in the Shoot's control plane. This could be problematic if the gardenlet is in a crashloop backoff or takes too much time to do the scaling.

As an alternative to the daemonset, a sidecar container can be added to each control plane component. The sidecar checks the DNS Entry to see if it is still the leader. If it is not, it shuts down the entire pod. This way we do not run the risk of deploymnets and statefulsets recreating the control plane pods after the seed's apiserver comes back up.

The problems with using DNS as leader election is caching. Additionally, not all DNS servers respect TTL settings.

Using timestamps in the ETCD backup entries: Once a Shoot is successfully created on the Source Seed a timestamp is saved in the Cluster Resource and/or saved in the ETCD backup restore sidecar (either as an environment variable or additional configuration). The timestamps must not be modified afterwards and is used when the backup restore container writes data to the backup entry of the Shoot in the following way:

  • If there is no timestamp in the backup entry, the current timestamp is uploaded.
  • If there is a timestamp in the backup entry it is compared to the backup restore container's timestamp:
    1. If it is the same, nothing is done
    2. If it is older, it is replaced with the timestamp of the current backuprestore container
    3. If it is newer - the current backuprestore container does not have ownership of the backup entry

When case 3 happens it means that the shoot has been migrated and the backuprestore container in the Destination Seed has started using the Shoot's backup entry. The backuprestore on the Source Seed should be configured to shut itself and the etcd container down once it sees the newer timestamp. Shutting down the etcd will cause the kubernetes control plane components to go into crashloop backoff and the MCM will not be able to do anything as it will not be able to list nodes (this has to be verified with MCM).

For this approach to work backups must be enabled for the Shoot which is migrated. Additionally, synchronization based on the timestamps in the backup entry depend on the frequency of backups made by the backuprestore container.