Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add in-place upgrade proposal #30

Merged
merged 7 commits into from
Sep 12, 2024
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
137 changes: 137 additions & 0 deletions docs/proposals/000-template.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,137 @@
<!--
To start a new proposal, create a copy of this template on this directory and
fill out the sections below.
-->

# Proposal information

<!-- Index number -->
- **Index**: 000

<!-- Status -->
- **Status**: <!-- **DRAFTING**/**ACCEPTED**/**REJECTED** -->

<!-- Short description for the feature -->
- **Name**: Feature name

<!-- Owner name and github handle -->
- **Owner**: FirstName LastName / <!-- [@name](https://github.com/name) -->

# Proposal Details

## Summary
<!--
In a short paragraph, explain what the proposal is about and what problem
it is attempting to solve.
-->

## Rationale
<!--
This section COULD be as short or as long as needed. In the appropriate amount
of detail, you SHOULD explain how this proposal improves k8s providers, what is the
problem it is trying to solve and how this makes the user experience better.

You can do this by describing user scenarios, and how this feature helps them.
You can also provide examples of how this feature may be used.
-->

## User facing changes
<!--
This section MUST describe any user-facing changes that this feature brings, if
any. If an API change is required, the affected endpoints MUST be mentioned. If
the output of any k8s command changes, the difference MUST be mentioned, with a
clear example of "before" and "after".
-->

none

## Alternative solutions
<!--
This section SHOULD list any possible alternative solutions that have been or
should be considered. If required, add more details about why these alternative
solutions were discarded.
-->

none

## Out of scope
<!--
This section MUST reference any work that is out of scope for this proposal.
Out of scope items are typically unknowns that we do not yet have a clear idea
of how to solve, so we explicitly do not tackle them until we have more
information.

This section is very useful to help guide the implementation details section
below, or serve as reference for future proposals.
-->

none

# Implementation Details

## API Changes
<!--
This section MUST mention any changes to the k8sd API, or any additional API
endpoints (and messages) that are required for this proposal.

Unless there is a particularly strong reason, it is preferable to add new v2/v3
APIs endpoints instead of breaking the existing APIs, such that API clients are
not affected.
-->
none

## Bootstrap Provider Changes
<!--
This section MUST mention any changes to the bootstrap provider.
-->
none

## ControlPlane Provider Changes
<!--
This section MUST mention any changes to the controlplane provider.
-->
none

## Configuration Changes
<!--
This section MUST mention any new configuration options or service arguments
that are introduced.
-->
none

## Documentation Changes
<!--
This section MUST mention any new documentation that is required for the new
feature. Most features are expected to come with at least a How-To and an
Explanation page.

In this section, it is useful to think about any existing pages that need to be
updated (e.g. command outputs).
-->
none

## Testing
<!--
This section MUST explain how the new feature will be tested.
-->

## Considerations for backwards compatibility
<!--
In this section, you MUST mention any breaking changes that are introduced by
this feature. Some examples:

- In case of deleting a database table, how do older k8sd instances handle it?
- In case of a changed API endpoint, how do existing clients handle it?
- etc
-->

## Implementation notes and guidelines
<!--
In this section, you SHOULD go into detail about how the proposal can be
implemented. If needed, link to specific parts of the code (link against
particular commits, not branches, such that any links remain valid going
forward).

This is useful as it allows the proposal owner to not be the person that
implements it.
-->
257 changes: 257 additions & 0 deletions docs/proposals/001-in-place-upgrades.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,257 @@
<!--
berkayoz marked this conversation as resolved.
Show resolved Hide resolved
To start a new proposal, create a copy of this template on this directory and
fill out the sections below.
-->

# Proposal information

<!-- Index number -->
- **Index**: 001

<!-- Status -->
- **Status**: **DRAFTING** <!-- **DRAFTING**/**ACCEPTED**/**REJECTED** -->

<!-- Short description for the feature -->
- **Name**: ClusterAPI In-Place Upgrades

<!-- Owner name and github handle -->
- **Owner**: Berkay Tekin Oz [@berkayoz](https://github.com/berkayoz) <!-- [@name](https://github.com/name) -->

# Proposal Details

## Summary
<!--
In a short paragraph, explain what the proposal is about and what problem
it is attempting to solve.
-->

Canonical Kubernetes CAPI providers should reconcile workload clusters and perform in-place upgrades based on the metadata in the cluster manifest.

This can be used in environments where rolling upgrades are not a viable option such as edge deployments and non-HA clusters.

## Rationale
<!--
This section COULD be as short or as long as needed. In the appropriate amount
of detail, you SHOULD explain how this proposal improves k8s-snap, what is the
problem it is trying to solve and how this makes the user experience better.

You can do this by describing user scenarios, and how this feature helps them.
You can also provide examples of how this feature may be used.
-->

The current Cluster API implementation does not provide a way of updating machines in-place and instead follows a rolling upgrade strategy.

This means that a version upgrade would trigger a rolling upgrade, which is the process of creating new machines with desired configuration and removing older ones. This strategy is acceptable in most-cases for clusters that are provisioned on public or private clouds where having extra resources are not a concern.
berkayoz marked this conversation as resolved.
Show resolved Hide resolved

However this strategy is not viable for smaller bare-metal or edge deployments where resources are limited. This makes Cluster API not suitable out of the box for most of the use cases in industries like telco.

We can enable the use of Cluster API in these use-cases by updating our providers to perform in-place upgrades.


## User facing changes
<!--
This section MUST describe any user-facing changes that this feature brings, if
any. If an API change is required, the affected endpoints MUST be mentioned. If
the output of any k8s command changes, the difference MUST be mentioned, with a
clear example of "before" and "after".
-->

Users will be able to perform in-place upgrades per machine basis by running:
```sh
kubectl annotate machine <machine-name> k8sd.io/in-place-upgrade-to={upgrade-option}
neoaggelos marked this conversation as resolved.
Show resolved Hide resolved
```

Users can also perform in-place upgrades on the entire cluster by running:
```sh
kubectl annotate cluster <cluster-name> k8sd.io/in-place-upgrade-to={upgrade-option}
berkayoz marked this conversation as resolved.
Show resolved Hide resolved
```
This would upgrade machines belonging to `<cluster-name>` one by one.

`{upgrade-option}` can be one of:
* `channel=<channel>` which would refresh the machine to the provided channel e.g. `channel=1.31-classic/stable`
* `revision=<revision>` which would refresh the machine to the provided revision e.g. `revision=640`
* `localPath=<absolute-path-to-file>` which would refresh the machine to the provided local `*.snap` file e.g. `localPath=/path/to/k8s.snap`

## Alternative solutions
<!--
This section SHOULD list any possible alternative solutions that have been or
should be considered. If required, add more details about why these alternative
solutions were discarded.
-->

We could alternatively use the `version` fields defined in `ControlPlane` and `MachineDeployment` manifests instead of annotations which could be a better/more native user experience.
berkayoz marked this conversation as resolved.
Show resolved Hide resolved

However at the time of writing CAPI does not have support for changing upgrade strategies which means changes to the `version` fields trigger a rolling update.

This behaviour can be adjusted on `ControlPlane` objects as our provider has more/full control but can not be easily adjusted on `MachineDeployment` objects which causes issues for worker nodes.

Switching to using the `version` field should take place when upstream implements support for different upgrade strategies.

## Out of scope
<!--
This section MUST reference any work that is out of scope for this proposal.
Out of scope items are typically unknowns that we do not yet have a clear idea
of how to solve, so we explicitly do not tackle them until we have more
information.

This section is very useful to help guide the implementation details section
below, or serve as reference for future proposals.
-->

The in-place upgrades only address the upgrades of Canonical Kubernetes and it's respective dependencies. Which means changes on the OS front/image would not be handled since the underlying machine image stays the same. This would be handled by a rolling upgrade as usual.

# Implementation Details

## API Changes
<!--
This section MUST mention any changes to the k8sd API, or any additional API
endpoints (and messages) that are required for this proposal.

Unless there is a particularly strong reason, it is preferable to add new v2/v3
APIs endpoints instead of breaking the existing APIs, such that API clients are
not affected.
-->
### `POST /x/capi/snap-refresh`
berkayoz marked this conversation as resolved.
Show resolved Hide resolved

```go
type SnapRefreshRequest struct {
// Channel is the channel to refresh the snap to.
Channel string
// Revision is the revision number to refresh the snap to.
Revision string
// LocalPath is the local path to use to refresh the snap.
berkayoz marked this conversation as resolved.
Show resolved Hide resolved
LocalPath string
}
```

`POST /x/capi/snap-refresh` performs the in-place upgrade with the given options.

The upgrade can be either done with a `Channel`, `Revision` or a local `*.snap` file provided via `LocalPath`. The value of `LocalPath` should be an absolute path.

This endpoint should use `ValidateCAPIAuthTokenAccessHandler("capi-auth-token")` for authentication.
berkayoz marked this conversation as resolved.
Show resolved Hide resolved

## Bootstrap Provider Changes
<!--
This section MUST mention any changes to the bootstrap provider.
-->

A machine controller called `MachineReconciler` is added which would perform the in-place upgrade if `k8sd.io/in-place-upgrade-to` annotation is set on the machine.

The controller would use the value of this annotation to make an endpoint call to the `/x/capi/snap-refresh` through `k8sd-proxy`.

The result of this operation will be communicated back to the user via the `k8sd.io/in-place-upgrade-status` annotation. Values being:
berkayoz marked this conversation as resolved.
Show resolved Hide resolved

berkayoz marked this conversation as resolved.
Show resolved Hide resolved
* `in-progress` for an upgrade currently in progress
* `done` for a successful upgrade
* `failed` for a failed upgrade
berkayoz marked this conversation as resolved.
Show resolved Hide resolved

After an upgrade process begins:
* `k8sd.io/in-place-upgrade-status` annotation on the `Machine` would be added/updated with `in-progress`

After a successfull upgrade:
* `k8sd.io/in-place-upgrade-to` annotation on the `Machine` would be removed
* `k8sd.io/in-place-upgrade-current` annotation on the `Machine` would be added/updated with the used `{upgrade-option}`.
berkayoz marked this conversation as resolved.
Show resolved Hide resolved
* `k8sd.io/in-place-upgrade-status` annotation on the `Machine` would be added/updated with `done`

After a failed upgrade:
* `k8sd.io/in-place-upgrade-failure` annotation on the `Machine` would be added/updated with the failure message
berkayoz marked this conversation as resolved.
Show resolved Hide resolved
* `k8sd.io/in-place-upgrade-status` annotation on the `Machine` would be added/updated with `failed`

The reconciler should ignore the upgrade if `k8sd.io/in-place-upgrade-status` is already set to `in-progress` on the machine.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, it would probably be wise to have a configurable (?) parallelism for multiple machines at once. With rollout replace, machines are rotated one by one, we might want to mimic the same behaviour here (== if any other in-place upgrade is already in progress, just requeue the request).


#### Changes for Rolling Upgrades and Creating New Machines
berkayoz marked this conversation as resolved.
Show resolved Hide resolved
In case of a rolling upgrade or when creating new machines the `CK8sConfigReconciler` should check for the `k8sd.io/in-place-upgrade-current` annotation both on the `Machine` and on the owner `Cluster` object.

The value of one of these annotations should be used instead of the `version` field while generating a cloud-init script for a machine. The precedence of version fields are:
1. Annotation on the `Machine`
2. Annotation on the `Cluster`
3. The `version` field

Which means the value from the annotation on the `Machine` would be used first if found.

Using an annotation value requires changing the `install.sh` file to perform the relevant snap operation based on the option.
* `snap install k8s --classic --channel <channel>` for `Channel`
* `snap install k8s --classic --revision <revision>` for `Revision`
* `snap install <path-to-snap> --classic --dangerous --name k8s` for `LocalPath`

When a rolling upgrade is triggered the `LocalPath` option requires the newly created machine to contain the local `*.snap` file. This usually means the machine image used by the infrastructure provider should be updated to contain this image. This file could possibly be sideloaded in the cloud-init script before installation.

This operation should not be performed if `install.sh` script is overridden by the user in the manifests.
berkayoz marked this conversation as resolved.
Show resolved Hide resolved

This would prevent adding nodes with an outdated version and possibly breaking the cluster due to a version mismatch.
bschimke95 marked this conversation as resolved.
Show resolved Hide resolved

## ControlPlane Provider Changes
<!--
This section MUST mention any changes to the controlplane provider.
-->
A cluster controller called `ClusterReconciler` is added which would perform the one-by-one in-place upgrade of the entire workload cluster.

The controller would propagate the `k8sd.io/in-place-upgrade-to` annotation on the `Cluster` object by adding this annotation one-by-one to all the machines that is owned by this cluster.

A Kubernetes API call listing the objects of type `Machine` and filtering with `ownerRef` would produce the list of machines owned by the cluster. The controller then would iterate over this list, annotating machines and waiting for the operation to complete on each iteration.

The reconciler should ignore a machine if `k8sd.io/in-place-upgrade-status` is already set to `in-progress`.

Once upgrades of the underlying machines are finished:
* `k8sd.io/in-place-upgrade-to` annotation on the `Cluster` would be removed
* `k8sd.io/in-place-upgrade-current` annotation on the `Cluster` would be added/updated with the used `{upgrade-option}`.
berkayoz marked this conversation as resolved.
Show resolved Hide resolved

## Configuration Changes
<!--
This section MUST mention any new configuration options or service arguments
that are introduced.
-->
none

## Documentation Changes
<!--
This section MUST mention any new documentation that is required for the new
feature. Most features are expected to come with at least a How-To and an
Explanation page.

In this section, it is useful to think about any existing pages that need to be
updated (e.g. command outputs).
-->
`How-To` page on performing in-place upgrades should be created.
bschimke95 marked this conversation as resolved.
Show resolved Hide resolved

`Reference` page listing the annotations and possible values should be created/updated.

## Testing
<!--
This section MUST explain how the new feature will be tested.
-->
The new feature can be tested manually by applying an annotation on the machine/node, waiting for the process to finish by checking for the `k8sd.io/in-place-upgrade-status` annotation and then checking for the version of the node through the Kubernetes API e.g. `kubectl get node`. A timeout should be set for waiting on the upgrade process.

The tests can be integrated into the CI the same way with the CAPD infrastructure provider.

The upgrade should be performed with the `localPath` option. Under Pebble the process would replace the `kubernetes` binary with the binary provided in the annotation value.

This means a docker image containing 2 versions should be created. The different/new version of the `kubernetes` binary would also be built and put into a path.


## Considerations for backwards compatibility
<!--
In this section, you MUST mention any breaking changes that are introduced by
this feature. Some examples:

- In case of deleting a database table, how do older k8sd instances handle it?
- In case of a changed API endpoint, how do existing clients handle it?
- etc
-->

## Implementation notes and guidelines
<!--
In this section, you SHOULD go into detail about how the proposal can be
implemented. If needed, link to specific parts of the code (link against
particular commits, not branches, such that any links remain valid going
forward).

This is useful as it allows the proposal owner to not be the person that
implements it.
-->

The annotation method is chosen due to the "immutable infrastructure" assumption CAPI currently has. Which means updates are always done by creating new machines and fields are immutable. This might also pose some challenges on displaying accurate Kubernetes version information through CAPI.
HomayoonAlimohammadi marked this conversation as resolved.
Show resolved Hide resolved

We should be aware of the [metadata propagation](https://cluster-api.sigs.k8s.io/developer/architecture/controllers/metadata-propagation) performed by the upstream controllers. Some metadata is propagated in-place, which can ultimately propagate all the way down to the `Machine` objects. This could potentially flood the cluster with upgrades if machines get annotated at the same time. The cluster wide upgrade is handled through the annotation on the actual Cluster object due to this reason.

Updating the `version` field would trigger rolling updates by default, with the only difference than upstream being the precedence of the version value provided in the annotations.
Loading