Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Short Rotation Period For Certificates #1670

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
176 changes: 176 additions & 0 deletions enhancements/certificate-short-rotation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,176 @@
---
title: certificate-short-rotation
authors:
- vrutkovs
reviewers:
- deads2k
approvers:
- deads2k
api-approvers:
- deads2k
creation-date: 2024-08-24
last-updated: 2024-08-24
tracking-link:
- https://issues.redhat.com/browse/API-1688
---

# Short Rotation Period For Certificates

## Summary

Add new feature gate in DevPreview set so that components would issue certificates with shorter
duration - hours instead of days.

## Motivation

Currently certificates are issued by Openshift with various validity durations, but at least its 15
days. This makes testing certificate rotation in CI complicated - we have to emulate passing time
using time skewing. This methods shows how cluster recovers after certificates have expired, but
it doesn't help us with testing happy path when certificates rotate during standard cluster lifecycle.

Some components (i.e. cluster-kube-apiserver-operator) issue certificate with shorter lifetime in
development branch. This requires us to revert this change every time we branch for new release.
This also doesn't help us in CI, as it needs a similar change in the installer.
Also, most components are not using this, so we end up with some certificates valid for hours but
most would be valid for days. No test currently verifies that certificates have indeed been
rotated and this didn't cause additional disruptions.

Since the change to revert this setting requires manual pull request, there is chance that this
setting will leak into supported releases.

This enhancement describes a new feature gate, which would enable this feature for all components
and ensure that stable releases don't have it accidentally enabled as it uses FeatureGates.
Additionally the enhancement describes how certificate rotation is being tested in order to
avoid disruptions during rotation.

### User Stories

> As an Openshift developer, I want to have a setting for component to issue shorter living
> certificates so that I could verify that certificate rotation doesn't cause issues
vrutkovs marked this conversation as resolved.
Show resolved Hide resolved

Note that this lacks any customer userstories - this is a developer-only feature, customers are
not expected to use it

### Goals

* Create a new FeatureGate in DevPreview featureset
vrutkovs marked this conversation as resolved.
Show resolved Hide resolved
vrutkovs marked this conversation as resolved.
Show resolved Hide resolved
* Each component can decide the new duration for certificates separately.
* Create e2e tests enabling this featuregate and checking that certificate rotate correctly
* Run e2e periodically to ensure cluster with this featuregate is functional

### Non-Goals

* Change validity duration for existing certificates

## Proposal
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Details are scant. To achieve the same benefit, this featuregate needs to be enabled in Default during pre-RC builds and moved to TechPreview after RC.0. Can we get this spelled out?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There wasn't any benefit, "developer branch fast rotation" has never worked in CI - see https://github.com/openshift/enhancements/pull/1670/files#diff-1695a5e93f0f7e139919d0e0fac08ce0ea6d442932ff1cc7fab792ef1c616cf1R31-R35

Added more details on how this will be tested in CI and promotion criteria

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There wasn't any benefit, "developer branch fast rotation" has never worked in CI - see https://github.com/openshift/enhancements/pull/1670/files#diff-1695a5e93f0f7e139919d0e0fac08ce0ea6d442932ff1cc7fab792ef1c616cf1R31-R35

Added more details on how this will be tested in CI and promotion criteria

The point is not coverage in CI. The point is every pre-production cluster in existence that runs longer than a day. Which covers a wide variety of clusters internal to this group and external to this group including education, TAMs, test platform, and various test environments. Losing this capability is enormous. CI was an objective, but we gained tremendous benefit without ever having a rotation in CI.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It never materialized into bugs though (especially into availability tracking), has it? Testing fast rotation in CI however will have the same effect alongside with a better set of debugging information.

In any case, this featuregate doesn't forbid specific components to do pre-release shorter cert rotations


Update components to read enabled FeatureGates and update certificate issuing code in all OpenShift
components.

The featuregate would make components generate certificates which have shorter duration - hours
instead of days, so that we could verify that most certificates can be rotated within duration of
e2e test. This would allow developers to verify that certificates get rotated without breaking
cluster features.

So far we've identified the following components which should use this featuregate:
* installer
* cluster-kube-apiserver-operator
* cluster-kube-controller-manager-operator
* cluster-etcd-operator
* cluster-network-operator
* service-ca-operator
* OLM

Component developers may also exclude some certificates from rotation.
For example, some signers are meant to last "indefinitely" (currently set to 10 years)
to support specific cluster features, i.e. CSR signer is not meant to expire so that
new nodes could join.

In order to avoid additional disruptions caused by fast certificate rotation a new set of
periodics should be created. These periodics would start a cluster with this featuregate enabled,
ensuring that installer generates short duration certificates on day 1.
Afterwards a standard minimal conformance test runs to verify that cluster is functional. Test
monitors would record API / service disruptions along with certificate rotation events so that
could measure their effect on availability.

New test job should run for a reasonable amount of time (i.e. 8 hours) and finish with the test
which makes sure content has changed for every certificate (except "indefinite" certs, see above).
This test will ensure that all certificates are being rotated and effect on availability is being
measured.

### Workflow Description

N/A

### API Extensions

N/A

### Topology Considerations

#### Hypershift / Hosted Control Planes

N/A

#### Standalone Clusters

N/A

#### Single-node Deployments or MicroShift

Not applicable to MicroShift - it doesn't issue certificates via operators

### Implementation Details/Notes/Constraints


### Risks and Mitigations


### Drawbacks


## Open Questions [optional]


## Test Plan

End to end testing this feature would:
* enable ShortCertificateRotation featuregate
* run minimal testsuite to ensure that main cluster functions are not affected
vrutkovs marked this conversation as resolved.
Show resolved Hide resolved
* observe the cluster for 6-8 hours
* create a new test which verifies that certificates have rotated
Some certificates - i.e. ingress or csr signer - are expected to remain unrotated, so the test
would have a list of known exceptions

## Graduation Criteria

This featuregate is not meant to be graduated to GA - its intended to be developer-only setting

### Dev Preview -> Tech Preview
A new set of periodics would show if cluster is functional and adheres to availability requirements.
Once this is archieved the featuregate may be promoted to TechPreview. This will give us additional
signal as techpreview jobs run during nightly verification.

### Tech Preview -> GA
N/A

### Removing a deprecated feature


## Upgrade / Downgrade Strategy

Setting DevPreview is permanent - there is no way to upgrade or downgrade the cluster.

## Version Skew Strategy

N/A

## Operational Aspects of API Extensions

N/A

## Support Procedures

This setting is unsupported

## Alternatives