From e859aeb52b477a344aaa40513b9176d7d851d1d4 Mon Sep 17 00:00:00 2001 From: Richard Wall Date: Tue, 9 Apr 2024 08:51:21 +0100 Subject: [PATCH 1/4] Memory and CPU optimizations and recommendations Signed-off-by: Richard Wall --- .spelling | 1 + .../docs/devops-tips/scaling-cert-manager.md | 101 ++++++++++++++++++ content/docs/manifest.json | 4 + 3 files changed, 106 insertions(+) create mode 100644 content/docs/devops-tips/scaling-cert-manager.md diff --git a/.spelling b/.spelling index 0f12073102..86a2f76eb1 100644 --- a/.spelling +++ b/.spelling @@ -250,6 +250,7 @@ RRs RSA Ramlot RinkiyaKeDad +Robusta Route53 Runtime roadmap diff --git a/content/docs/devops-tips/scaling-cert-manager.md b/content/docs/devops-tips/scaling-cert-manager.md new file mode 100644 index 0000000000..d2a3e6432e --- /dev/null +++ b/content/docs/devops-tips/scaling-cert-manager.md @@ -0,0 +1,101 @@ +--- +title: Scaling cert-manager +description: | + Learn how to optimize cert-manager for your cluster. +--- + +Learn how to optimize cert-manager for your cluster. + +## Overview + +The defaults in the Helm chart and YAML manifests are intended for general use. +You may want to modify the configuration to suit the size and usage of your Kubernetes cluster. + +## Set appropriate memory requests and limits + +**When Certificate resources are the dominant use-case**, +such as when workloads need to mount the TLS Secret or when gateway-shim is used, +the memory consumption of the cert-manager controller will be roughly +proportional to the total size of those Secret resources that contain the TLS +key pairs. +Why? Because the cert-manager controller caches the entire content of these Secret resources in memory. +If large TLS keys are used (e.g. RSA 4096) the memory use will be higher than if smaller TLS keys are used (e.g. ECDSA). + +The other Secrets in the cluster, such as those used for Helm chart configurations or for other workloads, +will not significantly increase the memory consumption, because cert-manager will only cache the metadata of these Secrets. + +**When CertificateRequest resources are the dominant use-case**, +such as with csi-driver or with istio-csr, +the memory consumption of the cert-manager controller will be much lower, +because there will be fewer TLS Secrets and fewer resources to be cached. + +> 📖️ Read [What Everyone Should Know About Kubernetes Memory Limits](https://home.robusta.dev/blog/kubernetes-memory-limit), +> to learn how to right-size the memory requests. + +## Disable client-side rate limiting for Kubernetes API requests + +By default cert-manager [throttles the rate of requests to the Kubernetes API server](https://github.com/cert-manager/cert-manager/blob/b61de55abda95a4c273be0c8d3e6025fe8511573/internal/apis/config/controller/v1alpha1/defaults.go#L59-L60) to 20 queries per second. +Historically this was intended to prevent cert-manager from overwhelming the Kubernetes API server, +but modern versions of Kubernetes implement [API Priority and Fairness](https://kubernetes.io/docs/concepts/cluster-administration/flow-control/), +which obviates the need for client side throttling. +You can increase the threshold of the the client-side rate limiter using the following helm values: + +```yaml +# helm-values.yaml +config: + apiVersion: controller.config.cert-manager.io/v1alpha1 + kind: ControllerConfiguration + kubernetesAPIQPS: 10000 + kubernetesAPIBurst: 10000 +``` + +> ℹ️ This does not technically disable the client-side rate-limiting but configures the QPS and Burst values high enough that they are never reached. +> +> 🔗 Read [`cert-manager#6890`: Allow client-side rate-limiting to be disabled](https://github.com/cert-manager/cert-manager/issues/6890); +> a proposal for a cert-manager configuration option to disable client-side rate-limiting. +> +> 🔗 Read [`kubernetes#111880`: Disable client-side rate-limiting when AP&F is enabled](https://github.com/kubernetes/kubernetes/issues/111880); +> a proposal that the `kubernetes.io/client-go` module should automatically use server-side rate-limiting when it is enabled. +> +> 🔗 Read about other projects that disable client-side rate limiting: [Flux](https://github.com/fluxcd/pkg/issues/269). +> +> 📖 Read [API documentation for ControllerConfiguration](../reference/api-docs.md#controller.config.cert-manager.io/v1alpha1.ControllerConfiguration) for a description of the `kubernetesAPIQPS` and `kubernetesAPIBurst` configuration options. + +## Restrict the use of large RSA keys + +Certificates with large RSA keys cause cert-manager to use more CPU resources. +When there are insufficient CPU resources, the reconcile queue length grows, +which delays the reconciliation of all Certificates. +A user who has permission to create a large number of RSA 4096 certificates, +might accidentally or maliciously cause a denial of service for other users on the cluster. + +> 📖 Learn [how to enforce an Approval Policy](../policy/approval/README.md), to prevent the use of large RSA keys. +> +> 📖 Learn [how to set Certificate defaults automatically](../tutorials/certificate-defaults/README.md), using tools like Kyverno. + + +## Set `revisionHistoryLimit: 1` on all Certificate resources + +By default, cert-manager will keep all the CertificateRequest resources that **it** creates +([`revisionHistoryLimit`](../reference/api-docs.md#cert-manager.io/v1.CertificateSpec)): + +> The maximum number of CertificateRequest revisions that are maintained in +> the Certificate's history. Each revision represents a single +> `CertificateRequest` created by this Certificate, either when it was +> created, renewed, or Spec was changed. Revisions will be removed by oldest +> first if the number of revisions exceeds this number. +> If set, `revisionHistoryLimit` must be a value of `1` or greater. If unset +> (`nil`), revisions will not be garbage collected. Default value is `nil`. + +On a busy cluster these will eventually overwhelm your Kubernetes API server; +because of the memory and CPU required to cache them all and the storage required to save them. + +Use a tool like Kyverno to override the `Certificate.spec.revisionHistoryLimit` for all namespaces. + +> 📖 Adapt [the Kyverno policies in the tutorial: how to set Certificate defaults automatically](../tutorials/certificate-defaults/README.md), +> to override rather than default the `revisionHistoryLimit` field. +> +> 📖 Learn [how to set `revisionHistoryLimit` when using Annotated Ingress resources](../usage/ingress.md#supported-annotations). +> +> 🔗 Read [`cert-manager#3773`: Certificate revision history limit](https://github.com/cert-manager/cert-manager/pull/3773), +> to learn why stale CertificateRequests resources are not automatically deleted. diff --git a/content/docs/manifest.json b/content/docs/manifest.json index 09cd5945a6..2ebe66e0ff 100644 --- a/content/docs/manifest.json +++ b/content/docs/manifest.json @@ -646,6 +646,10 @@ { "title": "Best Practice Installation Options", "path": "/docs/installation/best-practice.md" + }, + { + "title": "Scaling cert-manager", + "path": "/docs/devops-tips/scaling-cert-manager.md" } ] }, From 6592015fa8a995c60f2afdb0e9daa4c6dbf52d1b Mon Sep 17 00:00:00 2001 From: Richard Wall Date: Fri, 19 Apr 2024 15:53:05 +0100 Subject: [PATCH 2/4] Recommend server-side apply Signed-off-by: Richard Wall --- .../docs/devops-tips/scaling-cert-manager.md | 33 +++++++++++++++++++ 1 file changed, 33 insertions(+) diff --git a/content/docs/devops-tips/scaling-cert-manager.md b/content/docs/devops-tips/scaling-cert-manager.md index d2a3e6432e..b760577635 100644 --- a/content/docs/devops-tips/scaling-cert-manager.md +++ b/content/docs/devops-tips/scaling-cert-manager.md @@ -99,3 +99,36 @@ Use a tool like Kyverno to override the `Certificate.spec.revisionHistoryLimit` > > 🔗 Read [`cert-manager#3773`: Certificate revision history limit](https://github.com/cert-manager/cert-manager/pull/3773), > to learn why stale CertificateRequests resources are not automatically deleted. + +## Enable Server-Side Apply + +By default, cert-manager [uses Update requests](https://kubernetes.io/docs/reference/using-api/api-concepts/#update-mechanism-update) +to create and modify resources like CertificateRequest and Secret, +but on a busy cluster there will be frequent conflicts as the control loops in cert-manager each try to update the status of various resources. + +You will see errors, like this one, in the logs: + +> `I0419 14:11:51.325377 1 controller.go:162] "re-queuing item due to optimistic locking on resource" logger="cert-manager.certificates-trigger" key="team-864-p6ts6/app-7" error="Operation cannot be fulfilled on certificates.cert-manager.io \"app-7\": the object has been modified; please apply your changes to the latest version and try again"` + +This error is relatively harmless because the update attempt is retried, +but it slows down the reconciliation because each error triggers an exponential back off mechanism, +which causes increasing delays between retries. + +The solution is to turn on the [Server-Side Apply Feature](../installation/configuring-components.md#feature-gates), +which causes cert-manager to use [HTTP PATCH using Server-Side Apply](https://kubernetes.io/docs/reference/using-api/api-concepts/#update-mechanism-server-side-apply) when ever it needs to modify an API resource. +This avoids all conflicts because each cert-manager controller sets only the fields that it owns. + +You can enable the server-side apply feature gate with the following Helm chart values: + +```yaml +# helm-values.yaml +config: + apiVersion: controller.config.cert-manager.io/v1alpha1 + kind: ControllerConfiguration + featureGates: + AllBeta: true + ServerSideApply: true +``` + +> 📖 Read [Using Server-Side Apply in a controller](https://kubernetes.io/docs/reference/using-api/server-side-apply/#using-server-side-apply-in-a-controller), +> to learn about the advantages of server-side apply for software like cert-manager. From cb3053b638e72dd61838a84acf09761932767101 Mon Sep 17 00:00:00 2001 From: Richard Wall Date: Tue, 30 Apr 2024 11:21:43 +0100 Subject: [PATCH 3/4] Update content/docs/devops-tips/scaling-cert-manager.md MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-authored-by: Maël Valais Signed-off-by: Richard Wall --- content/docs/devops-tips/scaling-cert-manager.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/docs/devops-tips/scaling-cert-manager.md b/content/docs/devops-tips/scaling-cert-manager.md index b760577635..1aca0e133b 100644 --- a/content/docs/devops-tips/scaling-cert-manager.md +++ b/content/docs/devops-tips/scaling-cert-manager.md @@ -38,7 +38,7 @@ By default cert-manager [throttles the rate of requests to the Kubernetes API se Historically this was intended to prevent cert-manager from overwhelming the Kubernetes API server, but modern versions of Kubernetes implement [API Priority and Fairness](https://kubernetes.io/docs/concepts/cluster-administration/flow-control/), which obviates the need for client side throttling. -You can increase the threshold of the the client-side rate limiter using the following helm values: +You can increase the threshold of the client-side rate limiter using the following helm values: ```yaml # helm-values.yaml From cc066dc780956a1d53f0802853106cdb708b899e Mon Sep 17 00:00:00 2001 From: Richard Wall Date: Tue, 30 Apr 2024 13:52:59 +0100 Subject: [PATCH 4/4] Address code review feedback Signed-off-by: Richard Wall --- content/docs/devops-tips/scaling-cert-manager.md | 13 ++++++------- 1 file changed, 6 insertions(+), 7 deletions(-) diff --git a/content/docs/devops-tips/scaling-cert-manager.md b/content/docs/devops-tips/scaling-cert-manager.md index 1aca0e133b..c12185dc19 100644 --- a/content/docs/devops-tips/scaling-cert-manager.md +++ b/content/docs/devops-tips/scaling-cert-manager.md @@ -24,7 +24,7 @@ If large TLS keys are used (e.g. RSA 4096) the memory use will be higher than if The other Secrets in the cluster, such as those used for Helm chart configurations or for other workloads, will not significantly increase the memory consumption, because cert-manager will only cache the metadata of these Secrets. -**When CertificateRequest resources are the dominant use-case**, +**When `CertificateRequest` resources are the dominant use-case**, such as with csi-driver or with istio-csr, the memory consumption of the cert-manager controller will be much lower, because there will be fewer TLS Secrets and fewer resources to be cached. @@ -76,10 +76,10 @@ might accidentally or maliciously cause a denial of service for other users on t ## Set `revisionHistoryLimit: 1` on all Certificate resources -By default, cert-manager will keep all the CertificateRequest resources that **it** creates +By default, cert-manager will keep all the `CertificateRequest` resources that **it** creates ([`revisionHistoryLimit`](../reference/api-docs.md#cert-manager.io/v1.CertificateSpec)): -> The maximum number of CertificateRequest revisions that are maintained in +> The maximum number of `CertificateRequest` revisions that are maintained in > the Certificate's history. Each revision represents a single > `CertificateRequest` created by this Certificate, either when it was > created, renewed, or Spec was changed. Revisions will be removed by oldest @@ -97,13 +97,13 @@ Use a tool like Kyverno to override the `Certificate.spec.revisionHistoryLimit` > > 📖 Learn [how to set `revisionHistoryLimit` when using Annotated Ingress resources](../usage/ingress.md#supported-annotations). > -> 🔗 Read [`cert-manager#3773`: Certificate revision history limit](https://github.com/cert-manager/cert-manager/pull/3773), -> to learn why stale CertificateRequests resources are not automatically deleted. +> 🔗 Read [`cert-manager#3958`: Sane defaults for Certificate revision history limit](https://github.com/cert-manager/cert-manager/issues/3958); +> a proposal to change the default `revisionHistoryLimit`, which will obviate this particular recommendation. ## Enable Server-Side Apply By default, cert-manager [uses Update requests](https://kubernetes.io/docs/reference/using-api/api-concepts/#update-mechanism-update) -to create and modify resources like CertificateRequest and Secret, +to create and modify resources like `CertificateRequest` and `Secret`, but on a busy cluster there will be frequent conflicts as the control loops in cert-manager each try to update the status of various resources. You will see errors, like this one, in the logs: @@ -126,7 +126,6 @@ config: apiVersion: controller.config.cert-manager.io/v1alpha1 kind: ControllerConfiguration featureGates: - AllBeta: true ServerSideApply: true ```