diff --git a/content/docs/devops-tips/large-clusters.md b/content/docs/devops-tips/large-clusters.md new file mode 100644 index 00000000000..b64c77d778b --- /dev/null +++ b/content/docs/devops-tips/large-clusters.md @@ -0,0 +1,77 @@ +--- +title: Deploying cert-manager on Large Clusters +description: | + Learn how to optimize cert-manager for deployment on large clusters, + with thousands of Certificate and Secret resources. +--- + +Learn how to optimize cert-manager for deployment on large clusters, +with thousands of Certificate and Secret resources. + +## Overview + +The defaults in the Helm chart or YAML manifests are intended for general use. +You will need to modify the configuration if your Kubernetes cluster has thousands of Certificate resources and TLS Secrets. + +## Memory + +### Recommendations + +Here are some `memory.request` recommendations for each of the cert-manager components in different scenarios: + +| Scenario | controller (Mi) | cainjector (Mi) | webhook (Mi) | +|----------------------------|-----------------|-----------------|--------------| +| 2000 RSA 4096 Certificates | 350 | 150 | 50 | + +> 📖️ Read [What Everyone Should Know About Kubernetes Memory Limits](https://home.robusta.dev/blog/kubernetes-memory-limit), +> to learn why the best practice is to set memory limit equal to memory request. + +### Rationale + +**When Certificate resources are the dominant use-case**, +such as when workloads need to mount the TLS Secret or when gateway-shim is used, +the memory consumption of the cert-manager controller will be roughly +proportional to the total size of those Secret resources that contain the TLS +key pairs. +Why? Because the cert-manager controller caches the entire content of these Secret resources in memory. +If large TLS keys are used (e.g. RSA 4096) the memory use will be higher than if smaller TLS keys are used (e.g. ECDSA). + +The other Secrets in the cluster, such as those used for Helm chart configurations or for other workloads, +will not significantly increase the memory consumption, because cert-manager will only cache the metadata of these Secrets. + +**When CertificateRequest resources are the dominant use-case**, +such as with csi-driver or with istio-csr, +the memory consumption of the cert-manager controller will be much lower, +because there will be fewer TLS Secrets and fewer resources to be cached. + +### Evidence + +This chart shows the memory consumption of the cert-manager controller (1.14) +during an experiment where 2000 RSA 4096 Certificate are created, signed and +then deleted. + +Scatter chart showing cert-manager memory usage and cluster resource counts over time + +The pattern of memory consumption can be explained as follows: + +1. `0min`: `~50MiB`: There are 0 Certificates. + There are 13 incidental Secret resources which are: + Helm chart configuration Secrets, and + other Secrets of the metrics-server and Prometheus stack, which are also installed in the test cluster. + All the cert-manager Deployments were restarted before the experiment and the components have only cached cached resources. +1. `33min`: `~260MiB`: All 2000 Certificate resources have been reconciled. + Every Certificate now has a corresponding CertificateRequest and TLS Secret. + There are `~3600` Secret resources -- `~1600` more than can be explained by the 2000 TLS Secrets. + **Why?** + Possibly because cert-manager creates a temporary Secret resource for each Certificate. + The temporary Secret is where cert-manager stores the private key when it is first generated. + After the TLS certificate has been signed, the temporary Secret is deleted. +1. `40min`: `~280MiB`: When remaining temporary Secret resources are being deleted. + This causes a spike in memory. **Why?**. +1. `42min`: `~225MiB`: The Go garbage collector eventually frees the memory which had been allocated for the recently deleted Secrets. +1. `46min`: `~300MiB`: The Certificates and Secrets are now being rapidly deleted. + This causes another spike in memory. **Why?**. +1. `48min`: `~280MiB`: All the Certificate, CertificateRequest and Secret resources have now been deleted, + but the memory consumption remains at roughly the peak size. + **Why?** + The memory which had been allocated to cache the resources is not immediately freed. diff --git a/content/docs/manifest.json b/content/docs/manifest.json index 09cd5945a6a..922bb18816f 100644 --- a/content/docs/manifest.json +++ b/content/docs/manifest.json @@ -646,6 +646,10 @@ { "title": "Best Practice Installation Options", "path": "/docs/installation/best-practice.md" + }, + { + "title": "Large Cluster Configuration", + "path": "/docs/devops-tips/large-clusters.md" } ] }, diff --git a/public/docs/devops-tips/large-clusters/default-memory-1.png b/public/docs/devops-tips/large-clusters/default-memory-1.png new file mode 100644 index 00000000000..8762ee3d0a2 Binary files /dev/null and b/public/docs/devops-tips/large-clusters/default-memory-1.png differ