-
Notifications
You must be signed in to change notification settings - Fork 227
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SURE-6645] Fleet creating large number of resources for each cluster #1651
Comments
see also #1615 |
See this script as a possible workaround, deleting obsolete cluster registrations for each downstream cluster. |
This is the same issue as #1615 |
Additional QAProblemRe-registering clusters can leave clusterregistrations and their child resources, service account, roles, role bindings, behind. SolutionThe agent is only re-registered if the API server or CA changed, a new status field stores the old values. TestingRestart the controller, see that no agents are re-registered. Force agent re-registration, e.g. by running Engineering TestingManual TestingTested manually with multiple clusters. Automated TestingWe have no automation that can create and register clusters. QA Testing ConsiderationsAgents will still re-register, e.g. when updating and agent configuration changes. Re-deploy in contrast to a full re-registration will always happen, e.g. during upgrades when the agent image changes. Regressions ConsiderationsNot all agent configuration changes are propagated automatically. For example, if the Fleet controller is stopped, and the agent resources are modified on the cluster (https://fleet.rancher.io/ref-crds#clusterspec). Starting the controller will not automatically redeploy the cluster's agent. |
Detailed testing information can be found here --> #1690 (comment) |
Internal reference: SURE-6645
Environment
Rancher Cluster:
Rancher version: 2.7.5
Number of nodes: 37 (30 for ARGOcd)
Downstream Cluster:
Number of Downstream clusters: 1132
RKE/RKE2/K3S version: 2.7.5
Kubernetes version: 1.25.6
CNI: calico + traefik
Other:
Underlying Infrastructure: Rancher running in managed AKS cluster
Issue description:
ETCD has filled with a large number of fleet resources, or resources in the fleet-default namespace. The number of resources is not proportionate to the number of downstream clusters.
For example, the cluster s0409 has 170 similar requests.
Some clusters have 300+ requests for clusterregistration, none of which are cleaned up.
Business impact:
Kube-apiserver in AKS is hitting its limits and is causing a large number of timeouts when all clusters are attempting to connect at once on startup. It takes ~30 minutes for clusters to be able to connect. ETCD is getting overloaded at is around 2G after compact/defrag
Troubleshooting steps:
Worked with AKS team to reduce load and pinpoint what is filling ETCD. Argo takes up a large amount of space, but most API calls are coming from Rancher. See images for API volume to kube-apiserver, as well as ETCD object count and size. You can see 80k cluster registrations which is large for the 1132 clusters they have registered. Why would that be?
Actual behavior:
large number of fleet resources created for each cluster (roles, rolebindings, clusterregistrations, secrets, serviceAccounts) in the fleet default namespace. Customer sees large amount of timeouts to downstream clusters, possibly causing fleet to treat it as new clusters.
Expected behavior:
Smaller, appropriate number of resources created for each cluster.
The text was updated successfully, but these errors were encountered: