-
Notifications
You must be signed in to change notification settings - Fork 227
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Backport v2.7.6][SURE-6645] Fleet creating large number of resources for each cluster #1651 #1656
Comments
I have followed the Additional QA Template from the parent issue. Cluster information:
I kept cluster alive around 4.8 days. While it was running I keep patching In the meantime I don't see any increase in the
Simultaneously monitoring the fleet-controller pod logstime="2023-07-27T10:18:15Z" level=info msg="Deleted old agent for cluster (fleet-local/local) in namespace cattle-fleet-local-system" Regressions In order to test regressions, below steps were performed.
|
I don't see any large resources are being created in the cluster. Fix is working as expected. |
While testing there was no Later I upgraded fleet to If this issue need to re-test with Rancher 2.7.6 will require cluster to keep running atleast I am currently waiting for the RC of Fleet which includes this fix #1692 , any how I need
|
I have re-tested this issue again on the |
This is a backport issue of #1651 to go into v2.7.6.
Internal reference: SURE-6645
Environment
Rancher Cluster:
Rancher version: 2.7.5
Number of nodes: 37 (30 for ARGOcd)
Downstream Cluster:
Number of Downstream clusters: 1132
RKE/RKE2/K3S version: 2.7.5
Kubernetes version: 1.25.6
CNI: calico + traefik
Other:
Underlying Infrastructure: Rancher running in managed AKS cluster
Issue description:
ETCD has filled with a large number of fleet resources, or resources in the fleet-default namespace. The number of resources is not proportionate to the number of downstream clusters.
ROLE:
request-6s7nb 2023-07-13T12:46:51Z
ROLEBINDING:
request-6s7nb Role/request-6s7nb 25h
CLUSTERREGISTRATION:
request-6s7nb s0409 {"management.cattle.io/cluster-display-name":"s0409","management.cattle.io/cluster-name":"c-m-zcllc9kx","objectset.rio.cattle.io/hash":"d7128da63a4e07ede9c4d36b1a5fd60b31ce3d45","provider.cattle.io":"k3s"}
For example, the cluster s0409 has 170 similar requests.
Some clusters have 300+ requests for clusterregistration, none of which are cleaned up.
Business impact:
Kube-apiserver in AKS is hitting its limits and is causing a large number of timeouts when all clusters are attempting to connect at once on startup. It takes ~30 minutes for clusters to be able to connect. ETCD is getting overloaded at is around 2G after compact/defrag
Troubleshooting steps:
Worked with AKS team to reduce load and pinpoint what is filling ETCD. Argo takes up a large amount of space, but most API calls are coming from Rancher. See images for API volume to kube-apiserver, as well as ETCD object count and size. You can see 80k cluster registrations which is large for the 1132 clusters they have registered. Why would that be?
Actual behavior:
large number of fleet resources created for each cluster (roles, rolebindings, clusterregistrations, secrets, serviceAccounts) in the fleet default namespace. Customer sees large amount of timeouts to downstream clusters, possibly causing fleet to treat it as new clusters.
Expected behavior:
Smaller, appropriate number of resources created for each cluster.
The text was updated successfully, but these errors were encountered: