Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Backport v2.7.6][SURE-6645] Fleet creating large number of resources for each cluster #1651 #1656

Closed
Jono-SUSE-Rancher opened this issue Jul 18, 2023 · 4 comments

Comments

@Jono-SUSE-Rancher
Copy link

This is a backport issue of #1651 to go into v2.7.6.

Internal reference: SURE-6645

Environment
Rancher Cluster:
Rancher version: 2.7.5
Number of nodes: 37 (30 for ARGOcd)

Downstream Cluster:
Number of Downstream clusters: 1132
RKE/RKE2/K3S version: 2.7.5
Kubernetes version: 1.25.6
CNI: calico + traefik

Other:
Underlying Infrastructure: Rancher running in managed AKS cluster

Issue description:
ETCD has filled with a large number of fleet resources, or resources in the fleet-default namespace. The number of resources is not proportionate to the number of downstream clusters.

ROLE:
request-6s7nb 2023-07-13T12:46:51Z
ROLEBINDING:
request-6s7nb Role/request-6s7nb 25h
CLUSTERREGISTRATION:
request-6s7nb s0409 {"management.cattle.io/cluster-display-name":"s0409","management.cattle.io/cluster-name":"c-m-zcllc9kx","objectset.rio.cattle.io/hash":"d7128da63a4e07ede9c4d36b1a5fd60b31ce3d45","provider.cattle.io":"k3s"}
For example, the cluster s0409 has 170 similar requests.
Some clusters have 300+ requests for clusterregistration, none of which are cleaned up.

Business impact:
Kube-apiserver in AKS is hitting its limits and is causing a large number of timeouts when all clusters are attempting to connect at once on startup. It takes ~30 minutes for clusters to be able to connect. ETCD is getting overloaded at is around 2G after compact/defrag

Troubleshooting steps:
Worked with AKS team to reduce load and pinpoint what is filling ETCD. Argo takes up a large amount of space, but most API calls are coming from Rancher. See images for API volume to kube-apiserver, as well as ETCD object count and size. You can see 80k cluster registrations which is large for the 1132 clusters they have registered. Why would that be?

Actual behavior:
large number of fleet resources created for each cluster (roles, rolebindings, clusterregistrations, secrets, serviceAccounts) in the fleet default namespace. Customer sees large amount of timeouts to downstream clusters, possibly causing fleet to treat it as new clusters.

Expected behavior:
Smaller, appropriate number of resources created for each cluster.

@sbulage
Copy link
Contributor

sbulage commented Aug 1, 2023

I have followed the Additional QA Template from the parent issue.

Cluster information:

Rancher: 2.7.5
Fleet: 0.7.1-rc.1
1 Upstream (local) cluster and 2 downstream clusters (imported 1 after another)

I kept cluster alive around 4.8 days. While it was running I keep patching clusterregistrations mentioned in the QA template.

In the meantime I don't see any increase in the ClusterRoles, Roles, RoleBinding and clusterregistrations.
Everytime I forcefully tried to re-register the cluster using different values in the command:

kubectl patch clusters.fleet.cattle.io -n fleet-local local --type=json -p '[{"op": "add", "path": "/spec/redeployAgentGeneration", "value": 2}]'

Simultaneously monitoring the fleet-controller pod logs as well. I see clear message in the logs:

fleet-controller pod logs

time="2023-07-27T10:18:15Z" level=info msg="Deleted old agent for cluster (fleet-local/local) in namespace cattle-fleet-local-system"
time="2023-07-27T10:18:15Z" level=info msg="Cluster import for 'fleet-local/local'. Deployed new agent"
time="2023-07-27T10:18:16Z" level=info msg="Cluster registration request 'fleet-local/request-lkcdz', cluster 'fleet-local/local' granted [false], creating cluster and request service account"
time="2023-07-27T10:18:16Z" level=info msg="Waiting for service account token key to be populated for secret cluster-fleet-local-local-1a3d67d0a899/request-lkcdz-7f700864-93e7-44f0-90b5-8f07e4af96fd-token"
time="2023-07-27T10:18:17Z" level=info msg="Namespace assigned to cluster 'fleet-local/local' enqueues cluster registration 'fleet-local/request-lkcdz'"
time="2023-07-27T10:18:18Z" level=info msg="Cluster registration request 'fleet-local/request-lkcdz', cluster 'fleet-local/local' granted [true], creating cluster and request service account"
time="2023-07-27T10:18:18Z" level=info msg="Deleting old clusterregistration 'fleet-local/request-8q74c', now at 'request-lkcdz'"
time="2023-07-27T10:20:39Z" level=info msg="Deleted old agent for cluster (fleet-local/local) in namespace cattle-fleet-local-system"
time="2023-07-27T10:20:39Z" level=info msg="Cluster import for 'fleet-local/local'. Deployed new agent"
time="2023-07-27T10:20:39Z" level=info msg="Cluster registration request 'fleet-local/request-b4fj6', cluster 'fleet-local/local' granted [false], creating cluster and request service account"
time="2023-07-27T10:20:39Z" level=info msg="Cluster registration request 'fleet-local/request-g7hqs', cluster 'fleet-local/local' granted [false], creating cluster and request service account"
time="2023-07-27T10:20:39Z" level=info msg="Waiting for service account token key to be populated for secret cluster-fleet-local-local-1a3d67d0a899/request-b4fj6-f59824a5-ce0a-4626-94df-fa9a627ccd93-token"
time="2023-07-27T10:20:39Z" level=info msg="Waiting for service account token key to be populated for secret cluster-fleet-local-local-1a3d67d0a899/request-g7hqs-7c42397f-42de-4fce-b18c-53709390786e-token"
time="2023-07-27T10:20:41Z" level=info msg="Namespace assigned to cluster 'fleet-local/local' enqueues cluster registration 'fleet-local/request-b4fj6'"
time="2023-07-27T10:20:41Z" level=info msg="Namespace assigned to cluster 'fleet-local/local' enqueues cluster registration 'fleet-local/request-g7hqs'"
time="2023-07-27T10:20:41Z" level=info msg="Cluster registration request 'fleet-local/request-b4fj6', cluster 'fleet-local/local' granted [true], creating cluster and request service account"
time="2023-07-27T10:20:41Z" level=info msg="Deleting old clusterregistration 'fleet-local/request-lkcdz', now at 'request-b4fj6'"
time="2023-07-27T10:20:41Z" level=info msg="Cluster registration request 'fleet-local/request-g7hqs', cluster 'fleet-local/local' granted [true], creating cluster and request service account"
time="2023-07-27T10:20:41Z" level=info msg="Cluster registration request 'fleet-local/request-b4fj6', cluster 'fleet-local/local' granted [true], creating cluster and request service account"
time="2023-07-27T10:20:41Z" level=info msg="Cluster registration request 'fleet-local/request-b4fj6', cluster 'fleet-local/local' granted [true], creating cluster and request service account"
time="2023-07-27T10:20:41Z" level=error msg="error syncing 'fleet-local/request-lkcdz': handler cluster-registration: failed to delete fleet-local/request-lkcdz rbac.authorization.k8s.io/v1, Kind=Role for cluster-registration fleet-local/request-lkcdz: roles.rbac.authorization.k8s.io "request-lkcdz" not found, requeuing"

Regressions

In order to test regressions, below steps were performed.

  • Set fleet-controller deployment to '0'
  • Updated Cluster spec of the imported clusters.(ClusterSpec)
  • After every clusterSpec update, I started fleet-controller, I see fleet-agent is re-created on imported clusters with the updated spec configurations.

@sbulage
Copy link
Contributor

sbulage commented Aug 2, 2023

I don't see any large resources are being created in the cluster. Fix is working as expected.

@sbulage
Copy link
Contributor

sbulage commented Aug 9, 2023

I have followed the Additional QA Template from the parent issue.

Cluster information:

Rancher: 2.7.5
Fleet: 0.7.1-rc.1
1 Upstream (local) cluster and 2 downstream clusters (imported 1 after another)

While testing there was no 2.7.6 RC available and I did testing on fleet 0.7.1-rc.1.
I need to replicate the issue so I installed the Rancher 2.7.5 with fleet 0.7.0.

Later I upgraded fleet to 0.7.1-rc.1 and verified that no more large no. of resources were created by fleet in each cluster. The time taken by this scenario is ~5 days.

If this issue need to re-test with Rancher 2.7.6 will require cluster to keep running atleast 2-3 days in order to check whether it will create more resources or not.

I am currently waiting for the RC of Fleet which includes this fix #1692 , any how I need upgrade cluster from Rancher 2.7.5 to 2.7.6 and check cluster registrations are created or not as well as some regression testing which I did in this issue.

I kept cluster alive around 4.8 days. While it was running I keep patching clusterregistrations mentioned in the QA template.

In the meantime I don't see any increase in the ClusterRoles, Roles, RoleBinding and clusterregistrations. Everytime I forcefully tried to re-register the cluster using different values in the command:

kubectl patch clusters.fleet.cattle.io -n fleet-local local --type=json -p '[{"op": "add", "path": "/spec/redeployAgentGeneration", "value": 2}]'

Simultaneously monitoring the fleet-controller pod logs as well. I see clear message in the logs:

**fleet-controller pod logs **
Regressions

In order to test regressions, below steps were performed.

  • Set fleet-controller deployment to '0'
  • Updated Cluster spec of the imported clusters.(ClusterSpec)
  • After every clusterSpec update, I started fleet-controller, I see fleet-agent is re-created on imported clusters with the updated spec configurations.

@sbulage
Copy link
Contributor

sbulage commented Aug 16, 2023

I have re-tested this issue again on the Rancher 2.7.6-rc2 and Fleet 0.7.1-rc.2.

@zube zube bot removed the [zube]: Done label Nov 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

No branches or pull requests

4 participants