[Backport v2.7.6][SURE-6645] Fleet creating large number of resources for each cluster #1651 #1656

Jono-SUSE-Rancher · 2023-07-18T18:23:20Z

This is a backport issue of #1651 to go into v2.7.6.

Internal reference: SURE-6645

Environment
Rancher Cluster:
Rancher version: 2.7.5
Number of nodes: 37 (30 for ARGOcd)

Downstream Cluster:
Number of Downstream clusters: 1132
RKE/RKE2/K3S version: 2.7.5
Kubernetes version: 1.25.6
CNI: calico + traefik

Other:
Underlying Infrastructure: Rancher running in managed AKS cluster

Issue description:
ETCD has filled with a large number of fleet resources, or resources in the fleet-default namespace. The number of resources is not proportionate to the number of downstream clusters.

ROLE:
request-6s7nb 2023-07-13T12:46:51Z
ROLEBINDING:
request-6s7nb Role/request-6s7nb 25h
CLUSTERREGISTRATION:
request-6s7nb s0409 {"management.cattle.io/cluster-display-name":"s0409","management.cattle.io/cluster-name":"c-m-zcllc9kx","objectset.rio.cattle.io/hash":"d7128da63a4e07ede9c4d36b1a5fd60b31ce3d45","provider.cattle.io":"k3s"}
For example, the cluster s0409 has 170 similar requests.
Some clusters have 300+ requests for clusterregistration, none of which are cleaned up.

Business impact:
Kube-apiserver in AKS is hitting its limits and is causing a large number of timeouts when all clusters are attempting to connect at once on startup. It takes ~30 minutes for clusters to be able to connect. ETCD is getting overloaded at is around 2G after compact/defrag

Troubleshooting steps:
Worked with AKS team to reduce load and pinpoint what is filling ETCD. Argo takes up a large amount of space, but most API calls are coming from Rancher. See images for API volume to kube-apiserver, as well as ETCD object count and size. You can see 80k cluster registrations which is large for the 1132 clusters they have registered. Why would that be?

Actual behavior:
large number of fleet resources created for each cluster (roles, rolebindings, clusterregistrations, secrets, serviceAccounts) in the fleet default namespace. Customer sees large amount of timeouts to downstream clusters, possibly causing fleet to treat it as new clusters.

Expected behavior:
Smaller, appropriate number of resources created for each cluster.

sbulage · 2023-08-01T17:44:37Z

I have followed the Additional QA Template from the parent issue.

Cluster information:

Rancher: 2.7.5
Fleet: 0.7.1-rc.1
1 Upstream (local) cluster and 2 downstream clusters (imported 1 after another)

I kept cluster alive around 4.8 days. While it was running I keep patching clusterregistrations mentioned in the QA template.

In the meantime I don't see any increase in the ClusterRoles, Roles, RoleBinding and clusterregistrations.
Everytime I forcefully tried to re-register the cluster using different values in the command:

kubectl patch clusters.fleet.cattle.io -n fleet-local local --type=json -p '[{"op": "add", "path": "/spec/redeployAgentGeneration", "value": 2}]'

Simultaneously monitoring the fleet-controller pod logs as well. I see clear message in the logs:

fleet-controller pod logs

time="2023-07-27T10:18:15Z" level=info msg="Deleted old agent for cluster (fleet-local/local) in namespace cattle-fleet-local-system"
time="2023-07-27T10:18:15Z" level=info msg="Cluster import for 'fleet-local/local'. Deployed new agent"
time="2023-07-27T10:18:16Z" level=info msg="Cluster registration request 'fleet-local/request-lkcdz', cluster 'fleet-local/local' granted [false], creating cluster and request service account"
time="2023-07-27T10:18:16Z" level=info msg="Waiting for service account token key to be populated for secret cluster-fleet-local-local-1a3d67d0a899/request-lkcdz-7f700864-93e7-44f0-90b5-8f07e4af96fd-token"
time="2023-07-27T10:18:17Z" level=info msg="Namespace assigned to cluster 'fleet-local/local' enqueues cluster registration 'fleet-local/request-lkcdz'"
time="2023-07-27T10:18:18Z" level=info msg="Cluster registration request 'fleet-local/request-lkcdz', cluster 'fleet-local/local' granted [true], creating cluster and request service account"
time="2023-07-27T10:18:18Z" level=info msg="Deleting old clusterregistration 'fleet-local/request-8q74c', now at 'request-lkcdz'"
time="2023-07-27T10:20:39Z" level=info msg="Deleted old agent for cluster (fleet-local/local) in namespace cattle-fleet-local-system"
time="2023-07-27T10:20:39Z" level=info msg="Cluster import for 'fleet-local/local'. Deployed new agent"
time="2023-07-27T10:20:39Z" level=info msg="Cluster registration request 'fleet-local/request-b4fj6', cluster 'fleet-local/local' granted [false], creating cluster and request service account"
time="2023-07-27T10:20:39Z" level=info msg="Cluster registration request 'fleet-local/request-g7hqs', cluster 'fleet-local/local' granted [false], creating cluster and request service account"
time="2023-07-27T10:20:39Z" level=info msg="Waiting for service account token key to be populated for secret cluster-fleet-local-local-1a3d67d0a899/request-b4fj6-f59824a5-ce0a-4626-94df-fa9a627ccd93-token"
time="2023-07-27T10:20:39Z" level=info msg="Waiting for service account token key to be populated for secret cluster-fleet-local-local-1a3d67d0a899/request-g7hqs-7c42397f-42de-4fce-b18c-53709390786e-token"
time="2023-07-27T10:20:41Z" level=info msg="Namespace assigned to cluster 'fleet-local/local' enqueues cluster registration 'fleet-local/request-b4fj6'"
time="2023-07-27T10:20:41Z" level=info msg="Namespace assigned to cluster 'fleet-local/local' enqueues cluster registration 'fleet-local/request-g7hqs'"
time="2023-07-27T10:20:41Z" level=info msg="Cluster registration request 'fleet-local/request-b4fj6', cluster 'fleet-local/local' granted [true], creating cluster and request service account"
time="2023-07-27T10:20:41Z" level=info msg="Deleting old clusterregistration 'fleet-local/request-lkcdz', now at 'request-b4fj6'"
time="2023-07-27T10:20:41Z" level=info msg="Cluster registration request 'fleet-local/request-g7hqs', cluster 'fleet-local/local' granted [true], creating cluster and request service account"
time="2023-07-27T10:20:41Z" level=info msg="Cluster registration request 'fleet-local/request-b4fj6', cluster 'fleet-local/local' granted [true], creating cluster and request service account"
time="2023-07-27T10:20:41Z" level=info msg="Cluster registration request 'fleet-local/request-b4fj6', cluster 'fleet-local/local' granted [true], creating cluster and request service account"
time="2023-07-27T10:20:41Z" level=error msg="error syncing 'fleet-local/request-lkcdz': handler cluster-registration: failed to delete fleet-local/request-lkcdz rbac.authorization.k8s.io/v1, Kind=Role for cluster-registration fleet-local/request-lkcdz: roles.rbac.authorization.k8s.io "request-lkcdz" not found, requeuing"

Regressions

In order to test regressions, below steps were performed.

Set fleet-controller deployment to '0'
Updated Cluster spec of the imported clusters.(ClusterSpec)
After every clusterSpec update, I started fleet-controller, I see fleet-agent is re-created on imported clusters with the updated spec configurations.

sbulage · 2023-08-02T14:02:05Z

I don't see any large resources are being created in the cluster. Fix is working as expected.

sbulage · 2023-08-09T05:36:18Z

I have followed the Additional QA Template from the parent issue.

Cluster information:
Rancher: 2.7.5
Fleet: 0.7.1-rc.1
1 Upstream (local) cluster and 2 downstream clusters (imported 1 after another)

While testing there was no 2.7.6 RC available and I did testing on fleet 0.7.1-rc.1.
I need to replicate the issue so I installed the Rancher 2.7.5 with fleet 0.7.0.

Later I upgraded fleet to 0.7.1-rc.1 and verified that no more large no. of resources were created by fleet in each cluster. The time taken by this scenario is ~5 days.

If this issue need to re-test with Rancher 2.7.6 will require cluster to keep running atleast 2-3 days in order to check whether it will create more resources or not.

I am currently waiting for the RC of Fleet which includes this fix #1692 , any how I need upgrade cluster from Rancher 2.7.5 to 2.7.6 and check cluster registrations are created or not as well as some regression testing which I did in this issue.

I kept cluster alive around 4.8 days. While it was running I keep patching clusterregistrations mentioned in the QA template.

In the meantime I don't see any increase in the ClusterRoles, Roles, RoleBinding and clusterregistrations. Everytime I forcefully tried to re-register the cluster using different values in the command:
kubectl patch clusters.fleet.cattle.io -n fleet-local local --type=json -p '[{"op": "add", "path": "/spec/redeployAgentGeneration", "value": 2}]'
Simultaneously monitoring the fleet-controller pod logs as well. I see clear message in the logs:

**fleet-controller pod logs **
Regressions

In order to test regressions, below steps were performed.

Set fleet-controller deployment to '0'

Updated Cluster spec of the imported clusters.(ClusterSpec)

After every clusterSpec update, I started fleet-controller, I see fleet-agent is re-created on imported clusters with the updated spec configurations.

sbulage · 2023-08-16T10:27:40Z

I have re-tested this issue again on the Rancher 2.7.6-rc2 and Fleet 0.7.1-rc.2.

Jono-SUSE-Rancher assigned kkaempf Jul 18, 2023

Jono-SUSE-Rancher added this to the Fleet - v2.7.6 milestone Jul 18, 2023

github-actions bot added team/fleet labels Jul 18, 2023

kkaempf added JIRA Must shout kind/bug area/performance kind/backport labels Jul 19, 2023

kkaempf removed their assignment Jul 19, 2023

manno self-assigned this Jul 20, 2023

manno added the status/waiting-for-fleet-rc-and-chart label Jul 21, 2023

manno mentioned this issue Jul 26, 2023

[v0.7] Backport cluster registration clean up fix #1677

Merged

sbulage self-assigned this Jul 28, 2023

sbulage removed the status/waiting-for-fleet-rc-and-chart label Jul 28, 2023

Jono-SUSE-Rancher added the [zube]: QA Working label Jul 31, 2023

sbulage added [zube]: Done and removed [zube]: QA Working labels Aug 2, 2023

zube bot closed this as completed Aug 2, 2023

sbulage added the status/dev-validate label Aug 2, 2023

zube bot removed the [zube]: Done label Nov 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Backport v2.7.6][SURE-6645] Fleet creating large number of resources for each cluster #1651 #1656

[Backport v2.7.6][SURE-6645] Fleet creating large number of resources for each cluster #1651 #1656

Jono-SUSE-Rancher commented Jul 18, 2023

sbulage commented Aug 1, 2023 •

edited

Loading

sbulage commented Aug 2, 2023

sbulage commented Aug 9, 2023 •

edited

Loading

sbulage commented Aug 16, 2023

[Backport v2.7.6][SURE-6645] Fleet creating large number of resources for each cluster #1651 #1656

[Backport v2.7.6][SURE-6645] Fleet creating large number of resources for each cluster #1651 #1656

Comments

Jono-SUSE-Rancher commented Jul 18, 2023

sbulage commented Aug 1, 2023 • edited Loading

sbulage commented Aug 2, 2023

sbulage commented Aug 9, 2023 • edited Loading

sbulage commented Aug 16, 2023

sbulage commented Aug 1, 2023 •

edited

Loading

sbulage commented Aug 9, 2023 •

edited

Loading