[SURE-6645] Fleet creating large number of resources for each cluster #1651

kkaempf · 2023-07-17T08:42:05Z

Internal reference: SURE-6645

Environment

Rancher Cluster:
Rancher version: 2.7.5
Number of nodes: 37 (30 for ARGOcd)

Downstream Cluster:
Number of Downstream clusters: 1132
RKE/RKE2/K3S version: 2.7.5
Kubernetes version: 1.25.6
CNI: calico + traefik

Other:
Underlying Infrastructure: Rancher running in managed AKS cluster

Issue description:

ETCD has filled with a large number of fleet resources, or resources in the fleet-default namespace. The number of resources is not proportionate to the number of downstream clusters.

ROLE:
request-6s7nb                                                  2023-07-13T12:46:51Z
ROLEBINDING:
request-6s7nb                                                     Role/request-6s7nb                                                  25h
CLUSTERREGISTRATION:
request-6s7nb   s0409          {"management.cattle.io/cluster-display-name":"s0409","management.cattle.io/cluster-name":"c-m-zcllc9kx","objectset.rio.cattle.io/hash":"d7128da63a4e07ede9c4d36b1a5fd60b31ce3d45","provider.cattle.io":"k3s"}

For example, the cluster s0409 has 170 similar requests.
Some clusters have 300+ requests for clusterregistration, none of which are cleaned up.

Business impact:

Kube-apiserver in AKS is hitting its limits and is causing a large number of timeouts when all clusters are attempting to connect at once on startup. It takes ~30 minutes for clusters to be able to connect. ETCD is getting overloaded at is around 2G after compact/defrag

Troubleshooting steps:

Worked with AKS team to reduce load and pinpoint what is filling ETCD. Argo takes up a large amount of space, but most API calls are coming from Rancher. See images for API volume to kube-apiserver, as well as ETCD object count and size. You can see 80k cluster registrations which is large for the 1132 clusters they have registered. Why would that be?

Actual behavior:

large number of fleet resources created for each cluster (roles, rolebindings, clusterregistrations, secrets, serviceAccounts) in the fleet default namespace. Customer sees large amount of timeouts to downstream clusters, possibly causing fleet to treat it as new clusters.

Expected behavior:

Smaller, appropriate number of resources created for each cluster.

The text was updated successfully, but these errors were encountered:

kkaempf · 2023-07-17T08:45:58Z

Additional QA

Problem

Re-registering clusters can leave clusterregistrations and their child resources, service account, roles, role bindings, behind.
Previously agents were re-registered when the fleet controller started. This made sure they use the right config, like CA and API server URL.

Solution

The agent is only re-registered if the API server or CA changed, a new status field stores the old values.
Whenever an agent is granted a registration, the old clusterregistration resources are removed. Not all of them, since multiple registrations may be active at the same time, but this should be enough to avoid amassing many resources.
The Fleet controller and k8s will then remove the child resources.

Testing

Restart the controller, see that no agents are re-registered.

Force agent re-registration, e.g. by running kubectl patch clusters.fleet.cattle.io -n fleet-local local --type=json -p '[{"op": "add", "path": "/spec/redeployAgentGeneration", "value": 2}]'; kubectl patch clusters.fleet.cattle.io -n fleet-local local --type=json -p '[{"op": "add", "path": "/spec/redeployAgentGeneration", "value": 1}]'; kubectl patch clusters.fleet.cattle.io -n fleet-local local --type=json -p '[{"op": "add", "path": "/spec/redeployAgentGeneration", "value": 3}]'

Engineering Testing

Manual Testing

Tested manually with multiple clusters.

Automated Testing

We have no automation that can create and register clusters.

QA Testing Considerations

Agents will still re-register, e.g. when updating and agent configuration changes. Re-deploy in contrast to a full re-registration will always happen, e.g. during upgrades when the agent image changes.

Regressions Considerations

Not all agent configuration changes are propagated automatically. For example, if the Fleet controller is stopped, and the agent resources are modified on the cluster (https://fleet.rancher.io/ref-crds#clusterspec). Starting the controller will not automatically redeploy the cluster's agent.

sbulage · 2023-08-09T14:20:35Z

Detailed testing information can be found here --> #1690 (comment)

kkaempf added the kind/bug label Jul 17, 2023

github-actions bot added team/fleet labels Jul 17, 2023

kkaempf added area/performance and removed area/fleet labels Jul 17, 2023

kkaempf added this to the 2023-Q3-v2.7x milestone Jul 17, 2023

kkaempf added the JIRA Must shout label Jul 17, 2023

manno assigned weyfonk Jul 18, 2023

moio changed the title ~~Fleet creating large number of resources for each cluster~~ [SURE-6645] Fleet creating large number of resources for each cluster Jul 18, 2023

Jono-SUSE-Rancher modified the milestones: 2023-Q3-v2.7x, Fleet - v2.7.6 Jul 18, 2023

Jono-SUSE-Rancher mentioned this issue Jul 18, 2023

[Backport v2.7.6][SURE-6645] Fleet creating large number of resources for each cluster #1651 #1656

Closed

Jono-SUSE-Rancher added area/fleet labels Jul 20, 2023

kkaempf modified the milestones: 2023-Q3-v2.7x, 2023-Q4-v2.8x Jul 24, 2023

manno mentioned this issue Jul 24, 2023

Clean up old clusterregistrations and remember cluster's api server #1658

Merged

manno modified the milestones: 2023-Q4-v2.8x, 2023-Q3-v2.7x Jul 24, 2023

kkaempf mentioned this issue Jul 24, 2023

Clean up left-over registration resources #1615

Closed

manno added the status/needs-qa-test-template label Jul 24, 2023

manno assigned manno and unassigned weyfonk Jul 24, 2023

moio mentioned this issue Jul 24, 2023

[BUG] High memory usage on v2.7.5 rancher/rancher#42191

Closed

This was referenced Jul 25, 2023

Clusteregistration less aggressive cleanup #1675

Merged

Return early from clusterregistration handler #1676

Merged

manno removed the status/needs-qa-test-template label Jul 26, 2023

rancherbot mentioned this issue Aug 2, 2023

[Backport v0.7] Clean up existing ClusterRegistrations on Fleet Upgrade #1692

Closed

sbulage self-assigned this Aug 9, 2023

sbulage closed this as completed Aug 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SURE-6645] Fleet creating large number of resources for each cluster #1651

[SURE-6645] Fleet creating large number of resources for each cluster #1651

kkaempf commented Jul 17, 2023 •

edited by moio

Loading

kkaempf commented Jul 17, 2023

weyfonk commented Jul 18, 2023 •

edited

Loading

manno commented Jul 24, 2023

manno commented Jul 24, 2023 •

edited

Loading

sbulage commented Aug 9, 2023

[SURE-6645] Fleet creating large number of resources for each cluster #1651

[SURE-6645] Fleet creating large number of resources for each cluster #1651

Comments

kkaempf commented Jul 17, 2023 • edited by moio Loading

Environment

Issue description:

Business impact:

Troubleshooting steps:

Actual behavior:

Expected behavior:

kkaempf commented Jul 17, 2023

weyfonk commented Jul 18, 2023 • edited Loading

manno commented Jul 24, 2023

manno commented Jul 24, 2023 • edited Loading

Additional QA

Problem

Solution

Testing

Engineering Testing

Manual Testing

Automated Testing

QA Testing Considerations

Regressions Considerations

sbulage commented Aug 9, 2023

kkaempf commented Jul 17, 2023 •

edited by moio

Loading

weyfonk commented Jul 18, 2023 •

edited

Loading

manno commented Jul 24, 2023 •

edited

Loading