Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SURE-6645] Fleet creating large number of resources for each cluster #1651

Closed
kkaempf opened this issue Jul 17, 2023 · 5 comments
Closed

[SURE-6645] Fleet creating large number of resources for each cluster #1651

kkaempf opened this issue Jul 17, 2023 · 5 comments
Assignees
Milestone

Comments

@kkaempf
Copy link
Collaborator

kkaempf commented Jul 17, 2023

Internal reference: SURE-6645

Environment

Rancher Cluster:
Rancher version: 2.7.5
Number of nodes: 37 (30 for ARGOcd)

Downstream Cluster:
Number of Downstream clusters: 1132
RKE/RKE2/K3S version: 2.7.5
Kubernetes version: 1.25.6
CNI: calico + traefik

Other:
Underlying Infrastructure: Rancher running in managed AKS cluster

Issue description:

ETCD has filled with a large number of fleet resources, or resources in the fleet-default namespace. The number of resources is not proportionate to the number of downstream clusters.

ROLE:
request-6s7nb                                                  2023-07-13T12:46:51Z
ROLEBINDING:
request-6s7nb                                                     Role/request-6s7nb                                                  25h
CLUSTERREGISTRATION:
request-6s7nb   s0409          {"management.cattle.io/cluster-display-name":"s0409","management.cattle.io/cluster-name":"c-m-zcllc9kx","objectset.rio.cattle.io/hash":"d7128da63a4e07ede9c4d36b1a5fd60b31ce3d45","provider.cattle.io":"k3s"}

For example, the cluster s0409 has 170 similar requests.
Some clusters have 300+ requests for clusterregistration, none of which are cleaned up.

Business impact:

Kube-apiserver in AKS is hitting its limits and is causing a large number of timeouts when all clusters are attempting to connect at once on startup. It takes ~30 minutes for clusters to be able to connect. ETCD is getting overloaded at is around 2G after compact/defrag

Troubleshooting steps:

Worked with AKS team to reduce load and pinpoint what is filling ETCD. Argo takes up a large amount of space, but most API calls are coming from Rancher. See images for API volume to kube-apiserver, as well as ETCD object count and size. You can see 80k cluster registrations which is large for the 1132 clusters they have registered. Why would that be?

Actual behavior:

large number of fleet resources created for each cluster (roles, rolebindings, clusterregistrations, secrets, serviceAccounts) in the fleet default namespace. Customer sees large amount of timeouts to downstream clusters, possibly causing fleet to treat it as new clusters.

Expected behavior:

Smaller, appropriate number of resources created for each cluster.

@kkaempf kkaempf added this to the 2023-Q3-v2.7x milestone Jul 17, 2023
@kkaempf kkaempf added the JIRA Must shout label Jul 17, 2023
@kkaempf
Copy link
Collaborator Author

kkaempf commented Jul 17, 2023

see also #1615

@moio moio changed the title Fleet creating large number of resources for each cluster [SURE-6645] Fleet creating large number of resources for each cluster Jul 18, 2023
@weyfonk
Copy link
Contributor

weyfonk commented Jul 18, 2023

See this script as a possible workaround, deleting obsolete cluster registrations for each downstream cluster.

@manno
Copy link
Member

manno commented Jul 24, 2023

This is the same issue as #1615

@manno
Copy link
Member

manno commented Jul 24, 2023

Additional QA

Problem

Re-registering clusters can leave clusterregistrations and their child resources, service account, roles, role bindings, behind.
Previously agents were re-registered when the fleet controller started. This made sure they use the right config, like CA and API server URL.

Solution

The agent is only re-registered if the API server or CA changed, a new status field stores the old values.
Whenever an agent is granted a registration, the old clusterregistration resources are removed. Not all of them, since multiple registrations may be active at the same time, but this should be enough to avoid amassing many resources.
The Fleet controller and k8s will then remove the child resources.

Testing

Restart the controller, see that no agents are re-registered.

Force agent re-registration, e.g. by running kubectl patch clusters.fleet.cattle.io -n fleet-local local --type=json -p '[{"op": "add", "path": "/spec/redeployAgentGeneration", "value": 2}]'; kubectl patch clusters.fleet.cattle.io -n fleet-local local --type=json -p '[{"op": "add", "path": "/spec/redeployAgentGeneration", "value": 1}]'; kubectl patch clusters.fleet.cattle.io -n fleet-local local --type=json -p '[{"op": "add", "path": "/spec/redeployAgentGeneration", "value": 3}]'

Engineering Testing

Manual Testing

Tested manually with multiple clusters.

Automated Testing

We have no automation that can create and register clusters.

QA Testing Considerations

Agents will still re-register, e.g. when updating and agent configuration changes. Re-deploy in contrast to a full re-registration will always happen, e.g. during upgrades when the agent image changes.

Regressions Considerations

Not all agent configuration changes are propagated automatically. For example, if the Fleet controller is stopped, and the agent resources are modified on the cluster (https://fleet.rancher.io/ref-crds#clusterspec). Starting the controller will not automatically redeploy the cluster's agent.

@sbulage
Copy link
Contributor

sbulage commented Aug 9, 2023

Detailed testing information can be found here --> #1690 (comment)

@sbulage sbulage closed this as completed Aug 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

No branches or pull requests

5 participants