Clean up existing ClusterRegistrations on Fleet Upgrade #1690

manno · 2023-08-02T11:01:37Z

This is an extension to #1651.
Should also fix #1674
It needs a backport to 0.7.x.

Implemented by:

Add hook on upgrade to clean up old, duplicate clusterregistrations #1689

Fleet 0.7.0 creates multiple clusterregistration resources and does not clean them up. This adds a helm hook to run a a clean up script when upgrading Fleet.

We assume agents are only using the latest clusterregistration and clean up the others. The script does not check if a registration was granted. It does try to delete the child resources, too. If the fleet-controller is running, its clean up handler would also delete the orphaned resources. The script works over all namespaces.

The migration job can be disabled via helm values.

Testing

install a rancher/fleet version which does not have the automatic clean up after registration, e.g. 2.7.5
create a situation where there are multiple outdated clusterregistration, e.g. by forcing agent redeployments a few times:

#!/bin/bash

ns=${1:-fleet-local}
name=${2:-local}
kubectl patch clusters.fleet.cattle.io -n "$ns" "$name" --type=json -p '[{"op": "add", "path": "/spec/redeployAgentGeneration", "value": '$RANDOM'}]'

try to have some outdated registrations for clusters, that are deleted. Probably by creating lots of registrations, stopping the fleet controller and deleting the clusters.fleet.cattle.io (or the whole cluster in Rancher?) manually.
upgrade to a fleet version with the clean up upgrade job and see that all outdated clusterregistrations are removed
existing agents are still registered and can connect to the upstream API server, can be checked by deploying a new bundle

Engineering Testing

Manual Testing

Upgraded fleet standalone multiple times and watched the job spawn. Checked with helm template if the new value work.

QA Testing Considerations

The clean up script might use a lot of resources and run for a long time if cleaning up lots of (20k+) resources.
It should be fine for smaller fleets (<20 clusters).

Regressions Considerations

Some fleets might have too many resources for an automatic clean up to be effective?

The text was updated successfully, but these errors were encountered:

manno · 2023-08-02T11:50:52Z

/backport release/v0.7 fleet-v0.7.1-v2.7.6

rancherbot · 2023-08-02T11:50:53Z

@manno, Not creating backport issue for issue 1690 in repository fleet because milestone release/v0.7 does not exist or is not an open milestone

manno · 2023-08-02T14:14:34Z

/backport fleet-v0.7.1-v2.7.6 release/v0.7

sbulage · 2023-08-09T14:19:53Z

Issues #1651 and #1690 are cluster upgrade and post upgrade resources cleanup fixes.

Issue #1651:

Cleanup of cluster registrations if re-registration happens as well as preventing the creating large no. of resources in the cluster.
QA template followed: [SURE-6645] Fleet creating large number of resources for each cluster #1651 (comment)

Issue #1690:

Cleanup of cluster resources (old) while performing the upgrade of Rancher/Fleet in the cluster.
QA template followed from description.

Followed below steps to validate both issues i.e. Cleanup while upgrade is performing and later checked that the cluster registration and associated resources are removed.

In order to reproduce the issue following steps were performed.

I kept cluster for around 5 days and observed the current cluster resources and cluster registrations.
Upgrade performed from Rancher 2.7.5 to Rancher 2.7.7-rc1.

Observations

Before Upgrade
```
Rancher: v2.7.5
Fleet: v0.7.0
```
- In between those days, added 3 GitRepo in the cluster.
- Obsevered the cluster registrations before patching the clusterregistrations.
- Initially clusterregistrations were less as soon as I executed below command, they got increased meaning that old registrations weren't removed.
```
kubectl patch clusters.fleet.cattle.io -n fleet-local local --type=json -p '[{"op": "add", "path": "/spec/redeployAgentGeneration", "value": 2}]'
```
- Observed that the Role and RoleBindings were increased significantly.
- Every time I execute above command, it creates new cluster registrations without deleting old one.
- In my setup the clusterregistrations increased from 4 to 42.
- Other resources were also increased which has created by clusterregistrations.
- Deleted the one of the cluster from the clusters.fleet.cattle.io
```
kubectl delete clusters.fleet.cattle.io -n fleet-default imported-cluster-1
```

After observing this sitution over the days, I upgraded to the latest Rancher RC version and fleet RC version in which the fix is available.

After Upgrade
```
Rancher: v2.7.7-rc1
Fleet: 0.8.0-rc.7
```
- While upgrade was happening saw that clusterregistrations went down to 4.
- Before upgrade cluster deleted from clusters.fleet.cattle.io got re-added to fleet.
- Tried re-register the clusterregistrations by using above command, but there were no old registrations present.
- Re-registrationing of cluster deploying new fleet-agent everytime. and which can be seen in the fleet-controller logs.
- There were no harm to the existing resources added by the GitRepo while upgrading it to the Rancher 2.7.7-rc1.
- After upgrade imported cluster clusterspecs are working as expected.
- Updated Cluster spec of the imported clusters.(ClusterSpec).
- After every clusterSpec update, I started fleet-controller, I see fleet-agent is re-created on imported clusters with the updated spec configurations.

P.S. In above testing, P0 and regression tests performed on the cluster after upgrade.

kkaempf · 2023-08-09T14:41:48Z

Can this be closed as fixed now, @sbulage ? 🤔

manno added this to the 2023-Q4-v2.8x milestone Aug 2, 2023

github-actions bot added team/fleet labels Aug 2, 2023

manno self-assigned this Aug 2, 2023

manno mentioned this issue Aug 2, 2023

Add hook on upgrade to clean up old, duplicate clusterregistrations #1689

Merged

2 tasks

manno added the status/waiting-for-fleet-rc-and-chart label Aug 2, 2023

manno modified the milestones: 2023-Q4-v2.8x, 2023-Q3-v2.7x Aug 2, 2023

rancherbot mentioned this issue Aug 2, 2023

[Backport v0.7] Clean up existing ClusterRegistrations on Fleet Upgrade #1692

Closed

manno removed the status/waiting-for-fleet-rc-and-chart label Aug 7, 2023

sbulage self-assigned this Aug 9, 2023

sbulage added the status/dev-validate label Aug 9, 2023

sbulage mentioned this issue Aug 9, 2023

[SURE-6645] Fleet creating large number of resources for each cluster #1651

Closed

sbulage closed this as completed Aug 9, 2023

manno mentioned this issue Aug 21, 2023

Some resources remain even after downstream cluster was deleted #1674

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clean up existing ClusterRegistrations on Fleet Upgrade #1690

Clean up existing ClusterRegistrations on Fleet Upgrade #1690

manno commented Aug 2, 2023 •

edited

Loading

manno commented Aug 2, 2023

rancherbot commented Aug 2, 2023

manno commented Aug 2, 2023

sbulage commented Aug 9, 2023

kkaempf commented Aug 9, 2023

Clean up existing ClusterRegistrations on Fleet Upgrade #1690

Clean up existing ClusterRegistrations on Fleet Upgrade #1690

Comments

manno commented Aug 2, 2023 • edited Loading

Testing

Engineering Testing

Manual Testing

QA Testing Considerations

Regressions Considerations

manno commented Aug 2, 2023

rancherbot commented Aug 2, 2023

manno commented Aug 2, 2023

sbulage commented Aug 9, 2023

kkaempf commented Aug 9, 2023

manno commented Aug 2, 2023 •

edited

Loading