Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The hasConfigChanged has flaws and causes the unnecessary re-deployment of fleet-agent on local cluster #3061

Open
1 task done
w13915984028 opened this issue Nov 8, 2024 · 1 comment
Labels

Comments

@w13915984028
Copy link

w13915984028 commented Nov 8, 2024

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

hasConfigChanged := config.APIServerURL != cluster.Status.APIServerURL ||

fleet-controller assumes the config is changed if following conditions are met

		hasConfigChanged := config.APIServerURL != cluster.Status.APIServerURL ||
			hashStatusField(config.APIServerCA) != cluster.Status.APIServerCAHash ||
			config.AgentTLSMode != cluster.Status.AgentTLSMode ||
			hasGarbageCollectionIntervalChanged(config, cluster)

however, the status values are fetched from an related secret, and if they are empty, then fallback to a configmap

logrus.Debugf("Cluster import for '%s/%s'. Setting up agent with kubeconfig from secret '%s/%s'", cluster.Namespace, cluster.Name, kubeConfigSecretNamespace, cluster.Spec.KubeConfigSecret)

...
	logrus.Debugf("Cluster import for '%s/%s'. Setting up agent with kubeconfig from secret '%s/%s'", cluster.Namespace, cluster.Name, kubeConfigSecretNamespace, cluster.Spec.KubeConfigSecret)
	var (
		cfg          = config.Get()
		apiServerURL = string(secret.Data[config.APIServerURLKey])
		apiServerCA  = secret.Data[config.APIServerCAKey]
	)

	if apiServerURL == "" {
		if len(cfg.APIServerURL) == 0 {
			return status, fmt.Errorf("missing apiServerURL in fleet config for cluster auto registration")
		}
		logrus.Debugf("Cluster import for '%s/%s'. Using apiServerURL from fleet-controller config", cluster.Namespace, cluster.Name)
		apiServerURL = cfg.APIServerURL
	}

	if len(apiServerCA) == 0 {
		apiServerCA = cfg.APIServerCA
	}


the cluster.fleet status is updated from:

	status.AgentDeployedGeneration = &cluster.Spec.RedeployAgentGeneration
	status.AgentMigrated = true
	status.CattleNamespaceMigrated = true
	status.Agent = fleet.AgentStatus{
		Namespace: cluster.Spec.AgentNamespace,
	}
	status.AgentNamespaceMigrated = true
	status.AgentConfigChanged = false
	status.APIServerURL = apiServerURL
	status.APIServerCAHash = hashStatusField(apiServerCA)
	status.AgentTLSMode = cfg.AgentTLSMode
	status.GarbageCollectionInterval = &cfg.GarbageCollectionInterval

On Harvester cluster, an Rancher is embeded for local cluster provision, in sequences, the fleet-controller and fleet-agent are also deployed.

There are configmaps:

configmap -n cattle-fleet-system fleet-controller -oyaml
apiVersion: v1
data:
  config: |
    agentCheckinInterval: 15m
    agentImage: rancher/fleet-agent:v0.10.2
    agentImagePullPolicy: IfNotPresent
    agentTLSMode: strict
    apiServerCA: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUJ2VENDQVdPZ0F3SUJBZ0lCQURBS0JnZ3Foa2pPUFFRREFqQkdNUnd3R2dZRFZRUUtFeE5rZVc1aGJXbGoKYkdsemRHVnVaWEl0YjNKbk1TWXdKQVlEVlFRRERCMWtlVzVoYldsamJHbHpkR1Z1WlhJdFkyRkFNVGN5T1RVNApPREl5TURBZUZ3MHlOREV3TWpJd09URXdNakJhRncwek5ERXdNakF3T1RFd01qQmFNRVl4SERBYUJnTlZCQW9UCkUyUjVibUZ0YVdOc2FYTjBaVzVsY2kxdmNtY3hKakFrQmdOVkJBTU1IV1I1Ym1GdGFXTnNhWE4wWlc1bGNpMWoKWVVBeE56STVOVGc0TWpJd01Ga3dFd1lIS29aSXpqMENBUVlJS29aSXpqMERBUWNEUWdBRXpQdmFKY01CY3RtcgovTTdVdFZIOVlScmVMM0Z2dFhFWnZXOG9TUS9EVHdvNDZ1WmxnSW5wRThCbWM5b3BOaW95ZjhFa21ScGFlWFI3CnVud1VmLzJMRGFOQ01FQXdEZ1lEVlIwUEFRSC9CQVFEQWdLa01BOEdBMVVkRXdFQi93UUZNQU1CQWY4d0hRWUQKVlIwT0JCWUVGRDRFMWwrKzdWVWVOMEdqSCs1WVpaUzR2aFcrTUFvR0NDcUdTTTQ5QkFNQ0EwZ0FNRVVDSVFEQgpFdlNybGZUL2k2VGdIWHhWYXhyQUpGMGxuaW9pSUk3N2VFcUFCUVJTNEFJZ1oyRmpRZCtSQitrWmpXeFVOZG0vCmwzUWpveStDZXlNYkJLcnVuTHg1TjBNPQotLS0tLUVORCBDRVJUSUZJQ0FURS0tLS0tCg==
    apiServerURL: https://10.53.47.173
    bootstrap:
      agentNamespace: cattle-fleet-local-system
      branch: master
      namespace: fleet-local
      paths: ""
      repo: ""
      secret: ""
    githubURLPrefix: ""
    ignoreClusterRegistrationLabels: false
    systemDefaultRegistry: ""
    webhookReceiverURL: ""
kind: ConfigMap
metadata:
  annotations:
    meta.helm.sh/release-name: fleet
    meta.helm.sh/release-namespace: cattle-fleet-system
  creationTimestamp: "2024-10-22T09:10:42Z"
  labels:
    app.kubernetes.io/managed-by: Helm
  name: fleet-controller
  namespace: cattle-fleet-system
  resourceVersion: "1728677"
  uid: e903b5dd-c645-4fb0-8880-f0b8eb28ab69

secret:

secret -n fleet-local local-kubeconfig -oyaml
apiVersion: v1
data:
  apiServerCA: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUJlRENDQVIrZ0F3SUJBZ0lCQURBS0JnZ3Foa2pPUFFRREFqQWtNU0l3SUFZRFZRUUREQmx5YTJVeUxYTmwKY25abGNpMWpZVUF4TnpJNU5UZzRNRGcyTUI0WERUSTBNVEF5TWpBNU1EZ3dObG9YRFRNME1UQXlNREE1TURndwpObG93SkRFaU1DQUdBMVVFQXd3WmNtdGxNaTF6WlhKMlpYSXRZMkZBTVRjeU9UVTRPREE0TmpCWk1CTUdCeXFHClNNNDlBZ0VHQ0NxR1NNNDlBd0VIQTBJQUJORk5oN2ZKWGdwY2trK1d2QnBGT01UVjJrRlZmVjRVdXRkZnF3dk0Kb2JFdHcvK0RaQ1NJU3Jsc1ZxeHQ0di82S2lOSDZkcnVYbDNqVDdkV1BwdlIrUDJqUWpCQU1BNEdBMVVkRHdFQgovd1FFQXdJQ3BEQVBCZ05WSFJNQkFmOEVCVEFEQVFIL01CMEdBMVVkRGdRV0JCUVQ0a3c2YU1sVUVnWmpiUmpuCmJZc2Z4bnJCMXpBS0JnZ3Foa2pPUFFRREFnTkhBREJFQWlBS3FvNmtvdGlST3dvUTk0aVRTSnRWYnpIS0xFMHkKZFUwRFRGa2RFZkVreGdJZ1pDc1MwSWkvZmtLNHFiZFJrUVk5RU93QkFVYjVUMVR4dm5pMlNOYXVDU0E9Ci0tLS0tRU5EIENFUlRJRklDQVRFLS0tLS0K
  apiServerURL: aHR0cHM6Ly8xMC41My4wLjE6NDQz
  token: dS1tbzc3M3l0dHQ0Om12OHY3bmdieDVtcjVkOWpnMnNoc3Ridzk1a3cyNm1ud2hxNW5xY21uNzJobGxkZjY2dHg0eg==
  value: YXBpVmVyc2lvbjogdjEKY2x1c3RlcnM6Ci0gY2x1c3RlcjoKICAgIGNlcnRpZmljYXRlLWF1dGhvcml0eS1kYXRhOiBMUzB0TFMxQ1JVZEpUaUJEUlZKVVNVWkpRMEZVUlMwdExTMHRDazFKU1VKMlZFTkRRVmRQWjBGM1NVSkJaMGxDUVVSQlMwSm5aM0ZvYTJwUFVGRlJSRUZxUWtkTlVuZDNSMmRaUkZaUlVVdEZlRTVyWlZjMWFHSlhiR29LWWtkc2VtUkhWblZhV0VsMFlqTktiazFUV1hkS1FWbEVWbEZSUkVSQ01XdGxWelZvWWxkc2FtSkhiSHBrUjFaMVdsaEpkRmt5UmtGTlZHTjVUMVJWTkFwUFJFbDVUVVJCWlVaM01IbE9SRVYzVFdwSmQwOVVSWGROYWtKaFJuY3dlazVFUlhkTmFrRjNUMVJGZDAxcVFtRk5SVmw0U0VSQllVSm5UbFpDUVc5VUNrVXlValZpYlVaMFlWZE9jMkZZVGpCYVZ6VnNZMmt4ZG1OdFkzaEtha0ZyUW1kT1ZrSkJUVTFJVjFJMVltMUdkR0ZYVG5OaFdFNHdXbGMxYkdOcE1Xb0tXVlZCZUU1NlNUVk9WR2MwVFdwSmQwMUdhM2RGZDFsSVMyOWFTWHBxTUVOQlVWbEpTMjlhU1hwcU1FUkJVV05FVVdkQlJYcFFkbUZLWTAxQ1kzUnRjZ292VFRkVmRGWklPVmxTY21WTU0wWjJkRmhGV25aWE9HOVRVUzlFVkhkdk5EWjFXbXhuU1c1d1JUaENiV001YjNCT2FXOTVaamhGYTIxU2NHRmxXRkkzQ25WdWQxVm1MekpNUkdGT1EwMUZRWGRFWjFsRVZsSXdVRUZSU0M5Q1FWRkVRV2RMYTAxQk9FZEJNVlZrUlhkRlFpOTNVVVpOUVUxQ1FXWTRkMGhSV1VRS1ZsSXdUMEpDV1VWR1JEUkZNV3dyS3pkV1ZXVk9NRWRxU0NzMVdWcGFVelIyYUZjclRVRnZSME5EY1VkVFRUUTVRa0ZOUTBFd1owRk5SVlZEU1ZGRVFncEZkbE55YkdaVUwyazJWR2RJV0hoV1lYaHlRVXBHTUd4dWFXOXBTVWszTjJWRmNVRkNVVkpUTkVGSloxb3lSbXBSWkN0U1FpdHJXbXBYZUZWT1pHMHZDbXd6VVdwdmVTdERaWGxOWWtKTGNuVnVUSGcxVGpCTlBRb3RMUzB0TFVWT1JDQkRSVkpVU1VaSlEwRlVSUzB0TFMwdAogICAgc2VydmVyOiBodHRwczovLzEwLjUzLjQ3LjE3My9rOHMvY2x1c3RlcnMvbG9jYWwKICBuYW1lOiBjbHVzdGVyCmNvbnRleHRzOgotIGNvbnRleHQ6CiAgICBjbHVzdGVyOiBjbHVzdGVyCiAgICB1c2VyOiB1c2VyCiAgbmFtZTogZGVmYXVsdApjdXJyZW50LWNvbnRleHQ6IGRlZmF1bHQKa2luZDogQ29uZmlnCnByZWZlcmVuY2VzOiB7fQp1c2VyczoKLSBuYW1lOiB1c2VyCiAgdXNlcjoKICAgIHRva2VuOiB1LW1vNzczeXR0dDQ6bXY4djduZ2J4NW1yNWQ5amcyc2hzdGJ3OTVrdzI2bW53aHE1bnFjbW43MmhsbGRmNjZ0eDR6Cg==
kind: Secret
metadata:
  creationTimestamp: "2024-10-22T09:10:21Z"
  labels:
    cluster.x-k8s.io/cluster-name: local
  name: local-kubeconfig
  namespace: fleet-local
  ownerReferences:
  - apiVersion: provisioning.cattle.io/v1
    kind: Cluster
    name: local
    uid: f03854eb-90fe-4fac-8ccc-cb292dc9a583
  resourceVersion: "2547"
  uid: 422a9004-ab22-4013-811b-2ab94d2c37fd
type: Opaque

The cluster.fleet object:

get cluster.fleet -n fleet-local local -oyaml
apiVersion: fleet.cattle.io/v1alpha1
kind: Cluster
metadata:
  annotations:
    objectset.rio.cattle.io/applied: H4sIAAAAAAAA/4xSTW/bMAz9KwPPTresX4mBHYquGIoBPbS7FT0wEm1rkSlBopIagf/7ILtJjbYrcpPE9x75nriDlgQ1CkK5A2R2gmIcx3x1q7+kJJKcBONOFIpYOjHuq9FQQmWJZKZsikIBiv+C3ZYpzOrNGkrwwW1MNI4N1xPIZl58+W1Y/7g+Uo2xJSjBOoX2KHD0qOgw9MjrC1CBBrd/TEtRsPVQcrK2AIsrskMGLTLW1BLLRPjF9Uyb6C12b+f5lPMGe6SVBmOT5/+OC1wul0tcnNN8XlWL04vzM7qYK1KneLmYn10uq1V1CcWYtabwKgIlNBg2NETcv2v9SU7Rkxr2oyaWq6oybKTLD+w0Te8+UEUhkP6ZguH6QTWkkzVc39bsDs83z6RSzh3Kxz2HODfOeYtqbp59oBjHPXzcwZq6/VCTTIZpcmaeAooLUMItQwEbtIkyESQkgqf+qS9gS6ZuBMp5/9T3xejkbuJ4lJ1NjM9iF4VaKGCdVnTtuDL1A6lAsg9tlgtqKHyAuvskT0FJ8ZBoPliM8kDE+w2c/kamaIomkL4n1N0vI/fkXYTyWwEvOwjlri8gvCsHii4FRdcusQwtp0oDpDUxGq7Hs9OmMqSHCzt5RbngG+SXSjg8J16z2/Jw3qKRK+/tyM8+U9ti6D7supfo+/5fAAAA//+aI8sBhQQAAA
    objectset.rio.cattle.io/id: fleet-cluster
    objectset.rio.cattle.io/owner-gvk: provisioning.cattle.io/v1, Kind=Cluster
    objectset.rio.cattle.io/owner-name: local
    objectset.rio.cattle.io/owner-namespace: fleet-local
  creationTimestamp: "2024-10-22T09:10:21Z"
  generation: 9
  labels:
    management.cattle.io/cluster-display-name: local
    management.cattle.io/cluster-name: local
    name: local
    objectset.rio.cattle.io/hash: f2a8a9999a85e11ff83654e61cec3a781479fbf7
    provider.cattle.io: harvester
  name: local
  namespace: fleet-local
  resourceVersion: "2289565"
  uid: 299cf0b3-2b6b-433f-b790-4b4754d3fb31
spec:
  agentAffinity:
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - preference:
          matchExpressions:
          - key: fleet.cattle.io/agent
            operator: In
            values:
            - "true"
        weight: 1
  agentNamespace: cattle-fleet-local-system
  clientID: qqgxq26f8nzkc5xnvdxtt64x8jfjjh9545vwv4t7wnpb99vkjz89ft
  kubeConfigSecret: local-kubeconfig
  kubeConfigSecretNamespace: fleet-local
  redeployAgentGeneration: 7
status:
  agent:
    lastSeen: "2024-11-08T11:23:50Z"
    namespace: cattle-fleet-local-system
  agentAffinityHash: f50425c0999a8e18c2d104cdb8cb063762763f232f538b5a7c8bdb61
  agentDeployedGeneration: 7
  agentMigrated: true
  agentNamespaceMigrated: true
  agentTLSMode: strict
  apiServerCAHash: 302980e70d0e2817c3f94bfafa6e419be2249fc11d128da50054b390
  apiServerURL: https://10.53.0.1:443
  cattleNamespaceMigrated: true
  conditions:
  - lastUpdateTime: "2024-10-22T09:11:23Z"
    status: "True"
    type: Processed
  - lastUpdateTime: "2024-11-08T11:23:29Z"
    status: "True"
    type: Imported
  - lastUpdateTime: "2024-10-22T09:11:23Z"
    status: "True"
    type: Reconciled
  - lastUpdateTime: "2024-11-08T10:52:30Z"
    status: "True"
    type: Ready
  desiredReadyGitRepos: 0
  display:
    readyBundles: 7/7
  garbageCollectionInterval: 0s
  namespace: cluster-fleet-local-local-1a3d67d0a899
  readyGitRepos: 0
  resourceCounts:
    desiredReady: 0
    missing: 0
    modified: 0
    notReady: 0
    orphaned: 0
    ready: 0
    unknown: 0
    waitApplied: 0
  summary:
    desiredReady: 7
    ready: 7

And, if we kill the fleet-controller POD, it will always re-deploy the fleet-agent with below debug information

kk logs -n cattle-fleet-system fleet-controller-78f8b6677c-hvftb -c fleet-agentmanagement
I1108 11:23:28.926600       1 leaderelection.go:250] attempting to acquire leader lease cattle-fleet-system/fleet-agentmanagement-lock...
I1108 11:23:28.934031       1 leaderelection.go:260] successfully acquired lease cattle-fleet-system/fleet-agentmanagement-lock

// debug via https://github.com/rancher/fleet/blob/1cddbbff1c2cd71c9b9011c3738754e5b4c8fa89/internal/cmd/controller/agentmanagement/controllers/config/controller.go#L24

time="2024-11-08T11:23:28Z" level=info msg="When Register, cattle-fleet-system/fleet-controller the Lookup result: APIServerURL:https://10.53.47.173 AgentTLSMode:strict"

time="2024-11-08T11:23:29Z" level=info msg="Starting fleet.cattle.io/v1alpha1, Kind=Bundle controller"
time="2024-11-08T11:23:29Z" level=info msg="Starting fleet.cattle.io/v1alpha1, Kind=ClusterRegistrationToken controller"
time="2024-11-08T11:23:29Z" level=info msg="Starting fleet.cattle.io/v1alpha1, Kind=ClusterRegistration controller"
time="2024-11-08T11:23:29Z" level=info msg="Starting rbac.authorization.k8s.io/v1, Kind=Role controller"
time="2024-11-08T11:23:29Z" level=info msg="Starting fleet.cattle.io/v1alpha1, Kind=GitRepo controller"
time="2024-11-08T11:23:29Z" level=info msg="Starting /v1, Kind=ConfigMap controller"
time="2024-11-08T11:23:29Z" level=info msg="Starting rbac.authorization.k8s.io/v1, Kind=ClusterRole controller"
time="2024-11-08T11:23:29Z" level=info msg="Starting /v1, Kind=Namespace controller"
time="2024-11-08T11:23:29Z" level=info msg="Starting /v1, Kind=ServiceAccount controller"
time="2024-11-08T11:23:29Z" level=info msg="Starting rbac.authorization.k8s.io/v1, Kind=ClusterRoleBinding controller"

// debug via https://github.com/rancher/fleet/blob/1cddbbff1c2cd71c9b9011c3738754e5b4c8fa89/internal/cmd/controller/agentmanagement/controllers/config/controller.go#L42

time="2024-11-08T11:23:29Z" level=info msg="When onchange, reloadConfig, cattle-fleet-system/fleet-controller the ReadConfig result: APIServerURL:https://10.53.47.173 AgentTLSMode:strict"

time="2024-11-08T11:23:29Z" level=info msg="Starting fleet.cattle.io/v1alpha1, Kind=ClusterGroup controller"
time="2024-11-08T11:23:29Z" level=info msg="Starting fleet.cattle.io/v1alpha1, Kind=BundleDeployment controller"
time="2024-11-08T11:23:29Z" level=info msg="Update agent bundle for cluster fleet-local/local"
time="2024-11-08T11:23:29Z" level=info msg="Starting fleet.cattle.io/v1alpha1, Kind=Cluster controller"
time="2024-11-08T11:23:29Z" level=info msg="Starting rbac.authorization.k8s.io/v1, Kind=RoleBinding controller"
time="2024-11-08T11:23:29Z" level=info msg="Starting /v1, Kind=Secret controller"

time="2024-11-08T11:23:29Z" level=info msg="API server config changed, trigger cluster import for cluster fleet-local/local"

// debug via hasConfigChanged

time="2024-11-08T11:23:29Z" level=info msg="detected change: APIServerURL: https://10.53.47.173 https://10.53.0.1:443 equal:false, CAHash: d372882253c64e2862a02eb9022b9ed346b9ca21c13d06969e887c14 302980e70d0e2817c3f94bfafa6e419be2249fc11d128da50054b390 equal:false, AgentTLSMode:strict strict equal:true, garbagechanged: false"

time="2024-11-08T11:23:29Z" level=info msg="Deleted old agent for cluster (fleet-local/local) in namespace cattle-fleet-local-system"
time="2024-11-08T11:23:29Z" level=info msg="Cluster import for 'fleet-local/local'. Deployed new agent"

time="2024-11-08T11:23:31Z" level=info msg="Waiting for service account token key to be populated for secret cluster-fleet-local-local-1a3d67d0a899/request-66zfj-3f1e0ebd-af49-4c31-9f9a-382208c6777a-token"
time="2024-11-08T11:23:31Z" level=info msg="Namespace assigned to cluster 'fleet-local/local' enqueues cluster registration 'fleet-local/request-66zfj'"
time="2024-11-08T11:23:33Z" level=info msg="Cluster registration request 'fleet-local/request-66zfj' granted, creating cluster, request service account, registration secret"
time="2024-11-08T11:23:33Z" level=info msg="Cluster registration request 'fleet-local/request-66zfj' granted, creating cluster, request service account, registration secret"

Expected Behavior

Because the fleet-agent may deploy/update managedchart at any time, it should only be re-deployed in necessary cases.

The onChange needs to check the none-fallback case.

hasConfigChanged := config.APIServerURL != cluster.Status.APIServerURL ||

func (i *importHandler) onConfig(config *config.Config) error {

Steps To Reproduce

This is observed in the Harvester upgrade test

harvester/harvester#6851

When the embeded Rancher is upgraded and many conditions are checked, Harvester starts to upgrade the ManagedCharts, but randomly, the fleet-agent is re-deployed, it may cause some ManagedChart in middle-state, and the new fleet-agent does an rollback upon them, that causes other issues. For more details, please refer: harvester/harvester#6851 (comment)

Environment

- Architecture: 
- Fleet Version: Rancher v2.9.2 + fleet v0.10.2;  Harvester v1.4.0;  The `local` cluster is managed by `Rancher` and `Fleet`.
- Cluster:
  - Provider:
  - Options:
  - Kubernetes Version:

Logs

No response

Anything else?

No response

@rancherbot rancherbot added this to Fleet Nov 8, 2024
@github-project-automation github-project-automation bot moved this to 🆕 New in Fleet Nov 8, 2024
@w13915984028
Copy link
Author

w13915984028 commented Nov 8, 2024

note:
In the configmap, the apiServerURL is https://10.53.47.173, it is Rancher service IP in this cluster; and we also observed, in upgrade process, this value becomes empty first, then revert to https://10.53.47.173

configmap:

    name: fleet-controller
    namespace: cattle-fleet-system
    apiServerURL: https://10.53.244.156

In the secret: the apiServerURL is https://10.53.0.1:443, the default kubernetes service IP.

secret:
  name: local-kubeconfig
  namespace: fleet-local
apiServerURL: aHR0cHM6Ly8xMC41My4wLjE6NDQz


echo aHR0cHM6Ly8xMC41My4wLjE6NDQz | base64 -d
https://10.53.0.1:443

default                           kubernetes                                    ClusterIP      10.53.0.1

And, from our debug log, the apiServerCA is alway different.

This means anychange in configmap fleet-controller will cause fleet-controller re-deploy the fleet-agent. Because the hasConfigChanged is always TRUE.

cc @manno

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: To Triage
Development

No branches or pull requests

1 participant