Nodes fail to register when using external-aws cloud provider in k8s >= 1.29 #3721

ddl-kfrench · 2024-10-23T18:51:48Z

RKE version: v1.6.2

Docker version: (docker version,docker info preferred) 26.0.1

Operating system and kernel: (cat /etc/os-release, uname -r preferred) ubuntu 22.04 jammy

Type/provider of hosts: (VirtualBox/Bare-metal/AWS/GCE/DO) AWS, k8s 1.29, k8s 1.30

cluster.yml file:

apiVersion: management.cattle.io/v3
kind: Cluster
metadata:
  annotations:
    authz.management.cattle.io/creator-role-bindings: '{"created":["cluster-owner"],"required":["cluster-owner"]}'
    authz.management.cattle.io/initial-sync: 'true'
    field.cattle.io/creatorId: user-7vpvp
    lifecycle.cattle.io/create.cluster-agent-controller-cleanup: 'true'
    lifecycle.cattle.io/create.cluster-scoped-gc: 'true'
    lifecycle.cattle.io/create.mgmt-cluster-rbac-remove: 'true'
  creationTimestamp: '2024-10-18T19:26:05Z'
  finalizers:
    - wrangler.cattle.io/mgmt-cluster-remove
    - controller.cattle.io/cluster-agent-controller-cleanup
    - controller.cattle.io/cluster-scoped-gc
    - controller.cattle.io/cluster-provisioner-controller
    - controller.cattle.io/mgmt-cluster-rbac-remove
  generateName: c-
  labels:
    cattle.io/creator: norman
  name: c-xsvb9
spec:
  agentImageOverride: ''
  answers: {}
  clusterSecrets: {}
  description: ''
  desiredAgentImage: ''
  desiredAuthImage: ''
  displayName: example-cluster
  dockerRootDir: /var/lib/docker
  enableNetworkPolicy: false
  fleetWorkspaceName: fleet-default
  internal: false
  localClusterAuthEndpoint:
    enabled: false
  rancherKubernetesEngineConfig:
    addonJobTimeout: 45
    authentication:
      strategy: x509
    authorization: {}
    bastionHost: {}
    cloudProvider:
      name: external-aws
      useInstanceMetadataHostname: true
    enableCriDockerd: true
    ignoreDockerVersion: true
    ingress:
      defaultBackend: true
      defaultIngressClass: true
      provider: none
    kubernetesVersion: v1.30.4-rancher1-1
    monitoring:
      provider: metrics-server
      replicas: 1
    network:
      options:
        flannelBackendType: vxlan
      plugin: canal
    restore: {}
    rotateEncryptionKey: false
    services:
      etcd:
        backupConfig:
          enabled: true
          intervalHours: 12
          retention: 6
          s3BackupConfig: null
        creation: 12h
        extraArgs:
          election-timeout: '5000'
          heartbeat-interval: '500'
        retention: 72h
        snapshot: false
      kubeApi:
        extraArgs:
          service-account-issuer: kubernetes.default.svc
          service-account-signing-key-file: /etc/kubernetes/ssl/kube-service-account-token-key.pem
        serviceClusterIpRange: x.x.x.x/16
        serviceNodePortRange: 30000-32767
      kubeController:
        clusterCidr: x.x.x.x/16
        extraArgs:
          node-cidr-mask-size: '25'
        serviceClusterIpRange: x.x.x.x/16
      kubelet:
        clusterDnsServer: x.x.x.x
        clusterDomain: cluster.local
        failSwapOn: true
      kubeproxy: {}
      scheduler: {}
    sshAgentAuth: false
    systemImages: {}
    upgradeStrategy:
      drain: false
      maxUnavailableControlplane: '1'
      maxUnavailableWorker: 10%
  windowsPreferedCluster: false

Steps to Reproduce: install a cluster with RKE1 on kubernetes 1.29 or greater with the external-aws cloud provider and useInstanceMetadata: true.

Results: nodes fail to register -- provisioning hangs on:

3:31:14 pm | [INFO ] [sync] Syncing nodes Labels and Taints
3:33:56 pm | [ERROR] [ "ip-10-0-1-9" not found]

this appears to be due to a change in k8s 1.29 chronicled in this issue: kubernetes/kubernetes#124453 but ultimately originating in kubernetes/kubernetes#121028.

in rke, GetNode (by way of setNodeAnnotationsLabelsTaints and SyncLabelsAndTaints) conditions return of the node on having matched against the addresses written to the node status:

			if cloudProviderName == ExternalAWSCloudProviderName {
				if nodeAddress == "" {
					return nil, fmt.Errorf("failed to find node [%v] with empty nodeAddress, cloud provider: %v", nodeName, cloudProviderName)
				}
				logrus.Debugf("Checking internal address for node [%v], cloud provider: %v", nodeName, cloudProviderName)
				for _, addr := range node.Status.Addresses {
					if addr.Type == v1.NodeInternalIP && nodeAddress == addr.Address {
						logrus.Debugf("Found node [%s]: %v", nodeName, nodeAddress)
						return &node, nil
					}
				}
			}

in a k8s 1.28 cluster, the .status.Addresses entry looks something like the entry below:

status:
  addresses:
    - address: 10.0.1.10
      type: InternalIP
    - address: ip-10-0-1-10.us-west-2.compute.internal
      type: InternalDNS
    - address: ip-10-0-1-10.us-west-2.compute.internal
      type: Hostname

as these values are no longer written to the status in >= 1.29, the nodes fail to register.

The text was updated successfully, but these errors were encountered:

github-actions · 2024-12-23T02:03:42Z

This repository uses an automated workflow to automatically label issues which have not had any activity (commit/comment/label) for 60 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the workflow can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the workflow will automatically close the issue in 14 days. Thank you for your contributions.

github-actions bot added the status/stale label Dec 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nodes fail to register when using external-aws cloud provider in k8s >= 1.29 #3721

Nodes fail to register when using external-aws cloud provider in k8s >= 1.29 #3721

ddl-kfrench commented Oct 23, 2024

github-actions bot commented Dec 23, 2024

Nodes fail to register when using external-aws cloud provider in k8s >= 1.29 #3721

Nodes fail to register when using external-aws cloud provider in k8s >= 1.29 #3721

Comments

ddl-kfrench commented Oct 23, 2024

github-actions bot commented Dec 23, 2024