Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nodes fail to register when using external-aws cloud provider in k8s >= 1.29 #3721

Open
ddl-kfrench opened this issue Oct 23, 2024 · 1 comment

Comments

@ddl-kfrench
Copy link

RKE version: v1.6.2

Docker version: (docker version,docker info preferred) 26.0.1

Operating system and kernel: (cat /etc/os-release, uname -r preferred) ubuntu 22.04 jammy

Type/provider of hosts: (VirtualBox/Bare-metal/AWS/GCE/DO) AWS, k8s 1.29, k8s 1.30

cluster.yml file:

apiVersion: management.cattle.io/v3
kind: Cluster
metadata:
  annotations:
    authz.management.cattle.io/creator-role-bindings: '{"created":["cluster-owner"],"required":["cluster-owner"]}'
    authz.management.cattle.io/initial-sync: 'true'
    field.cattle.io/creatorId: user-7vpvp
    lifecycle.cattle.io/create.cluster-agent-controller-cleanup: 'true'
    lifecycle.cattle.io/create.cluster-scoped-gc: 'true'
    lifecycle.cattle.io/create.mgmt-cluster-rbac-remove: 'true'
  creationTimestamp: '2024-10-18T19:26:05Z'
  finalizers:
    - wrangler.cattle.io/mgmt-cluster-remove
    - controller.cattle.io/cluster-agent-controller-cleanup
    - controller.cattle.io/cluster-scoped-gc
    - controller.cattle.io/cluster-provisioner-controller
    - controller.cattle.io/mgmt-cluster-rbac-remove
  generateName: c-
  labels:
    cattle.io/creator: norman
  name: c-xsvb9
spec:
  agentImageOverride: ''
  answers: {}
  clusterSecrets: {}
  description: ''
  desiredAgentImage: ''
  desiredAuthImage: ''
  displayName: example-cluster
  dockerRootDir: /var/lib/docker
  enableNetworkPolicy: false
  fleetWorkspaceName: fleet-default
  internal: false
  localClusterAuthEndpoint:
    enabled: false
  rancherKubernetesEngineConfig:
    addonJobTimeout: 45
    authentication:
      strategy: x509
    authorization: {}
    bastionHost: {}
    cloudProvider:
      name: external-aws
      useInstanceMetadataHostname: true
    enableCriDockerd: true
    ignoreDockerVersion: true
    ingress:
      defaultBackend: true
      defaultIngressClass: true
      provider: none
    kubernetesVersion: v1.30.4-rancher1-1
    monitoring:
      provider: metrics-server
      replicas: 1
    network:
      options:
        flannelBackendType: vxlan
      plugin: canal
    restore: {}
    rotateEncryptionKey: false
    services:
      etcd:
        backupConfig:
          enabled: true
          intervalHours: 12
          retention: 6
          s3BackupConfig: null
        creation: 12h
        extraArgs:
          election-timeout: '5000'
          heartbeat-interval: '500'
        retention: 72h
        snapshot: false
      kubeApi:
        extraArgs:
          service-account-issuer: kubernetes.default.svc
          service-account-signing-key-file: /etc/kubernetes/ssl/kube-service-account-token-key.pem
        serviceClusterIpRange: x.x.x.x/16
        serviceNodePortRange: 30000-32767
      kubeController:
        clusterCidr: x.x.x.x/16
        extraArgs:
          node-cidr-mask-size: '25'
        serviceClusterIpRange: x.x.x.x/16
      kubelet:
        clusterDnsServer: x.x.x.x
        clusterDomain: cluster.local
        failSwapOn: true
      kubeproxy: {}
      scheduler: {}
    sshAgentAuth: false
    systemImages: {}
    upgradeStrategy:
      drain: false
      maxUnavailableControlplane: '1'
      maxUnavailableWorker: 10%
  windowsPreferedCluster: false

Steps to Reproduce: install a cluster with RKE1 on kubernetes 1.29 or greater with the external-aws cloud provider and useInstanceMetadata: true.

Results: nodes fail to register -- provisioning hangs on:

3:31:14 pm | [INFO ] [sync] Syncing nodes Labels and Taints
3:33:56 pm | [ERROR] [ "ip-10-0-1-9" not found]

this appears to be due to a change in k8s 1.29 chronicled in this issue: kubernetes/kubernetes#124453 but ultimately originating in kubernetes/kubernetes#121028.

in rke, GetNode (by way of setNodeAnnotationsLabelsTaints and SyncLabelsAndTaints) conditions return of the node on having matched against the addresses written to the node status:

			if cloudProviderName == ExternalAWSCloudProviderName {
				if nodeAddress == "" {
					return nil, fmt.Errorf("failed to find node [%v] with empty nodeAddress, cloud provider: %v", nodeName, cloudProviderName)
				}
				logrus.Debugf("Checking internal address for node [%v], cloud provider: %v", nodeName, cloudProviderName)
				for _, addr := range node.Status.Addresses {
					if addr.Type == v1.NodeInternalIP && nodeAddress == addr.Address {
						logrus.Debugf("Found node [%s]: %v", nodeName, nodeAddress)
						return &node, nil
					}
				}
			}

in a k8s 1.28 cluster, the .status.Addresses entry looks something like the entry below:

status:
  addresses:
    - address: 10.0.1.10
      type: InternalIP
    - address: ip-10-0-1-10.us-west-2.compute.internal
      type: InternalDNS
    - address: ip-10-0-1-10.us-west-2.compute.internal
      type: Hostname

as these values are no longer written to the status in >= 1.29, the nodes fail to register.

Copy link
Contributor

This repository uses an automated workflow to automatically label issues which have not had any activity (commit/comment/label) for 60 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the workflow can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the workflow will automatically close the issue in 14 days. Thank you for your contributions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant