CCM deletes cluster nodes when dynamic groups and policies are not set properly #434

adriengentil · 2023-07-17T15:51:53Z

Is this a BUG REPORT or FEATURE REQUEST?

BUG REPORT

Versions

CCM Version: 1.25+

Environment:

Kubernetes version (use kubectl version): Openshift 4.14

# oc version
Client Version: 4.12.23
Kustomize Version: v4.5.7
Kubernetes Version: v1.27.3+4aaeaec

OS (e.g. from /etc/os-release):

$ cat /etc/redhat-release 
Red Hat Enterprise Linux CoreOS release 4.14

Kernel (e.g. uname -a):

$ uname -a
Linux localhost.localdomain 5.14.0-284.13.1.el9_2.x86_64 #1 SMP PREEMPT_DYNAMIC Thu Apr 27 13:35:10 EDT 2023 x86_64 x86_64 x86_64 GNU/Linux

Others:

What happened?

We deployed the CCM (with useInstancePrincipals: true) into our cluster without setting up dynamic groups and policies in our compartment, as a consequence the cluster nodes were deleted (kubectl get nodes returned no nodes).

This behavior of the CCM made the investigation and the access to the logs complicated as the CCM pods were evicted along with the nodes.

We guess this behavior is not limited to Openshift.

What you expected to happen?

Nodes are left uninitialized, the CCM logs a meaningful message, and retries until the user creates the required policies in OCI.

How to reproduce it (as minimally and precisely as possible)?

Provision a cluster and ensure:

policies are not set or badly set
the nodes are tainted with node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule

then deploy the CCM with useInstancePrincipals: true config flag. At this time, the CCM should delete the nodes.

Anything else we need to know?

Here are the logs of the CCM pod before it deletes a node:

I0717 15:13:56.437876       1 node_controller.go:415] Initializing node test-infra-cluster-4107b8b3-master-2 with cloud provider
E0717 15:13:56.437954       1 node_controller.go:229] error syncing 'test-infra-cluster-4107b8b3-master-2': failed to get instance metadata for node test-infra-cluster-4107b8b3-master-2: error fetching node by provider ID: compartmentID annotation missing in the node. Would retry, and error by node name: error getting CompartmentID from Node Name: compartmentID annotation missing in the node. Would retry, requeuing
2023-07-17T15:13:56.969Z	ERROR	oci/node_info_controller.go:244	Failed to get instance from instance ID	{"component": "cloud-controller-manager", "node": "test-infra-cluster-4107b8b3-master-2", "error": "Error returned by Compute Service. Http Status Code: 404. Error Code: NotAuthorizedOrNotFound. Opc request id: ed0509ddc78d5d902a7b8257aadea741/F14BBE797B83333448017788F7DE2651/E08C35B4C4FA433DD1DE55198F6F99AD. Message: instance ocid1.instance.oc1.us-sanjose-1.anzwuljr2bh44rycj2smgvblx5zqeryvqbjzleearhsrv6imiqytslrkoxuq not found\nOperation Name: GetInstance\nTimestamp: 2023-07-17 15:13:54 +0000 GMT\nClient Version: Oracle-GoSDK/65.2.0\nRequest Endpoint: GET https://iaas.us-sanjose-1.oraclecloud.com/20160918/instances/ocid1.instance.oc1.us-sanjose-1.anzwuljr2bh44rycj2smgvblx5zqeryvqbjzleearhsrv6imiqytslrkoxuq\nTroubleshooting Tips: See https://docs.oracle.com/iaas/Content/API/References/apierrors.htm#apierrors_404__404_notauthorizedornotfound for more information about resolving this error.\nAlso see https://docs.oracle.com/iaas/api/#/en/iaas/20160918/Instance/GetInstance for details on this operation's requirements.\nTo get more info on the failing request, you can set OCI_GO_SDK_DEBUG env var to info or higher level to log the request/response details.\nIf you are unable to resolve this Compute issue, please contact Oracle support and provide them this full error message.", "errorVerbose": "Error returned by Compute Service. Http Status Code: 404. Error Code: NotAuthorizedOrNotFound. Opc request id: ed0509ddc78d5d902a7b8257aadea741/F14BBE797B83333448017788F7DE2651/E08C35B4C4FA433DD1DE55198F6F99AD. Message: instance ocid1.instance.oc1.us-sanjose-1.anzwuljr2bh44rycj2smgvblx5zqeryvqbjzleearhsrv6imiqytslrkoxuq not found\nOperation Name: GetInstance\nTimestamp: 2023-07-17 15:13:54 +0000 GMT\nClient Version: Oracle-GoSDK/65.2.0\nRequest Endpoint: GET https://iaas.us-sanjose-1.oraclecloud.com/20160918/instances/ocid1.instance.oc1.us-sanjose-1.anzwuljr2bh44rycj2smgvblx5zqeryvqbjzleearhsrv6imiqytslrkoxuq\nTroubleshooting Tips: See https://docs.oracle.com/iaas/Content/API/References/apierrors.htm#apierrors_404__404_notauthorizedornotfound for more information about resolving this error.\nAlso see https://docs.oracle.com/iaas/api/#/en/iaas/20160918/Instance/GetInstance for details on this operation's requirements.\nTo get more info on the failing request, you can set OCI_GO_SDK_DEBUG env var to info or higher level to log the request/response details.\nIf you are unable to resolve this Compute issue, please contact Oracle support and provide them this full error message.\ngithub.com/oracle/oci-cloud-controller-manager/pkg/oci/client.(*client).GetInstance\n\t/go/src/github.com/oracle/oci-cloud-controller-manager/pkg/oci/client/compute.go:50\ngithub.com/oracle/oci-cloud-controller-manager/pkg/cloudprovider/providers/oci.getInstanceByNode\n\t/go/src/github.com/oracle/oci-cloud-controller-manager/pkg/cloudprovider/providers/oci/node_info_controller.go:242\ngithub.com/oracle/oci-cloud-controller-manager/pkg/cloudprovider/providers/oci.(*NodeInfoController).processItem\n\t/go/src/github.com/oracle/oci-cloud-controller-manager/pkg/cloudprovider/providers/oci/node_info_controller.go:168\ngithub.com/oracle/oci-cloud-controller-manager/pkg/cloudprovider/providers/oci.(*NodeInfoController).processNextItem\n\t/go/src/github.com/oracle/oci-cloud-controller-manager/pkg/cloudprovider/providers/oci/node_info_controller.go:139\ngithub.com/oracle/oci-cloud-controller-manager/pkg/cloudprovider/providers/oci.(*NodeInfoController).runWorker\n\t/go/src/github.com/oracle/oci-cloud-controller-manager/pkg/cloudprovider/providers/oci/node_info_controller.go:124\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/go/src/github.com/oracle/oci-cloud-controller-manager/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:157\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/go/src/github.com/oracle/oci-cloud-controller-manager/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:158\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/src/github.com/oracle/oci-cloud-controller-manager/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:135\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/go/src/github.com/oracle/oci-cloud-controller-manager/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:92\ngithub.com/oracle/oci-cloud-controller-manager/pkg/cloudprovider/providers/oci.(*NodeInfoController).Run\n\t/go/src/github.com/oracle/oci-cloud-controller-manager/pkg/cloudprovider/providers/oci/node_info_controller.go:119\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1571"}
2023-07-17T15:13:56.969Z	ERROR	oci/node_info_controller.go:142	Error processing node test-infra-cluster-4107b8b3-master-2 (will retry): Error returned by Compute Service. Http Status Code: 404. Error Code: NotAuthorizedOrNotFound. Opc request id: ed0509ddc78d5d902a7b8257aadea741/F14BBE797B83333448017788F7DE2651/E08C35B4C4FA433DD1DE55198F6F99AD. Message: instance ocid1.instance.oc1.us-sanjose-1.anzwuljr2bh44rycj2smgvblx5zqeryvqbjzleearhsrv6imiqytslrkoxuq not found
Operation Name: GetInstance
Timestamp: 2023-07-17 15:13:54 +0000 GMT
Client Version: Oracle-GoSDK/65.2.0
Request Endpoint: GET https://iaas.us-sanjose-1.oraclecloud.com/20160918/instances/ocid1.instance.oc1.us-sanjose-1.anzwuljr2bh44rycj2smgvblx5zqeryvqbjzleearhsrv6imiqytslrkoxuq
Troubleshooting Tips: See https://docs.oracle.com/iaas/Content/API/References/apierrors.htm#apierrors_404__404_notauthorizedornotfound for more information about resolving this error.
Also see https://docs.oracle.com/iaas/api/#/en/iaas/20160918/Instance/GetInstance for details on this operation's requirements.
To get more info on the failing request, you can set OCI_GO_SDK_DEBUG env var to info or higher level to log the request/response details.
If you are unable to resolve this Compute issue, please contact Oracle support and provide them this full error message.	{"component": "cloud-controller-manager"}
I0717 15:13:58.998504       1 node_controller.go:415] Initializing node test-infra-cluster-4107b8b3-master-2 with cloud provider
E0717 15:13:58.998590       1 node_controller.go:229] error syncing 'test-infra-cluster-4107b8b3-master-2': failed to get instance metadata for node test-infra-cluster-4107b8b3-master-2: error fetching node by provider ID: compartmentID annotation missing in the node. Would retry, and error by node name: error getting CompartmentID from Node Name: compartmentID annotation missing in the node. Would retry, requeuing
I0717 15:13:59.329292       1 node_lifecycle_controller.go:164] deleting node since it is no longer present in cloud provider: test-infra-cluster-4107b8b3-master-2
I0717 15:13:59.329476       1 event.go:294] "Event occurred" object="test-infra-cluster-4107b8b3-master-2" fieldPath="" kind="Node" apiVersion="" type="Normal" reason="DeletingNode" message="Deleting node test-infra-cluster-4107b8b3-master-2 because it does not exist in the cloud provider"
2023-07-17T15:14:03.394Z	ERROR	oci/node_info_controller.go:142	Error processing node test-infra-cluster-4107b8b3-master-0 (will retry): node "test-infra-cluster-4107b8b3-master-0" not found	{"component": "cloud-controller-manager"}

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CCM deletes cluster nodes when dynamic groups and policies are not set properly #434

CCM deletes cluster nodes when dynamic groups and policies are not set properly #434

adriengentil commented Jul 17, 2023 •

edited

Loading

CCM deletes cluster nodes when dynamic groups and policies are not set properly #434

CCM deletes cluster nodes when dynamic groups and policies are not set properly #434

Comments

adriengentil commented Jul 17, 2023 • edited Loading

Is this a BUG REPORT or FEATURE REQUEST?

Versions

What happened?

What you expected to happen?

How to reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

adriengentil commented Jul 17, 2023 •

edited

Loading