You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
$ cat /etc/redhat-release
Red Hat Enterprise Linux CoreOS release 4.14
Kernel (e.g. uname -a):
$ uname -a
Linux localhost.localdomain 5.14.0-284.13.1.el9_2.x86_64 #1 SMP PREEMPT_DYNAMIC Thu Apr 27 13:35:10 EDT 2023 x86_64 x86_64 x86_64 GNU/Linux
Others:
What happened?
We deployed the CCM (with useInstancePrincipals: true) into our cluster without setting up dynamic groups and policies in our compartment, as a consequence the cluster nodes were deleted (kubectl get nodes returned no nodes).
This behavior of the CCM made the investigation and the access to the logs complicated as the CCM pods were evicted along with the nodes.
We guess this behavior is not limited to Openshift.
What you expected to happen?
Nodes are left uninitialized, the CCM logs a meaningful message, and retries until the user creates the required policies in OCI.
How to reproduce it (as minimally and precisely as possible)?
Provision a cluster and ensure:
policies are not set or badly set
the nodes are tainted with node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule
then deploy the CCM with useInstancePrincipals: true config flag. At this time, the CCM should delete the nodes.
Anything else we need to know?
Here are the logs of the CCM pod before it deletes a node:
I0717 15:13:56.437876 1 node_controller.go:415] Initializing node test-infra-cluster-4107b8b3-master-2 with cloud provider
E0717 15:13:56.437954 1 node_controller.go:229] error syncing 'test-infra-cluster-4107b8b3-master-2': failed to get instance metadata for node test-infra-cluster-4107b8b3-master-2: error fetching node by provider ID: compartmentID annotation missing in the node. Would retry, and error by node name: error getting CompartmentID from Node Name: compartmentID annotation missing in the node. Would retry, requeuing
2023-07-17T15:13:56.969Z ERROR oci/node_info_controller.go:244 Failed to get instance from instance ID {"component": "cloud-controller-manager", "node": "test-infra-cluster-4107b8b3-master-2", "error": "Error returned by Compute Service. Http Status Code: 404. Error Code: NotAuthorizedOrNotFound. Opc request id: ed0509ddc78d5d902a7b8257aadea741/F14BBE797B83333448017788F7DE2651/E08C35B4C4FA433DD1DE55198F6F99AD. Message: instance ocid1.instance.oc1.us-sanjose-1.anzwuljr2bh44rycj2smgvblx5zqeryvqbjzleearhsrv6imiqytslrkoxuq not found\nOperation Name: GetInstance\nTimestamp: 2023-07-17 15:13:54 +0000 GMT\nClient Version: Oracle-GoSDK/65.2.0\nRequest Endpoint: GET https://iaas.us-sanjose-1.oraclecloud.com/20160918/instances/ocid1.instance.oc1.us-sanjose-1.anzwuljr2bh44rycj2smgvblx5zqeryvqbjzleearhsrv6imiqytslrkoxuq\nTroubleshooting Tips: See https://docs.oracle.com/iaas/Content/API/References/apierrors.htm#apierrors_404__404_notauthorizedornotfound for more information about resolving this error.\nAlso see https://docs.oracle.com/iaas/api/#/en/iaas/20160918/Instance/GetInstance for details on this operation's requirements.\nTo get more info on the failing request, you can set OCI_GO_SDK_DEBUG env var to info or higher level to log the request/response details.\nIf you are unable to resolve this Compute issue, please contact Oracle support and provide them this full error message.", "errorVerbose": "Error returned by Compute Service. Http Status Code: 404. Error Code: NotAuthorizedOrNotFound. Opc request id: ed0509ddc78d5d902a7b8257aadea741/F14BBE797B83333448017788F7DE2651/E08C35B4C4FA433DD1DE55198F6F99AD. Message: instance ocid1.instance.oc1.us-sanjose-1.anzwuljr2bh44rycj2smgvblx5zqeryvqbjzleearhsrv6imiqytslrkoxuq not found\nOperation Name: GetInstance\nTimestamp: 2023-07-17 15:13:54 +0000 GMT\nClient Version: Oracle-GoSDK/65.2.0\nRequest Endpoint: GET https://iaas.us-sanjose-1.oraclecloud.com/20160918/instances/ocid1.instance.oc1.us-sanjose-1.anzwuljr2bh44rycj2smgvblx5zqeryvqbjzleearhsrv6imiqytslrkoxuq\nTroubleshooting Tips: See https://docs.oracle.com/iaas/Content/API/References/apierrors.htm#apierrors_404__404_notauthorizedornotfound for more information about resolving this error.\nAlso see https://docs.oracle.com/iaas/api/#/en/iaas/20160918/Instance/GetInstance for details on this operation's requirements.\nTo get more info on the failing request, you can set OCI_GO_SDK_DEBUG env var to info or higher level to log the request/response details.\nIf you are unable to resolve this Compute issue, please contact Oracle support and provide them this full error message.\ngithub.com/oracle/oci-cloud-controller-manager/pkg/oci/client.(*client).GetInstance\n\t/go/src/github.com/oracle/oci-cloud-controller-manager/pkg/oci/client/compute.go:50\ngithub.com/oracle/oci-cloud-controller-manager/pkg/cloudprovider/providers/oci.getInstanceByNode\n\t/go/src/github.com/oracle/oci-cloud-controller-manager/pkg/cloudprovider/providers/oci/node_info_controller.go:242\ngithub.com/oracle/oci-cloud-controller-manager/pkg/cloudprovider/providers/oci.(*NodeInfoController).processItem\n\t/go/src/github.com/oracle/oci-cloud-controller-manager/pkg/cloudprovider/providers/oci/node_info_controller.go:168\ngithub.com/oracle/oci-cloud-controller-manager/pkg/cloudprovider/providers/oci.(*NodeInfoController).processNextItem\n\t/go/src/github.com/oracle/oci-cloud-controller-manager/pkg/cloudprovider/providers/oci/node_info_controller.go:139\ngithub.com/oracle/oci-cloud-controller-manager/pkg/cloudprovider/providers/oci.(*NodeInfoController).runWorker\n\t/go/src/github.com/oracle/oci-cloud-controller-manager/pkg/cloudprovider/providers/oci/node_info_controller.go:124\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/go/src/github.com/oracle/oci-cloud-controller-manager/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:157\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/go/src/github.com/oracle/oci-cloud-controller-manager/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:158\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/src/github.com/oracle/oci-cloud-controller-manager/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:135\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/go/src/github.com/oracle/oci-cloud-controller-manager/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:92\ngithub.com/oracle/oci-cloud-controller-manager/pkg/cloudprovider/providers/oci.(*NodeInfoController).Run\n\t/go/src/github.com/oracle/oci-cloud-controller-manager/pkg/cloudprovider/providers/oci/node_info_controller.go:119\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1571"}
2023-07-17T15:13:56.969Z ERROR oci/node_info_controller.go:142 Error processing node test-infra-cluster-4107b8b3-master-2 (will retry): Error returned by Compute Service. Http Status Code: 404. Error Code: NotAuthorizedOrNotFound. Opc request id: ed0509ddc78d5d902a7b8257aadea741/F14BBE797B83333448017788F7DE2651/E08C35B4C4FA433DD1DE55198F6F99AD. Message: instance ocid1.instance.oc1.us-sanjose-1.anzwuljr2bh44rycj2smgvblx5zqeryvqbjzleearhsrv6imiqytslrkoxuq not found
Operation Name: GetInstance
Timestamp: 2023-07-17 15:13:54 +0000 GMT
Client Version: Oracle-GoSDK/65.2.0
Request Endpoint: GET https://iaas.us-sanjose-1.oraclecloud.com/20160918/instances/ocid1.instance.oc1.us-sanjose-1.anzwuljr2bh44rycj2smgvblx5zqeryvqbjzleearhsrv6imiqytslrkoxuq
Troubleshooting Tips: See https://docs.oracle.com/iaas/Content/API/References/apierrors.htm#apierrors_404__404_notauthorizedornotfound for more information about resolving this error.
Also see https://docs.oracle.com/iaas/api/#/en/iaas/20160918/Instance/GetInstance for details on this operation's requirements.
To get more info on the failing request, you can set OCI_GO_SDK_DEBUG env var to info or higher level to log the request/response details.
If you are unable to resolve this Compute issue, please contact Oracle support and provide them this full error message. {"component": "cloud-controller-manager"}
I0717 15:13:58.998504 1 node_controller.go:415] Initializing node test-infra-cluster-4107b8b3-master-2 with cloud provider
E0717 15:13:58.998590 1 node_controller.go:229] error syncing 'test-infra-cluster-4107b8b3-master-2': failed to get instance metadata for node test-infra-cluster-4107b8b3-master-2: error fetching node by provider ID: compartmentID annotation missing in the node. Would retry, and error by node name: error getting CompartmentID from Node Name: compartmentID annotation missing in the node. Would retry, requeuing
I0717 15:13:59.329292 1 node_lifecycle_controller.go:164] deleting node since it is no longer present in cloud provider: test-infra-cluster-4107b8b3-master-2
I0717 15:13:59.329476 1 event.go:294] "Event occurred" object="test-infra-cluster-4107b8b3-master-2" fieldPath="" kind="Node" apiVersion="" type="Normal" reason="DeletingNode" message="Deleting node test-infra-cluster-4107b8b3-master-2 because it does not exist in the cloud provider"
2023-07-17T15:14:03.394Z ERROR oci/node_info_controller.go:142 Error processing node test-infra-cluster-4107b8b3-master-0 (will retry): node "test-infra-cluster-4107b8b3-master-0" not found {"component": "cloud-controller-manager"}
The text was updated successfully, but these errors were encountered:
Is this a BUG REPORT or FEATURE REQUEST?
BUG REPORT
Versions
CCM Version: 1.25+
Environment:
kubectl version
): Openshift 4.14uname -a
):What happened?
We deployed the CCM (with
useInstancePrincipals: true
) into our cluster without setting up dynamic groups and policies in our compartment, as a consequence the cluster nodes were deleted (kubectl get nodes
returned no nodes).This behavior of the CCM made the investigation and the access to the logs complicated as the CCM pods were evicted along with the nodes.
We guess this behavior is not limited to Openshift.
What you expected to happen?
Nodes are left uninitialized, the CCM logs a meaningful message, and retries until the user creates the required policies in OCI.
How to reproduce it (as minimally and precisely as possible)?
Provision a cluster and ensure:
node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule
then deploy the CCM with
useInstancePrincipals: true
config flag. At this time, the CCM should delete the nodes.Anything else we need to know?
Here are the logs of the CCM pod before it deletes a node:
The text was updated successfully, but these errors were encountered: