too many cloud api calls in node-update-controller #442

yussufsh · 2023-08-24T05:25:02Z

/kind bug
/kind enhancement

What happened?
There are lots of API calls in node-update-controller which creates the powervs cloud object where some fails.

In a minute, a total of ~13 calls to create a cloud object and calls GET pvm instance for checking and setting the storage affinity policy.

# oc logs ibm-powervs-block-csi-driver-controller-86f4c6459-gxn8f -c node-update-controller --previous | grep 'I0821 05:21' | wc -l
27

See the example below where a few errors are while fetching the pvm instance. The last one is while getting the powervs client object (which is fatal) and suggesting the container restart (See #441 ).

Examples:

# oc logs ibm-powervs-block-csi-driver-controller-86f4c6459-gxn8f -c node-update-controller --previous | grep -v 'StoragePoolAffinity' | grep -v 'PROVIDER-ID'
2023-08-19T02:42:36Z    INFO    controller-runtime.metrics      Metrics server is starting to listen    {"addr": ":8081"}
2023-08-19T02:42:36Z    INFO    setup   starting manager
2023-08-19T02:42:36Z    INFO    Starting server {"kind": "health probe", "addr": "[::]:8082"}
2023-08-19T02:42:36Z    INFO    Starting server {"path": "/metrics", "kind": "metrics", "addr": "[::]:8081"}
2023-08-19T02:42:36Z    INFO    Starting EventSource    {"controller": "node", "controllerGroup": "", "controllerKind": "Node", "source": "kind source: *v1.Node"}
2023-08-19T02:42:36Z    INFO    Starting Controller     {"controller": "node", "controllerGroup": "", "controllerKind": "Node"}
2023-08-19T02:42:36Z    INFO    Starting workers        {"controller": "node", "controllerGroup": "", "controllerKind": "Node", "worker count": 1}
I0819 05:54:42.543016       1 nodeupdate_controller.go:81] Unable to fetch Instance Details failed to Get PVM Instance 36776ce2-ef10-400b-be7d-c9511d00f01b :[GET /pcloud/v1/cloud-instances/{cloud_instance_id}/pvm-instances/{pvm_instance_id}][500] pcloudPvminstancesGetInternalServerError  &{Code:0 Description:pvm-instance 36776ce2-ef10-400b-be7d-c9511d00f01b in cloud-instance f4d71e5f9bea49f9a6fdae6f38c4b2cb error: failed to get server and update cache: timed out of retrieving resource for pvmInstanceServer:lon06:f4d71e5f9bea49f9a6fdae6f38c4b2cb:36776ce2-ef10-400b-be7d-c9511d00f01b Error:internal server error Message:}
I0820 06:54:24.914454       1 nodeupdate_controller.go:81] Unable to fetch Instance Details failed to Get PVM Instance 36776ce2-ef10-400b-be7d-c9511d00f01b :[GET /pcloud/v1/cloud-instances/{cloud_instance_id}/pvm-instances/{pvm_instance_id}][500] pcloudPvminstancesGetInternalServerError  &{Code:0 Description:pvm-instance 36776ce2-ef10-400b-be7d-c9511d00f01b in cloud-instance f4d71e5f9bea49f9a6fdae6f38c4b2cb error: failed to get server and update cache: timed out of retrieving resource for pvmInstanceServer:lon06:f4d71e5f9bea49f9a6fdae6f38c4b2cb:36776ce2-ef10-400b-be7d-c9511d00f01b Error:internal server error Message:}
I0820 17:30:31.360402       1 nodeupdate_controller.go:81] Unable to fetch Instance Details failed to Get PVM Instance 36776ce2-ef10-400b-be7d-c9511d00f01b :[GET /pcloud/v1/cloud-instances/{cloud_instance_id}/pvm-instances/{pvm_instance_id}][403] pcloudPvminstancesGetForbidden  &{Code:403 Description: Error: Message:user iam-ServiceId-c27c3ef5-8405-4dc1-9590-4440adaad19f does not have correct permissions to access crn:v1:bluemix:public:power-iaas:lon06:a/bf9f1f230466481b95a99f18739fede9:dbc67d5e-9579-49da-b1d9-fc2ec7ddc680:: with {role:user-unauthorized permissions (read:false write:false manage:false)}}
F0821 05:22:32.216618       1 powervs_node.go:69] Failed to get powervs cloud: errored while getting the Power VS service instance with ID: dbc67d5e-9579-49da-b1d9-fc2ec7ddc680, err: Get "https://resource-controller.cloud.ibm.com/v2/resource_instances/dbc67d5e-9579-49da-b1d9-fc2ec7ddc680": read tcp 192.168.81.10:46226->104.102.54.251:443: read: connection reset by peer

What you expected to happen?
The node-update-controller should not have so many cloud API calls.

How to reproduce it (as minimally and precisely as possible)?

Anything else we need to know?:

Environment

Kubernetes version (use kubectl version):
Driver version: latest

The text was updated successfully, but these errors were encountered:

k8s-ci-robot · 2023-08-24T05:25:04Z

@yussufsh: The label(s) kind/enhancement cannot be applied, because the repository doesn't have them.

In response to this:

/kind bug
/kind enhancement

What happened?
There are lots of API calls in node-update-controller which creates the powervs cloud object where some fails.

In a minute, a total of ~13 calls to create a cloud object and calls GET pvm instance for checking and setting the storage affinity policy.

# oc logs ibm-powervs-block-csi-driver-controller-86f4c6459-gxn8f -c node-update-controller --previous | grep 'I0821 05:21' | wc -l
27

See the example below where a few errors are while fetching the pvm instance. The last one is while getting the powervs client object (which is fatal) and suggesting the container restart (See #441 ).

Examples:

# oc logs ibm-powervs-block-csi-driver-controller-86f4c6459-gxn8f -c node-update-controller --previous | grep -v 'StoragePoolAffinity' | grep -v 'PROVIDER-ID'
2023-08-19T02:42:36Z    INFO    controller-runtime.metrics      Metrics server is starting to listen    {"addr": ":8081"}
2023-08-19T02:42:36Z    INFO    setup   starting manager
2023-08-19T02:42:36Z    INFO    Starting server {"kind": "health probe", "addr": "[::]:8082"}
2023-08-19T02:42:36Z    INFO    Starting server {"path": "/metrics", "kind": "metrics", "addr": "[::]:8081"}
2023-08-19T02:42:36Z    INFO    Starting EventSource    {"controller": "node", "controllerGroup": "", "controllerKind": "Node", "source": "kind source: *v1.Node"}
2023-08-19T02:42:36Z    INFO    Starting Controller     {"controller": "node", "controllerGroup": "", "controllerKind": "Node"}
2023-08-19T02:42:36Z    INFO    Starting workers        {"controller": "node", "controllerGroup": "", "controllerKind": "Node", "worker count": 1}
I0819 05:54:42.543016       1 nodeupdate_controller.go:81] Unable to fetch Instance Details failed to Get PVM Instance 36776ce2-ef10-400b-be7d-c9511d00f01b :[GET /pcloud/v1/cloud-instances/{cloud_instance_id}/pvm-instances/{pvm_instance_id}][500] pcloudPvminstancesGetInternalServerError  &{Code:0 Description:pvm-instance 36776ce2-ef10-400b-be7d-c9511d00f01b in cloud-instance f4d71e5f9bea49f9a6fdae6f38c4b2cb error: failed to get server and update cache: timed out of retrieving resource for pvmInstanceServer:lon06:f4d71e5f9bea49f9a6fdae6f38c4b2cb:36776ce2-ef10-400b-be7d-c9511d00f01b Error:internal server error Message:}
I0820 06:54:24.914454       1 nodeupdate_controller.go:81] Unable to fetch Instance Details failed to Get PVM Instance 36776ce2-ef10-400b-be7d-c9511d00f01b :[GET /pcloud/v1/cloud-instances/{cloud_instance_id}/pvm-instances/{pvm_instance_id}][500] pcloudPvminstancesGetInternalServerError  &{Code:0 Description:pvm-instance 36776ce2-ef10-400b-be7d-c9511d00f01b in cloud-instance f4d71e5f9bea49f9a6fdae6f38c4b2cb error: failed to get server and update cache: timed out of retrieving resource for pvmInstanceServer:lon06:f4d71e5f9bea49f9a6fdae6f38c4b2cb:36776ce2-ef10-400b-be7d-c9511d00f01b Error:internal server error Message:}
I0820 17:30:31.360402       1 nodeupdate_controller.go:81] Unable to fetch Instance Details failed to Get PVM Instance 36776ce2-ef10-400b-be7d-c9511d00f01b :[GET /pcloud/v1/cloud-instances/{cloud_instance_id}/pvm-instances/{pvm_instance_id}][403] pcloudPvminstancesGetForbidden  &{Code:403 Description: Error: Message:user iam-ServiceId-c27c3ef5-8405-4dc1-9590-4440adaad19f does not have correct permissions to access crn:v1:bluemix:public:power-iaas:lon06:a/bf9f1f230466481b95a99f18739fede9:dbc67d5e-9579-49da-b1d9-fc2ec7ddc680:: with {role:user-unauthorized permissions (read:false write:false manage:false)}}
F0821 05:22:32.216618       1 powervs_node.go:69] Failed to get powervs cloud: errored while getting the Power VS service instance with ID: dbc67d5e-9579-49da-b1d9-fc2ec7ddc680, err: Get "https://resource-controller.cloud.ibm.com/v2/resource_instances/dbc67d5e-9579-49da-b1d9-fc2ec7ddc680": read tcp 192.168.81.10:46226->104.102.54.251:443: read: connection reset by peer

What you expected to happen?
The node-update-controller should not have so many cloud API calls.

How to reproduce it (as minimally and precisely as possible)?

Anything else we need to know?:

Environment

Kubernetes version (use kubectl version):
Driver version: latest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

yussufsh · 2023-08-24T05:26:55Z

/assign @yussufsh
One solution could be to add a node label as soon as we set the Storage Affinity Policy to false on the pvm instance. Subsequent reconcile calls should check if a particular node has that label. If the label is present no need to call cloud APIs.

k8s-triage-robot · 2024-01-26T21:38:03Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

yussufsh · 2024-01-27T03:35:21Z

/remove-lifecycle stale

k8s-triage-robot · 2024-04-26T04:24:46Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

yussufsh · 2024-04-27T09:29:07Z

/remove-lifecycle stale

k8s-triage-robot · 2024-07-26T09:46:33Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

yussufsh · 2024-08-25T09:24:15Z

/remove-lifecycle stale

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Aug 24, 2023

k8s-ci-robot assigned yussufsh Aug 24, 2023

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 26, 2024

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 27, 2024

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 26, 2024

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 27, 2024

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 26, 2024

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

too many cloud api calls in node-update-controller #442

too many cloud api calls in node-update-controller #442

yussufsh commented Aug 24, 2023

k8s-ci-robot commented Aug 24, 2023

yussufsh commented Aug 24, 2023

k8s-triage-robot commented Jan 26, 2024

yussufsh commented Jan 27, 2024

k8s-triage-robot commented Apr 26, 2024

yussufsh commented Apr 27, 2024

k8s-triage-robot commented Jul 26, 2024

yussufsh commented Aug 25, 2024

too many cloud api calls in node-update-controller #442

too many cloud api calls in node-update-controller #442

Comments

yussufsh commented Aug 24, 2023

k8s-ci-robot commented Aug 24, 2023

yussufsh commented Aug 24, 2023

k8s-triage-robot commented Jan 26, 2024

yussufsh commented Jan 27, 2024

k8s-triage-robot commented Apr 26, 2024

yussufsh commented Apr 27, 2024

k8s-triage-robot commented Jul 26, 2024

yussufsh commented Aug 25, 2024