Disk don't attach after a rateLimitExceeded #1129

guilhem · 2023-02-10T13:52:40Z

When our node are recycle (spot or upgrade) a lot of disk aren't properly unmount and get stuck:

  Warning  FailedAttachVolume  -      attachdetach-controller  AttachVolume.Attach failed for volume "XXXXXXX" : rpc error: code = Internal desc = unknown Attach error: failed cloud service attach disk call: googleapi: Error 403: Too many pending operations on a resource., rateLimitExceeded

  Warning  FailedAttachVolume  -  attachdetach-controller  AttachVolume.Attach failed for volume "XXXXXXX" : rpc error: code = Unavailable desc = ControllerPublish not permitted on node "projects/themecloud-prod/zones/europe-west1-c/instances/XXXXXXXXXXXXXX-newNode" due to backoff condition

  Warning  FailedAttachVolume  -     attachdetach-controller  AttachVolume.Attach failed for volume "XXXXXXX" : rpc error: code = Unavailable desc = ControllerPublish not permitted on node "projects/XXXXXXXXXXXXXX/zones/europe-west1-c/instances/XXXXXXXXXXXXXX-oldnode" due to backoff condition

I have to manually: gcloud compute instances detach-disk

Versions:

PD-CSI: gke.gcr.io/gcp-compute-persistent-disk-csi-driver:v1.8.1-gke.0
GKE: v1.25.5-gke.2000

The text was updated successfully, but these errors were encountered:

mattcary · 2023-02-10T16:27:46Z

This rateLimitExeecded looks like an error from GCP, ie you've exceeded your api quota. That link has instructions for how to check your quota.

Then the pd csi driver enters a backoff condition in order to not make your quota problem worse by continuing to query.

It looks like you have access to the pd csi controller logs? In that case you can look for these logs to see if the backoff is expiring as expected.

guilhem · 2023-02-10T22:31:44Z

I perfectly understand what backoff is and what it does.
Problem here is the it never converge on a solution. I have to manually detach any disk by hand, even 10h after node failure.
It makes my applications falling and my clients quite… angry.

mattcary · 2023-02-11T00:48:04Z

You haven't described the problem completely. Is the backoff always caused by the GCP API quota exhaustion? The quota must recover at some point, because you're able to detach by hand. Are the controller logs visible to you? You should be seeing why the backoff condition is not expiring.

The backoff will not stay enabled for 10h on its own, something is causing subsequent api calls by the driver to reset the backoff. If you discover what that is it might give a clue to your problem.

msau42 · 2023-02-13T19:37:38Z

We have seen in issue where an attach call fails with operation get api quota error, we then detach with "success", but the queued attach call executes sometime later, resulting in "pd is already attached to another node" error when we try to attach the disk to a 2nd node.

Not sure if that's the same scenario being encountered here.

msau42 · 2023-02-13T19:38:18Z

Some more details on the sequence here: kubernetes-csi/external-attacher#400 (comment)

guilhem · 2023-02-13T20:03:17Z

It can definitely be related.
Just in my case, detach never appeared.

guilhem · 2023-02-15T11:55:21Z

@mattcary current behavior is just impossible. With spot instances, I have hundreds of pods every days stucks.
I had to add descheduler to delete all pod stuck in creating:

  "PodLifeTime":
    enabled: true
    params:
      podLifeTime:
        maxPodLifeTimeSeconds: 600
        states:
          - "Pending"
          - "ContainerCreating"

mattcary · 2023-02-15T17:35:05Z

Can you give us more information?

Since we have made the backoff per-node and per-volume we haven't seen instances of stuck backoff. I can't help based on the limited information here.

If your API quota is getting exhausted, then the backoff is probably helping reduce even more stuck pods (as otherwise the controller manager retries detaching too aggressively). If it's a different problem, then that's something to dig into.

Maybe first thing to do is eliminate API quota problems. You should be able to look in the cloud console about what your quota usage rates have been and if it spikes during upgrades?

mattcary closed this as completed Feb 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Disk don't attach after a rateLimitExceeded #1129

Disk don't attach after a rateLimitExceeded #1129

guilhem commented Feb 10, 2023 •

edited

Loading

mattcary commented Feb 10, 2023

guilhem commented Feb 10, 2023

mattcary commented Feb 11, 2023

msau42 commented Feb 13, 2023

msau42 commented Feb 13, 2023

guilhem commented Feb 13, 2023

guilhem commented Feb 15, 2023

mattcary commented Feb 15, 2023

Disk don't attach after a rateLimitExceeded #1129

Disk don't attach after a rateLimitExceeded #1129

Comments

guilhem commented Feb 10, 2023 • edited Loading

mattcary commented Feb 10, 2023

guilhem commented Feb 10, 2023

mattcary commented Feb 11, 2023

msau42 commented Feb 13, 2023

msau42 commented Feb 13, 2023

guilhem commented Feb 13, 2023

guilhem commented Feb 15, 2023

mattcary commented Feb 15, 2023

guilhem commented Feb 10, 2023 •

edited

Loading