Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disk don't attach after a rateLimitExceeded #1129

Closed
guilhem opened this issue Feb 10, 2023 · 8 comments
Closed

Disk don't attach after a rateLimitExceeded #1129

guilhem opened this issue Feb 10, 2023 · 8 comments

Comments

@guilhem
Copy link

guilhem commented Feb 10, 2023

When our node are recycle (spot or upgrade) a lot of disk aren't properly unmount and get stuck:

  Warning  FailedAttachVolume  -      attachdetach-controller  AttachVolume.Attach failed for volume "XXXXXXX" : rpc error: code = Internal desc = unknown Attach error: failed cloud service attach disk call: googleapi: Error 403: Too many pending operations on a resource., rateLimitExceeded
  Warning  FailedAttachVolume  -  attachdetach-controller  AttachVolume.Attach failed for volume "XXXXXXX" : rpc error: code = Unavailable desc = ControllerPublish not permitted on node "projects/themecloud-prod/zones/europe-west1-c/instances/XXXXXXXXXXXXXX-newNode" due to backoff condition
  Warning  FailedAttachVolume  -     attachdetach-controller  AttachVolume.Attach failed for volume "XXXXXXX" : rpc error: code = Unavailable desc = ControllerPublish not permitted on node "projects/XXXXXXXXXXXXXX/zones/europe-west1-c/instances/XXXXXXXXXXXXXX-oldnode" due to backoff condition

I have to manually: gcloud compute instances detach-disk

Versions:

PD-CSI: gke.gcr.io/gcp-compute-persistent-disk-csi-driver:v1.8.1-gke.0
GKE: v1.25.5-gke.2000

@mattcary
Copy link
Contributor

This rateLimitExeecded looks like an error from GCP, ie you've exceeded your api quota. That link has instructions for how to check your quota.

Then the pd csi driver enters a backoff condition in order to not make your quota problem worse by continuing to query.

It looks like you have access to the pd csi controller logs? In that case you can look for these logs to see if the backoff is expiring as expected.

@guilhem
Copy link
Author

guilhem commented Feb 10, 2023

I perfectly understand what backoff is and what it does.
Problem here is the it never converge on a solution. I have to manually detach any disk by hand, even 10h after node failure.
It makes my applications falling and my clients quite… angry.

@mattcary
Copy link
Contributor

You haven't described the problem completely. Is the backoff always caused by the GCP API quota exhaustion? The quota must recover at some point, because you're able to detach by hand. Are the controller logs visible to you? You should be seeing why the backoff condition is not expiring.

The backoff will not stay enabled for 10h on its own, something is causing subsequent api calls by the driver to reset the backoff. If you discover what that is it might give a clue to your problem.

@msau42
Copy link
Contributor

msau42 commented Feb 13, 2023

We have seen in issue where an attach call fails with operation get api quota error, we then detach with "success", but the queued attach call executes sometime later, resulting in "pd is already attached to another node" error when we try to attach the disk to a 2nd node.

Not sure if that's the same scenario being encountered here.

@msau42
Copy link
Contributor

msau42 commented Feb 13, 2023

Some more details on the sequence here: kubernetes-csi/external-attacher#400 (comment)

@guilhem
Copy link
Author

guilhem commented Feb 13, 2023

It can definitely be related.
Just in my case, detach never appeared.

@guilhem
Copy link
Author

guilhem commented Feb 15, 2023

@mattcary current behavior is just impossible. With spot instances, I have hundreds of pods every days stucks.
I had to add descheduler to delete all pod stuck in creating:

  "PodLifeTime":
    enabled: true
    params:
      podLifeTime:
        maxPodLifeTimeSeconds: 600
        states:
          - "Pending"
          - "ContainerCreating"

@mattcary
Copy link
Contributor

Can you give us more information?

Since we have made the backoff per-node and per-volume we haven't seen instances of stuck backoff. I can't help based on the limited information here.

If your API quota is getting exhausted, then the backoff is probably helping reduce even more stuck pods (as otherwise the controller manager retries detaching too aggressively). If it's a different problem, then that's something to dig into.

Maybe first thing to do is eliminate API quota problems. You should be able to look in the cloud console about what your quota usage rates have been and if it spikes during upgrades?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants