-
Notifications
You must be signed in to change notification settings - Fork 150
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Disk don't attach after a rateLimitExceeded #1129
Comments
This rateLimitExeecded looks like an error from GCP, ie you've exceeded your api quota. That link has instructions for how to check your quota. Then the pd csi driver enters a backoff condition in order to not make your quota problem worse by continuing to query. It looks like you have access to the pd csi controller logs? In that case you can look for these logs to see if the backoff is expiring as expected. |
I perfectly understand what backoff is and what it does. |
You haven't described the problem completely. Is the backoff always caused by the GCP API quota exhaustion? The quota must recover at some point, because you're able to detach by hand. Are the controller logs visible to you? You should be seeing why the backoff condition is not expiring. The backoff will not stay enabled for 10h on its own, something is causing subsequent api calls by the driver to reset the backoff. If you discover what that is it might give a clue to your problem. |
We have seen in issue where an attach call fails with operation get api quota error, we then detach with "success", but the queued attach call executes sometime later, resulting in "pd is already attached to another node" error when we try to attach the disk to a 2nd node. Not sure if that's the same scenario being encountered here. |
Some more details on the sequence here: kubernetes-csi/external-attacher#400 (comment) |
It can definitely be related. |
@mattcary current behavior is just impossible. With spot instances, I have hundreds of pods every days stucks. "PodLifeTime":
enabled: true
params:
podLifeTime:
maxPodLifeTimeSeconds: 600
states:
- "Pending"
- "ContainerCreating" |
Can you give us more information? Since we have made the backoff per-node and per-volume we haven't seen instances of stuck backoff. I can't help based on the limited information here. If your API quota is getting exhausted, then the backoff is probably helping reduce even more stuck pods (as otherwise the controller manager retries detaching too aggressively). If it's a different problem, then that's something to dig into. Maybe first thing to do is eliminate API quota problems. You should be able to look in the cloud console about what your quota usage rates have been and if it spikes during upgrades? |
When our node are recycle (spot or upgrade) a lot of disk aren't properly unmount and get stuck:
I have to manually:
gcloud compute instances detach-disk
Versions:
PD-CSI:
gke.gcr.io/gcp-compute-persistent-disk-csi-driver:v1.8.1-gke.0
GKE:
v1.25.5-gke.2000
The text was updated successfully, but these errors were encountered: