Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Single PVC's going in pending state after continuous creation and deletion #83

Open
ksandha opened this issue Oct 18, 2018 · 8 comments
Labels
bug Something isn't working

Comments

@ksandha
Copy link

ksandha commented Oct 18, 2018

PVC going in pending state after creating and deleting single PVC in loop.

[vagrant@kube1 ~]$ kubectl describe pvc 
Name:          gcs-pvc1
Namespace:     default
StorageClass:  glusterfs-csi
Status:        Pending
Volume:        
Labels:        <none>
Annotations:   volume.beta.kubernetes.io/storage-provisioner: org.gluster.glusterfs
Finalizers:    [kubernetes.io/pvc-protection]
Capacity:      
Access Modes:  
Events:
  Type       Reason                Age               From                         Message
  ----       ------                ----              ----                         -------
  Normal     ExternalProvisioning  8s (x3 over 20s)  persistentvolume-controller  waiting for a volume to be created, either by external provisioner "org.gluster.glusterfs" or manually created by system administrator
Mounted By:  <none>
[vagrant@kube1 ~]$ 
@ksandha ksandha changed the title PVC's going in pending state after continuous creation and deletion Single PVC's going in pending state after continuous creation and deletion Oct 18, 2018
@ksandha
Copy link
Author

ksandha commented Oct 18, 2018

[vagrant@kube1 ~]$ kubectl describe pvc 
Name:          gcs-pvc1
Namespace:     default
StorageClass:  glusterfs-csi
Status:        Pending
Volume:        
Labels:        <none>
Annotations:   volume.beta.kubernetes.io/storage-provisioner: org.gluster.glusterfs
Finalizers:    [kubernetes.io/pvc-protection]
Capacity:      
Access Modes:  
Events:
  Type       Reason                Age               From                         Message
  ----       ------                ----              ----                         -------
  Normal     ExternalProvisioning  8s (x3 over 20s)  persistentvolume-controller  waiting for a volume to be created, either by external provisioner "org.gluster.glusterfs" or manually created by system administrator
Mounted By:  <none>
[vagrant@kube1 ~]$ kubectl describe pvc 
Name:          gcs-pvc1
Namespace:     default
StorageClass:  glusterfs-csi
Status:        Pending
Volume:        
Labels:        <none>
Annotations:   control-plane.alpha.kubernetes.io/leader:
                 {"holderIdentity":"4c17a0c4-d2a6-11e8-86f7-0a580ae94209","leaseDurationSeconds":15,"acquireTime":"2018-10-18T08:23:09Z","renewTime":"2018-...
               volume.beta.kubernetes.io/storage-provisioner: org.gluster.glusterfs
Finalizers:    [kubernetes.io/pvc-protection]
Capacity:      
Access Modes:  
Events:
  Type       Reason                Age                   From                                                                                          Message
  ----       ------                ----                  ----                                                                                          -------
  Normal     ExternalProvisioning  19m (x25 over 22m)    persistentvolume-controller                                                                   waiting for a volume to be created, either by external provisioner "org.gluster.glusterfs" or manually created by system administrator
  Normal     ExternalProvisioning  16m (x19 over 17m)    persistentvolume-controller                                                                   waiting for a volume to be created, either by external provisioner "org.gluster.glusterfs" or manually created by system administrator
  Normal     ExternalProvisioning  4m5s (x221 over 14m)  persistentvolume-controller                                                                   waiting for a volume to be created, either by external provisioner "org.gluster.glusterfs" or manually created by system administrator
  Warning    ProvisioningFailed    3m42s (x10 over 15m)  org.gluster.glusterfs csi-provisioner-glusterfsplugin-0 4c17a0c4-d2a6-11e8-86f7-0a580ae94209  Failed to provision volume with StorageClass "glusterfs-csi": rpc error: code = Internal desc = error in fetching volume details context deadline exceeded
  Normal     Provisioning          92s (x13 over 15m)    org.gluster.glusterfs csi-provisioner-glusterfsplugin-0 4c17a0c4-d2a6-11e8-86f7-0a580ae94209  External provisioner is provisioning volume for claim "default/gcs-pvc1"
  Warning    ProvisioningFailed    92s (x3 over 8m17s)   org.gluster.glusterfs csi-provisioner-glusterfsplugin-0 4c17a0c4-d2a6-11e8-86f7-0a580ae94209  Failed to provision volume with StorageClass "glusterfs-csi": rpc error: code = Internal desc = error in fetching volume details Get http://glusterd2-client.gcs:24007/v1/volumes/pvc-36fc6b59d2b011e8/status: dial tcp 10.233.60.197:24007: connect: connection refused
Mounted By:  <none>
[vagrant@kube1 ~]$ 

@JohnStrunk
Copy link
Member

Debugging info needed:

  • Status of GCS pods: kubectl get -n gcs all
  • Logs from the CSI driver & gluster pods

@JohnStrunk JohnStrunk added the bug Something isn't working label Oct 18, 2018
@Madhu-1
Copy link
Member

Madhu-1 commented Oct 22, 2018

From PVC describe
its an issue with glusterd2 and ETCD cluster

Failed to provision volume with StorageClass "glusterfs-csi": rpc error: code = Internal desc = error in fetching volume details context deadline exceeded

This is the error message from glusterd2 when it's not able to connect to the ETCD

fetching volume details Get http://glusterd2-client.gcs:24007/v1/volumes/pvc-36fc6b59d2b011e8/status: dial tcp 10.233.60.197:24007: connect: connection refused

This is the error message seen when CSI driver is not able to connect to the glusterd2 pod.

AS per my understanding, this is not the issue with the CSI driver rather its the issue in glusterd2 cluster

@vpandey-RH
Copy link

@aravindavk #80 could be possibly related to the same issue. I have a question - Why are we even requesting etcd everytime we have successfully created the volume ? We already create a volinfo in one of the transaction steps during volume-create, additional etcd queries can be avoided, I guess.

@ksandha
Copy link
Author

ksandha commented Oct 22, 2018

[vagrant@kube1 ~]$ kubectl get -n gcs all
NAME                                       READY   STATUS    RESTARTS   AGE
pod/csi-attacher-glusterfsplugin-0         2/2     Running   0          12h
pod/csi-nodeplugin-glusterfsplugin-4dkdj   2/2     Running   0          12h
pod/csi-nodeplugin-glusterfsplugin-998k2   2/2     Running   0          12h
pod/csi-nodeplugin-glusterfsplugin-gsv64   2/2     Running   0          12h
pod/csi-provisioner-glusterfsplugin-0      2/2     Running   0          12h
pod/etcd-7xhlbpk8t5                        1/1     Running   0          12h
pod/etcd-d6tgccsrkv                        1/1     Running   0          12h
pod/etcd-operator-7cb5bd459b-zm4xn         1/1     Running   0          12h
pod/etcd-vpnmtxxpmz                        1/1     Running   0          12h
pod/gluster-kube1-0                        1/1     Running   1          12h
pod/gluster-kube2-0                        1/1     Running   0          12h
pod/gluster-kube3-0                        1/1     Running   0          12h

NAME                                TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)               AGE
service/etcd                        ClusterIP   None            <none>        2379/TCP,2380/TCP     12h
service/etcd-client                 ClusterIP   10.233.45.50    <none>        2379/TCP              12h
service/glusterd2                   ClusterIP   None            <none>        24007/TCP,24008/TCP   12h
service/glusterd2-client            ClusterIP   10.233.44.221   <none>        24007/TCP             12h
service/glusterd2-client-nodeport   NodePort    10.233.41.38    <none>        24007:31007/TCP       12h

NAME                                            DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
daemonset.apps/csi-nodeplugin-glusterfsplugin   3         3         3       3            3           <none>          12h

NAME                            DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/etcd-operator   1         1         1            1           12h

NAME                                       DESIRED   CURRENT   READY   AGE
replicaset.apps/etcd-operator-7cb5bd459b   1         1         1       12h

NAME                                               DESIRED   CURRENT   AGE
statefulset.apps/csi-attacher-glusterfsplugin      1         1         12h
statefulset.apps/csi-provisioner-glusterfsplugin   1         1         12h
statefulset.apps/gluster-kube1                     1         1         12h
statefulset.apps/gluster-kube2                     1         1         12h
statefulset.apps/gluster-kube3                     1         1         12h
[vagrant@kube1 ~]$ 

@ksandha
Copy link
Author

ksandha commented Oct 22, 2018

CSI_porvisioner.log

@Madhu-1
Copy link
Member

Madhu-1 commented Oct 24, 2018

From the logs provided

E1022 08:48:31.420202       1 controller.go:1138] Deletion of volume "pvc-032ca506d5d211e8" failed: rpc error: code = Internal desc = failed to stop volume Post http://glusterd2-client.gcs:24007/v1/volumes/pvc-032ca506d5d211e8/stop: dial tcp 10.233.44.221:24007: connect: connection refused
E1022 08:48:31.420399       1 goroutinemap.go:150] Operation for "delete-pvc-032ca506d5d211e8[dc61727e-d5d4-11e8-ab51-5254008c51c3]" failed. No retries permitted until 2018-10-22 08:50:33.420330939 +0000 UTC m=+45493.134694769 (durationBeforeRetry 2m2s). Error: "rpc error: code = Internal desc = failed to stop volume Post http://glusterd2-client.gcs:24007/v1/volumes/pvc-032ca506d5d211e8/stop: dial tcp 10.233.44.221:24007: connect: connection refused"
I1022 08:48:46.388560       1 controller.go:1167] scheduleOperation[delete-pvc-032ca506d5d211e8[dc61727e-d5d4-11e8-ab51-5254008c51c3]]
E1022 08:48:46.388682       1 controller.go:1174] Error scheduling operaion "delete-pvc-032ca506d5d211e8[dc61727e-d5d4-11e8-ab51-5254008c51c3]": Failed to create operation with name "delete-pvc-032ca506d5d211e8[dc61727e-d5d4-11e8-ab51-5254008c51c3]". An operation with that name failed at 2018-10-22 08:48:31.420330939 +0000 UTC m=+45371.134694769. No retries permitted until 2018-10-22 08:50:33.420330939 +0000 UTC m=+45493.134694769 (2m2s). Last error: "rpc error: code = Internal desc = failed to stop volume Post http://glusterd2-client.gcs:24007/v1/volumes/pvc-032ca506d5d211e8/stop: dial tcp 10.233.44.221:24007: connect: connection refused".
I1022 08:49:01.397761       1 controller.go:1167] scheduleOperation[delete-pvc-032ca506d5d211e8[dc61727e-d5d4-11e8-ab51-5254008c51c3]]
E1022 08:49:01.397838       1 controller.go:1174] Error scheduling operaion "delete-pvc-032ca506d5d211e8[dc61727e-d5d4-11e8-ab51-5254008c51c3]": Failed to create operation with name "delete-pvc-032ca506d5d211e8[dc61727e-d5d4-11e8-ab51-5254008c51c3]". An operation with that name failed at 2018-10-22 08:48:31.420330939 +0000 UTC m=+45371.134694769. No retries permitted until 2018-10-22 08:50:33.420330939 +0000 UTC m=+45493.134694769 (2m2s). Last error: "rpc error: code = Internal desc = failed to stop volume Post http://glusterd2-client.gcs:24007/v1/volumes/pvc-032ca506d5d211e8/stop: dial tcp 10.233.44.221:24007: connect: connection refused".
I1022 08:49:16.415974       1 controller.go:1167] scheduleOperation[delete-pvc-032ca506d5d211e8[dc61727e-d5d4-11e8-ab51-5254008c51c3]]
E1022 08:49:16.416053       1 controller.go:1174] Error scheduling operaion "delete-pvc-032ca506d5d211e8[dc61727e-d5d4-11e8-ab51-5254008c51c3]": Failed to create operation with name "delete-pvc-032ca506d5d211e8[dc61727e-d5d4-11e8-ab51-5254008c51c3]". An operation with that name failed at 2018-10-22 08:48:31.420330939 +0000 UTC m=+45371.134694769. No retries permitted until 2018-10-22 08:50:33.420330939 +0000 UTC m=+45493.134694769 (2m2s). Last error: "rpc error: code = Internal desc = failed to stop volume Post http://glusterd2-client.gcs:24007/v1/volumes/pvc-032ca506d5d211e8/stop: dial tcp 10.233.44.221:24007: connect: connection refused".
I1022 08:49:31.505858       1 controller.go:1167] scheduleOperation[delete-pvc-032ca506d5d211e8[dc61727e-d5d4-11e8-ab51-5254008c51c3]]
E1022 08:49:31.505957       1 controller.go:1174] Error scheduling operaion "delete-pvc-032ca506d5d211e8[dc61727e-d5d4-11e8-ab51-5254008c51c3]": Failed to create operation with name "delete-pvc-032ca506d5d211e8[dc61727e-d5d4-11e8-ab51-5254008c51c3]". An operation with that name failed at 2018-10-22 08:48:31.420330939 +0000 UTC m=+45371.134694769. No retries permitted until 2018-10-22 08:50:33.420330939 +0000 UTC m=+45493.134694769 (2m2s). Last error: "rpc error: code = Internal desc = failed to stop volume Post http://glusterd2-client.gcs:24007/v1/volumes/pvc-032ca506d5d211e8/stop: dial tcp 10.233.44.221:24007: connect: connection refused".

This is the connection issue between glusterd2 and CSI provisioner

i feel this is not an issue with CSI driver
@JohnStrunk any thoughts on this one.

@JohnStrunk
Copy link
Member

First thing I see is:

I1022 07:47:58.161467       1 controller.go:900] volume "pvc-c9630f0cd5ce11e8" for claim "default/gcs-pvc1" created
I1022 07:47:58.185548       1 controller.go:917] volume "pvc-c9630f0cd5ce11e8" for claim "default/gcs-pvc1" saved
I1022 07:47:58.185606       1 controller.go:953] volume "pvc-c9630f0cd5ce11e8" provisioned for claim "default/gcs-pvc1"
I1022 07:47:58.295398       1 leaderelection.go:198] stopped trying to renew lease to provision for pvc default/gcs-pvc1, task succeeded
I1022 07:48:16.056529       1 controller.go:1167] scheduleOperation[delete-pvc-c9630f0cd5ce11e8[cb109aab-d5ce-11e8-ab51-5254008c51c3]]
I1022 07:48:16.255186       1 controller.go:1167] scheduleOperation[delete-pvc-c9630f0cd5ce11e8[cb109aab-d5ce-11e8-ab51-5254008c51c3]]
I1022 07:48:17.227000       1 controller.go:1143] volume "pvc-c9630f0cd5ce11e8" deleted
I1022 07:48:17.257527       1 controller.go:1154] volume "pvc-c9630f0cd5ce11e8" deleted from database
I1022 07:48:17.258480       1 controller.go:1167] scheduleOperation[delete-pvc-c9630f0cd5ce11e8[cb109aab-d5ce-11e8-ab51-5254008c51c3]]
I1022 07:48:17.296047       1 controller.go:1167] scheduleOperation[lock-provision-default/gcs-pvc1[d60b807c-d5ce-11e8-a89d-5254003f14bf]]
I1022 07:48:17.328763       1 controller.go:1143] volume "pvc-c9630f0cd5ce11e8" deleted
I1022 07:48:17.366335       1 controller.go:1150] failed to delete volume "pvc-c9630f0cd5ce11e8" from database: persistentvolumes "pvc-c9630f0cd5ce11e8" not found

I'm not sure why it failed to delete it from the "database" or even what database this is...

Seems like the real problems start here:

I1022 08:31:24.292024       1 controller.go:1167] scheduleOperation[delete-pvc-032ca506d5d211e8[dc61727e-d5d4-11e8-ab51-5254008c51c3]]
I1022 08:31:31.306224       1 controller.go:1167] scheduleOperation[delete-pvc-032ca506d5d211e8[dc61727e-d5d4-11e8-ab51-5254008c51c3]]
I1022 08:31:31.306300       1 controller.go:1167] scheduleOperation[delete-pvc-f683aab6d5d111e8[bcb42c53-d5d4-11e8-ab51-5254008c51c3]]
E1022 08:31:42.653042       1 leaderelection.go:273] Failed to update lock: Operation cannot be fulfilled on persistentvolumeclaims "gcs-pvc1": the object has been modified; please apply your changes to the latest version and try again
E1022 08:31:42.679760       1 leaderelection.go:273] Failed to update lock: Operation cannot be fulfilled on persistentvolumeclaims "gcs-pvc1": the object has been modified; please apply your changes to the latest version and try again
I1022 08:31:42.722991       1 controller.go:1154] volume "pvc-f683aab6d5d111e8" deleted from database
I1022 08:31:42.727233       1 controller.go:1167] scheduleOperation[delete-pvc-f683aab6d5d111e8[bcb42c53-d5d4-11e8-ab51-5254008c51c3]]
E1022 08:31:42.745796       1 leaderelection.go:273] Failed to update lock: Operation cannot be fulfilled on persistentvolumeclaims "gcs-pvc1": the object has been modified; please apply your changes to the latest version and try again
I1022 08:31:42.773592       1 leaderelection.go:156] attempting to acquire leader lease...
I1022 08:31:42.774356       1 controller.go:999] claim default/gcs-pvc1 was deleted
I1022 08:31:43.527429       1 controller.go:1143] volume "pvc-f683aab6d5d111e8" deleted
I1022 08:31:45.249844       1 leaderelection.go:163] stopped trying to acquire lease to provision for pvc default/gcs-pvc1, task succeeded
I1022 08:31:45.249974       1 controller.go:994] claim default/gcs-pvc1 was modified to not ask for this provisioner
I1022 08:31:45.762912       1 leaderelection.go:163] stopped trying to acquire lease to provision for pvc default/gcs-pvc1, task succeeded
I1022 08:31:45.763047       1 controller.go:994] claim default/gcs-pvc1 was modified to not ask for this provisioner
I1022 08:31:45.885358       1 leaderelection.go:163] stopped trying to acquire lease to provision for pvc default/gcs-pvc1, task succeeded
I1022 08:31:45.885477       1 controller.go:994] claim default/gcs-pvc1 was modified to not ask for this provisioner
I1022 08:31:46.306409       1 controller.go:1167] scheduleOperation[delete-pvc-032ca506d5d211e8[dc61727e-d5d4-11e8-ab51-5254008c51c3]]
E1022 08:31:52.624897       1 controller.go:1138] Deletion of volume "pvc-032ca506d5d211e8" failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
E1022 08:31:52.626702       1 goroutinemap.go:150] Operation for "delete-pvc-032ca506d5d211e8[dc61727e-d5d4-11e8-ab51-5254008c51c3]" failed. No retries permitted until 2018-10-22 08:31:53.125386042 +0000 UTC m=+44372.839749882 (durationBeforeRetry 500ms). Error: "rpc error: code = DeadlineExceeded desc = context deadline exceeded"
W1022 08:31:52.734114       1 controller.go:354] CreateVolume timeout: 10s has expired, operation will be retried
I1022 08:32:01.306706       1 controller.go:1167] scheduleOperation[delete-pvc-032ca506d5d211e8[dc61727e-d5d4-11e8-ab51-5254008c51c3]]
W1022 08:32:07.734855       1 controller.go:354] CreateVolume timeout: 10s has expired, operation will be retried
I1022 08:32:16.306854       1 controller.go:1167] scheduleOperation[delete-pvc-032ca506d5d211e8[dc61727e-d5d4-11e8-ab51-5254008c51c3]]
E1022 08:32:22.135344       1 controller.go:895] Failed to provision volume for claim "default/gcs-pvc1" with StorageClass "glusterfs-csi": rpc error: code = Internal desc = error in fetching volume details context deadline exceeded
E1022 08:32:22.137797       1 goroutinemap.go:150] Operation for "provision-default/gcs-pvc1[b4d70d4e-d5d2-11e8-a89d-5254003f14bf]" failed. No retries permitted until 2018-10-22 08:32:22.637746324 +0000 UTC m=+44402.352110180 (durationBeforeRetry 500ms). Error: "rpc error: code = Internal desc = error in fetching volume details context deadline exceeded"
I1022 08:32:23.251369       1 leaderelection.go:156] attempting to acquire leader lease...
I1022 08:32:23.252056       1 controller.go:999] claim default/gcs-pvc1 was deleted
I1022 08:32:31.307239       1 controller.go:1167] scheduleOperation[delete-pvc-032ca506d5d211e8[dc61727e-d5d4-11e8-ab51-5254008c51c3]]
E1022 08:32:33.337181       1 controller.go:895] Failed to provision volume for claim "default/gcs-pvc1" with StorageClass "glusterfs-csi": rpc error: code = Internal desc = error in fetching volume details context deadline exceeded

From my reading, the operation times out (10s) and gets scheduled for a retry (which is ok), then later, we start getting the connection refused. The thing to remember is that the CSI driver is talking to the GD2 service IP, so as long as any GD2 is up, it should be able to connect... connection refused would indicate they are all down.

I suspect it's a GD2 or etcd problem causing the failure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants