noobaa-db-pg pod doesn't migrate when the node has Kubelet service stopped, says PVC can't be moved #6853

rkomandu · 2022-01-12T10:20:21Z

Environment info

NooBaa Version: VERSION
Platform: Kubernetes 1.14.1 | minikube 1.1.1 | OpenShift 4.1 | other: specify

Noobaa version is RC code of the ODF 4.9.0

noobaa status
INFO[0000] CLI version: 5.9.0
INFO[0000] noobaa-image: quay.io/rhceph-dev/mcg-core@sha256:6ce2ddee7aff6a0e768fce523a77c998e1e48e25d227f93843d195d65ebb81b9
INFO[0000] operator-image: quay.io/rhceph-dev/mcg-operator@sha256:cc293c7fe0fdfe3812f9d1af30b6f9c59e97d00c4727c4463a5b9d3429f4278e
INFO[0000] noobaa-db-image: registry.redhat.io/rhel8/postgresql-12@sha256:b3e5b7bc6acd6422f928242d026171bcbed40ab644a2524c84e8ccb4b1ac48ff
INFO[0000] Namespace: openshift-storage

oc version
Client Version: 4.9.5
Server Version: 4.9.5
Kubernetes Version: v1.22.0-rc.0+a44d0f0

Actual behavior

Expected behavior

Steps to reproduce

Configured MetalLB on the cluster (which shouldn't matter) for this problem description though.. Have the Noobaa core/db and 3 endpoints are running on the respective worker nodes as shown below

NAME                                               READY   STATUS    RESTARTS        AGE    IP              NODE                                  NOMINATED NO
DE   READINESS GATES
noobaa-core-0                                      1/1     Running   0               20d    10.254.14.77    worker1.rkomandu-ta.cp.fyre.ibm.com   <none>
     <none>
noobaa-db-pg-0                                     1/1     Running   3 (17d ago)     20d    10.254.18.0     worker0.rkomandu-ta.cp.fyre.ibm.com   <none>
     <none>
noobaa-default-backing-store-noobaa-pod-a1bf952a   1/1     Running   0               20d    10.254.18.4     worker0.rkomandu-ta.cp.fyre.ibm.com   <none>
     <none>
noobaa-endpoint-bfffdd599-7jzdf                    1/1     Running   0               3d1h   10.254.20.43    worker2.rkomandu-ta.cp.fyre.ibm.com   <none>
     <none>
noobaa-endpoint-bfffdd599-gxz5h                    1/1     Running   0               3d4h   10.254.15.112   worker1.rkomandu-ta.cp.fyre.ibm.com   <none>
     <none>
noobaa-endpoint-bfffdd599-mbfrj                    1/1     Running   0               3d4h   10.254.17.208   worker0.rkomandu-ta.cp.fyre.ibm.com   <none>
     <none>
noobaa-operator-5c46775cdd-vplhr                   1/1     Running   0               31d    10.254.16.22    worker0.rkomandu-ta.cp.fyre.ibm.com   <none>
     <none>

Step 2: Issued a kubelet service stop on the node where the noobaa-db pg pod is running 

[core@worker0 ~]$ sudo systemctl stop kubelet

Step 3:  noobaa-db-pg pod trying to migrate to worker2 from worker0 , noobaa operator restarted,  noobaa endpoint on worker0 has got into Pending state as expected 

NAME                                               READY   STATUS              RESTARTS         AGE    IP              NODE                                  N
OMINATED NODE   READINESS GATES
noobaa-core-0                                      1/1     Running             0                20d    10.254.14.77    worker1.rkomandu-ta.cp.fyre.ibm.com   <
none>           <none>
noobaa-db-pg-0                                     0/1     Init:0/2            0                6s     <none>          worker2.rkomandu-ta.cp.fyre.ibm.com   <
none>           <none>
noobaa-endpoint-bfffdd599-7jzdf                    1/1     Running             0                3d1h   10.254.20.43    worker2.rkomandu-ta.cp.fyre.ibm.com   <
none>           <none>
noobaa-endpoint-bfffdd599-gxz5h                    1/1     Running             0                3d4h   10.254.15.112   worker1.rkomandu-ta.cp.fyre.ibm.com   <
none>           <none>
noobaa-endpoint-bfffdd599-wlktz                    0/1     Pending             0                6s     <none>          <none>                                <
none>           <none>
noobaa-operator-5c46775cdd-9mgxt                   0/1     ContainerCreating   0                6s     <none>          worker2.rkomandu-ta.cp.fyre.ibm.com   <
none>           <none>

Step 4: Noobaa-db-pg pod continues to be in the Init state on the worker2 

NAME                                               READY   STATUS        RESTARTS         AGE     IP              NODE                                  NOMINA
TED NODE   READINESS GATES
noobaa-core-0                                      1/1     Running       0                20d     10.254.14.77    worker1.rkomandu-ta.cp.fyre.ibm.com   <none>
           <none>
noobaa-db-pg-0                                     0/1     Init:0/2      0                7m52s   <none>          worker2.rkomandu-ta.cp.fyre.ibm.com   <none>
           <none>
noobaa-endpoint-bfffdd599-7jzdf                    1/1     Running       0                3d1h    10.254.20.43    worker2.rkomandu-ta.cp.fyre.ibm.com   <none>
           <none>
noobaa-endpoint-bfffdd599-gxz5h                    1/1     Running       0                3d4h    10.254.15.112   worker1.rkomandu-ta.cp.fyre.ibm.com   <none>
           <none>
noobaa-endpoint-bfffdd599-wlktz                    0/1     Pending       0                7m52s   <none>          <none>                                <none>
           <none>
noobaa-operator-5c46775cdd-9mgxt                   1/1     Running       0                7m52s   10.254.20.72    worker2.rkomandu-ta.cp.fyre.ibm.com   <none>
           <none>
           
  Step 5: When described the noobaa-db-pg-0 , it showed the pvc was bound on the worker0 node can't be bound to worker2 node. 
  
  Events:
  Type     Reason              Age                    From                     Message
  ----     ------              ----                   ----                     -------
  Normal   Scheduled           11m                    default-scheduler        Successfully assigned openshift-storage/noobaa-db-pg-0 to worker2.rkomandu-ta.cp.fyre.ibm.com
  Warning  FailedAttachVolume  11m                    attachdetach-controller  Multi-Attach error for volume "pvc-3e03cdb0-a374-4aed-bc3f-6e6f9ba74bca" Volume is already exclusively attached to one node and can't be attached to another
  Warning  FailedMount         2m42s (x4 over 9m31s)  kubelet                  Unable to attach or mount volumes: unmounted volumes=[db], unattached volumes=[db kube-api-access-89bwb noobaa-postgres-initdb-sh-volume noobaa-postgres-config-volume]: timed out waiting for the condition
  Warning  FailedMount         25s                    kubelet                  Unable to attach or mount volumes: unmounted volumes=[db], unattached volumes=[noobaa-postgres-initdb-sh-volume noobaa-postgres-config-volume db kube-api-access-89bwb]: timed out waiting for the condition
  Warning  FailedAttachVolume  16s (x10 over 4m31s)   attachdetach-controller  AttachVolume.Attach failed for volume "pvc-3e03cdb0-a374-4aed-bc3f-6e6f9ba74bca" : rpc error: code = Internal desc = ControllerPublishVolume : Error in getting filesystem Name for filesystem ID of 0D790B0A:61B0F1B9. Error [Get "https://ibm-spectrum-scale-gui.ibm-spectrum-scale:443/scalemgmt/v2/filesystems?filter=uuid=0D790B0A:61B0F1B9": context deadline exceeded (Client.Timeout exceeded while awaiting headers)]

'''

This is a problem as I see, Is there a way to get this resolved. 

What happens with this later for HPO team is that, the database is init state, HPO admin can't create any new accounts/exports etc. 

Temp Workaround which was done is 
on worker0 restarted the "Service Kubelet" that was made down earlier and then the noobaa-db-pg pod moved to worker2 w/o any problem. I u/s that this is linked with Kubelet service for the movement of the pod. 

Could you take a look at this defect and provide your thoughts/comments  ? 

  



### More information - Screenshots / Logs / Other output

The text was updated successfully, but these errors were encountered:

dannyzaken · 2022-01-18T11:40:37Z

@baum can you take a look?

baum · 2022-01-18T13:51:52Z

@rkomandu thank you for this report!

Looks like a CSI issue, let me explain why.

According to the provided error: Multi-Attach error for volume "pvc-3e03cdb0-a374-4aed-bc3f-6e6f9ba74bca" Volume is already exclusively attached to one node and can't be attached to another

Usually, Multi-Attach error upon k8s node failure indicates an issue with the volume provisioning - CSI driver. The storage provisioner is expected to react to node failure and detach the volume from the failing node since the pod using the volume was deleted.

Another clue is the following error: AttachVolume.Attach failed for volume "pvc-3e03cdb0-a374-4aed-bc3f-6e6f9ba74bca" : rpc error: code = Internal desc = ControllerPublishVolume : Error in getting filesystem Name for filesystem ID of 0D790B0A:61B0F1B9. Error [Get "https://ibm-spectrum-scale-gui.ibm-spectrum-scale:443/scalemgmt/v2/filesystems?filter=uuid=0D790B0A:61B0F1B9": context deadline exceeded (Client.Timeout exceeded while awaiting headers)]

This is an indication that the kube-controller-manager (attachdetach-controller) fails to talk to the ibm-spectrum-scale CSI driver.

Could you get input/feedback about this issue from the ibm-spectrum-scale CSI driver team?

Hope this helps, let me know if you need any additional info.

Best regards.

rkomandu · 2022-01-18T16:14:55Z

Thank you @dannyzaken , for now opened the following bug in CSI IBM/ibm-spectrum-scale-csi#563

let me have this one opened till we get to the conclusion.

rkomandu · 2022-02-01T13:34:14Z

@dannyzaken @baum

Basic Q here: When a node that is currently running noobaa-db-pg-0 is made down, then the noobaa-db-pg-0 has been moved to other worker node and got into Running state around 6min delay, however after that we can't create

-- new users
-- new buckets (using s3mb)

Step 1: 

NAME                                               READY   STATUS    RESTARTS      AGE     IP              NODE                                  NOMINATED NOD
E   READINESS GATES
noobaa-core-0                                      1/1     Running   0             20h     10.254.23.179   worker2.rkomandu-ta.cp.fyre.ibm.com   <none>
    <none>
noobaa-db-pg-0                                     1/1     Running   0             31m     10.254.12.12    worker1.rkomandu-ta.cp.fyre.ibm.com   <none>
    <none>
    
Step 2: Made worker1 down where noobaa-db-pg-0 is running 

worker0.rkomandu-ta.cp.fyre.ibm.com   Ready      worker   53d   v1.22.0-rc.0+a44d0f0
worker1.rkomandu-ta.cp.fyre.ibm.com   NotReady   worker   53d   v1.22.0-rc.0+a44d0f0
worker2.rkomandu-ta.cp.fyre.ibm.com   Ready      worker   53d   v1.22.0-rc.0+a44d0f0

Step 3:  noobaa-db-pg-0 moved to worker2 from worker1 
noobaa-db-pg-0                                     0/1     Init:0/2   0             15s     <none>          worker2.rkomandu-ta.cp.fyre.ibm.com   <none>
     <none>
     
     Step 4:  Still it is trying to get Initialized 
     noobaa-db-pg-0                                     0/1     Init:0/2   0             3m56s   <none>          worker2.rkomandu-ta.cp.fyre.ibm.com   <none>
     <none>
     
     Step 5:  After 6mXsec, it got into Running state 
     noobaa-db-pg-0                                     1/1     Running   0              91m     10.254.23.217   worker2.rkomandu-ta.cp.fyre.ibm.com   <none>
     <none>
     
     Step 6: Noobaa api to list_accounts just hangs 
     
     noobaa api account_api list_accounts {}
INFO[0000] ✅ Exists: NooBaa "noobaa"
INFO[0000] ✅ Exists: Service "noobaa-mgmt"
INFO[0000] ✅ Exists: Secret "noobaa-operator"
INFO[0000] ✅ Exists: Secret "noobaa-admin"
INFO[0000] ✈️  RPC: account.list_accounts() Request: map[]
WARN[0000] RPC: GetConnection creating connection to wss://localhost:42325/rpc/ 0xc000a996d0
INFO[0000] RPC: Connecting websocket (0xc000a996d0) &{RPC:0xc0004bd130 Address:wss://localhost:42325/rpc/ State:init WS:<nil> PendingRequests:map[] NextRequestID:0 Lock:{state:1 sema:0} ReconnectDelay:0s}
INFO[0000] RPC: Connected websocket (0xc000a996d0) &{RPC:0xc0004bd130 Address:wss://localhost:42325/rpc/ State:init WS:<nil> PendingRequests:map[] NextRequestID:0 Lock:{state:1 sema:0} ReconnectDelay:0s}

This is a bigger problem when we do any Failover testing the new user's can't be created , no new buckets can be created as well

Could you comment on the same ?

rkomandu · 2022-02-01T13:43:35Z

@baum FYA

baum · 2022-02-02T10:19:50Z

@rkomandu noobaa-db-pg-0 Init:0/2 status indicates:

PV was attached to the new node and the pod was scheduled
The init container num 0 was not successful, stuck?

The init container num 0 is changing ownership of the files on the volume, and usually runs on the initial run only, since the ownership of files on subsequent runs should be already set.

could you please provide: kubectl describe pod noobaa-db-pg-0 and logs of the kubectl logs noobaa-db-pg-0 -c init ?

rkomandu · 2022-02-02T11:04:21Z

Had a Live debug session with @baum and collected the oc adm logs, noobaa db-dump on the system. Attaching for review

noobaa_db_dump_02feb.gz
must-gather.local-noobaa-db-pg-0.tar.gz

rkomandu · 2022-02-02T11:05:22Z

Will keep the system as-is, until Alex reviews the logs and wants to re-look into the system. As this is a high priority since no new users or new accounts can be created

baum · 2022-02-02T18:08:19Z

hey Ravi, thank you for the logs!
Let's discuss on chat tomorrow.

rkomandu · 2022-02-03T06:01:48Z

Changed the subject now, since the earlier defect is moved to CSI team and tracked in IBM/ibm-spectrum-scale-csi#563

Now this issue is about Noobaa-db-pg-0 when migrated to other node (as the current node was down), after it moved to other node, no new users or new buckets can be created. Even the noobaa api calls are hung as mentioned in the above update #6853 (comment)

rkomandu · 2022-02-03T11:43:43Z

We have restarted the noobaa-core then the

oc delete pod noobaa-core-0
pod "noobaa-core-0" deleted

NAME                                               READY   STATUS    RESTARTS         AGE     IP              NODE                                  NOMINATED NODE   R
EADINESS GATES
noobaa-core-0                                      1/1     Running   0                10m     10.254.14.158   worker1.rkomandu-ta.cp.fyre.ibm.com   <none>           <
none> --> **This has been restarted** 
noobaa-db-pg-0                                     1/1     Running   0                2d      10.254.23.217   worker2.rkomandu-ta.cp.fyre.ibm.com   <none>           <
none>
noobaa-default-backing-store-noobaa-pod-77176233   1/1     Running   0                2d6h    10.254.18.15    worker0.rkomandu-ta.cp.fyre.ibm.com   <none>           <
none>
noobaa-endpoint-5cdc86865c-9j654                   1/1     Running   0                9m34s   10.254.23.237   worker2.rkomandu-ta.cp.fyre.ibm.com   <none>           <
none>
noobaa-endpoint-5cdc86865c-cfqls                   1/1     Running   0                9m44s   10.254.14.159   worker1.rkomandu-ta.cp.fyre.ibm.com   <none>           <
none>
noobaa-endpoint-5cdc86865c-mnd4m                   1/1     Running   0                9m23s   10.254.14.161   worker1.rkomandu-ta.cp.fyre.ibm.com   <none>           <
none>
noobaa-operator-54877b7dc9-zjsvl                   1/1     Running   0                2d2h    10.254.18.86    worker0.rkomandu-ta.cp.fyre.ibm.com   <none>           <
none>
ocs-metrics-exporter-7955bfc785-cn2zl              1/1     Running   0                2d2h    10.254.18.84    worker0.rkomandu-ta.cp.fyre.ibm.com   <none>           <
none>
ocs-operator-57d785c8c7-bqpfl                      1/1     Running   12 (4h32m ago)   2d2h    10.254.18.90    worker0.rkomandu-ta.cp.fyre.ibm.com   <none>           <
none>
odf-console-756c9c8bc7-4jsfl                       1/1     Running   0                2d2h    10.254.18.88    worker0.rkomandu-ta.cp.fyre.ibm.com   <none>           <
none>
odf-operator-controller-manager-89746b599-z64f6    2/2     Running   12 (4h32m ago)   2d2h    10.254.18.87    worker0.rkomandu-ta.cp.fyre.ibm.com   <none>           <
none>
rook-ceph-operator-74864f7c6f-rlf6c                1/1     Running   0                2d2h    10.254.18.82    worker0.rkomandu-ta.cp.fyre.ibm.com   <none>           <
none>


-- noobaa api calls started working


-- Able to create new buckets 

 s3u5302 mb s3://newbucket-alex-03feb2022
urllib3/connectionpool.py:1013: InsecureRequestWarning: Unverified HTTPS request is being made to host 's3-openshift-storage.apps.rkomandu-ta.cp.fyre.ibm.com'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
make_bucket: newbucket-alex-03feb2022

rkomandu · 2022-02-03T11:45:16Z

Investigation has to be done with the provided MG logs of what happened to noobaa-core when the noobaa-db pod node has been shutdown and restarted after 30min .

@nimrod-becker @jeniawhite

dannyzaken · 2022-02-03T13:28:03Z

@rkomandu can you please open a new issue and not change the original issue? I think that the original issue can be closed as there is a CSI bug causing it

nimrod-becker · 2022-02-03T13:34:35Z

+1, please do

rkomandu · 2022-02-03T17:18:17Z

@nimrod-becker @dannyzaken
Opened #6869 for the new issue.

rkomandu · 2022-02-03T17:18:53Z

Reset the defect as-is for this now.

rkomandu · 2022-02-03T17:19:53Z

CSI team is investigating, so for now don't want to close this, rather you can mark it as Low priority for now, until CSI team fixes it rather than loosing the entire defect if there is a comeback to Noobaa team from them (which shouldn't be the case though)

nimrod-becker · 2022-02-06T08:34:54Z

We can reference this bug in the CSI team bug.

History is not lost on a closed bug. Since its not a NooBaa bug, I would rather close

rkomandu · 2022-02-07T05:26:12Z

ok @nimrod-becker , Based on the discussion with CSI team and from your above post, closing this defect for now.

Closed as it is now in CSI court.

rkomandu added the NS-FS label Jan 12, 2022

jeniawhite changed the title ~~noobaa-db-pg pod doesn't migrate when the node is Tainted with NoExecute , says PVC can't be moved~~ noobaa-db-pg pod doesn't migrate when the node has Kubelet service stopped, says PVC can't be moved Jan 13, 2022

jeniawhite assigned baum Jan 18, 2022

rkomandu mentioned this issue Jan 18, 2022

PVC atached to a pod doesn't migrate across nodes when Kubelet Service is stopped IBM/ibm-spectrum-scale-csi#563

Closed

rkomandu changed the title ~~noobaa-db-pg pod doesn't migrate when the node has Kubelet service stopped, says PVC can't be moved~~ noobaa-db-pg pod when migrated, doesn't allow the new users or new buckets to be created Feb 3, 2022

rkomandu mentioned this issue Feb 3, 2022

noobaa-db-pg pod when migrated, doesn't allow the new users or new buckets to be created #6869

Closed

rkomandu changed the title ~~noobaa-db-pg pod when migrated, doesn't allow the new users or new buckets to be created~~ noobaa-db-pg pod doesn't migrate when the node has Kubelet service stopped, says PVC can't be moved Feb 3, 2022

rkomandu closed this as completed Feb 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

noobaa-db-pg pod doesn't migrate when the node has Kubelet service stopped, says PVC can't be moved #6853

noobaa-db-pg pod doesn't migrate when the node has Kubelet service stopped, says PVC can't be moved #6853

rkomandu commented Jan 12, 2022

dannyzaken commented Jan 18, 2022

baum commented Jan 18, 2022

rkomandu commented Jan 18, 2022

rkomandu commented Feb 1, 2022 •

edited

Loading

rkomandu commented Feb 1, 2022

baum commented Feb 2, 2022 •

edited

Loading

rkomandu commented Feb 2, 2022

rkomandu commented Feb 2, 2022

baum commented Feb 2, 2022

rkomandu commented Feb 3, 2022

rkomandu commented Feb 3, 2022

rkomandu commented Feb 3, 2022

dannyzaken commented Feb 3, 2022

nimrod-becker commented Feb 3, 2022

rkomandu commented Feb 3, 2022

rkomandu commented Feb 3, 2022

rkomandu commented Feb 3, 2022

nimrod-becker commented Feb 6, 2022

rkomandu commented Feb 7, 2022

noobaa-db-pg pod doesn't migrate when the node has Kubelet service stopped, says PVC can't be moved #6853

noobaa-db-pg pod doesn't migrate when the node has Kubelet service stopped, says PVC can't be moved #6853

Comments

rkomandu commented Jan 12, 2022

Environment info

Actual behavior

Expected behavior

Steps to reproduce

dannyzaken commented Jan 18, 2022

baum commented Jan 18, 2022

rkomandu commented Jan 18, 2022

rkomandu commented Feb 1, 2022 • edited Loading

rkomandu commented Feb 1, 2022

baum commented Feb 2, 2022 • edited Loading

rkomandu commented Feb 2, 2022

rkomandu commented Feb 2, 2022

baum commented Feb 2, 2022

rkomandu commented Feb 3, 2022

rkomandu commented Feb 3, 2022

rkomandu commented Feb 3, 2022

dannyzaken commented Feb 3, 2022

nimrod-becker commented Feb 3, 2022

rkomandu commented Feb 3, 2022

rkomandu commented Feb 3, 2022

rkomandu commented Feb 3, 2022

nimrod-becker commented Feb 6, 2022

rkomandu commented Feb 7, 2022

rkomandu commented Feb 1, 2022 •

edited

Loading

baum commented Feb 2, 2022 •

edited

Loading