-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
noobaa-db-pg pod doesn't migrate when the node has Kubelet service stopped, says PVC can't be moved #6853
Comments
@baum can you take a look? |
@rkomandu thank you for this report! Looks like a CSI issue, let me explain why. According to the provided error: Multi-Attach error for volume "pvc-3e03cdb0-a374-4aed-bc3f-6e6f9ba74bca" Volume is already exclusively attached to one node and can't be attached to another Usually, Multi-Attach error upon k8s node failure indicates an issue with the volume provisioning - CSI driver. The storage provisioner is expected to react to node failure and detach the volume from the failing node since the pod using the volume was deleted. Another clue is the following error: AttachVolume.Attach failed for volume "pvc-3e03cdb0-a374-4aed-bc3f-6e6f9ba74bca" : rpc error: code = Internal desc = ControllerPublishVolume : Error in getting filesystem Name for filesystem ID of 0D790B0A:61B0F1B9. Error [Get "https://ibm-spectrum-scale-gui.ibm-spectrum-scale:443/scalemgmt/v2/filesystems?filter=uuid=0D790B0A:61B0F1B9": context deadline exceeded (Client.Timeout exceeded while awaiting headers)] This is an indication that the kube-controller-manager (attachdetach-controller) fails to talk to the ibm-spectrum-scale CSI driver. Could you get input/feedback about this issue from the ibm-spectrum-scale CSI driver team? Hope this helps, let me know if you need any additional info. Best regards. |
Thank you @dannyzaken , for now opened the following bug in CSI IBM/ibm-spectrum-scale-csi#563 let me have this one opened till we get to the conclusion. |
Basic Q here: When a node that is currently running noobaa-db-pg-0 is made down, then the noobaa-db-pg-0 has been moved to other worker node and got into Running state around 6min delay, however after that we can't create -- new users
This is a bigger problem when we do any Failover testing the new user's can't be created , no new buckets can be created as well Could you comment on the same ? |
@baum FYA |
@rkomandu noobaa-db-pg-0 Init:0/2 status indicates:
The init container num 0 is changing ownership of the files on the volume, and usually runs on the initial run only, since the ownership of files on subsequent runs should be already set. could you please provide: |
Had a Live debug session with @baum and collected the oc adm logs, noobaa db-dump on the system. Attaching for review noobaa_db_dump_02feb.gz |
Will keep the system as-is, until Alex reviews the logs and wants to re-look into the system. As this is a high priority since no new users or new accounts can be created |
hey Ravi, thank you for the logs! |
Changed the subject now, since the earlier defect is moved to CSI team and tracked in IBM/ibm-spectrum-scale-csi#563 Now this issue is about Noobaa-db-pg-0 when migrated to other node (as the current node was down), after it moved to other node, no new users or new buckets can be created. Even the noobaa api calls are hung as mentioned in the above update #6853 (comment) |
We have restarted the noobaa-core then the oc delete pod noobaa-core-0
|
Investigation has to be done with the provided MG logs of what happened to noobaa-core when the noobaa-db pod node has been shutdown and restarted after 30min . |
@rkomandu can you please open a new issue and not change the original issue? I think that the original issue can be closed as there is a CSI bug causing it |
+1, please do |
@nimrod-becker @dannyzaken |
Reset the defect as-is for this now. |
CSI team is investigating, so for now don't want to close this, rather you can mark it as Low priority for now, until CSI team fixes it rather than loosing the entire defect if there is a comeback to Noobaa team from them (which shouldn't be the case though) |
We can reference this bug in the CSI team bug. History is not lost on a closed bug. Since its not a NooBaa bug, I would rather close |
ok @nimrod-becker , Based on the discussion with CSI team and from your above post, closing this defect for now. Closed as it is now in CSI court. |
Environment info
Noobaa version is RC code of the ODF 4.9.0
noobaa status
INFO[0000] CLI version: 5.9.0
INFO[0000] noobaa-image: quay.io/rhceph-dev/mcg-core@sha256:6ce2ddee7aff6a0e768fce523a77c998e1e48e25d227f93843d195d65ebb81b9
INFO[0000] operator-image: quay.io/rhceph-dev/mcg-operator@sha256:cc293c7fe0fdfe3812f9d1af30b6f9c59e97d00c4727c4463a5b9d3429f4278e
INFO[0000] noobaa-db-image: registry.redhat.io/rhel8/postgresql-12@sha256:b3e5b7bc6acd6422f928242d026171bcbed40ab644a2524c84e8ccb4b1ac48ff
INFO[0000] Namespace: openshift-storage
oc version
Client Version: 4.9.5
Server Version: 4.9.5
Kubernetes Version: v1.22.0-rc.0+a44d0f0
Actual behavior
Expected behavior
Steps to reproduce
Configured MetalLB on the cluster (which shouldn't matter) for this problem description though.. Have the Noobaa core/db and 3 endpoints are running on the respective worker nodes as shown below
The text was updated successfully, but these errors were encountered: