Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failover of noobaa-db-pg0 works but fall back leaves pod in CrashLoopback #6953

Closed
MonicaLemay opened this issue May 4, 2022 · 36 comments
Closed
Labels

Comments

@MonicaLemay
Copy link

Environment info

INFO[0000] CLI version: 5.9.0
INFO[0000] noobaa-image: noobaa/noobaa-core:nsfs_backport_5.9-20220331
INFO[0000] operator-image: quay.io/rhceph-dev/odf4-mcg-rhel8-operator@sha256:eb56f52778b71a04b7df62db4ea241c65f84e2b80ddcc9b588414d173d7e086c
INFO[0000] noobaa-db-image: quay.io/rhceph-dev/rhel8-postgresql-12@sha256:be7212e938d1ef314a75aca070c28b6433cd0346704d0d3523c8ef403ff0c69e
OCP 4.9.29
ODF 4.9.6-1 w/ patch 3/31

Actual behavior

The node, c83f1-dan3.ocp4.pokprv.stglabs.ibm.com was hosting the pods, noobaa-db-pg-0 and noobaa-default-backing-store-noobaa-pod-6b3cc266

I ssh'ed into node c834f1-dan3 and ran:

core@c83f1-dan3 ~]$ date; sudo systemctl stop kubelet ; date
Wed May  4 17:33:25 UTC 2022
Wed May  4 17:33:25 UTC 2022
[core@c83f1-dan3 ~]$

All pods failed over to either dan2 or dan1 after about 7 minutes. The pod noobaa-db-pg-0 failed over to dan2 and noobaa-default-backing-store to dan1. They go to running state and s3 and mmdas and I/O continues to run.

The fall back failed for noobaa-db-pg-0.

[core@c83f1-dan3 ~]$ date; sudo systemctl start kubelet ; date
Wed May  4 17:47:23 UTC 2022
Wed May  4 17:47:31 UTC 2022

The noobaa-default-baking-store stays on dan1. The Noobaa-db-pg-0 goes to crashloopbackoff.

NAME                                               READY   STATUS             RESTARTS        AGE
  IP             NODE                                     NOMINATED NODE   READINESS GATES
noobaa-core-0                                      1/1     Running            0               4h33m
  10.128.2.178   c83f1-dan2.ocp4.pokprv.stglabs.ibm.com   <none>           <none>
noobaa-db-pg-0                                     0/1     CrashLoopBackOff   18 (4m8s ago)   85m
  10.128.2.185   c83f1-dan2.ocp4.pokprv.stglabs.ibm.com   <none>           <none>
noobaa-default-backing-store-noobaa-pod-6b3cc266   1/1     Running            4 (72m ago)     85m
  10.128.5.242   c83f1-dan1.ocp4.pokprv.stglabs.ibm.com   <none>           <none>
noobaa-endpoint-6546dfb677-4r6gq                   1/1     Running            0               4h33m
  10.128.2.177   c83f1-dan2.ocp4.pokprv.stglabs.ibm.com   <none>           <none>
noobaa-endpoint-6546dfb677-8n7qp                   1/1     Running            0               4h33m
  10.128.5.208   c83f1-dan1.ocp4.pokprv.stglabs.ibm.com   <none>           <none>
noobaa-endpoint-6546dfb677-9np5x                   1/1     Running            0               71m
  10.128.1.47    c83f1-dan3.ocp4.pokprv.stglabs.ibm.com   <none>           <none>
noobaa-endpoint-6546dfb677-hh4wm                   1/1     Running            1 (71m ago)     71m
  10.128.1.46    c83f1-dan3.ocp4.pokprv.stglabs.ibm.com   <none>           <none>
noobaa-endpoint-6546dfb677-jxt6q                   1/1     Running            0               71m
  10.128.1.48    c83f1-dan3.ocp4.pokprv.stglabs.ibm.com   <none>           <none>

Describe of pod shows:

Type   Reason         Age          From           Message
 ----   ------         ----         ----           -------
 Normal  Scheduled        24m          default-scheduler    Successfully assigned openshift-storage/noobaa-db-pg-0 to [c83f1-dan2.ocp4.pokprv.stglabs.ibm.com](http://c83f1-dan2.ocp4.pokprv.stglabs.ibm.com/)
 Warning FailedAttachVolume   24m          attachdetach-controller Multi-Attach error for volume "pvc-985f5aa7-967b-4bcc-9269-05eb17240b80" Volume is already exclusively attached to one node and can't be attached to another
 Warning FailedMount       22m          kubelet         Unable to attach or mount volumes: unmounted volumes=[db], unattached volumes=[noobaa-postgres-initdb-sh-volume noobaa-postgres-config-volume db kube-api-access-k6mpn]: timed out waiting for the condition
 Warning FailedMount       20m          kubelet         Unable to attach or mount volumes: unmounted volumes=[db], unattached volumes=[db kube-api-access-k6mpn noobaa-postgres-initdb-sh-volume noobaa-postgres-config-volume]: timed out waiting for the condition
 Normal  SuccessfulAttachVolume 18m          attachdetach-controller AttachVolume.Attach succeeded for volume "pvc-985f5aa7-967b-4bcc-9269-05eb17240b80"
 Normal  Pulled         18m          kubelet         Container image "noobaa/noobaa-core:nsfs_backport_5.9-20220331" already present on machine
 Normal  AddedInterface     18m          multus          Add eth0 [10.128.2.185/23] from openshift-sdn
 Normal  Created         18m          kubelet         Created container init
 Normal  Pulled         18m          kubelet         Container image "[quay.io/rhceph-dev/rhel8-postgresql-12@sha256:be7212e938d1ef314a75aca070c28b6433cd0346704d0d3523c8ef403ff0c69e](http://quay.io/rhceph-dev/rhel8-postgresql-12@sha256:be7212e938d1ef314a75aca070c28b6433cd0346704d0d3523c8ef403ff0c69e)" already present on machine
 Normal  Started         18m          kubelet         Started container init
 Normal  Created         18m          kubelet         Created container initialize-database
 Normal  Started         18m          kubelet         Started container initialize-database
 Normal  Created         9m31s (x4 over 18m)  kubelet         Created container db
 Normal  Started         9m31s (x4 over 18m)  kubelet         Started container db
 Normal  Pulled         7m23s (x6 over 18m)  kubelet         Container image "[quay.io/rhceph-dev/rhel8-postgresql-12@sha256:be7212e938d1ef314a75aca070c28b6433cd0346704d0d3523c8ef403ff0c69e](http://quay.io/rhceph-dev/rhel8-postgresql-12@sha256:be7212e938d1ef314a75aca070c28b6433cd0346704d0d3523c8ef403ff0c69e)" already present on machine
 Warning BackOff         2m27s (x36 over 10m) kubelet         Back-off restarting failed container
[root@c83f1-infa ~]#

Logs on the failing pod:

oc logs noobaa-db-pg-0
pg_ctl: another server might be running; trying to start server anyway
waiting for server to start....2022-05-04 18:04:22.856 UTC [23] LOG:  starting PostgreSQL 12.9 on x86_64-redhat-linux-gnu, compiled by gcc (GCC) 8.5.0 20210514 (Red Hat 8.5.0-4), 64-bit
2022-05-04 18:04:22.857 UTC [23] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2022-05-04 18:04:22.859 UTC [23] LOG:  listening on Unix socket "/tmp/.s.PGSQL.5432"
2022-05-04 18:04:23.043 UTC [23] LOG:  redirecting log output to logging collector process
2022-05-04 18:04:23.043 UTC [23] HINT:  Future log output will appear in directory "log".
 done
server started
/var/run/postgresql:5432 - accepting connections
=> sourcing /usr/share/container-scripts/postgresql/start/set_passwords.sh ...
ERROR:  tuple concurrently updated

It does not recover on its own. The only way to recover is to do delete the s3 service. All accounts and exports are lost unless a backup was done prior to this event

Noobaa diagnose and must-gather are in box note and labeled with this defect number
https://ibm.ent.box.com/folder/145794528783?s=uueh7fp424vxs2bt4ndrnvh7uusgu6tocd

Not sure if this defect that I am reporting is related in any way to.
sclorg/postgresql-container#166

Expected behavior

  1. Fall back should occur.

Steps to reproduce

Listed above

More information - Screenshots / Logs / Other output

@MonicaLemay
Copy link
Author

This will be a blocker to our testing b/c I am assigned failvoer/fallback testing.

@cxhong
Copy link

cxhong commented May 4, 2022

Romy debugged this issue on other cluster and find a link relate to this sclorg/postgresql-container#166

@rkomandu
Copy link
Collaborator

rkomandu commented May 5, 2022

@MonicaLemay , we shouldn't do stop Kubectl stop and start type of testing. There are known defects in CSI which are not fixed for our MVP.

Please read E2E on this for reference --> IBM/ibm-spectrum-scale-csi#563
Also look at Noobaa-defect 6870 last 10-15 comments

Will share with you list of defects, which you can go through on this Failover/Failback

Failover/Failback scenario has to be done over the node power on/off method only

@MonicaLemay
Copy link
Author

I was NOT able to recreate with ipmitool power off. However, the same conditions did not exist but that might not matter. In my test listed above, both noobaa-db-pg-o and the noobaa-default-backing-store-noobaa-pod were hosted by the same node. I don't have that condition any more.

@rkomandu
Copy link
Collaborator

rkomandu commented May 6, 2022

@MonicaLemay , I don't think the backing store will matter if you are done via ipmi tool. Yes, it does matter if you have done with kubectl command as mentioned the CSI driver volume will come into play for the pvc to be Active. However there is no real use for us w/r/t backingstore as Nimrod did say they are planned not to have that one for future ODF releases as per few months back discussion.

Also, if you want to get the pods on the same node, you could try deleting one of them few times and they might get onto the same node. This is what was tried for noobaa core and noobaa-db pod's on the same node for Failover/Failback testing

@MonicaLemay
Copy link
Author

If this is not recreatable when we do ipmitool power off then power on. AND is only recreateable with systemctl stop kubectl , then this might not be a sev1. Can someone weigh in on this so that we can better understand our priorities??

@baum
Copy link
Contributor

baum commented May 8, 2022

@MonicaLemay, how are you?

Regarding the noobaa-db-pg-0 db container producing ERROR: tuple concurrently updated

According to noobaa-db-pg-0-db.log, noobaa-db-pg-0-db-previous.log the start up script warns about pg_ctl: another server might be running; trying to start server anyway.

To troubleshoot, could you please try the manual DB volume recovery procedure described here?

Could you let me know if you see the Postgres server starting normally, after scaling the DB stateful set replicas back to 1 at the end of that procedure?

noobaa-db-pg-0                                     0/1     Pending             0          0s
noobaa-db-pg-0                                     0/1     Init:0/2            0          0s
noobaa-db-pg-0                                     0/1     Init:1/2            0          1s
noobaa-db-pg-0                                     0/1     PodInitializing     0          4s
noobaa-db-pg-0                                     1/1     Running             0          5s

@akmithal
Copy link

Hi @MonicaLemay , Please confirm if this is a Non-GA blocker. Thanks

@MonicaLemay
Copy link
Author

@akmithal This is not a GA blocker.

@rkomandu
Copy link
Collaborator

@NeeshaPan ,
Provide the steps that are tried. if this is still a postgres related issue, Noobaa team need to analyze

@NeeshaPan
Copy link

we followed below link for getting out of crashloopback state.
https://serverfault.com/questions/942743/postgres-crash-loop-caused-by-a-tuple-concurrently-updated-error

@NeeshaPan
Copy link

After bringing the noobaa-db pod up using above link, mmdas commands / Rest-api's are not working as expected. we are hitting error "something went wrong" when trying to run any of the mmdas commands.

@rkomandu
Copy link
Collaborator

@NeeshaPan , when you try next time, update the steps that are relevant for the noobaa pods by trying when the noobaa-db pod running node , the service is made down on Fyre and update if the same behavior is observed.

@NeeshaPan
Copy link

I tried the failover failback of noobaa-core and noobaa-db pod on same node and also failover failback of noobaa-db pod only on fyre cluster. However I am not able to reproduce this issue on fyre cluster.

@rkomandu
Copy link
Collaborator

rkomandu commented Jul 26, 2022

@baum .
@NeeshaPan will try on the physical system and update it with results as per our last week discussion. System is not functioning as expected since a week or so.

@NeeshaPan
Copy link

Will try the scenario on BM once it comes up as it is a compact cluster
FYI @rkomandu @romayalon

@NeeshaPan
Copy link

NeeshaPan commented Aug 5, 2022

@romayalon I am able to reproduce this issue on BM. It is easily reproducible on BM systems.
Attached file that have output of pods & logs and procedure followed for FOFB.
OutputsCapturedDuringFOFB.txt

FYI @rkomandu

@romayalon
Copy link
Contributor

@baum please see last comment

@rkomandu
Copy link
Collaborator

rkomandu commented Sep 6, 2022

This was discussed with @baum and the pointer is towards the access to Postgres DB from 2 different nodes. Will need to discuss with CSI team.

@NeeshaPan
Copy link

NeeshaPan commented Sep 6, 2022

@MonicaLemay Could you please update the summary of defect. just need to replace fall back to failback
Replace
Failover of noobaa-db-pg0 works but fall back leaves pod in CrashLoopback to

Failover of noobaa-db-pg0 works but failback leaves pod in CrashLoopback

FYI @rkomandu

@rkomandu
Copy link
Collaborator

rkomandu commented Sep 7, 2022

@MonicaLemay , we just thought that "fail back" is appropriate than "fall back" .. Do change only if you agree.

@rkomandu
Copy link
Collaborator

rkomandu commented Sep 7, 2022

@baum, we had a discussion with Deepak (IBM CSI)team, there are few Questions that we need to get clarity on , in order to make progress on this behavior

  1. Why noobaa db was restarted when kubelet on hpo-app7 node was started back ?
    
  2. Why does noobaa db is in crashloopback ? You mentioned that it is because two db writer on different node. Other than noobaa db pod, no other entity like csi driver or spectrum scale writes anything to any volume. So question is when kubelet Service is down, the old noobaa db pod on hpo-app7 node which should be cleaned up by k8s is causing that write when kubelet on hpo-app7 node  is started back? I think we need kubenetes/kubelet expert to help here
    
  3. I also noticed one of the comment that this is only reproducible only compact cluster so we should check with expert if shutting down one master node is causing any interference in cleanup process.
    

@nimrod-becker , @baum
We need Noobaa team to check with RH team (K8's expert) if required on the above Questions when (Kubectl Service is down) on a compact cluster.

Note: Earlier discussed with Shane Heroux on the same, posted for reference. Could Alex discuss with Shane and get the ball rolling ?

Email reply from Shane around 16th June 2022 
Hi Ravi,

How much impact and the side-affects are if the "kubelet service" is stopped on a compact cluster.

At a high level, stopping the kubelet service on a compact cluster as you described has at least the same impact as taking a master out of the cluster with the addition of whatever workloads are scheduled.

 Did any Customer do stop the Kubelet service on a master node (traditional 3M+3W) , if yes, under what circumstances could the service be stopped ?

I'm not sure how to answer the first part of this question, but the service is stopped and masters are rolled out and back into clusters (by way of a reboot) every time they update to a new minor version. Done properly, a cluster can tolerate a single master being removed at any one time, but you would only want to do this when strictly necessary. 

There are many resources describing the exact process, but you'd likely want a backup of etcd and cordon / drain the node in question first.

If you need a more specific / deeper answer, I will need more context around what you're trying to accomplish so I can direct the question better.

@baum
Copy link
Contributor

baum commented Sep 8, 2022

Hello @rkomandu 🖖

  1. The DB pod is scheduled by the corresponding stateful set. Once a node becomes unavailable, according to K8S API server, the NooBaa operator HA controller explicitly deletes the pod on the failing node, causing the DB stateful set to reschedule the DB pod on a healthy node. The NooBaa operator is not involved in scheduling "when kubelet on hpo-app7 node was started back", this all is a normal stateful set behavior. Just a theory: restarting kubelet would allow the API server to communicate again the node previously considered not available, and a stateful set reconciliation is taking place. It is worth considering that stopping the kubelet or explicitly deleting a pod those do not actually stop the pod container process, so once the kublet communication is restored, it is possible the stateful discovers too many pods running ( > replicas ) causing reconciliation.

  2. The Postgres process crashes with "ERROR: tuple concurrently updated". This would happen if the DB persistent volume is accessed by several Postgres processes concurrently.

Considering the test performed. After the kublet is stopped, the Postgres on that node continues to run, potentially writing into the persistent volume. Then a new Postgres is scheduled on a healthy node, also potentially writing into the same persistent volume.

Question: what happens to the persistent volume changes on the node where kubelet was stopped, while kublet is not running? Are those changes "merged" with changes on the healthy node? Or those changes are discarded once the communication with the node is restored?

@rkomandu
Copy link
Collaborator

rkomandu commented Sep 9, 2022

@baum , thank you for your explanation. As mentioned by CSI team, the volume is RWO policy, so it can't be done from multiple nodes at a time.

Will ask Deepak from CSI team to have a look at your above post and respond.

@deeghuge
Copy link

Hi @rkomandu, As I mentioned earlier k8s enforces the RWO, it means there is chance in such scenario where orphan pod on node where kubelet is being restarted can access data since kubelet communication was down and cleanup did not happen. I think here we need to understand impact of shutting down/starting kubelet on master and reconciliation process and possibilities of orphan pod causing data corruption. I think @baum is also hinting the same.

@rkomandu
Copy link
Collaborator

Hi @baum , as per email communication with Shane from RH , Is there any progress made to do further investigate on this behavior ?

@rkomandu
Copy link
Collaborator

@romayalon / @nimrod-becker
Please add label as NS-FS here.

@rkomandu
Copy link
Collaborator

@baum

please look at https://bugzilla.redhat.com/show_bug.cgi?id=2010702 (this is similar behavior) we are discussing here on the noobaa db/ pvc

@NeeshaPan @romayalon FYI

@nimrod-becker
Copy link
Contributor

The BZ stated was closed as not a bug (not the tracker but the actual BZ it tracked...)

@rkomandu
Copy link
Collaborator

Status: CLOSED → NEW
Resolution: NOTABUG → ---
CC: [email protected]
Keywords: Reopened

@guymguym
Copy link
Member

guymguym commented Jun 15, 2023

I just handled a similar case.

Did we try to add a proper shutdown to postgres -

https://stackoverflow.com/questions/75828221/how-to-safely-restart-postgresql-container-in-kubernetes

I tried to add manually the following patch to noobaa-db-pg - but I don't know if this helps yet.

$ oc patch sts/noobaa-db-pg --patch '
spec:
  template:
    spec:
      terminationGracePeriodSeconds: 90
      containers:
        - name: db
          lifecycle:
            preStop:
              exec:
                command: [ "/usr/bin/pg_ctl", "stop", "-D", "/var/lib/pgsql/data/userdata", "-w", "-t", "60", "-m", "fast" ]
'

statefulset.apps/noobaa-db-pg patched

@rkomandu
Copy link
Collaborator

hi @guymguym , this postgres seems to be the problem, if you have come across the above

There were few discussions with RH manager(Shane/Shaun, don't recall exactly) and Nimrod on this defect to understand whether we should allow customer to perform "kubectl service stop" or not. We couldn't get to consensus over the email, about the usecase usability. I was pushing this since the DAS solution is Compact Cluster env. After multiple discussions/try out with explaining to Romy/Alex it was mostly the postgres related issue.

If customer perform this then the defect has to be fixed IMO. Please advise

@rkomandu
Copy link
Collaborator

@romayalon @nimrod-becker
Any plans based on the above Guy's comments , it got unassigned by Guy

@guymguym
Copy link
Member

Hi @rkomandu I apologies I don't know the exact status. @dannyzaken - any plans?

I do hope the options to the statefulset, like I showed above (preStop hook and termination grace seconds) could be tested, and if working better we can add it to the operator statefulset-postgres-db.yaml.

Copy link

github-actions bot commented Jul 3, 2024

This issue had no activity for too long - it will now be labeled stale. Update it to prevent it from getting closed.

@github-actions github-actions bot added the Stale label Jul 3, 2024
@romayalon
Copy link
Contributor

Closing, HPO is discontinued.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

10 participants