-
Notifications
You must be signed in to change notification settings - Fork 217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Redeployment unable to startup again #166
Comments
Hello, here facing the same problem: container unable to start, with same output detailed above. EDIT: I double checked, and in my case the output is:
|
Thanks for the report. Interesting, @pedro-dlfa so you manually dropped the pid file, and immediately after that the container again refused to start (because of the pid file)? Smells like |
Hello, I am facing the same issue as @pedro-dlfa , I tried to delete the file manually and redeploy pod but with no success. My solution is to recreate pod again. I am using version 9.5 of postgresql. |
If you are affected by this, can you confirm that deployment strategy is |
Hi, |
The problem with the Please use the |
I've also just run into this issue. Is there a way to make this work with "Rolling" strategy to have zero downtime upgrades? |
Not with this trivial layout. This problem is equivalent to non-container scenario where you do Btw., PostgreSQL server has a guard against "multiple servers writing to the same data directory" situation, but unfortunately in container scenario - it has deterministic PID number (PID=1). So concurrent PostgreSQL server (in different container) checks the pid/lock file, compares the PID with it's own PID and assumes "I'm the PID=1, so the PID file is some leftover from previous run". So it removes the PID file and continues with data directory modifications. This has a disaster potential. Our templates only support Recreate strategy. The fact Rolling "mostly" works is matter of luck that the old server is not under heavy load. That said, zero downtime problem needs to be solved on higher logical layer. |
Ok, that makes sense, thanks. If I wanted to solve this on higher logical layer, how would I go about this? Do you have any good pointers? |
At this point, you'd have to start thinking about pgpool or similar thing (I'd prefer to have a separate issue for such RFEs, to not go off-topic in this bug report). |
This issue seems to be caused by concurrent run of multiple postgresql I've heard an idea that it could also happen if OpenShift happens to be Anyways, I'd like to have opinions how to handle this situation properly; We might delegate this to OpenShift operators, but I suspect that |
Hi, facing the same issue, using the Recreate strategy. Deleting the postmaster.pid also did not help, as I got the same error at the next pod startup. |
Had this problem after there was an issue with the underlying node that caused it to terminate very ungracefully. A new pod got spun up (as it is supposed to) on a new node but the container got stuck in a crash-back loop with this exact error message. Surely there needs to be an automated way to get around this problem? Especially because only a single replica is supported, there's not a lot of wiggle room for high-availability if the container can't start |
This is old issue, but just faced the same with Recreate strategy. Next article explains how to reanimate failing pod and it helped me. We use only one database pod, so I believe that maybe it will not solve high availability issue, but at least database will work with one pod. Maybe will be useful for somebody. |
We've also had an off-line discussion with Daniel Messer from RH who've hit this problem in his team as well. After changing the strategy to Recreate, it problem seems to disappear, but there was a good point to start testing the crash scenario in the CI tests (run the OpenShift template, then kill the pod or postgres deamon directly). This seems like a good addition to our test coverage. |
@drobus We changed the DeploymentConfig -> Deployment, here: https://github.com/sclorg/postgresql-container/blob/master/examples/postgresql-persistent-template.json. And strategy is mentioned 'Recrete'. So please closing this issue. In case it is not yet fixed, Feel free to re-open it again. |
Updated the resource limits for a postgresql-persistent 9.5 deployment
It seems the first pod did not shutdown cleanly and left the PID in /var/lib/pgsql/data/userdata/postmaster.pid volume thus preventing the container from starting up automatically without manual intervention
Perhaps an edge case as this is the first time seeing this with many other postgresql deployments
The text was updated successfully, but these errors were encountered: