Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PostgreSQL server startup should not time out just because things are slow #1510

Open
jclulow opened this issue Aug 14, 2024 · 0 comments
Open

Comments

@jclulow
Copy link
Contributor

jclulow commented Aug 14, 2024

I have a relatively busy host system that has a few OmniOS VMs. When the host is booting, sometimes the guests (which are a little oversubscribed) compete for CPU and disk I/O resources as they're getting out of bed. The PostgreSQL server service frequently fails to start under these conditions, because:

  • the start method invokes pg_ctl with the -w (--wait) flag, causing it to wait for the server to be ready to accept SQL connections, which it may not do for some time if it needs to replay the WAL and there was a lot of activity since the last checkpoint
  • the SMF start method has a timeout of 60 seconds, which is far too short

Sadly, once we hit the timeout we kill the pg_ctl process, but that apparently doesn't necessarily immediately result in the postgres process being killed. This is probably compounded by a long-standing but as-yet unfixed SMF bug, 13091 process contract escaped SMF, where the restarter does not always completely clean up the entire contract of a failed method prior to proceeding with more actions, such as kicking off another instance of the start method. This can result in hitting the maintenance state relatively reliably after a single timeout failure like this, e.g.,

[ Aug 14 01:21:34 Executing start method ("/opt/ooce/pgsql-15/bin/pg_ctl -D /var/opt/ooce/pgsql/pgsql-15 -w start"). ]
pg_ctl: another server might be running; trying to start server anyway
waiting for server to start...........................2024-08-14 01:21:59.831 UTC [406] LOG:  starting PostgreSQL 15.7 on x86_64-pc-solaris2.11, compiled by gcc (OmniOS 151046/12.2.0-il-0) 12.2.0, 64-bit
...............................2024-08-14 01:22:30.844 UTC [406] LOG:  listening on IPv6 address "::1", port 5432
2024-08-14 01:22:30.844 UTC [406] LOG:  listening on IPv4 address "127.0.0.1", port 5432
....[ Aug 14 01:22:34 Method or service exit timed out.  Killing contract 59. ]
[ Aug 14 01:22:34 Method "start" failed due to signal KILL. ]
[ Aug 14 01:22:34 Executing start method ("/opt/ooce/pgsql-15/bin/pg_ctl -D /var/opt/ooce/pgsql/pgsql-15 -w start"). ]
pg_ctl: another server might be running; trying to start server anyway
waiting for server to start....2024-08-14 01:22:34.625 UTC [413] FATAL:  lock file "postmaster.pid" already exists
2024-08-14 01:22:34.625 UTC [413] HINT:  Is another postmaster (PID 406) running in data directory "/var/opt/ooce/pgsql/pgsql-15"?
 stopped waiting
pg_ctl: could not start server
Examine the log output.
[ Aug 14 01:22:34 Method "start" exited with status 1. ]
[ Aug 14 01:22:34 Executing start method ("/opt/ooce/pgsql-15/bin/pg_ctl -D /var/opt/ooce/pgsql/pgsql-15 -w start"). ]
pg_ctl: another server might be running; trying to start server anyway
waiting for server to start....2024-08-14 01:22:34.836 UTC [418] FATAL:  lock file "postmaster.pid" already exists
2024-08-14 01:22:34.836 UTC [418] HINT:  Is another postmaster (PID 406) running in data directory "/var/opt/ooce/pgsql/pgsql-15"?
 stopped waiting
pg_ctl: could not start server
Examine the log output.
[ Aug 14 01:22:34 Method "start" exited with status 1. ]

I think it's important to note that PostgreSQL can, by design, take a totally arbitrary and often surprisingly long time between starting the database and being ready to accept SQL connections due to WAL recovery. I don't believe it is appropriate for the SMF method to wait for this; we should be using pg_ctl --no-wait to start the database. We should also increase the timeout to something more like 300 or even 600 seconds. If the database gets going and then something catastrophic occurs, it will almost certainly exit and the empty contract will cause SMF to note a fault and try to restart it anyway.

Monitoring the active/healthy state of a PostgreSQL instance (which may, for instance, be part of a cluster of instances anyway) has to be something that higher level site-specific software does outside of the context of the lower-level process supervision that SMF provides.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant