Standby resyncs with Primary Node at every Restart #858

aviralsingh21 · 2024-09-05T18:35:53Z

I have a docker swarm HA architecture with setup of 3 nodes of PostgreSQL, 1 pgpool-II service and various other services.
PostgreSQL is setup in HA Cluster using Replication Manager (repmgr) tool. 1 Primary Node + 1 Standby Node + 1 Witness Node

Docker Image Used: bitnami/postgresql-repmgr:16.3.0

Issue: Standby resyncs with Primary Node at every Restart of docker services.

What I was planning to do is to perform a graceful shutdown of postgresql database and then stop the container. In the process of shutting down the database at primary node (node-1), as soon it was shutdown then container got exited and database started as with new container id with a standby role and started to re-sync with new primary(node-2). I assumed this is normal. Since everytime container was restarting at every db shutdown try, I thought it will be better to first stop the repmgr daemon to permanently stop the database. But this didn't help.

I didn't get the permanent way to perform graceful shutdown of database before stopping docker service of postgresql. I didn't get the solution for it but I discovered another issue where whenever I restart the postgresql docker service, standby node (node-1) re-syncs (performs cloning) every single time with primary node (node-1).

PostgreSQL Logs from Standby Node:

^[[38;5;6mpostgresql-repmgr ^[[38;5;5m06:13:00.65 ^[[0m^[[38;5;2mINFO ^[[0m ==>
^[[38;5;6mpostgresql-repmgr ^[[38;5;5m06:13:00.67 ^[[0m^[[38;5;2mINFO ^[[0m ==> ^[[1mWelcome to the Bitnami postgresql-repmgr container^[[0m
^[[38;5;6mpostgresql-repmgr ^[[38;5;5m06:13:00.67 ^[[0m^[[38;5;2mINFO ^[[0m ==> Subscribe to project updates by watching ^[[1m[https://github.com/bitnami/containers^[[0m](https://github.com/bitnami/containers%5E[[0m)
^[[38;5;6mpostgresql-repmgr ^[[38;5;5m06:13:00.67 ^[[0m^[[38;5;2mINFO ^[[0m ==> Submit issues and feature requests at ^[[1m[https://github.com/bitnami/containers/issues^[[0m](https://github.com/bitnami/containers/issues%5E[[0m)
^[[38;5;6mpostgresql-repmgr ^[[38;5;5m06:13:00.67 ^[[0m^[[38;5;2mINFO ^[[0m ==> Upgrade to Tanzu Application Catalog for production environments to access custom-configured and pre-packaged software components. Gain enhanced features, including Software Bill of Materials (SBOM), CVE scan result reports, and VEX documents. To learn more, visit ^[[1m[https://bitnami.com/enterprise^[[0m](https://bitnami.com/enterprise%5E[[0m)
^[[38;5;6mpostgresql-repmgr ^[[38;5;5m06:13:00.67 ^[[0m^[[38;5;2mINFO ^[[0m ==>
^[[38;5;6mpostgresql-repmgr ^[[38;5;5m06:13:00.69 ^[[0m^[[38;5;2mINFO ^[[0m ==> ** Starting PostgreSQL with Replication Manager setup **
^[[38;5;6mpostgresql-repmgr ^[[38;5;5m06:13:00.72 ^[[0m^[[38;5;2mINFO ^[[0m ==> Validating settings in REPMGR_* env vars...
^[[38;5;6mpostgresql-repmgr ^[[38;5;5m06:13:00.72 ^[[0m^[[38;5;2mINFO ^[[0m ==> Validating settings in POSTGRESQL_* env vars..
^[[38;5;6mpostgresql-repmgr ^[[38;5;5m06:13:00.72 ^[[0m^[[38;5;2mINFO ^[[0m ==> Querying all partner nodes for common upstream node...
^[[38;5;6mpostgresql-repmgr ^[[38;5;5m06:13:00.84 ^[[0m^[[38;5;2mINFO ^[[0m ==> Auto-detected primary node: 'pg-0:5432'
^[[38;5;6mpostgresql-repmgr ^[[38;5;5m06:13:00.84 ^[[0m^[[38;5;2mINFO ^[[0m ==> Node configured as standby
^[[38;5;6mpostgresql-repmgr ^[[38;5;5m06:13:00.85 ^[[0m^[[38;5;2mINFO ^[[0m ==> Preparing PostgreSQL configuration...
^[[38;5;6mpostgresql-repmgr ^[[38;5;5m06:13:00.85 ^[[0m^[[38;5;2mINFO ^[[0m ==> postgresql.conf file not detected. Generating it...
^[[38;5;6mpostgresql-repmgr ^[[38;5;5m06:13:00.90 ^[[0m^[[38;5;2mINFO ^[[0m ==> Preparing repmgr configuration...
^[[38;5;6mpostgresql-repmgr ^[[38;5;5m06:13:00.91 ^[[0m^[[38;5;2mINFO ^[[0m ==> Initializing Repmgr...
^[[38;5;6mpostgresql-repmgr ^[[38;5;5m06:13:00.91 ^[[0m^[[38;5;2mINFO ^[[0m ==> Waiting for primary node...
^[[38;5;6mpostgresql-repmgr ^[[38;5;5m06:13:00.93 ^[[0m^[[38;5;2mINFO ^[[0m ==> Rejoining node...
^[[38;5;6mpostgresql-repmgr ^[[38;5;5m06:13:00.93 ^[[0m^[[38;5;2mINFO ^[[0m ==> Cloning data from primary node...
^[[38;5;6mpostgresql-repmgr ^[[38;5;5m06:20:14.33 ^[[0m^[[38;5;2mINFO ^[[0m ==> Initializing PostgreSQL database...
^[[38;5;6mpostgresql-repmgr ^[[38;5;5m06:20:14.35 ^[[0m^[[38;5;2mINFO ^[[0m ==> Custom configuration /opt/bitnami/postgresql/conf/postgresql.conf detected
^[[38;5;6mpostgresql-repmgr ^[[38;5;5m06:20:14.35 ^[[0m^[[38;5;2mINFO ^[[0m ==> Custom configuration /opt/bitnami/postgresql/conf/pg_hba.conf detected
^[[38;5;6mpostgresql-repmgr ^[[38;5;5m06:20:14.39 ^[[0m^[[38;5;2mINFO ^[[0m ==> Deploying PostgreSQL with persisted data...
^[[38;5;6mpostgresql-repmgr ^[[38;5;5m06:20:14.45 ^[[0m^[[38;5;2mINFO ^[[0m ==> Configuring replication parameters
^[[38;5;6mpostgresql-repmgr ^[[38;5;5m06:20:14.48 ^[[0m^[[38;5;2mINFO ^[[0m ==> Configuring fsync
^[[38;5;6mpostgresql-repmgr ^[[38;5;5m06:20:14.49 ^[[0m^[[38;5;2mINFO ^[[0m ==> Setting up streaming replication slave...
^[[38;5;6mpostgresql-repmgr ^[[38;5;5m06:20:14.51 ^[[0m^[[38;5;2mINFO ^[[0m ==> Starting PostgreSQL in background...
^[[38;5;6mpostgresql-repmgr ^[[38;5;5m06:20:17.49 ^[[0m^[[38;5;2mINFO ^[[0m ==> Unregistering standby node...
^[[38;5;6mpostgresql-repmgr ^[[38;5;5m06:20:17.62 ^[[0m^[[38;5;2mINFO ^[[0m ==> Registering Standby node...
^[[38;5;6mpostgresql-repmgr ^[[38;5;5m06:20:17.72 ^[[0m^[[38;5;2mINFO ^[[0m ==> Stopping PostgreSQL...
waiting for server to shut down.... done
server stopped
^[[38;5;6mpostgresql-repmgr ^[[38;5;5m06:20:17.84 ^[[0m^[[38;5;2mINFO ^[[0m ==> ** PostgreSQL with Replication Manager setup finished! **

^[[38;5;6mpostgresql-repmgr ^[[38;5;5m06:20:17.89 ^[[0m^[[38;5;2mINFO ^[[0m ==> Starting PostgreSQL in background...

I also compared logs of standby with other same environment's standby node which is not facing such issue. Logs are same as above, just 'Rejoining Node...' log does not exist there.

Additional information:
I have already reviewed other relevant issues. Like #52213, #34986. I configured pg_rewind and enabled wal_log_hints. But situation is still same.
I tested with bitnami/postgresql-repmgr:12.4.0 docker imager. Same situation is there also.
I also deleted the volume and deployed the postgresql service with fresh volume, restored the database again. This time I directly stopped the docker service instead of stopping database first. But still I am facing same issue.
Database Size used for testing: Around 60GB.

How to tackle this situation, anyone can please help me with this situation?

The text was updated successfully, but these errors were encountered:

JP95Git · 2024-11-22T10:30:40Z

I had a similar problem. When I reboot my primary e.g. for updating the Linux kernel, the secondary is promoted to primary. To "fix" this, I pause the service before reboot and unpause after reboot.

Pause service (execute on ONE node of the cluster):
/path/to/binary/repmgr --config-file=/path/to/config/repmgr.conf service pause

Unpause/continue service (execute on ONE node of the cluster):
/path/to/binary/repmgr --config-file=/path/to/config/repmgr.conf service unpause

aviralsingh21 · 2025-01-16T10:33:44Z

Hello @JP95Git,

I tried with the steps which you suggested. But this is not working for me.

I paused the repmgr service. Confirmed the service status, all 3 node's services were paused.
Removed the docker stack of postgresql docker stack rm <postgresql_stack_name>
Deployed the docker stack again. docker stack deploy postgresql-ha.yml <postgresql_stack_name>
As soon the service was started, again standby node was in unreachable state (it was resyncing with the primary). Few minutes later, standby got synced.
Before unpausing the service, I confirmed the service status. Only primary node's service was paused and standby node's service was un-paused automatically.

Actually all this effort going on in testing for the reason behind the issue:

One of the Standby node's status is "running as primary". Although all the connection are routing to original primary.
To fix this I tried by running repmgr standby register --force for pg-1 node but got error "this node should be a standby".
Since this was a live setup so couldn't afford cluster health status to be like this for a long time, I restored the complete cluster using pg_dump backup and restored it in fresh volumes. But this solution required a downtime for 20 minutes as database size was around 20GB.

Performed the backup from original cluster.
Stopped the docker service, Deleted the volumes.
Deployed the stack again which created new volumes.
Restored the backup in the new volumes.

I am finding a way to solve this issue without requesting for any downtime.

I tried another method to fix this:

I scaled the service of pg-1 to 0 using command docker service scale persistence_pg-1=0.
Deleted the volume of pg-1.
Then scaled the service of pg-1 to 1 using command docker service scale persistence_pg-1=1.
pg-1 started to re-sync with the primary node automatically as soon the pg-1 container was up.

But again, same issue occurred as before, every time pgsql service is started, both replica nodes starts to re-sync with primary node each time even though I scaled only one replica node.

Is it something with the configuration setting which is causing this re-syncing at every time services are deployed or am I performing any wrong steps?

JP95Git · 2025-01-16T14:25:56Z

@aviralsingh21
The "pause" command just stops the automatic failover, as seen here: https://www.repmgr.org/docs/current/repmgr-service-pause.html
The synchronisation of the nodes is not affected by the pause. All paused nodes should stay paused, even after restarting them.

I never used docker, so I can't help you with this. But something went wrong while doing "stack rm" and "stack deploy" because some of your nodes were paused and some were not.

The message "running as primary" indicates a "split brain", which means that you have 2 primaries, each of them is holding some parts of your database. I also faced this situation, had to restore from a backup. I use repmgr in pause mode until I find a solution for the "split brain".

I think it is normal that the standby syncs again with the primary at the restart. I setup WAL archiving using barman, which allows the standby to use the WAL archive to sync very fast with the primary, without doing a full sync.

aviralsingh21 · 2025-01-16T15:10:17Z

@JP95Git
I forgot to mention one point related to issue running as a primary.
Although it shows the split-brain situation but all the connections were routing to original Primary only. I confirmed in the pgpool-II POOL_NODES status (load_balancing is also disabled).

Related to concern you raised about docker stack rm and docker stack deploy commands. Now I also feel that there might be something to look for.
Because I tried deploying the service with the fresh volumes later. Restored the fresh data to the database. Then again restarted the service. But still after the deploying the stack, standby nodes are resyncing with primary node at every restart.
Thank You for providing the new angle for investigation. I will also explore that part too.

But I confirmed the service status of the nodes, they were all paused before stopping the service.

Status after re-syncing was complete:

Logs from the Standby Node at the time of re-syncing:

^[[38;5;6mpostgresql-repmgr ^[[38;5;5m14:39:12.15 ^[[0m
^[[38;5;6mpostgresql-repmgr ^[[38;5;5m14:39:12.15 ^[[0m^[[1mWelcome to the Bitnami postgresql-repmgr container^[[0m
^[[38;5;6mpostgresql-repmgr ^[[38;5;5m14:39:12.15 ^[[0mSubscribe to project updates by watching ^[[1mhttps://github.com/bitnami/bitnami-docker-postgresql-repmgr^[[0m
^[[38;5;6mpostgresql-repmgr ^[[38;5;5m14:39:12.15 ^[[0mSubmit issues and feature requests at ^[[1mhttps://github.com/bitnami/bitnami-docker-postgresql-repmgr/issues^[[0m
^[[38;5;6mpostgresql-repmgr ^[[38;5;5m14:39:12.15 ^[[0m
^[[38;5;6mpostgresql-repmgr ^[[38;5;5m14:39:12.16 ^[[0m^[[38;5;2mINFO ^[[0m ==> ** Starting PostgreSQL with Replication Manager setup **
^[[38;5;6mpostgresql-repmgr ^[[38;5;5m14:39:12.18 ^[[0m^[[38;5;2mINFO ^[[0m ==> Validating settings in REPMGR_* env vars...
^[[38;5;6mpostgresql-repmgr ^[[38;5;5m14:39:12.18 ^[[0m^[[38;5;2mINFO ^[[0m ==> Validating settings in POSTGRESQL_* env vars..
^[[38;5;6mpostgresql-repmgr ^[[38;5;5m14:39:12.18 ^[[0m^[[38;5;2mINFO ^[[0m ==> Querying all partner nodes for common upstream node...
^[[38;5;6mpostgresql-repmgr ^[[38;5;5m14:39:12.26 ^[[0m^[[38;5;2mINFO ^[[0m ==> Preparing PostgreSQL configuration...
^[[38;5;6mpostgresql-repmgr ^[[38;5;5m14:39:12.27 ^[[0m^[[38;5;2mINFO ^[[0m ==> postgresql.conf file not detected. Generating it...
^[[38;5;6mpostgresql-repmgr ^[[38;5;5m14:39:12.32 ^[[0m^[[38;5;2mINFO ^[[0m ==> Preparing repmgr configuration...
^[[38;5;6mpostgresql-repmgr ^[[38;5;5m14:39:12.32 ^[[0m^[[38;5;2mINFO ^[[0m ==> Initializing Repmgr...
^[[38;5;6mpostgresql-repmgr ^[[38;5;5m14:39:12.33 ^[[0m^[[38;5;2mINFO ^[[0m ==> Waiting for primary node...

Logs after re-syncing was complete:

^[[38;5;6mpostgresql-repmgr ^[[38;5;5m14:40:40.66 ^[[0m^[[38;5;2mINFO ^[[0m ==> Initializing PostgreSQL database...
^[[38;5;6mpostgresql-repmgr ^[[38;5;5m14:40:40.66 ^[[0m^[[38;5;2mINFO ^[[0m ==> Cleaning stale /bitnami/postgresql/data/standby.signal file
^[[38;5;6mpostgresql-repmgr ^[[38;5;5m14:40:40.66 ^[[0m^[[38;5;2mINFO ^[[0m ==> Custom configuration /opt/bitnami/postgresql/conf/postgresql.conf detected
^[[38;5;6mpostgresql-repmgr ^[[38;5;5m14:40:40.66 ^[[0m^[[38;5;2mINFO ^[[0m ==> Custom configuration /opt/bitnami/postgresql/conf/pg_hba.conf detected
^[[38;5;6mpostgresql-repmgr ^[[38;5;5m14:40:40.68 ^[[0m^[[38;5;2mINFO ^[[0m ==> Deploying PostgreSQL with persisted data...
^[[38;5;6mpostgresql-repmgr ^[[38;5;5m14:40:40.69 ^[[0m^[[38;5;2mINFO ^[[0m ==> Configuring replication parameters
^[[38;5;6mpostgresql-repmgr ^[[38;5;5m14:40:40.71 ^[[0m^[[38;5;2mINFO ^[[0m ==> Configuring fsync
^[[38;5;6mpostgresql-repmgr ^[[38;5;5m14:40:40.72 ^[[0m^[[38;5;2mINFO ^[[0m ==> Setting up streaming replication slave...
^[[38;5;6mpostgresql-repmgr ^[[38;5;5m14:40:40.74 ^[[0m^[[38;5;2mINFO ^[[0m ==> Starting PostgreSQL in background...
^[[38;5;6mpostgresql-repmgr ^[[38;5;5m14:40:41.16 ^[[0m^[[38;5;2mINFO ^[[0m ==> Unregistering standby node...
^[[38;5;6mpostgresql-repmgr ^[[38;5;5m14:40:41.21 ^[[0m^[[38;5;2mINFO ^[[0m ==> Registering Standby node...
^[[38;5;6mpostgresql-repmgr ^[[38;5;5m14:40:41.25 ^[[0m^[[38;5;2mINFO ^[[0m ==> Stopping PostgreSQL...
waiting for server to shut down.... done
server stopped
^[[38;5;6mpostgresql-repmgr ^[[38;5;5m14:40:41.45 ^[[0m^[[38;5;2mINFO ^[[0m ==> ** PostgreSQL with Replication Manager setup finished! **

^[[38;5;6mpostgresql-repmgr ^[[38;5;5m14:40:41.47 ^[[0m^[[38;5;2mINFO ^[[0m ==> Starting PostgreSQL in background...
waiting for server to start....2025-01-16 14:40:41 GMT [201]: [67891a69.c9-1] @,app= [00000] LOG: pgaudit extension initialized
2025-01-16 14:40:41 GMT [201]: [67891a69.c9-2] @,app= [00000] LOG: starting PostgreSQL 12.4 on x86_64-pc-linux-gnu, compiled by gcc (Debian 8.3.0-6) 8.3.0, 64-bit
2025-01-16 14:40:41 GMT [201]: [67891a69.c9-3] @,app= [00000] LOG: listening on IPv4 address "0.0.0.0", port 5432
2025-01-16 14:40:41 GMT [201]: [67891a69.c9-4] @,app= [00000] LOG: listening on IPv6 address "::", port 5432
2025-01-16 14:40:41 GMT [201]: [67891a69.c9-5] @,app= [00000] LOG: listening on Unix socket "/tmp/.s.PGSQL.5432"
2025-01-16 14:40:41 GMT [201]: [67891a69.c9-6] @,app= [00000] LOG: redirecting log output to logging collector process
2025-01-16 14:40:41 GMT [201]: [67891a69.c9-7] @,app= [00000] HINT: Future log output will appear in directory "/opt/bitnami/postgresql/logs".
2025-01-16 14:40:41 GMT [203]: [67891a69.cb-1] @,app= [00000] LOG: database system was shut down in recovery at 2025-01-16 14:40:41 GMT
2025-01-16 14:40:41 GMT [203]: [67891a69.cb-2] @,app= [00000] LOG: entering standby mode
2025-01-16 14:40:41 GMT [203]: [67891a69.cb-3] @,app= [00000] LOG: redo starts at 2/2000028
2025-01-16 14:40:41 GMT [203]: [67891a69.cb-4] @,app= [00000] LOG: consistent recovery state reached at 2/2000138
2025-01-16 14:40:41 GMT [203]: [67891a69.cb-5] @,app= [00000] LOG: invalid record length at 2/4001530: wanted 24, got 0
2025-01-16 14:40:41 GMT [201]: [67891a69.c9-8] @,app= [00000] LOG: database system is ready to accept read only connections
done
server started
2025-01-16 14:40:41 GMT [207]: [67891a69.cf-1] @,app= [00000] LOG: started streaming WAL from primary at 2/4000000 on timeline 1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Standby resyncs with Primary Node at every Restart #858

Standby resyncs with Primary Node at every Restart #858

aviralsingh21 commented Sep 5, 2024

JP95Git commented Nov 22, 2024

aviralsingh21 commented Jan 16, 2025

JP95Git commented Jan 16, 2025

aviralsingh21 commented Jan 16, 2025

Standby resyncs with Primary Node at every Restart #858

Standby resyncs with Primary Node at every Restart #858

Comments

aviralsingh21 commented Sep 5, 2024

JP95Git commented Nov 22, 2024

aviralsingh21 commented Jan 16, 2025

JP95Git commented Jan 16, 2025

aviralsingh21 commented Jan 16, 2025