Add GUC controlling whether to pause recovery if some critical GUCs at replica have smaller value than on primary #9057

knizhnik · 2024-09-19T06:58:41Z

Problem

See #9023

Summary of changes

Ass GUC recovery_pause_on_misconfig allowing not to pause in case of replica and primary configuration mismatch

See neondatabase/postgres#501
See neondatabase/postgres#502
See neondatabase/postgres#503
See neondatabase/postgres#504

Checklist before requesting a review

I have performed a self-review of my code.
If it is a core feature, I have added thorough tests.
Do we need to implement analytics? if so did you add the relevant metrics to the dashboard?
If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.

Checklist before merging

Do not forget to reformat commit message to not include the above checklist

github-actions · 2024-09-19T07:48:05Z

6996 tests run: 6688 passed, 0 failed, 308 skipped (full report)

Flaky tests (2)

Postgres 17

test_sharding_split_failures[failure18]: debug-x86-64
test_timeline_archive[4]: release-arm64

Code coverage* (full report)

functions: 30.4% (8273 of 27226 functions)
lines: 47.8% (65234 of 136507 lines)

* collected from Rust tests only

_{The comment gets automatically updated with the latest test results
1c57694 at 2024-12-01T07:11:38.492Z :recycle:}

MMeent

See: neondatabase/postgres#501

I don't think we should be exposing this GUC in the global namespace, but rather from the extension's namespace (so we can only enable this behavior when the extension is loaded)

As for !recoveryPauseOnMisconfig, this can cause us to see more active transactions on a standby than that standby expects.

What's the behaviour when we see >max_connections concurrently active transactions on a hot standby?

knizhnik · 2024-09-23T06:34:06Z

See: neondatabase/postgres#501

I don't think we should be exposing this GUC in the global namespace, but rather from the extension's namespace (so we can only enable this behavior when the extension is loaded)

Sorry, I do not understand it. We are using this GUC in Postgres core code. Certainly we can register variable in Postgres code byut define GUC which changes it in neon extension. But this is IMHO strange solution. We can also add yet another hook (what to do on misconfig) and define it in Neon extension.

But I think that such "less strict" policy for checking primary/replica configuration will be useful not only to Neon.
So may we sometimes it is possible to propose this patch to community.

As for !recoveryPauseOnMisconfig, this can cause us to see more active transactions on a standby than that standby expects.

What's the behaviour when we see >max_connections concurrently active transactions on a hot standby?

I have added test for this case. Transactions are normally applied. It is expected behaviour: LR serialise transactions and only one transactions is applied at each moment of time.

In case of some other cases of misconfiguration (for example max_prepared_transations is smaller at primary than replica will crash with the fatal error:

PG:2024-09-23 06:29:50.661 GMT [38233] FATAL:  maximum number of prepared transactions reached
PG:2024-09-23 06:29:50.661 GMT [38233] HINT:  Increase max_prepared_transactions (currently 5).
PG:2024-09-23 06:29:50.661 GMT [38233] CONTEXT:  WAL redo at 0/154CAA8 for Transaction/PREPARE: gid t5: 2024-09-23 06:29:50.661593+00
PG:2024-09-23 06:29:50.662 GMT [38230] LOG:  startup process (PID 38233) exited with exit code 1
PG:2024-09-23 06:29:50.662 GMT [38230] LOG:  terminating any other active server processes
PG:2024-09-23 06:29:50.663 GMT [38230] LOG:  shutting down due to startup process failure
PG:2024-09-23 06:29:50.663 GMT [38230] LOG:  database system is shut down

after which control plane should restart replica with synced config parameters and so next time recon try should succeed.

MMeent · 2024-09-23T10:02:00Z

What's the behaviour when we see >max_connections concurrently active transactions on a hot standby?

I have added test for this case. Transactions are normally applied. It is expected behaviour: LR serialise transactions and only one transactions is applied at each moment of time.

I don't care about LR, I care about normal replication, which does not serialize transactions. And in that case, we're probably writing visibility information into data structs sized to MaxBackends, while we're writing > MaxBackends values into those, which is probably very unsafe.

Did you check that visibility information is correctly applied even at large concurrency?

Note that a replica's transaction state handling mechanism is managed by 64 * (MaxBackends + max_prepared_xacts), so disarming this check is bound to create errors elsewhere during replay of transaction state.

E.g. spin up a replica with max_connections=5, a primary with max_connections=1000, and start a transaction in all those max_connections of the primary. If my math is correct, the WAL redo process on the replica will throw an error before even half of the connections' transaction IDs have been received because it ran out of space to put those transaction IDs. I had to check that we didn't silently write into shared structs, and am happy we don't, but it's really not good to remove protections and assume everything still works just fine because you did some light testing, because usually checks are in place to protect us against exactly those extreme cases.

knizhnik · 2024-09-23T16:27:00Z

What's the behaviour when we see >max_connections concurrently active transactions on a hot standby?

I have added test for this case. Transactions are normally applied. It is expected behaviour: LR serialise transactions and only one transactions is applied at each moment of time.

I don't care about LR, I care about normal replication, which does not serialize transactions. And in that case, we're probably writing visibility information into data structs sized to MaxBackends, while we're writing > MaxBackends values into those, which is probably very unsafe.

Did you check that visibility information is correctly applied even at large concurrency?

Note that a replica's transaction state handling mechanism is managed by 64 * (MaxBackends + max_prepared_xacts), so disarming this check is bound to create errors elsewhere during replay of transaction state.

E.g. spin up a replica with max_connections=5, a primary with max_connections=1000, and start a transaction in all those max_connections of the primary. If my math is correct, the WAL redo process on the replica will throw an error before even half of the connections' transaction IDs have been received because it ran out of space to put those transaction IDs. I had to check that we didn't silently write into shared structs, and am happy we don't, but it's really not good to remove protections and assume everything still works just fine because you did some light testing, because usually checks are in place to protect us against exactly those extreme cases.

Sorry, many different tickets are missed in my head:(
Certainly I am not speaking about physical replication.
And my reply that LR serialise transaction is completely irrelevant.

But I failed to reproduce the problem with recovery failure with max_)connections at primary equal to 100 and at replica - just 5. I run 90 parallel transactions and they are normally replicated:

def test_physical_replication_config_mismatch(neon_simple_env: NeonEnv):
    env = neon_simple_env
    with env.endpoints.create_start(
        branch_name="main",
        endpoint_id="primary",
    ) as primary:
        with primary.connect() as p_con:
            with p_con.cursor() as p_cur:
                p_cur.execute(
                    "CREATE TABLE t(pk bigint primary key, payload text default repeat('?',200))"
                )
        time.sleep(1)

        with env.endpoints.new_replica_start(
            origin=primary,
            endpoint_id="secondary",
            config_lines=["max_connections=5"],
        ) as secondary:
            with secondary.connect() as s_con:
                with s_con.cursor() as s_cur:
                    cursors = []
                    for i in range(90):
                        p_con = primary.connect()
                        p_cur = p_con.cursor()
                        p_cur.execute("begin")
                        p_cur.execute("insert into t (pk) values (%s)", (i,))
                        cursors.append(p_cur)

                    for p_cur in cursors:
                        p_cur.execute("commit")

                    time.sleep(5)
                    s_cur.execute("select count(*) from t")
                    assert s_cur.fetchall()[0][0] == 90
                    s_cur.execute("show max_connections")
                    assert s_cur.fetchall()[0][0] == '5'

Any idea why it work?

MMeent · 2024-09-23T17:16:07Z

But I failed to reproduce the problem with recovery failure with max_)connections at primary equal to 100 and at replica - just 5. I run 90 parallel transactions and they are normally replicated:

I said 1000, not 100.

The issue occurs at n_entries >= 64 * max_connections, so you'll have to consume 320 (multi)xids for that array to fill up at max_connections=5. I'd test with 1, but I'm not sure we can start with max_connections=1. If we can, you can use that as replica node setting instead.

knizhnik · 2024-09-24T09:17:59Z

But I failed to reproduce the problem with recovery failure with max_)connections at primary equal to 100 and at replica - just 5. I run 90 parallel transactions and they are normally replicated:

I said 1000, not 100.

The issue occurs at n_entries >= 64 * max_connections, so you'll have to consume 320 (multi)xids for that array to fill up at max_connections=5. I'd test with 1, but I'm not sure we can start with max_connections=1. If we can, you can use that as replica node setting instead.

Sorry, can you explain the source of this formula: n_entries >= 64 * max_connections
I changed test to 400 connections at primary and 4 at replica and it still passed:

def test_physical_replication_config_mismatch(neon_simple_env: NeonEnv):
    env = neon_simple_env
    with env.endpoints.create_start(
        branch_name="main",
        endpoint_id="primary",
        config_lines=["max_connections=500"],
    ) as primary:
        with primary.connect() as p_con:
            with p_con.cursor() as p_cur:
                p_cur.execute(
                    "CREATE TABLE t(pk bigint primary key, payload text default repeat('?',200))"
                )
        time.sleep(1)

        with env.endpoints.new_replica_start(
            origin=primary,
            endpoint_id="secondary",
            config_lines=["max_connections=4"],
        ) as secondary:
            with secondary.connect() as s_con:
                with s_con.cursor() as s_cur:
                    cursors = []
                    for i in range(400):
                        p_con = primary.connect()
                        p_cur = p_con.cursor()
                        p_cur.execute("begin")
                        p_cur.execute("insert into t (pk) values (%s)", (i,))
                        cursors.append(p_cur)

                    time.sleep(5)

                    for p_cur in cursors:
                        p_cur.execute("commit")
                    time.sleep(2)

                    s_cur.execute("select count(*) from t")
                    assert s_cur.fetchall()[0][0] == 400

knizhnik · 2024-09-24T09:23:29Z

With 900 connections at primary test also passed

MMeent · 2024-10-16T15:25:40Z

OK, I've found a case where we hit the elog(ERROR) in KnownAssignedXidsAdd on the secondary.

Configuration:

Primary:

max_connections=1000

Secondary:

max_connections=2
autovacuum_max_workers=1
max_worker_processes=5
max_wal_senders=1
superuser_reserved_connections=0

Execute 650+ concurrently on the primary, e.g. with pgbench -c 990 -f script.sql:

BEGIN;
INSERT INTO test SELECT 1;
SELECT pg_sleep(10);
COMMIT;

You can adjust the secondary's max_connections upward if you increase the number of subxacts consumed by the benchmark transaction before the pg_sleep() operation.

MMeent · 2024-10-17T13:52:01Z

@knizhnik can you verify my findings?

knizhnik · 2024-10-17T15:18:22Z

@knizhnik can you verify my findings?

I have created test based on your scenario and reproduced FATAL: too many KnownAssignedXids
But is actually expected behaviour!
Yes, we know that smaller values of some critical parameters can cause recovery failure.
Probability of it is expected to be very small, but it can happen and cause replica restart.
As far as replica is always restarted from most recent LSN, it skips this transactions which cause this failure and most likely will be able to process.

So, I do not treat this error as a reason of rejecting this approach, do you?

MMeent · 2024-10-17T15:51:52Z

As far as replica is always restarted from most recent LSN, it skips this transactions which cause this failure and most likely will be able to process.

I don't see it that way. If a user has a workload that causes this crash, then it's likely they will hit this again. And I don't like the idea of a primary that can consistently cause a secondary to crash.

knizhnik · 2024-10-17T17:38:45Z

As far as replica is always restarted from most recent LSN, it skips this transactions which cause this failure and most likely will be able to process.

I don't see it that way. If a user has a workload that causes this crash, then it's likely they will hit this again. And I don't like the idea of a primary that can consistently cause a secondary to crash.

Well, somebody needs to make a decision.
I see very good reasons to support different configurations of primary and replica (different workloads, OLTP vs. OLAP,...).

Probability of such kind of problems is very very low. In your case we need to specify max_connections=1000 for primary and just 2 for replica. In rel life nobody never will setup such configuration. Moreover - we do not allow user to alter GUCs which are critical for replication (like max_connection, max_prepared_transactions,...). Values of some of this GUCs are now fixed and some of them depends on number of CU. And possible range of values for example for max_connections excludes such situations and crashes of replica.

Also, as far as I understand @hlinnaka is going to use his patch with CSN at replica which will completely eliminate this problem with known XIDs. Yes, that may not happen soon (still I hope that we will do it before patch will be committed in vanilla).

And last moment: if some customer is manager to spawn 1000 active transactions, then most likely he will be faced with many other problems (OOM, local disk space exhaustion, ...) much ore critical than problems with replication.

hlinnaka

Let's add tests for overrunning max_prepared_transactions and max_locks_per_transactions too. I.e. a test that creates many prepared transactions in primary, and a test that acquires a lot of AccessExclusiveLocks in the primary.

test_runner/regress/test_physical_replication.py

MMeent

I think this is still missing the changes that register the new GUC from Neon's code.

Apart from that (which thus requires changes), within the scope of needing this, LGTM

…smatch_max_locks_per_transaction to be compatioble with older Postgres versin which are not placing GUCname in quotas

Co-authored-by: Heikki Linnakangas <[email protected]>

knizhnik · 2024-12-01T11:56:33Z

I think this is still missing the changes that register the new GUC from Neon's code.

Apart from that (which thus requires changes), within the scope of needing this, LGTM

Thank you for noticing it.
Somehow definition of GUC disappears from my commit.
Fixed

Requested change is done

…t replica have smaller value than on primary (#9057) ## Problem See #9023 ## Summary of changes Ass GUC `recovery_pause_on_misconfig` allowing not to pause in case of replica and primary configuration mismatch See neondatabase/postgres#501 See neondatabase/postgres#502 See neondatabase/postgres#503 See neondatabase/postgres#504 ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist --------- Co-authored-by: Konstantin Knizhnik <[email protected]> Co-authored-by: Heikki Linnakangas <[email protected]>

knizhnik requested a review from a team as a code owner September 19, 2024 06:58

knizhnik requested a review from ololobus September 19, 2024 06:58

ololobus requested review from hlinnaka and MMeent September 19, 2024 12:08

MMeent requested changes Sep 20, 2024

View reviewed changes

knizhnik force-pushed the pause_recovery_at_misconfig branch from ab155a9 to e133d7d Compare September 23, 2024 15:57

knizhnik force-pushed the pause_recovery_at_misconfig branch from e133d7d to 6a688cf Compare September 25, 2024 14:42

hlinnaka reviewed Oct 30, 2024

View reviewed changes

test_runner/regress/test_physical_replication.py Show resolved Hide resolved

knizhnik force-pushed the pause_recovery_at_misconfig branch from a6d1cd3 to 15ed135 Compare October 31, 2024 15:26

knizhnik force-pushed the pause_recovery_at_misconfig branch 2 times, most recently from 1e86ffa to 1457f7f Compare November 13, 2024 17:16

knizhnik requested a review from hlinnaka November 18, 2024 11:47

hlinnaka reviewed Nov 18, 2024

View reviewed changes

test_runner/regress/test_physical_replication.py Outdated Show resolved Hide resolved

knizhnik requested a review from hlinnaka November 19, 2024 10:16

hlinnaka approved these changes Nov 27, 2024

View reviewed changes

test_runner/regress/test_physical_replication.py Outdated Show resolved Hide resolved

MMeent previously requested changes Nov 28, 2024

View reviewed changes

knizhnik force-pushed the pause_recovery_at_misconfig branch from eefba7b to 48b46af Compare November 29, 2024 15:02

Konstantin Knizhnik and others added 14 commits November 30, 2024 08:44

Add negatiove tests of replica configurtation mismatch

beddff8

Increase shared_buffers to prevent no unpinned buffers available error

c0a926d

Add test_physical_replication_config_mismatch_max_locks_per_transaction

1fe7df5

Change expected error message for test_physical_replication_config_mi…

64dcb0e

…smatch_max_locks_per_transaction to be compatioble with older Postgres versin which are not placing GUCname in quotas

Make ruff happy

70d94a1

Update test_runner/regress/test_physical_replication.py

a991048

Co-authored-by: Heikki Linnakangas <[email protected]>

Update test_runner/regress/test_physical_replication.py

755aa56

Co-authored-by: Heikki Linnakangas <[email protected]>

Replace sleep with wait-for_repli_caughtup

ea7be9c

Remove threads from test_physical_repliction_test

f1de98e

Check that replica is failed in negative tests

4770ce1

Use pytest.raises to check that replica is crashed

59b9352

Bump Postgres vesions

9a76be2

Bump Postgres versions

0c0dfd1

Update test_runner/regress/test_physical_replication.py

a5e60f5

Co-authored-by: Heikki Linnakangas <[email protected]>

knizhnik force-pushed the pause_recovery_at_misconfig branch from a139e87 to a5e60f5 Compare November 30, 2024 06:46

Konstantin Knizhnik added 2 commits November 30, 2024 14:11

Undo unintended changes

cbbf5da

Allow replica startup on misconfig by default

2b4db6e

knizhnik requested a review from a team as a code owner November 30, 2024 19:50

knizhnik requested a review from VladLazar November 30, 2024 19:50

Konstantin Knizhnik added 3 commits November 30, 2024 21:51

Bump Postgres version

209acdf

Bump postgres version

cd896ce

Bump postgres version

1c57694

knizhnik requested a review from MMeent December 1, 2024 11:54

knizhnik added this pull request to the merge queue Dec 1, 2024

Merged via the queue into main with commit 97a9abd Dec 1, 2024
80 checks passed

knizhnik deleted the pause_recovery_at_misconfig branch December 1, 2024 12:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add GUC controlling whether to pause recovery if some critical GUCs at replica have smaller value than on primary #9057

Add GUC controlling whether to pause recovery if some critical GUCs at replica have smaller value than on primary #9057

knizhnik commented Sep 19, 2024

github-actions bot commented Sep 19, 2024 •

edited

Loading

Postgres 17

MMeent left a comment

knizhnik commented Sep 23, 2024

MMeent commented Sep 23, 2024

knizhnik commented Sep 23, 2024

MMeent commented Sep 23, 2024

knizhnik commented Sep 24, 2024

knizhnik commented Sep 24, 2024

MMeent commented Oct 16, 2024

MMeent commented Oct 17, 2024

knizhnik commented Oct 17, 2024

MMeent commented Oct 17, 2024

knizhnik commented Oct 17, 2024

hlinnaka left a comment

MMeent left a comment

knizhnik commented Dec 1, 2024 •

edited

Loading

Add GUC controlling whether to pause recovery if some critical GUCs at replica have smaller value than on primary #9057

Add GUC controlling whether to pause recovery if some critical GUCs at replica have smaller value than on primary #9057

Conversation

knizhnik commented Sep 19, 2024

Problem

Summary of changes

Checklist before requesting a review

Checklist before merging

github-actions bot commented Sep 19, 2024 • edited Loading

6996 tests run: 6688 passed, 0 failed, 308 skipped (full report)

Postgres 17

Code coverage* (full report)

MMeent left a comment

Choose a reason for hiding this comment

knizhnik commented Sep 23, 2024

MMeent commented Sep 23, 2024

knizhnik commented Sep 23, 2024

MMeent commented Sep 23, 2024

knizhnik commented Sep 24, 2024

knizhnik commented Sep 24, 2024

MMeent commented Oct 16, 2024

MMeent commented Oct 17, 2024

knizhnik commented Oct 17, 2024

MMeent commented Oct 17, 2024

knizhnik commented Oct 17, 2024

hlinnaka left a comment

Choose a reason for hiding this comment

MMeent left a comment

Choose a reason for hiding this comment

knizhnik commented Dec 1, 2024 • edited Loading

github-actions bot commented Sep 19, 2024 •

edited

Loading

knizhnik commented Dec 1, 2024 •

edited

Loading