-
Notifications
You must be signed in to change notification settings - Fork 492
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add GUC controlling whether to pause recovery if some critical GUCs at replica have smaller value than on primary #9057
Conversation
6996 tests run: 6688 passed, 0 failed, 308 skipped (full report)Flaky tests (2)Postgres 17
Code coverage* (full report)
* collected from Rust tests only The comment gets automatically updated with the latest test results
1c57694 at 2024-12-01T07:11:38.492Z :recycle: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See: neondatabase/postgres#501
I don't think we should be exposing this GUC in the global namespace, but rather from the extension's namespace (so we can only enable this behavior when the extension is loaded)
As for !recoveryPauseOnMisconfig
, this can cause us to see more active transactions on a standby than that standby expects.
What's the behaviour when we see >max_connections concurrently active transactions on a hot standby?
Sorry, I do not understand it. We are using this GUC in Postgres core code. Certainly we can register variable in Postgres code byut define GUC which changes it in neon extension. But this is IMHO strange solution. We can also add yet another hook (what to do on misconfig) and define it in Neon extension. But I think that such "less strict" policy for checking primary/replica configuration will be useful not only to Neon.
I have added test for this case. Transactions are normally applied. It is expected behaviour: LR serialise transactions and only one transactions is applied at each moment of time. In case of some other cases of misconfiguration (for example max_prepared_transations is smaller at primary than replica will crash with the fatal error:
after which control plane should restart replica with synced config parameters and so next time recon try should succeed. |
I don't care about LR, I care about normal replication, which does not serialize transactions. And in that case, we're probably writing visibility information into data structs sized to MaxBackends, while we're writing > MaxBackends values into those, which is probably very unsafe. Did you check that visibility information is correctly applied even at large concurrency? Note that a replica's transaction state handling mechanism is managed by E.g. spin up a replica with |
ab155a9
to
e133d7d
Compare
Sorry, many different tickets are missed in my head:( But I failed to reproduce the problem with recovery failure with max_)connections at primary equal to 100 and at replica - just 5. I run 90 parallel transactions and they are normally replicated:
Any idea why it work? |
I said 1000, not 100. The issue occurs at |
Sorry, can you explain the source of this formula:
|
With 900 connections at primary test also passed |
e133d7d
to
6a688cf
Compare
OK, I've found a case where we hit the Configuration: Primary:
Secondary:
Execute 650+ concurrently on the primary, e.g. with
You can adjust the secondary's |
@knizhnik can you verify my findings? |
I have created test based on your scenario and reproduced So, I do not treat this error as a reason of rejecting this approach, do you? |
I don't see it that way. If a user has a workload that causes this crash, then it's likely they will hit this again. And I don't like the idea of a primary that can consistently cause a secondary to crash. |
Well, somebody needs to make a decision. Probability of such kind of problems is very very low. In your case we need to specify max_connections=1000 for primary and just 2 for replica. In rel life nobody never will setup such configuration. Moreover - we do not allow user to alter GUCs which are critical for replication (like max_connection, max_prepared_transactions,...). Values of some of this GUCs are now fixed and some of them depends on number of CU. And possible range of values for example for Also, as far as I understand @hlinnaka is going to use his patch with CSN at replica which will completely eliminate this problem with known XIDs. Yes, that may not happen soon (still I hope that we will do it before patch will be committed in vanilla). And last moment: if some customer is manager to spawn 1000 active transactions, then most likely he will be faced with many other problems (OOM, local disk space exhaustion, ...) much ore critical than problems with replication. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add tests for overrunning max_prepared_transactions
and max_locks_per_transactions
too. I.e. a test that creates many prepared transactions in primary, and a test that acquires a lot of AccessExclusiveLocks in the primary.
a6d1cd3
to
15ed135
Compare
1e86ffa
to
1457f7f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is still missing the changes that register the new GUC from Neon's code.
Apart from that (which thus requires changes), within the scope of needing this, LGTM
eefba7b
to
48b46af
Compare
…smatch_max_locks_per_transaction to be compatioble with older Postgres versin which are not placing GUCname in quotas
Co-authored-by: Heikki Linnakangas <[email protected]>
Co-authored-by: Heikki Linnakangas <[email protected]>
Co-authored-by: Heikki Linnakangas <[email protected]>
a139e87
to
a5e60f5
Compare
Thank you for noticing it. |
…t replica have smaller value than on primary (#9057) ## Problem See #9023 ## Summary of changes Ass GUC `recovery_pause_on_misconfig` allowing not to pause in case of replica and primary configuration mismatch See neondatabase/postgres#501 See neondatabase/postgres#502 See neondatabase/postgres#503 See neondatabase/postgres#504 ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist --------- Co-authored-by: Konstantin Knizhnik <[email protected]> Co-authored-by: Heikki Linnakangas <[email protected]>
Problem
See #9023
Summary of changes
Ass GUC
recovery_pause_on_misconfig
allowing not to pause in case of replica and primary configuration mismatchSee neondatabase/postgres#501
See neondatabase/postgres#502
See neondatabase/postgres#503
See neondatabase/postgres#504
Checklist before requesting a review
Checklist before merging