-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible FAILURE STATE in State Machine #1025
Comments
Workaround EVENTS
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Description
While unlikely it is possible to trigger failure state that is impossible to solve with provided tools and has to assigned state has to be set manually.
In cluster with two nodes and one quorum it's possible to trigger this sequence of events.
Beginning STATE
NODE1 PRIMARY
NODE2 SECONDARY
Sequence of events
NODE1 has error in .pgpass preventing it from comunicating with rest of cluster but itself it isn't cause of automatic switchover
NODE2 tries enable maintenance.
STATE SECONDARY > wait_maintenance
NODE1 is assigned state
PRIMARY > wait_primary but the error in .pgpass causes fall into demote_timeout which causes impasse.
Because NODE1 cannot reach target state
FATAL pg_autoctl does not know how to reach state "wait_primary" from "demote_timeout"
And NODE2 cannot leave maintenance because it is stuck in wait_maintenance. Neither node will start and whole cluster gets stuck.
Workaround:
update node set goalstate='primary' where nodeid=1;
This allows the NODE1 to start and transition to wait_primary which allows NODE2 to reach maintenance.
Expected solution:
Transition between demote_timeout and wait_primary should be implemented.
The text was updated successfully, but these errors were encountered: