Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delay the next campaign if the node lost the vote #169

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

ahrtr
Copy link
Member

@ahrtr ahrtr commented Feb 22, 2024

Delay the next campaign if the node lost the vote. It's highly likely it will also lose next campaign, so it makes more sense to prioritize campaigns by other nodes within the current term.

Alternative to resolve etcd-io/etcd#17455. It has better readability than #167.

@boringhello
Copy link

I think a better approach would be to verify if the rejection is due to outdated logs before delaying the next request.This approach may potentially increase the downtime of the cluster.

@ahrtr
Copy link
Member Author

ahrtr commented Feb 23, 2024

I think a better approach would be to verify if the rejection is due to outdated logs before delaying the next request.

Yes, I was thinking the same. But unfortunately, we don't have such info in the MsgVoteResp.

This approach may potentially increase the downtime of the cluster.

It should be fine. It just delays current node's next campaign. It doesn't change other nodes' campaign timings. Also the randomizedElectionTimeout will be reset to the normal value in next term.

@ahrtr
Copy link
Member Author

ahrtr commented Feb 23, 2024

@Cjen1 I think this PR can also resolve what your PR #86 is trying to fix, and with minimum change. Please take a look and let me know if you have any comment. thx.

@erikgrinaker @pav-kv PTAL

@erikgrinaker
Copy link
Contributor

@ahrtr Hey, sorry for the review lag here, it's been a crazy week. I'll be on PTO for the next week, but maybe @pav-kv can take a look.

raft.go Show resolved Hide resolved
@pav-kv
Copy link
Contributor

pav-kv commented Feb 26, 2024

@ahrtr Are you able to reproduce the problem in etcd-io/etcd#17455 in a datadriven test (the tests in testdata, for TestInteraction)? What kind of race are we trying to fix?

@pav-kv
Copy link
Contributor

pav-kv commented Feb 26, 2024

This change seems to be concerned with the situation when a candidate loses because of having an outdated log. However, the fix may affect the situation of a split vote. When two candidates duel and both lose campaign because of a split vote, both will bump the election timeout, and the election may stall for a bit. It appears that there is no silver bullet against all these situations.

Enabling PreVote would address the outdated log problem more directly.

@ahrtr
Copy link
Member Author

ahrtr commented Feb 26, 2024

Enabling PreVote would address the outdated log problem more directly.

Yes. This PR is fixing the use cases where PreVote isn't enabled at all.

The idea is to skip the next electionTimeout to give other nodes a chance to win in the term+1 rather than current term?

The idea is to increase the current node's randomizedElectionTimeout when it lost the vote. Normally it's in range [electionTimeout, 2*electionTimeout). If a node lost the vote, we intentionally set the range to [2*electionTimeout, 3*electionTimeout), so that other nodes will always campaign before the node, which already lost the vote, within current term.

Let's work with an example. Assuming a cluster has 3 nodes, and one of the node (say node A) is lag behind other nodes. When they start to campaign, the node A has a possibility of 33% to start the campaign before other nodes. Its vote will definitely be rejected by other nodes. Then all the 3 nodes will conduct the next round of election fairly. The node A still has a possibility of 33% to start the campaign before other nodes. Eventually it may lead to a long time for a real leader to get elected. It's exactly this PR tries to fix.

Note that randomizedElectionTimeout will automatically be reset in next term.

When two candidates duel and both lose campaign because of a split vote, both will bump the election timeout, and the election may stall for a bit.

Yes, it's true. But I imagine that the possibility of running into such split vote should be far less than 33% mentioned above. Usually we recommend odd number of members in a cluster. But it isn't verified yet. Ideally we should a tool to measure the average recovery time, something like what #86 does.

Are you able to reproduce the problem in etcd-io/etcd#17455 in a datadriven test (the tests in testdata, for TestInteraction)?

The datadriven test might not be a good choice to reproduce & verify the fix. The PR doesn't change the logic, so it hasn't any impact on correctness.

@ahrtr
Copy link
Member Author

ahrtr commented Feb 26, 2024

When two candidates duel and both lose campaign because of a split vote, both will bump the election timeout, and the election may stall for a bit.

Even in split vote, we do not set all nodes' randomizedElectionTimeout to [2*electionTimeout, 3*electionTimeout). Instead we only update the randomizedElectionTimeout for the nodes which have lost the vote.

@pav-kv
Copy link
Contributor

pav-kv commented Feb 26, 2024

The datadriven test might not be a good choice to reproduce & verify the fix. The PR doesn't change the logic, so it hasn't any impact on correctness.

The datadriven test is a good way, IMO. It clearly shows the behaviour. Yes, we're not changing correctness, but we're changing the behaviour. The datadriven harness gives ways to emulate lagging and duelling campaigns, you can cause specific message delivery ordering that repro the problem. There is also a command for setting specific election timeouts which might be useful here.

To demonstrate the effect of this PR, it would be good to 1) in the first commit add a test with the election rejection scenario you're targetting, 2) fix it in the second commit - the diff in the testdata file will demonstrate the effect.

The idea is to increase the current node's randomizedElectionTimeout when it lost the vote. Normally it's in range [electionTimeout, 2electionTimeout). If a node lost the vote, we intentionally set the range to [2electionTimeout, 3*electionTimeout), so that other nodes will always campaign before the node, which already lost the vote, within current term.

I'm still not getting the problem. Say, all the nodes campaign in the [electionTimeout, 2*electionTimeout) interval: at t1, t2, t3. Yes, the first node A can lose the election at t1. If the reason for rejected votes was a short log, all the nodes will vote in this same term for another, longer log, at t2 or t3. Upon losing the election, node A already moves its election time by at least electionTimeout, so next time it will campaign t4 is > t2 and > t3.

@ahrtr
Copy link
Member Author

ahrtr commented Feb 26, 2024

If the reason for rejected votes was a short log, all the nodes will vote in this same term for another, longer log, at t2 or t3. Upon losing the election, node A already moves its election time by at least electionTimeout, so next time it will campaign t4 is > t2 and > t3.

Note this PR is fixing the cases where preVote isn't enabled at all. Each time when any node campaign, it will increase its term by 1. Other nodes will be converted to a follower when receiving the MsgVote. It means even a node lost its vote, all nodes' randomizedElectionTimeout and electionElapsed will be reset.

raft/raft.go

Line 1094 in d475d7e

r.becomeFollower(m.Term, None)

raft/raft.go

Lines 774 to 777 in d475d7e

r.electionElapsed = 0
r.heartbeatElapsed = 0
r.resetRandomizedElectionTimeout()

@Cjen1
Copy link

Cjen1 commented Feb 26, 2024

@ahrtr Just checking my intuition here. Given there is a lagging node that calls and fails to be elected, you want to bias the retried election to the other replicas?

I think some issues arise when there is asynchrony or the election timeout is roughly the same order of magnitude as the round-trip-time. Specifically this means that before receiving the next msgVote the nodes have already sent their own (This causes duelling leaders), and hence cause each retry to be (2->3) * electionTimeout rather than (1->2) * electionTimeout).

For context I've been running some tests in cloudlab in a 3 node cluster with ~50ms latency between nodes.
With 500ms electionTimeout, I'm seeing 2-5 retries before a leader is elected.
(I'm currently calibrating the reckon test setup, so I am not 100% satisified with the reliability of these numbers, or that they weren't statistical flukes)

@pav-kv
Copy link
Contributor

pav-kv commented Feb 26, 2024

@ahrtr I see. You're saying that the nodes who rejected the vote, will become followers and reset the timeout. So my assumption that t4 > t2 and > t3, is incorrect? Then it's possible that the short log A campaigns first again.

I wonder though if, alternatively to the fix in this PR, we should just not reset the election timeout in the nodes who reject a vote and becomeFollower(leader=None). Then they will campaign earlier than the rejected node A tries again.

@ahrtr
Copy link
Member Author

ahrtr commented Feb 26, 2024

I think some issues arise when there is asynchrony or the election timeout is roughly the same order of magnitude as the round-trip-time.

Yes, it's true. It's a real issue even without the change in this PR. It's a misconfiguration of the election timeout. Usually we recommend the heartbeat interval to be roughly the round-trip time between members, and the election timeout to be the 10 times of heartbeat interval/timeout.

Please refer to https://etcd.io/docs/v3.5/tuning/#time-parameters

So my assumption that t4 > t2 and > t3, is incorrect? Then it's possible that the short log A campaigns first again.

Correct.

I wonder though if, alternatively to the fix in this PR, we should just not reset the election timeout in the nodes who reject a vote and becomeFollower(leader=None). Then they will campaign earlier than the rejected node A tries again.

it's exactly my original thought. Please see #167.

@ahrtr
Copy link
Member Author

ahrtr commented Feb 26, 2024

Just checking my intuition here. Given there is a lagging node that calls and fails to be elected, you want to bias the retried election to the other replicas?

Yes. Normally the randomized election timeout is in range [electionTimeout, 2*electionTimeout). If a node lost the vote, we intentionally set the range to [2*electionTimeout, 3*electionTimeout).

But it's only within one term. All the time parameters will be reset in next term.

@boringhello
Copy link

boringhello commented Feb 28, 2024

Yes, I was thinking the same. But unfortunately, we don't have such info in the MsgVoteResp.

Is it possible to add whether it was rejected because of the log? It feels like there are very few things to add to MsgVoteResp.And it can stop in this raft activity, because it cannot become a leader unless it receives the appendMsg and gets the log. This may also be an optimization point. I feel that the original raft’s expression of reject is not perfect.Because rejected by high terms and logs, there is a little difference between them.

@ahrtr
Copy link
Member Author

ahrtr commented Feb 28, 2024

Added a data-driven test case. @pav-kv PTAL, suggest to review it commit by commit. @Cjen1 Please also take a look if you have time. Thanks both.

Also responding to your previous comment on split vote, actually the vote result in split vote is VotePending. This PR only updates the randomized election timeout when the vote result is VoteLost.

Is it possible to add whether it was rejected because of the log?

No such plan in this PR for now. Also note that one peer's rejection doesn't mean current node will have to necessarily lose the campaign.

@boringhello
Copy link

boringhello commented Feb 28, 2024

Also note that one peer's rejection doesn't mean current node will have to necessarily lose

at the majority of nodes

@ahrtr
Copy link
Member Author

ahrtr commented Feb 29, 2024

Also responding to your previous comment on split vote, actually the vote result in split vote is VotePending. This PR only updates the randomized election timeout when the vote result is VoteLost.

This isn't correct. For split vote, the vote result is VoteLost instead of VotePending. But my previous comment #169 (comment) still stands.

testdata/vote.txt Outdated Show resolved Hide resolved
testdata/vote.txt Show resolved Hide resolved
testdata/vote.txt Outdated Show resolved Hide resolved
@ahrtr ahrtr force-pushed the prevote_20240222 branch 3 times, most recently from 861120d to 1e3047d Compare March 4, 2024 07:36
@ahrtr
Copy link
Member Author

ahrtr commented Mar 7, 2024

@erikgrinaker @pav-kv PTAL when you have time.

@Cjen1 do you have any immediate concern?

Thanks.

Delay the next campaign if the node lost the vote. It's
highly likely it will also lose next campaign, so it makes
more sense to prioritize campaigns by other nodes within
the current term.

Signed-off-by: Benjamin Wang <[email protected]>
…campaign

When a node lost the vote, it will be de-prioritized on the next campaign.
Its randomized election timeout is in range [2*electionTime, 3*electionTime).

Signed-off-by: Benjamin Wang <[email protected]>
@pav-kv
Copy link
Contributor

pav-kv commented Mar 7, 2024

I'm concerned with a duelling leaders situation. Can we have a second datadriven test (similar to the one you added) to test this scenario? Something like:

  1. Both nodes 1 and 2 campaign.
  2. Both lose (because the peer has already voted for itself).
    • As a result, both bump the election time by an extra electionTimeout?
  3. So now there is a temporary election stall? If we tick both nodes electionTimeout times, nobody will campaign. Someone will only campaign if we tick another electionTimeout times.

Maybe there is a 2* multiplier in this reasoning. But the idea is: wouldn't this fix introduce an extra electionTimeout delay in this case?

@Cjen1
Copy link

Cjen1 commented Mar 7, 2024

@erikgrinaker @pav-kv PTAL when you have time.

@Cjen1 do you have any immediate concern?

Thanks.

I am still concerned about this impacting the minimal viable recovery time, since if leaders duel it will double the length of each iteration (afaict).

I fully understand that this isn't a massive concern since the recommendation is to minimize any erroneous elections.

Is there an easy way for me to build a binary to test it in reckon? Otherwise if you swap out the etcd binary in reckon that will also work.

(Additionally I'm on holiday until Monday so have limited time to test this until then)

@ahrtr
Copy link
Member Author

ahrtr commented Mar 7, 2024

I'm concerned with a duelling leaders situation.

3. So now there is a temporary election stall? If we tick both nodes electionTimeout times, nobody will campaign.

It's true if there are only two nodes in the cluster and the two nodes are campaigning at exactly the same time.

  • Firstly, it's low possibility due to the randomized election timeout. But as I mentioned in Delay the next campaign if the node lost the vote #169 (comment), a campaign failed node still has a possibility of 33% in a 3 nodes cluster to start the next campaign before other nodes without this PR.
  • Secondly, usually we recommend odd number of cluster members, e.g 3, 5, 7 etc. As mentioned in Delay the next campaign if the node lost the vote #169 (comment), we don't delay all nodes'(or members') election timeout, we only delay the nodes which have failed the campaign. In other nodes, usually there are always some nodes, of which the election timeout isn't updated/delayed.

Is there an easy way for me to build a binary to test it in reckon? Otherwise if you swap out the etcd binary in reckon that will also work.

Is it enough to provide an etcd binary on top of this PR? I am happy to help on the test.

@Cjen1
Copy link

Cjen1 commented Mar 13, 2024

I've run this PR through reckon, using 100 leader loss events for each config, varying the number of nodes (3 and 5) and the election timeout.

fault-aggregate

I think the result is that this PR has no significant effect on the average time to recovery of etcd.
There are some subtle changes in the tail for example I think there is evidence that it reduces the number of repeated elections, but also prolongs each of these rounds, but I think it would require some analytical techniques to properly pull that apart.

@erikgrinaker
Copy link
Contributor

I share @pav-kv's concerns in #169 (comment), and also the concerns voiced in #167. Spurious election disruption is a known weakness when not using pre-vote and checkquorum, and it sounds like enabling those would adequately address etcd-io/etcd#17455. I think we should recommend employing those as a first measure, and only tweak election timeout policies when we see significant problems with those enabled.

@ahrtr
Copy link
Member Author

ahrtr commented Mar 15, 2024

I think the result is that this PR has no significant effect on the average time to recovery of etcd.
There are some subtle changes in the tail for example I think there is evidence that it reduces the number of repeated elections, but also prolongs each of these rounds, but I think it would require some analytical techniques to properly pull that apart.

Thanks for the test result. My dev branch is based on etcd release-3.5, while your baseline is based on release-3.4. It might be another reason for the subtle change.

Spurious election disruption is a known weakness when not using pre-vote and checkquorum, and it sounds like enabling those would adequately address etcd-io/etcd#17455.

The PR is trying to mitigate such kind of known weakness. Not all applications enable pre-vote. Note etcd release-3.4 is a stable release, it has never enabled pre-vote.

As mentioned above #169 (comment), a lag node always has 33% possibility (in a 3-node cluster) to start the campaign before other nodes without this PR. The possibility is 50% if there are only two available nodes.

But the possibility of split vote in a 3 node cluster is below based on raft paper 9.2:

  • about 3% when l = 10% (l: network latency / election timeout is 10%);
  • about 10% when l = 20%
  • about 18% when l = 10% and with 2 available nodes;
  • about 35% when l = 20% and with 2 available nodes;

Obviously 33% vs 3-10%, or 50% vs 18-35%, this PR improves the situation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Robustness] Etcd v3.4 required more than expected leader elections
5 participants