Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Figure out why raft quorum bug wasn't detected #28

Open
cole-miller opened this issue Sep 15, 2022 · 4 comments
Open

Figure out why raft quorum bug wasn't detected #28

cole-miller opened this issue Sep 15, 2022 · 4 comments
Labels
bug Something isn't working enhancement New feature or request question Further information is requested

Comments

@cole-miller
Copy link
Contributor

Apparently our Jepsen tests weren't able to detect the bug in our implementation of the raft quorum logic that's addressed by canonical/raft#302. We should figure out why not, and strengthen the tests so that they successfully detect the bug.

@cole-miller
Copy link
Contributor Author

cole-miller commented Sep 16, 2022

It's possible that this comes down to timing. For the bug to manifest, a leader needs to replicate a log entry from its term to a majority of nodes, but then crash/go offline before communicating the new commit index to the node that will become the next leader. Could it be that that window of time is just too short for Jepsen to have a good chance of hitting it?

@freeekanayaka
Copy link
Contributor

That sounds plausible to me.

@cole-miller cole-miller self-assigned this Sep 21, 2022
@MathieuBordere
Copy link
Contributor

It's possible that this comes down to timing. For the bug to manifest, a leader needs to replicate a log entry from its term to a majority of nodes, but then crash/go offline before communicating the new commit index to the node that will become the next leader. Could it be that that window of time is just too short for Jepsen to have a good chance of hitting it?

We should have a higher chance to hit it it if we increase the heartbeat intervals, if I'm not mistaken the heartbeat intervals are determined by setting the network latency. We could randomize the network latency in the tests to try and hit more timing sensitive bugs.

@freeekanayaka
Copy link
Contributor

It's possible that this comes down to timing. For the bug to manifest, a leader needs to replicate a log entry from its term to a majority of nodes, but then crash/go offline before communicating the new commit index to the node that will become the next leader. Could it be that that window of time is just too short for Jepsen to have a good chance of hitting it?

We should have a higher chance to hit it it if we increase the heartbeat intervals, if I'm not mistaken the heartbeat intervals are determined by setting the network latency. We could randomize the network latency in the tests to try and hit more timing sensitive bugs.

That sounds like a good idea, regardless of whether it will help triggering this specific bug. More than randomizing it, perhaps just setting it very high (e.g. 10x current value or more) and run all the tests with that high settings, as well as the normal default setting of course.

@MathieuBordere MathieuBordere added bug Something isn't working enhancement New feature or request question Further information is requested labels Jun 12, 2023
@cole-miller cole-miller removed their assignment Sep 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants