Replies: 8 comments 6 replies
-
If you don't have a cluster majority, you cannot use consistent queries. Arguably if this happens |
Beta Was this translation helpful? Give feedback.
-
I looked at the code and it seems like it should work fine. However, this is what I see when running the reproduction test:
I'll investigate. |
Beta Was this translation helpful? Give feedback.
-
Out of interest, I tried the reproduction steps and got the similar result with @orre 's.
The output of ra1 shell is: https://gist.github.com/shino/d8091d0a3ece8156974c99efd021c599 . |
Beta Was this translation helpful? Give feedback.
-
@lukebakken There is a bug re-declaring a new cluster with the same name when there is persisted data from aprevious session that you're hitting. This PR should address that bug: #339 (basically clobbering the old cluster when a new one is declared). I can also see a bug with consistent_query where the query indexes get our of sync after certain election events. Not sure how to fix it yet but am working on it. |
Beta Was this translation helpful? Give feedback.
-
Ok with the latest commits to #339 I left the test program running whilst going to grab some lunch and it was still running when I came back. @orre give it a try. I will need to spend some more time reviewing the code for consistent query to make sure it works as expected before we can merge this change but I think I know what the problem was and should have at least solved the liveness issue. |
Beta Was this translation helpful? Give feedback.
-
Ok @kjnilsson - I've been trying to reproduce the problem all morning, but it seems to be gone with your PR in place! Thanks to all that helped out. |
Beta Was this translation helpful? Give feedback.
-
Ok the fix is now in #340 |
Beta Was this translation helpful? Give feedback.
-
[I'm using RA v2.4.0]
We have a lot of "loss of majority" situations in our RA clusters due to "natural causes" that we cannot affect.
I've observed a problem that
ra:consistent_query
seems to timeout/hang in situations where it should not hang at all according to my (possibly limited) knowledge of RAFT/RA.It goes so far that it hangs forever (or at least a very long time), when the cluster has perfectly good majority and an elected leader.
When this occurs,
ra:leader_query
always works perfectly fine AND it is possibly to commit to the log. It is onlyra:consistent_query
that mysteriously hangs.I have created a "reproducer" repo here
NB: hostname (and possibly also your domain) must be adapted in src/ra_kv_store.erl (function: servers()) before running.
Steps to reproduce:
Start 3 erlang nodes
Start cluster from ra2
Run test cycle
Repeating the test cycle will eventually block for a long time.
Example session
So whats going on here? Why is
ra:consistent_query
blocking?Thanks
Örjan
Beta Was this translation helpful? Give feedback.
All reactions