Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Leader election issues? #6

Open
prologic opened this issue Jun 7, 2021 · 6 comments
Open

Leader election issues? #6

prologic opened this issue Jun 7, 2021 · 6 comments

Comments

@prologic
Copy link

prologic commented Jun 7, 2021

HI @tidwall 👋

Getting back to playing with your library again and making some updates/fixes to my bitraft I noticed some strange behaviour I'm wondering if you can help figure out.

I form a 3-node cluster:

Node 1:

james@Jamess-MacBook-Pro
Mon Jun 07 11:22:17
~/tmp/bitraft
 (upgrade_finn_uhaha) 0
$ ./bitraft -i 1 -p ./data1 -b 0.0.0.0:4920
INFO[0000] 63366:S 07 Jun 2021 11:22:21.566 # starting bitraft version 0.0.1@HEAD
INFO[0000] 63366:S 07 Jun 2021 11:22:21.861 * synchronized time
INFO[0000] 63366:S 07 Jun 2021 11:22:21.861 * server listening at [::]:4920
INFO[0000] 63366:S 07 Jun 2021 11:22:21.861 * server advertising as 0.0.0.0:4920
INFO[0000] 63366:S 07 Jun 2021 11:22:21.935 * initial configuration: index=0 servers=[]
INFO[0000] 63366:S 07 Jun 2021 11:22:21.935 * bootstrapping new cluster
INFO[0000] 63366:F 07 Jun 2021 11:22:21.935 * entering follower state: follower="Node at [::]:4920 [Follower]" leader=
INFO[0001] 63366:F 07 Jun 2021 11:22:23.366 # heartbeat timeout reached, starting election: last-leader=
INFO[0001] 63366:C 07 Jun 2021 11:22:23.366 * entering candidate state: node="Node at [::]:4920 [Candidate]" term=2
INFO[0001] 63366:C 07 Jun 2021 11:22:23.366 * election won: tally=1
INFO[0001] 63366:L 07 Jun 2021 11:22:23.366 * entering leader state: leader="Node at [::]:4920 [Leader]"
INFO[0002] 63366:L 07 Jun 2021 11:22:23.937 * logs loaded: ready for commands
INFO[0044] 63366:L 07 Jun 2021 11:23:06.027 * updating configuration: command=AddStaging server-id=2 server-addr=0.0.0.0:4921 servers="[{Suffrage:Voter ID:1 Address:0.0.0.0:4920} {Suffrage:Voter ID:2 Address:0.0.0.0:4921}]"
INFO[0044] 63366:L 07 Jun 2021 11:23:06.027 * added peer, starting replication: peer=2
INFO[0044] 63366:L 07 Jun 2021 11:23:06.029 # appendEntries rejected, sending older logs: peer="{Voter 2 0.0.0.0:4921}" next=1
INFO[0044] 63366:L 07 Jun 2021 11:23:06.030 * pipelining replication: peer="{Voter 2 0.0.0.0:4921}"
INFO[0106] 63366:L 07 Jun 2021 11:24:07.976 * updating configuration: command=AddStaging server-id=3 server-addr=0.0.0.0:4922 servers="[{Suffrage:Voter ID:1 Address:0.0.0.0:4920} {Suffrage:Voter ID:2 Address:0.0.0.0:4921} {Suffrage:Voter ID:3 Address:0.0.0.0:4922}]"
INFO[0106] 63366:L 07 Jun 2021 11:24:07.976 * added peer, starting replication: peer=3
INFO[0106] 63366:L 07 Jun 2021 11:24:07.978 # appendEntries rejected, sending older logs: peer="{Voter 3 0.0.0.0:4922}" next=1
INFO[0106] 63366:L 07 Jun 2021 11:24:07.982 * pipelining replication: peer="{Voter 3 0.0.0.0:4922}"

INFO[0300] 63366:L 07 Jun 2021 11:27:22.554 * aborting pipeline replication: peer="{Voter 3 0.0.0.0:4922}"
INFO[0300] 63366:L 07 Jun 2021 11:27:22.565 # failed to heartbeat to: peer=0.0.0.0:4922 error=EOF
INFO[0301] 63366:L 07 Jun 2021 11:27:23.055 # failed to contact: server-id=3 time=501.441291ms
INFO[0301] 63366:L 07 Jun 2021 11:27:23.531 # failed to contact: server-id=3 time=977.02433ms
INFO[0302] 63366:L 07 Jun 2021 11:27:23.622 # failed to appendEntries to: peer="{Voter 3 0.0.0.0:4922}" error="dial tcp 0.0.0.0:4922: connect: connection refused"
INFO[0302] 63366:L 07 Jun 2021 11:27:23.682 # failed to heartbeat to: peer=0.0.0.0:4922 error="dial tcp 0.0.0.0:4922: connect: connection refused"
INFO[0302] 63366:L 07 Jun 2021 11:27:24.007 # failed to contact: server-id=3 time=1.453318097s
INFO[0303] 63366:L 07 Jun 2021 11:27:24.633 # failed to appendEntries to: peer="{Voter 3 0.0.0.0:4922}" error="dial tcp 0.0.0.0:4922: connect: connection refused"
INFO[0303] 63366:L 07 Jun 2021 11:27:24.867 # failed to heartbeat to: peer=0.0.0.0:4922 error="dial tcp 0.0.0.0:4922: connect: connection refused"
INFO[0304] 63366:L 07 Jun 2021 11:27:25.644 # failed to appendEntries to: peer="{Voter 3 0.0.0.0:4922}" error="dial tcp 0.0.0.0:4922: connect: connection refused"
INFO[0304] 63366:L 07 Jun 2021 11:27:25.672 * pipelining replication: peer="{Voter 3 0.0.0.0:4922}"
INFO[0304] 63366:L 07 Jun 2021 11:27:26.007 # failed to heartbeat to: peer=0.0.0.0:4922 error="dial tcp 0.0.0.0:4922: connect: connection refused"

Node 2:

james@Jamess-MacBook-Pro
Mon Jun 07 11:22:46
~/tmp/bitraft
 (upgrade_finn_uhaha) 0
$ ./bitraft -i 2 -p ./data2 -b 0.0.0.0:4921 -j 127.0.0.1:4920
INFO[0000] 63379:S 07 Jun 2021 11:23:05.679 # starting bitraft version 0.0.1@HEAD
INFO[0000] 63379:S 07 Jun 2021 11:23:05.956 * synchronized time
INFO[0000] 63379:S 07 Jun 2021 11:23:05.956 * server listening at [::]:4921
INFO[0000] 63379:S 07 Jun 2021 11:23:05.956 * server advertising as 0.0.0.0:4921
INFO[0000] 63379:S 07 Jun 2021 11:23:06.027 * initial configuration: index=0 servers=[]
INFO[0000] 63379:S 07 Jun 2021 11:23:06.027 * joining existing cluster at 127.0.0.1:4920
INFO[0000] 63379:F 07 Jun 2021 11:23:06.027 * entering follower state: follower="Node at [::]:4921 [Follower]" leader=
INFO[0000] 63379:F 07 Jun 2021 11:23:06.029 # failed to get previous log: previous-index=214 last-index=0 error="log not found"

Node 3:

james@Jamess-MacBook-Pro
Mon Jun 07 11:27:22
~/tmp/bitraft
 (upgrade_finn_uhaha) 130
$ ./bitraft -i 3 -p ./data3 -b 0.0.0.0:4922 -j 127.0.0.1:4920
INFO[0000] 63413:S 07 Jun 2021 11:27:25.137 # starting bitraft version 0.0.1@HEAD
INFO[0000] 63413:S 07 Jun 2021 11:27:25.427 * synchronized time
INFO[0000] 63413:S 07 Jun 2021 11:27:25.427 * server listening at [::]:4922
INFO[0000] 63413:S 07 Jun 2021 11:27:25.427 * server advertising as 0.0.0.0:4922
INFO[0000] 63413:S 07 Jun 2021 11:27:25.534 * initial configuration: index=522 servers="[{Suffrage:Voter ID:1 Address:0.0.0.0:4920} {Suffrage:Voter ID:2 Address:0.0.0.0:4921} {Suffrage:Voter ID:3 Address:0.0.0.0:4922}]"
INFO[0000] 63413:S 07 Jun 2021 11:27:25.534 # ignoring join request because server already belongs to a cluster
INFO[0000] 63413:F 07 Jun 2021 11:27:25.534 * entering follower state: follower="Node at [::]:4922 [Follower]" leader=

Then I query the cluster with RAFT LEADER and RAFT SERVER LIST:

$ telnet localhost 4920
Trying ::1...
Connected to localhost.
Escape character is '^]'.
RAFT LEADER
$0

RAFT SERVER LIST
*3
*6
$2
id
$1
1
$7
address
$12
0.0.0.0:4920
$6
leader
$5
false
*6
$2
id
$1
2
$7
address
$12
0.0.0.0:4921
$6
leader
$5
false
*6
$2
id
$1
3
$7
address
$12
0.0.0.0:4922
$6
leader
$5
false

It would appear nobody knows who the leader is? However writes do work as well as reads and there no obvious errors on Node 1 (besides what you see above).

SET foo bar
+OK
GET foo
$3
bar
@tidwall
Copy link
Owner

tidwall commented Jun 7, 2021

Hi @prologic

I just tested on my side and I wasn't able to get a blank result.

I think that this issue was addresses in v0.8.0.

@tidwall
Copy link
Owner

tidwall commented Jun 7, 2021

Though unlikely, perhaps there's an inconsistency in the library due to the platform.
I would like to reproduce what you are seeing.
Is the bitraft branch that you are seeing the error with publicly available?

@tidwall
Copy link
Owner

tidwall commented Jun 7, 2021

I noticed that your '0.0.0.0:4920' is advertising both as '[::]:4920' and '0.0.0.0:4920' in your logs.
I don't know if this is the cause or not, but I just pushed a commit that disables the ipv6 listener.
For my use cases, I don't need ipv6. And it should clear up the mismatch.

@prologic
Copy link
Author

prologic commented Jun 7, 2021

Is the bitraft branch that you are seeing the error with publicly available?

Yes. https://github.com/prologic/bitraft/tree/upgrade_finn_uhaha

@prologic
Copy link
Author

prologic commented Jun 7, 2021

Okay I'll try upgrading the version of uhaha I'm using and report back. Thanks!

@pims
Copy link

pims commented Aug 5, 2021

I noticed that your '0.0.0.0:4920' is advertising both as '[::]:4920' and '0.0.0.0:4920' in your logs.
I don't know if this is the cause or not, but I just pushed a commit that disables the ipv6 listener.
For my use cases, I don't need ipv6. And it should clear up the mismatch.

That was the issue for me. Upgrading fixed it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants