Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock on simultaneous nodeup #91

Open
kzemek opened this issue Jul 12, 2018 · 6 comments
Open

Deadlock on simultaneous nodeup #91

kzemek opened this issue Jul 12, 2018 · 6 comments

Comments

@kzemek
Copy link

kzemek commented Jul 12, 2018

I'm having an issue similar to #60, reproducible very often when I bring up containers with the app at roughly the same time. Looks like each node is waiting for another one, and they're perpetually stuck in :syncing state. Here are the :sys.get_status(Swarm.Tracer) results from my 5 nodes: https://pastebin.com/EYLg6YNE . No custom options set, all default; clustering with libcluster gossip strategy.

kzemek added a commit to kzemek/swarm-deadlock-repro that referenced this issue Jul 12, 2018
kzemek added a commit to kzemek/swarm-deadlock-repro that referenced this issue Jul 12, 2018
@kzemek
Copy link
Author

kzemek commented Jul 12, 2018

Please see https://github.com/kzemek/swarm-deadlock-repro for reliable reproduction of the issue.

@kzemek
Copy link
Author

kzemek commented Jul 12, 2018

These are the logs produced with debug: true: https://gist.github.com/kzemek/f5067111dcc5b1c40803b12966cd1a1f#file-gistfile1-txt
There are no more debug logs after that point.

@kzemek
Copy link
Author

kzemek commented Jul 12, 2018

I've also tried manipulating the choice of sync node in hopes that it would solve the lock: kzemek@28516d9

But instead, the states of the Swarm.Tracker processes got stranger: https://gist.github.com/kzemek/f5067111dcc5b1c40803b12966cd1a1f#file-nodes_sync_to_smallest-txt

All nodes tried to sync to repro_2 (the "smallest" node), except repro_2 itself which synced to repro_3. repro_3 synced successfully and was put into :tracking state, while at the same time repro_2 was put into :awaiting_sync_ack and sent cast {sync_recv,<16250.182.0>,{{0,1},0},[]} to repro_3. But sync_recv cast is not handled in :tracking state, so repro_2 got stuck, and so did other nodes that tried to sync to it.

@kzemek
Copy link
Author

kzemek commented Jul 12, 2018

This particular issue is not there when reverting to commit c305633 (pre 412bad9). The nodes all go into :tracking state almost instantly.

@joxford531
Copy link

Seeing this issue as well. When I revert to version 3.1 I don't see any problems with deadlocking on startup.

@malmovich
Copy link

We've been having this issue as well, and I'm pretty sure we also had this in 3.3.1

In our case we observed the following scenario. Lets say we have node A,B and C and the following happens:
A - :sync -> B
B - :sync -> C
C - :sync -> A

All nodes are now in syncing state waiting for a :sync_recv message.

So far we have resolved this with a state timeout in syncing, were stops the syncing and tries another node. It seems to work fine, however, this approach gave a few complications and made it a bit more complex. So a simpler approach could be to drop the pending_sync_request strategy and and just decline the sync request while syncing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants