Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Verify nodemanager behavior on connectivity changes #86

Open
choksi81 opened this issue Jun 2, 2014 · 1 comment
Open

Verify nodemanager behavior on connectivity changes #86

choksi81 opened this issue Jun 2, 2014 · 1 comment
Assignees

Comments

@choksi81
Copy link
Contributor

choksi81 commented Jun 2, 2014

Thanks to the recently added Affix support, the Seattle nodemanager is now contactable even when the node is behind a NAT. We now must ensure that transitions between NAT and connectivity states always result in the nodemanager being contactable.

Previously (without NAT traversal), the nodemanager would detect that its public node IP either changed to a different public node IP, or connectivity was interrupted altogether. In the latter case, the nodemanager would retry repeatedly to discover when connectivity was restored. When on the other hand the IP address changed, it would stop its current advertise thread, and start a new one that would advertise the node's new address and port.

With support for NAT traversal for the nodemanager, we have a lot of additional states and transitions between states that need to be considered: Private-to-new-private, public-to-private, private-to-public, and also "flapping" (on--off--on) connectivity with no IP address changes. We might notice the lack of connectivity or change of address in parts of the Affix stack before the main nodemanager logic triggers. This makes the problem a little more difficult.

Task: For all of the scenarios (those involving Affixes and those who don't), ensure that the nodemanager remains contactable after a few (tens of) seconds of reconfiguration. Also, make sure that the old advertised values are no longer advertised, and appropriate new values start (and then continue) to be advertised.

@choksi81 choksi81 self-assigned this Jun 2, 2014
@choksi81
Copy link
Contributor Author

choksi81 commented Jun 3, 2014

Testing with "flapping private IP": My node is connected to a NAT forwarder when the interface goes down for 7 seconds. 127.0.0.1 is the new perceived node IP; looking up Afffix-enable doesn't succeed (obviously). When the interface comes back up, the old private IP is restored. Retrying to contact the NAT forwarder results in "Error: There is a duplicate connection which conflicts with the request!" -- seemingly, the old socket to the forwarder was never closed when the node IP changed, and we retry with the same source port number (cf. #1397). This error turns up multiple times as nmmain keeps retrying until it gives up.
At the same time, since the old socket is still there, the connection to the NAT forwarder appears to never have gone down! The NAT forwarder didn't notice either: It received no FINs or RSTs, and the node was only shortly disconnected from the network so the forwarder's TCP stack didn't time out either. Consequently, I can still seash into the node once it is up again regardless of the fact that no new (post-flap) connection to the NAT forwarder could be established.
Unless we code up a mobility affix that takes care of such situations, my proposal is to treat any change in IP address as a fatal, unrecoverable discontinuity to all current connections, which as a consequence must be torn down and set up anew. This causes chatter in the case of very frequent flapping (which I would consider unlikely, even given the connectivity patterns I exhibit), but makes the additional required logic stateless and thus rather simple.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant