Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MTU discovery improvements #61

Merged
merged 19 commits into from
Jan 10, 2015
Merged

MTU discovery improvements #61

merged 19 commits into from
Jan 10, 2015

Conversation

dechamps
Copy link
Contributor

@dechamps dechamps commented Jan 2, 2015

As promised, here is the result of my work on tinc's MTU discovery code.

This massive pull request has three major goals, which are described below.

Code simplification

The current MTU discovery code is quite difficult to understand. The crux of the problem is the send_mtu_probe_handler() function which implements a complicated and hard to read state machine.

I believe the main reason for the complexity if because the MTU discovery code is trying to do too many things at once: it's trying to estimate the MTU for the link, while making sure UDP connectivity is still up and running. These are two separate concerns intermingled in the same code, which increases state machine complexity.

In this pull request, I separate these two concerns by introducing a new concept, "UDP discovery". UDP discovery concentrates on keeping the UDP tunnel alive and checking its status, while MTU discovery concentrates on improving and fixing tunnel MTU as soon as UDP discovery establishes a working tunnel. This massively simplifies the MTU discovery code, which as a result only has two states to care about: initial discovery and "steady-state" (i.e. maxmtu checking). Separating these two concerns also makes it easier to implement the next improvements described below.

Less network chatter

Something that's been grinding my gears for quite some time is tinc's eagerness to send probes when the original reason for sending them is gone. Consider a scenario in which a node makes a DNS request to another node over the VPN, and then the link between these two goes silent. From the high-level VPN point of view, that's one single packet going over the wire in each direction. In response to this, tinc will start MTU discovery, which in some cases can result in 90 packets (~126KB) going over the wire. Then, it will continue maintaining the tunnel basically forever, sending 3 packets (~4KB) every few seconds. This is in response to sending one single small VPN packet. The resulting write amplification is somewhat absurd.

While this is not a critical problem per se, it results in lots of network pollution, and makes tinc look "chatty" and "verbose". The inefficiency is detrimental to mobile devices (battery time) and constrained network links. More importantly, it does not scale: if a tinc node sends one packet to 1000 other nodes and then goes silent (e.g. P2P application), the resulting chatter will look truly awful.

The solution implemented in this pull request is simple: if you're not sending any real VPN packets, then you don't get to send any probes. Probes can only be sent from the send_packet() TX path. No packets, no probes. This results in the following behavior:

  • If a node sends one packet and then goes silent, one single probe might be sent (well, technically, maybe two, if both UDP discovery and MTU discovery want one). That's it. After that, the physical link goes completely silent.
  • If a node sends a slow trickle of packets, UDP/MTU discovery won't go faster than the packet rate. This way, we ensure we're not sending inappropriate amounts of control traffic compared to real traffic.
  • If a node sends packets at a high rate (e.g. downloading), then UDP/MTU discovery behaves like it currently does - it will go at full speed. Indeed, if the tunnel is heavily used, it makes sense to optimize it as much as we can.
  • If a link goes silent for a while (UDPDiscoveryTimeout), tinc can't assume the UDP link is usable anymore. Therefore it will reset UDP discovery and MTU discovery information. If the link gets used again, the first packet will go over TCP (which is fine), while tinc re-establishes a UDP tunnel and rediscovers the MTU. The latter operation is not really that expensive thanks to the next improvement below.

The keyword here is adaptive. tinc adapts to the real usage of the tunnel between two nodes, and will send probes more or less aggressively depending on how much the tunnel is used.

Improved MTU discovery algorithm

This pull request also contains two changes to the core of the MTU discovery algorithm itself, yielding massive improvements with regard to convergence time and accuracy.

The first change makes the MTU probing code send one packet at a time instead of three. The frequency is adjusted from 1 second to 0.3 second to compensate, so that the number of packets per second stays the same. The rationale is that it leads to more efficient feedback, as the size of each probe is based on the previous replies; therefore, allowing replies to come in between probes decreases the total number of probe packets required to converge.

The second change moves from a naive random probe size model to a more sophisticated mathematical model with the goal of more efficiently covering the remaining MTU interval. This leads to massive improvements in convergence time and accuracy.

With these two improvements, tinc is typically able to converge on the precise (byte-level accuracy) MTU in less than 10 probe packets (3 seconds!). In contrast, the current random algorithm gives highly variable results, sometimes using 6 times more probe packets to even get remotely close to the actual MTU.

This pull request supersedes #45.

This cleans up the PMTU probing function a little bit. It moves the
low-level sending of packets to a separate function, so that the code
reads naturally instead of using a weird for loop with "special
indexes". In addition, comments are moved inside the body of the
function for additional context.

This shouldn't introduce any change of behavior, except for local
discovery which has some minor logic fixes and which now always uses
small packets (16 bytes) because there's no need for a full-length
probe just to try the local network.
Currently, the TX path (starting from send_packet()) in tinc has three
responsabilities:

 - Making sure packets can be sent (e.g. fetching SPTPS keys);
 - Making sure they can be sent optimally (e.g. fetching non-SPTPS keys
   so that UDP can be used);
 - Sending the actual packet, if feasible.

The first two are closely related; the third one, however, can be
cleanly separated from the other two - meaning, we can loosen code
coupling between sending packets and "optimizing" the way packets are
sent. This will become increasingly important as future commits will
move more tunnel establishment and maintenance code into the TX path,
so we will benefit from a cleaner separation of concerns.

This is especially relevant because of the dual nature of the TX path
(SPTPS versus non-SPTPS), which can make things really complicated when
trying to share low-level code between both.

In this commit, code related to establishing or improving tunnels is
moved away from the core TX path by introducing the "try_*()" family of
function, of which try_sptps() already existed before this commit.

This is a pure refactoring; this commit shouldn't introduce any change
in behavior.
This moves related functions together. try_tx() is at the right place
since its only caller is send_packet().

This is a pure cut-and-paste change. The reason it was not done in the
previous commit is because it would have made the diff harder to review.
This adds a new mechanism by which tinc can determine if a node is
reachable via UDP. The new mechanism is currently redundant with the
PMTU discovery mechanism - that will be fixed in a future commit.

Conceptually, the UDP discovery mechanism works similarly to PMTU
discovery: it sends UDP probes (of minmtu size, to make sure the tunnel
is fully usable), and assumes UDP is usable if it gets replies. It
assumes UDP is broken if too much time has passed since the last reply.

The big difference with the current PMTU discovery mechanism, however,
is that UDP discovery probes are only triggered as part of the
packet TX path (through try_tx()). This is quite interesting, because
it means tinc will never send UDP pings more often than normal packets,
and most importantly, it will automatically stop sending pings as soon
as packets stop flowing, thereby nicely reducing network chatter.

Of course, there are small drawbacks in some edge cases: for example,
if a node only sends one packet every minute to another node, these
packets will only be sent over TCP, because the interval between packets
is too long for tinc to maintain the UDP tunnel. I consider this a
feature, not a bug: I believe it is appropriate to use TCP in scenarios
where traffic is negligible, so that we don't pollute the network with
pings just to maintain a UDP tunnel that's seeing negligible usage.
Since UDP discovery is the place where UDP feasibility is checked, it
makes sense to test for local connectivity as well. This was previously
done as part of PMTU discovery.
This is a rewrite of the send_mtu_probe_handler() function to make it
focus on the actual discovery of PMTU. In particular, the PMTU
discovery code doesn't care about tunnel state anymore - it only cares
about doing the initial PMTU discovery, and once that's done, making
sure PMTU did not increase by checking it from time to time. All other
duties have already been rewritten in the UDP discovery code.

As a result, the send_mtu_probe_handler(), which previously implemented
a nightmarish state machine which was very difficult to follow and
understand, has been massively simplified. We moved from four persistent
states to only two - initial discovery and steady state.

Furthermore, a side effect is that network chatter is reduced: instead
of sending bursts of three minmtu-sized packets in the steady state,
there is only one such packet that's sent from the UDP discovery code.
However, that introduces a slight regression in the bandwidth estimation
code, which relies on three-packet bursts in order to function.
Considering that this estimation is extremely unreliable (in my
experience) and isn't relied on by anything, this seems like an
acceptable regression.
Currently, the PMTU discovery code is run by a timeout callback,
independently of tunnel activity. This commit moves it into the TX
path, meaning that send_mtu_probe_handler() is only called if a
packet is about to be sent. Consequently, it has been renamed to
try_mtu() for consistency with try_tx(), try_udp() and try_sptps().

Running PMTU discovery code only as part of the TX path prevents
PMTU discovery from generating unreasonable amounts of traffic when
the "real" traffic is negligible. One extreme example is sending one
real packet and then going silent: in the current code this one little
packet will result in the entire PMTU discovery algorithm being run
from start to finish, resulting in absurd write traffic amplification.
With this patch, PMTU discovery stops as soon as "real" packets stop
flowing, and will be no more aggressive than the underlying traffic.

Furthermore, try_mtu() only runs if there is confirmed UDP
connectivity as per the UDP discovery mechanism. This prevents
unnecessary network chatter - previously, the PMTU discovery code
would send bursts of (potentially large) probe packets every second
even if there was nothing on the other side. With this patch, the
PMTU code only does that if something replied to the lightweight UDP
discovery pings.

These inefficiencies were made even worse when the node is not a
direct neighbour, as tinc will use PMTU discovery both on the
destination node *and* the relay. UDP discovery is more lightweight for
this purpose.

As a bonus, this code simplifies overall code somewhat - state is
easier to manage when code is run in predictable contexts as opposed
to "surprise callbacks". In addition, there is no need to call PMTU
discovery code outside of net_packet.c anymore, thereby simplifying
module boundaries.
This moves related functions together, and is a pure cut-and-paste
change. The reason it was not done in the previous commit is because it
would have made the diff harder to review.
If a probe reply is received that makes minmtu equal to maxmtu, we
have to wait until try_mtu() runs to realize that. Since try_mtu()
runs after a packet is sent, this means there is at least one packet
(possibly more, depending on timing) that won't benefit from the
fixed MTU. This also happens when maxmtu is updated from the send()
path.

This commit fixes that by making sure we check whether the MTU can be
fixed every time minmtu or maxmtu is touched.
This is a minor cosmetic nit to emphasise the distinction between the
initial MTU discovery phase, and the post-initial phase (i.e. maxmtu
checking).

Furthermore, this is an improvement with regard to the DRY (Don't
Repeat Yourself) principle, as the maximum mtuprobes value is only
written once.
Currently, tinc sends MTU probes in batches of three every second. This
commit changes that to send one packet every 333 milliseconds instead.

This change brings two benefits:

 - It makes MTU probing faster, because MTU probe lengths are calculated
   based on minmtu, and minmtu is adjusted based on the replies. When
   sending batches of three packets, all three packets are based on the
   same minmtu estimation; in contrast, by sending one packet more
   frequently, each subsequent packet can benefit from the replies that
   have been received since the last packet was sent. As a result, MTU
   discovery converges much faster (2-3 times as fast, typically).

 - It reduces network spikiness - it's more network-friendly to send
   one packet from time to time as opposed to sending bursts.
tinc bandwidth estimation has always been quite unreliable (at least in
my experience), but there's no chance of it working anymore since the
last changes to MTU discovery code, because packets are not sent in
batches of three anymore.

This commit removes the dead code - fortunately, nothing depends on this
estimation (it's not even shown in node info). We probably need be
smarter about this if we do want this estimation back.
Currently, tinc uses a naive algorithm for choosing MTU discovery probe
sizes, picking a size at random between minmtu and maxmtu.

This is of course suboptimal - since the behavior of probes is
deterministic (assuming no packet loss), it seems likely that using a
non-deterministic discovery algorithm will not yield the best results.
Furthermore, the randomness introduces a lot of variation in convergence
times.

The random solution also suffers from pathological cases - since it's
using a uniform distribution, it doesn't take into account the fact that
it's often more interesting to send small probes rather than large ones,
because getting replies is the only way we can make progress (assuming
the worst case scenario in which the OS doesn't know anything, therefore
keeping maxmtu constant). This can lead to absurd situations where the
discovery algorithm is close to the real MTU, but can't get to it
because the random number generator keeps generating numbers that are
past it.

The algorithm implemented in this patch aims to improve on the naive
random algorithm. It is organized around "cycles" of 8 probes; the sizes
of the probes decrease as we go through the cycle, thus making sure the
algorithm can cover lots of ground quickly (in case we're far from
actual MTU), but also examining the local area (in case we're close to
actual MTU). Using cycles ensures that the algorithm will "go back" to
large probes to better cover the new interval and to protect against
packet loss.

For the probe size itself, various mathematical models were simulated in
an attempt to find the one that converges the fastest; it has been
determined that using an exponential based on the size of the remaining
interval was the most effective option. The exponential is adjusted with
a magic multiplier fine-tuned to make tinc jump to the "most
interesting" (i.e. 1400+) section as soon as discovery starts.

Simulations indicate that assuming no packet loss and no help from the
OS (i.e. maxmtu stays constant), this algorithm will typically converge
to the *exact* MTU value in less than 10 probes, and will get within 8
bytes in less than 5 probes, for actual MTUs between 1417 and ~1450
(which is the range the algorithm is fine-tuned for). In contrast, the
previous algorithm gives results all over the place, sometimes taking
30+ probes to get in the ballpark. Because of the issues with the
distribution, the previous algorithm sometimes never gets to the precise
MTU value within any reasonable amount of time - in contrast, the new
algorithm will always get to the precise value in less than 30 probes,
even if the actual MTU is completely outside the optimized range.
The recently introduced new MTU discovery algorithm converges much
faster than the previous one, which allows us to reduce the number
of probes required before we can confidently fix the MTU. This commit
reduces the number of initial discovery probes from 90 to 20. With the
new algorithm this is more than enough to get to the precise (byte-level
accuracy) MTU value; in cases of packet loss or weird MTU values for
which the algorithm is not optimized, we should get close to the actual
value, and then we rely on MTU increase detection (steady state probes)
to fine-tune it later if the need arises.

This patch also triggers MTU increase detection even if the MTU we have
is off by only one byte. Previously we only did that if it was off by at
least 8 bytes. Considering that (1) this should happen less often,
(2) restarting MTU discovery is cheaper than before and (3) having MTUs
that are subtly off from their intended values by just a few bytes
sounds like trouble, this sounds like a good idea.
If MTU discovery comes up with an MTU smaller than 512 bytes (e.g. due
to massive packet loss), it's pretty much guaranteed to be wrong. Even
if it's not, most Internet applications assume the MTU will be at least
512, so fixing the MTU to a small value is likely to cause trouble
anyway.

This also makes the discovery algorithm converge even faster, since the
interval it has to consider is smaller.
Linux provides a getsockopt() option, IP_MTU, to get the kernel's best
guess at a connection MTU. In practice, it seems to return the MTU of
the physical interface the socket is using.

This patch uses this option to initialize maxmtu to a better value when
MTU discovery starts.

Unfortunately, this is not supported on Windows. Winsock has options
such as SO_MAX_MSG_SIZE, SO_MAXDG and SO_MAXPATHDG but they seem useless
as they always return absurdly large values (typically, 65507), as
confirmed by http://support.microsoft.com/kb/822061/
The original multiplier constant for the MTU discovery algorithm, 0.97,
assumes a somewhat pessmistic scenario where we don't get any help from
the OS - i.e. maxmtu never changes. This can happen if IP_MTU is not
usable and the OS doesn't reject overly large packets.

However, in most systems the OS will, in fact, contribute to the MTU
discovery process. In these situations, an actual MTU equal to maxmtu
is quite likely (as opposed to the maxmtu = 1518 case where that is
highly unlikely, unless the physical network supports jumbo frames).
It therefore makes sense to use a multiplier of 1 - that will make the
first probe length equal to maxmtu.

The best results are obtained if the OS supports the getsockopt(IP_MTU)
call, and its result is accurate. In that case, tinc will typically fix
the MTU after one single probe(!), like so:

    Using system-provided maximum tinc MTU for foobar (1.2.3.4 port 655): 1442
    Sending UDP probe length 1442 to foobar (1.2.3.4 port 655)
    Got type 2 UDP probe reply 1442 from foobar (1.2.3.4 port 655)
    Fixing MTU of foobar (1.2.3.4 port 655) to 1442 after 1 probes
Currently, if a MTU probe is sent and gets rejected by the system
because it is too large (i.e. send() returns EMSGSIZE), the MTU
discovery algorithm is not aware of it and still behaves as if the probe
was actually sent.

This patch makes the MTU discovery algorithm recalculate and send a new
probe when this happens, so that the probe "slot" does not go to waste.
@gsliepen
Copy link
Owner

gsliepen commented Jan 2, 2015

On Fri, Jan 02, 2015 at 02:46:24AM -0800, Etienne Dechamps wrote:

As promised, here is the result of my work on tinc's MTU discovery code.

This massive pull request has three major goals, which are described below.

Code simplification

The current MTU discovery code is quite difficult to understand. The crux of the problem is the send_mtu_probe_handler() function which implements a complicated and hard to read state machine.

I completely agree.

I believe the main reason for the complexity if because the MTU discovery code is trying to do too many things at once: it's trying to estimate the MTU for the link, while making sure UDP connectivity is still up and running. These are two separate concerns intermingled in the same code, which increases state machine complexity.

Well, in my mind it's one and the same thing; it's just trying to
determine a minimum value for the MTU, so that everything up to that
size can be sent via UDP, the rest goes via TCP. But yes, the state
machine is ugly.

In this pull request, I separate these two concerns by introducing a new concept, "UDP discovery". UDP discovery concentrates on keeping the UDP tunnel alive and checking its status, while MTU discovery concentrates on improving and fixing tunnel MTU as soon as UDP discovery establishes a working tunnel. This massively simplifies the MTU discovery code, which as a result only has two states to care about: initial discovery and "steady-state" (i.e. maxmtu checking). Separating these two concerns also makes it easier to implement the next improvements described below.

I'm not against this :)

Less network chatter

Something that's been grinding my gears for quite some time is tinc's eagerness to send probes when the original reason for sending them is gone. Consider a scenario in which a node makes a DNS request to another node over the VPN, and then the link between these two goes silent. From the high-level VPN point of view, that's one single packet going over the wire in each direction. In response to this, tinc will start MTU discovery, which in some cases can result in 90 packets (~126KB) going over the wire. Then, it will continue maintaining the tunnel basically forever, sending 3 packets (~4KB) every few seconds. This is in response to sending one single small VPN packet. The resulting write amplification is somewhat absurd.

That's indeed a waste of bandwidth.

While this is not a critical problem per se, it results in lots of network pollution, and makes tinc look "chatty" and "verbose". The inefficiency is detrimental to mobile devices (battery time) and constrained network links. More importantly, it does not scale: if a tinc node sends one packet to 1000 other nodes and then goes silent (e.g. P2P application), the resulting chatter will look truly awful.

The solution implemented in this pull request is simple: if you're not sending any real VPN packets, then you don't get to send any probes. Probes can only be sent from the send_packet() TX path. No packets, no probes.

That's an excellent solution to the problem!

This results in the following behavior:

  • If a node sends one packet and then goes silent, one single probe might be sent (well, technically, maybe two, if both UDP discovery and MTU discovery want one). That's it. After that, the physical link goes completely silent.

Great.

  • If a node sends a slow trickle of packets, UDP/MTU discovery won't go faster than the packet rate. This way, we ensure we're not sending inappropriate amounts of control traffic compared to real traffic.
  • If a node sends packets at a high rate (e.g. downloading), then UDP/MTU discovery behaves like it currently does - it will go at full speed. Indeed, if the tunnel is heavily used, it makes sense to optimize it as much as we can.
  • If a link goes silent for a while (UDPDiscoveryTimeout), tinc can't assume the UDP link is usable anymore. Therefore it will reset UDP discovery and MTU discovery information. If the link gets used again, the first packet will go over TCP (which is fine), while tinc re-establishes a UDP tunnel and rediscovers the MTU. The latter operation is not really that expensive thanks to the next improvement below.

There is one issue here, and that is UDP hole punching. If you have
three nodes, A B and C, and A and B are behind a NAT and C not, then
they need C to help them punch holes. This can only happen if C has
active UDP connections with A and B. This is actually the reason that
tinc tried to keep UDP alive for every meta-connection. There's some
limitations in the 1.0.x protocol that prevent this from being done on
demand only.

The keyword here is adaptive. tinc adapts to the real usage of the tunnel between two nodes, and will send probes more or less aggressively depending on how much the tunnel is used.

Improved MTU discovery algorithm

This pull request also contains two changes to the core of the MTU discovery algorithm itself, yielding massive improvements with regard to convergence time and accuracy.

The first change makes the MTU probing code send one packet at a time instead of three. The frequency is adjusted from 1 second to 0.3 second to compensate, so that the number of packets per second stays the same. The rationale is that it leads to more efficient feedback, as the size of each probe is based on the previous replies; therefore, allowing replies to come in between probes decreases the total number of probe packets required to converge.

Yes, however the burst bandwidth estimation code then indeed does not
work anymore. Although it might be inaccurate, it does at least get the
order of magnitude correct. On the other hand, it's not used for
anything at the moment.

The second change moves from a naive random probe size model to a more sophisticated mathematical model with the goal of more efficiently covering the remaining MTU interval. This leads to massive improvements in convergence time and accuracy.

I'm very sceptical about this. It's very easy to optimize the algorithm
for YOUR network, it's something else to make an algorithm that works
well on ANY network. I did some simulations myself in the past with
different algorithms, and calculated the mean/median/sdev of them for
ALL possible path MTUs, with and without ICMP Packet Too Big responses.
The random sized packets algorithm median number of packets to fix the
MTU sees much less peaks when you vary the actual PMTU than other
algorithms, and on average over all possible PMTUs it is performing more
or less just as well as deterministic algorithms. I'll try to run the
simulations again with your algorithm.

With these two improvements, tinc is typically able to converge on the precise (byte-level accuracy) MTU in less than 10 probe packets (3 seconds!). In contrast, the current random algorithm gives highly variable results, sometimes using 6 times more probe packets to even get remotely close to the actual MTU.

Well, before tinc usually fixed the MTU in 7 probes if ICMP Packet Too
Big responses are not filtered, and just switching to 1 packet every 0.3
seconds would improve that. But 3 seconds is indeed a very nice result.

You can merge this Pull Request by running:

git pull https://github.com/dechamps/tinc pmtu

I tried it out, but there are some issues. Running it between two nodes
on a LAN in switch mode, I see that the broadcast packets caused by IPv6
auto IP assigment and Avahi when the interface goes up don't trigger UDP
probes at all. Only when there are no broadcast packets for > 10
seconds, and I then try to ping the other side, will UDP probes be sent.

After the first UDP probe is succesfully received, MTU probing starts,
and I see this:

Using system-provided maximum tinc MTU for xar (10.1.1.1 port 28540):
1472
Sending UDP probe length 1472 to xar (10.1.1.1 port 28540)
Sending UDP probe length 1471 to xar (10.1.1.1 port 28540)
Sending UDP probe length 1470 to xar (10.1.1.1 port 28540)
[...]
Sending UDP probe length 1460 to xar (10.1.1.1 port 28540)
Sending UDP probe length 1459 to xar (10.1.1.1 port 28540)
Got type 2 UDP probe reply 1459 from xar (10.1.1.1 port 28540)
Fixing MTU of xar (10.1.1.1 port 28540) to 1459 after 1 probes

This is because choose_initial_maxmtu() doesn't apply the overhead of
legacy UDP packet encryption, which can vary depending on Cipher, Digest
and MACLength. In this very case, it's sort of nice it counts down until
it reaches a size that works, but I think that it should only retry if
n->mtuprobes >= 0 and should increase the counter inside the for loop.

Apart from the algorithm for choosing packet sizes, I think the MTU
probing code is very nice. As for UDP probing, I would make a few
changes. First of all, while status.udp_confirmed is false, instead of
using udp_discovery_value, a smaller value like 0.3 seconds could be used.
Also, for nodes that we also have a meta-connection with, try_udp()
should be called every udp_discovery_interval to keep the UDP mappings alive.
Maybe this can be done just by looping over the connection_list and
calling try_udp(c->node) in periodic_handler() in net.c.

Do you want to change anything before I pull your changes? I don't mind
pulling now and working on the details later.

Met vriendelijke groet / with kind regards,
Guus Sliepen [email protected]

@dechamps
Copy link
Contributor Author

dechamps commented Jan 2, 2015

There is one issue here, and that is UDP hole punching. If you have three nodes, A B and C, and A and B are behind a NAT and C not, then they need C to help them punch holes. This can only happen if C has active UDP connections with A and B. This is actually the reason that tinc tried to keep UDP alive for every meta-connection. There's some limitations in the 1.0.x protocol that prevent this from being done on demand only.

Oh, right, I always wondered why tinc viewed direct neighbors as "special snowflakes" in that regard, that answers it. As you probably already noticed, this is not an issue with SPTPS because it will spontaneously relay in that case, so indirect UDP tunnels will always get established if necessary. I understand that the non-SPTPS path doesn't do that (it doesn't know how to fall back to indirect UDP), hence the problem.

To be honest, I'm not sure how "well" you want 1.0.x-1.1.x cohabitation to work. I always thought this was simply to ease the transition path, and that no-one would keep an heterogeneous graph in the long term, and that is was therefore acceptable to have less-than-optimal performance characteristics between 1.0.x and 1.1.x nodes, as long as communication is still possible (e.g. via TCP). That said this looks like an easy fix, so we might as well do that.

Yes, however the burst bandwidth estimation code then indeed does not work anymore. Although it might be inaccurate, it does at least get the order of magnitude correct. On the other hand, it's not used for anything at the moment.

That's collateral damage I'm afraid. I strongly believe it's worth it, but it's your call of course :)

That said, you are quite lucky to at least get the correct order of magnitude. In my experience, if the link is just a little unpredictable (jitter, congestion), that thing returns completely absurd numbers.

I'm very sceptical about this. It's very easy to optimize the algorithm for YOUR network, it's something else to make an algorithm that works well on ANY network.

Well, the algorithm I wrote is optimized for a MTU in the 1400-1450 range (which is typical of tinc running over typical physical networks), with no help from the system (i.e. no EMSGSIZE). It's not optimized for "my" network (which does have reliable EMSGSIZE notifications). It's optimized in a spreadsheet :)

I did some simulations myself in the past with different algorithms, and calculated the mean/median/sdev of them for ALL possible path MTUs, with and without ICMP Packet Too Big responses. The random sized packets algorithm median number of packets to fix the MTU sees much less peaks when you vary the actual PMTU than other algorithms, and on average over all possible PMTUs it is performing more or less just as well as deterministic algorithms.

Oh yeah, I spent a lot of time doing tons of simulations as well. Yes, I agree, it's surprisingly hard to beat the random algorithm. None of the formulas I tried managed to beat it, except the one I implemented in this pull request. It was the only one to show significantly better results compared to the random one. Here's the spreadsheet that I used to determine that. I believe you should be able to copy it to your own Drive and then put in new numbers from there.

My main problem with the random algorithm is, well, it's random. In my opinion, that alone is a disadvantage, as by definition is leads to somewhat unpredictable results. Having an MTU that depends on the phase of the moon is not great. Of course there is always some nondeterminism in the form of packet loss, but that's not as bad.

In addition, my algorithm pretty much always guarantee it will get to byte-level accuracy in a reasonable amount of time (typically < 20 probes). With the random algorithm there are pathological cases where that will take a very long time (consider minmtu 1418, maxmtu 1518 and actual MTU 1419, each probe has a 1% chance of landing on 1419). That cannot happen with the one I implemented.

Well, before tinc usually fixed the MTU in 7 probes if ICMP Packet Too Big responses are not filtered

All the numbers I mentioned assume a worst case scenario where these responses are filtered.

Running it between two nodes on a LAN in switch mode, I see that the broadcast packets caused by IPv6 auto IP assigment and Avahi when the interface goes up don't trigger UDP probes at all. Only when there are no broadcast packets for > 10 seconds, and I then try to ping the other side, will UDP probes be sent.

That's weird. I think I've tested broadcast packets and they seemed to work correctly. I'll look into it.

This is because choose_initial_maxmtu() doesn't apply the overhead of legacy UDP packet encryption, which can vary depending on Cipher, Digest and MACLength.

Yes. I had no idea how to calculate overhead for 1.0.x, so I cheated and applied my "we don't care that much about 1.0" philosophy instead :) For SPTPS packets the overhead calculation is accurate. If you know how to calculate that for 1.0, then by all means, be my guest.

In this very case, it's sort of nice it counts down until it reaches a size that works, but I think that it should only retry if n->mtuprobes >= 0 and should increase the counter inside the for loop.

Maybe. I don't really see that as a problem, but I'm fine either way.

As for UDP probing, I would make a few changes. First of all, while status.udp_confirmed is false, instead of using udp_discovery_value, a smaller value like 0.3 seconds could be used.

I guess that makes sense, probes sent while udp_confirmed is false are very lightweight anyway (16 bytes payload). I'll change the pull request.

Also, for nodes that we also have a meta-connection with, try_udp() should be called every udp_discovery_interval to keep the UDP mappings alive. Maybe this can be done just by looping over the connection_list and calling try_udp(c->node) in periodic_handler() in net.c.

That's related to the hole punching issue with 1.0.x that you mentioned above, right? In that case I'm fine with you making the change, you seem to care more about 1.0.x than I do :)

Do you want to change anything before I pull your changes?

I'll try to investigate the problem with broadcasts and fix it. I'll also make the change to fine-tune the UDP discovery interval. After that, feel free to merge and do anything you want with regard to the legacy stuff and the handling of maxmtu changes in try_mtu().

@gsliepen
Copy link
Owner

gsliepen commented Jan 2, 2015

On Fri, Jan 02, 2015 at 10:07:23AM -0800, Etienne Dechamps wrote:

There is one issue here, and that is UDP hole punching. If you have three nodes, A B and C, and A and B are behind a NAT and C not, then they need C to help them punch holes. This can only happen if C has active UDP connections with A and B. This is actually the reason that tinc tried to keep UDP alive for every meta-connection. There's some limitations in the 1.0.x protocol that prevent this from being done on demand only.

Oh, right, I always wondered why tinc viewed direct neighbors as "special snowflakes" in that regard, that answers it. As you probably already noticed, this is not an issue with SPTPS because it will spontaneously relay in that case, so indirect UDP tunnels will always get established if necessary. I understand that the non-SPTPS path doesn't do that (it doesn't know how to fall back to indirect UDP), hence the problem.

I don't really see how this is different for SPTPS... the issue is that
node C needs to inform A and B of their "reflexive" UDP address+port.
This is done only when A sends an ANS_KEY to B via C, and vice versa.
And that only happens during the initial SPTPS handshake between A and
B. I know, this is stupid. But that's because all this is shoe-horned
into a protocol that even tinc 1.0.0 nodes should still be able to work
with. I'll think a bit about how to improve this, at least between 1.1
nodes this can be fixed.

To be honest, I'm not sure how "well" you want 1.0.x-1.1.x cohabitation to work. I always thought this was simply to ease the transition path, and that no-one would keep an heterogeneous graph in the long term,

I hope things are better at your work, but at my last job as a UNIX
sysadmin we had a translation table that had examples like this:

"On paper: legacy system will be phased out in three months -> In
reality: system has to be supported until the heat death of the
universe."

Any network with >10 devices that are not under the control of a single
individual is bound to be very heterogenous.

and that is was therefore acceptable to have less-than-optimal performance characteristics between 1.0.x and 1.1.x nodes, as long as communication is still possible (e.g. via TCP). That said this looks like an easy fix, so we might as well do that.

I would like to keep performance of mixed networks as optimal as
possible, only sacrificing it if it would be too burdensome.

Yes, however the burst bandwidth estimation code [...], you are quite lucky to at least get the correct order of magnitude. In my experience, if the link is just a little unpredictable (jitter, congestion), that thing returns completely absurd numbers.

Ok. Well it just gave the raw values from a single burst, if it were
used for anything it should have been smoothed in some way. Did you get
numbers substantially higher than the actual bandwidth?

I did some simulations myself in the past with different algorithms, and calculated the mean/median/sdev of them for ALL possible path MTUs, with and without ICMP Packet Too Big responses. The random sized packets algorithm median number of packets to fix the MTU sees much less peaks when you vary the actual PMTU than other algorithms, and on average over all possible PMTUs it is performing more or less just as well as deterministic algorithms.

Oh yeah, I spent a lot of time doing tons of simulations as well. Yes, I agree, it's surprisingly hard to beat the random algorithm. None of the formulas I tried managed to beat it, except the one I implemented in this pull request. It was the only one to show significantly better results compared to the random one. [Here's the spreadsheet that I used to determine that][sim]. I believe you should be able to copy it to your own Drive and then put in new numbers from there.

I have several drives, but they are all in my own computers ;)

In any case, I have put your algorithm into my own simulation, and there
is a surprising result. The random algorithm, modified to not do burst
of 3 packets anymore but just 1 packet every 0.3 seconds, is on average
better than your algorithm by a factor 1.5, but only if sendto() can
fail with EMSGSIZE. If not, then your algorithm finds the actual value
of the MTU in only a quarter of the number of probes (I'm not counting
any extra probes because it doesn't know maxmtu yet here).

Your algorithm has some weird behavior. It is quite sensitive to the
multiplier. Indeed, 0.982 optimizes it for the range 1410-1460, and if
you set it to 1 it actually performs quite a bit worse at MTUs >1400.
Also, you would expect that the behavior improves with EMSGSIZE, but
for your algorithm it just gives different results, sometimes better,
sometimes worse.

Your algorithm is slightly more sensitive to packet loss, but this only
becomes an issue when it's more than 20%. That's actually a very good
result for a completely deterministic algorithm.

I'll try to see if I can combine the best of both algorithms :)

In addition, my algorithm pretty much always guarantee it will get to byte-level accuracy in a reasonable amount of time (typically < 20 probes). With the random algorithm there are pathological cases where that will take a very long time (consider minmtu 1418, maxmtu 1518 and actual MTU 1419, each probe has a 1% chance of landing on 1419). That cannot happen with the one I implemented.

That's indeed true, and probably that is why your algorithm converges so
much faster than the random one. That got me thinking, it might be nice to
keep some extra state, namely what size packet was sent last time. If no
response is received, then randomly/deterministically choose to either
send a packet that is halfway between that size and minmtu, or send one
between maxmtu and minmtu. In the latter case, don't update the "last sent
packet size). I modified my algorithm in the simulation, and with some
experimentation found that the best results are when 1 in 8 times I
always try a random packet size between minmtu and maxmtu, the rest of
the times I choose minmtu + 1 + (lastsize - minmtu) / 3. The 1 in 8
times matches your probes_per_cycle, and the reduction in size by 3
matches your powf(). The resulting algorithm has similar performance as
yours without EMSGSIZE, and somewhat better performance with EMSGSIZE.

It's too late now, but tomorrow I'll send you the simulation code.

This is because choose_initial_maxmtu() doesn't apply the overhead of legacy UDP packet encryption, which can vary depending on Cipher, Digest and MACLength.

Yes. I had no idea how to calculate overhead for 1.0.x, so I cheated and applied my "we don't care that much about 1.0" philosophy instead :) For SPTPS packets the overhead calculation is accurate. If you know how to calculate that for 1.0, then by all means, be my guest.

Sure, I'll add it :)

Also, for nodes that we also have a meta-connection with, try_udp() should be called every udp_discovery_interval to keep the UDP mappings alive. Maybe this can be done just by looping over the connection_list and calling try_udp(c->node) in periodic_handler() in net.c.

That's related to the hole punching issue with 1.0.x that you mentioned above, right? In that case I'm fine with you making the change, you seem to care more about 1.0.x than I do :)

Ok :)

Do you want to change anything before I pull your changes?

I'll try to investigate the problem with broadcasts and fix it. I'll also make the change to fine-tune the UDP discovery interval. After that, feel free to merge and do anything you want with regard to the legacy stuff and the handling of maxmtu changes in try_mtu().

Ok.

Met vriendelijke groet / with kind regards,
Guus Sliepen [email protected]

@dechamps
Copy link
Contributor Author

dechamps commented Jan 3, 2015

I don't really see how this is different for SPTPS... the issue is that node C needs to inform A and B of their "reflexive" UDP address+port. This is done only when A sends an ANS_KEY to B via C, and vice versa. And that only happens during the initial SPTPS handshake between A and B. I know, this is stupid. But that's because all this is shoe-horned into a protocol that even tinc 1.0.0 nodes should still be able to work with. I'll think a bit about how to improve this, at least between 1.1 nodes this can be fixed.

Ah, yes, of course. That makes sense. Actually I knew about this at some point in the past, because I remember noticing the potential race condition issues related to piggybacking on this message: if ANS_KEY is sent before the intermediate nodes have a full picture about their UDP situation, then it's ineffective, and it won't be resent afterwards.

Well, the next thing I was planning to do after this was to implement UDP info messages, which is designed specifically to address this problem. It will make timing irrelevant, because these messages will be sent periodically so that UDP address information can be updated in real-time as intermediate nodes establish tunnels (which they always do because SPTPS relaying fallback). Of course that will only work if every node in the path is able to understand these messages, so that excludes 1.0.x. Is that okay with you? From your message I get the feeling you had something similar in mind.

In any case, I have put your algorithm into my own simulation, and there is a surprising result. The random algorithm, modified to not do burst of 3 packets anymore but just 1 packet every 0.3 seconds, is on average better than your algorithm by a factor 1.5, but only if sendto() can fail with EMSGSIZE. If not, then your algorithm finds the actual value of the MTU in only a quarter of the number of probes (I'm not counting any extra probes because it doesn't know maxmtu yet here).
Also, you would expect that the behavior improves with EMSGSIZE, but for your algorithm it just gives different results, sometimes better, sometimes worse.

Well, I fine-tuned it for the worst case scenario (no EMSGSIZE). Maybe by tweaking it a little bit more we can get to a compromise between the best case scenario and the worst case scenario.

Your algorithm has some weird behavior. It is quite sensitive to the multiplier. Indeed, 0.982 optimizes it for the range 1410-1460, and if you set it to 1 it actually performs quite a bit worse at MTUs >1400.

Setting it to 1 is a bad idea, because it means the first probe is always set to maxmtu (offset = pow(interval, 1)). If EMSGSIZE is not returned, it means the algorithm will send one 1518-byte probe at the beginning of each cycle, and that probe is unlikely to get back unless the physical network supports jumbo frames.

The reason why the multiplier is very sensitive is because it is directly multiplying an exponent. Exponentials are known to be quite sensitive :)

I'll try to see if I can combine the best of both algorithms :)

You might want to try using your random algorithm but with a non-uniform distribution biased towards small probes. The problem is, you're unlikely to get the nice behavior of trying large probes first (which is what my algorithm does) if you do it that way.

That's indeed true, and probably that is why your algorithm converges so much faster than the random one. That got me thinking, it might be nice to keep some extra state, namely what size packet was sent last time.

I thought about it, but I was trying to refrain from adding more complexity and state to the algorithm, so I did not investigate that option. But if you're fine with it, by all means, be my guest.

This introduces a new configuration option,
UDPDiscoveryKeepaliveInterval, which is used as the UDP discovery
interval once the UDP tunnel is established. The pre-existing option,
UDPDiscoveryInterval, is therefore only used before UDP connectivity
is established.

The defaults are set so that tinc sends UDP pings more aggressively
if the tunnel is not established yet. This is appropriate since the
size of probes in that scenario is very small (16 bytes).
@dechamps
Copy link
Contributor Author

dechamps commented Jan 3, 2015

I've added a commit for the discovery timeout change. I've set it to 2 seconds, that seems more appropriate than 0.3 second which sounds very aggressive to me. Tunnel establishment getting delayed by two or four seconds because of lost packets doesn't seem like a big deal to me. Feel free to readjust if you disagree.

I tried to reproduce your issue with broadcasts, but was unable to. I've set up a testbed where I only send broadcasts and nothing else, and everything works normally. The alternative would have been very surprising because broadcasts use the same send_packet() TX path as everything else - there's no reason why they would act any different from normal packets as far as UDP discovery is concerned. Since I can't reproduce it, I'm afraid you're gonna have to try to debug it yourself. Is there anything non-standard about your test configuration?

I'm okay with merging now.

@gsliepen
Copy link
Owner

gsliepen commented Jan 3, 2015

On Sat, Jan 03, 2015 at 01:26:40AM -0800, Etienne Dechamps wrote:

I don't really see how this is different for SPTPS... the issue is that node C needs to inform A and B of their "reflexive" UDP address+port. This is done only when A sends an ANS_KEY to B via C, and vice versa. And that only happens during the initial SPTPS handshake between A and B. I know, this is stupid. But that's because all this is shoe-horned into a protocol that even tinc 1.0.0 nodes should still be able to work with. I'll think a bit about how to improve this, at least between 1.1 nodes this can be fixed.

Ah, yes, of course. That makes sense. Actually I knew about this at some point in the past, because I remember noticing the potential race condition issues related to piggybacking on this message: if ANS_KEY is sent before the intermediate nodes have a full picture about their UDP situation, then it's ineffective, and it won't be resent afterwards.

Well, the next thing I was planning to do after this was to implement [UDP info messages][udpinfo], which is designed specifically to address this problem. It will make timing irrelevant, because these messages will be sent periodically so that UDP address information can be updated in real-time as intermediate nodes establish tunnels (which they always do because SPTPS relaying fallback). Of course that will only work if every node in the path is able to understand these messages, so that excludes 1.0.x. Is that okay with you? From your message I get the feeling you had something similar in mind.

Yes. But between 1.1 and 1.0 nodes it should still be possible to just
keep the UDP mappings alive with small UDP probes. In fact, the 1.0
nodes will do this anyway... so maybe this doesn't require any effort.

I'll try to see if I can combine the best of both algorithms :)

You might want to try using your random algorithm but with a non-uniform distribution biased towards small probes. The problem is, you're unlikely to get the nice behavior of trying large probes first (which is what my algorithm does) if you do it that way.

That might be interesting. Anyway, I put a tarball with the simulation
code and the results online:

http://tinc-vpn.org/temp/simpmtu.tar.gz

I ran the simulations with both 1% and 10% packet loss. With 1% packet
loss, the result is virtually the same as 0% packet loss for all
algorithms. I also ran it with support for jumbograms (maximum MTU
9018).

Met vriendelijke groet / with kind regards,
Guus Sliepen [email protected]

@gsliepen
Copy link
Owner

gsliepen commented Jan 3, 2015

On Sat, Jan 03, 2015 at 02:24:57AM -0800, Etienne Dechamps wrote:

I tried to reproduce your issue with broadcasts, but was unable to. I've set up a testbed where I only send broadcasts and nothing else, and everything works normally. The alternative would have been very surprising because broadcasts use the same send_packet() TX path as everything else - there's no reason why they would act any different from normal packets as far as UDP discovery is concerned. Since I can't reproduce it, I'm afraid you're gonna have to try to debug it yourself. Is there anything non-standard about your test configuration?

I didn't think so... but maybe I made a mistake in my setup, I'll try it
again.

I've added a commit for the discovery timeout change. I've set it to 2 seconds, that seems more appropriate than 0.3 second which sounds very aggressive to me. Tunnel establishment getting delayed by two or four seconds because of lost packets doesn't seem like a big deal to me. Feel free to readjust if you disagree.

Well, since those probes are only sent when there is other traffic
anyway, I don't see why a shorter interval would hurt. If packet loss is
completely stochastic anyway, then the time between the probes should
not matter much: if you say two or four seconds, that just means 1 or 2
packets get lost. If you decrease the interval, it means tunnel
establishment only gets delayed 0.3 or 0.6 seconds. Especially when you
stop UDP probes much more agressively than tinc does now, you want it to
come back up faster as well IMHO.

I'm okay with merging now.

Great!

Met vriendelijke groet / with kind regards,
Guus Sliepen [email protected]

@dechamps
Copy link
Contributor Author

dechamps commented Jan 3, 2015

Well, since those probes are only sent when there is other traffic anyway, I don't see why a shorter interval would hurt.

Well, it hurts because it increases network chatter, though how much chatter is acceptable is a pure judgement call, so I'm fine with anything you think is best.

That said, I would urge you to consider the worst case scenario of sustained traffic from a node behind a TCP-only firewall: in that case 4 probes will be sent per interval (1 for the destination node, 1 for the relay, and both are in duplicate because of local discovery). If you do that every 0.3 second, you end up with 13 pings per second, which seem a little extreme to me.

@gsliepen
Copy link
Owner

gsliepen commented Jan 3, 2015

On Sat, Jan 03, 2015 at 07:25:49AM -0800, Etienne Dechamps wrote:

Well, since those probes are only sent when there is other traffic anyway, I don't see why a shorter interval would hurt.

Well, it hurts because it increases network chatter, though how much chatter is acceptable is a pure judgement call, so I'm fine with anything you think is best.

That said, I would urge you to consider the worst case scenario of sustained traffic from a node behind a TCP-only firewall: in that case 4 probes will be sent per interval (1 for the destination node, 1 for the relay, and both are in duplicate because of local discovery). If you do that every 0.3 second, you end up with 13 pings per second, which seem a little extreme to me.

Ah, maybe I forgot to mention, but I don't intend to do probes every 0.3
seconds all the time, but just for the first 10 probes or so when trying
to start up UDP. And if it fails just do it once every minute or so. I
have three intervals in my mind:

  1. Short (0.3 s) during initial UDP probing, only for 10 packets max
  2. Medium (10 s) while UDP traffic is flowing
  3. Long (60 s) while we have traffic but status.udp_confirmed == false.

The rationale is to have quick (re-)establishment of UDP tunnels, then
keep it alive as long as it's in use, and if UDP doesn't work then only
try it again once in a while. Maybe 3) can be omitted and just use 10 s
as well, since we only send the packets when there is actual traffic
anyway.

Met vriendelijke groet / with kind regards,
Guus Sliepen [email protected]

@dechamps
Copy link
Contributor Author

dechamps commented Jan 3, 2015

Ah, maybe I forgot to mention, but I don't intend to do probes every 0.3 seconds all the time, but just for the first 10 probes or so when trying to start up UDP. And if it fails just do it once every minute or so.

I don't think this is worth the extra complexity, but that's just my opinion.

Furthermore, if you're going to do something like this you might as well go all the way and implement exponential backoff, since that's best practice for this sort of thing as far as I know.

@gsliepen gsliepen merged commit 0710811 into gsliepen:1.1 Jan 10, 2015
@dechamps dechamps deleted the pmtu branch March 8, 2015 11:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants