Prefer routes with lowest rtt #235

iczero · 2020-02-26T07:37:57Z

In situations where two identical routes exist, prefer the node with the lowest rtt.

An implementation of "poor man's anycast" as originally proposed in #194.

leptonyu · 2020-04-08T08:08:16Z

I tested it with wireguard. It seems that this strategy will affect tinc to choose the right endpoint of node.

I have 3 nodes, A,B,C

A, has public ip 1.2.3.4
B and C has no public ip.

I set tinc cidr as 10.0.0.0/24
wireguard cidr as 10.0.1.0/24

A has ip 10.0.0.1, 10.0.1.1, 1.2.3.4
B has ip 10.0.0.2,10.0.1.2
C has ip 10.0.0.3,10.0.1.3

Wireguard only support forword packets from B to C By A. So B can ping C using 10.0.1.3，and this is stable.

Tinc can make nat hole punching, using A, so C may has some public ip 2.3.4.5 with port 34562.

C endpoint 2.3.4.5:34562 is not stable due to nat timeout.

With this strategy, i find that tinc node C's endpoint will become 10.0.1.3:655.

I want this,
C = 2.3.4.5 => B

But, result in
C = 1.0.1.3 => A => 1.0.0.3 => B
.

I think this strategy should be controlled by configuration, not automatically. Some situations may not fit this strategy.

wolfy1339 · 2020-04-08T17:08:11Z

This shouldn't affect the current behaviour, only for situations where 1 or more nodes have the same subnet (what you're referring to as CIDR).

Example:

A has 0.0.0.0/0 and 10.0.0.1/24
B has 0.0.0.0/0 and 10.0.0.2/24
C has 10.0.0.3/24

If C has a lower RTT to A rather than B, it will choose to forward those packets to A. If B has a better RTT than A, C will forward the packets to B.

iczero · 2020-04-08T20:05:10Z

This PR does not affect the behavior you describe. Tinc already does this automatically. This PR only affects which node packets are sent to, not how.

leptonyu · 2020-04-09T02:37:37Z

yeah, you are right. @iczero @wolfy1339.

I should disable forward between 10.0.0.0/24 and 10.0.1.0/24. This will make C = 1.0.1.3 => A => 1.0.0.3 => B impossible.

fangfufu · 2021-06-25T22:17:30Z

@gsliepen , I think this pull request is an excellent feature.

gsliepen

There is an issue with this patch: the first time an address is looked up, the result of the lookup is cached. That means that if the RTT changes, packets will still go to the node found in the first attempt.

Another issue is that tinc only tries to establish UDP communication when there is traffic to a given node. This means there is a chicken and egg problem. In the beginning, all candidates for a given address might have a udp_ping_rtt of -1, so it will just choose the node that just came last, and then it will try to establish UDP with that node, but not with any of the other candidates, so the other candidates will never be preferred.

Finally, the original loop quit when the first matching subnet was found, and subnets are sorted by their prefixlength. However with this change, you will continue going through all the subnets until you find a matching one with the lowest RTT. But that means that if you have two subnets, say 192.168.1.1/32 and 192.168.1.0/24, then this might cause a lookup for 192.168.1.1 to choose the latter subnet, which is wrong.

splitice · 2021-06-28T12:41:06Z

Regarding cache I think it's perfectly acceptable to employ a heavy level of route caching. Not only is it something that should not change frequently route switches for small changes in network conditions should be avoided as many protocols (those designed to handle such situations) will become unstable.

gsliepen · 2021-06-30T21:12:43Z

Frequent route switches should indeed be averted, but there's no telling how long the route will stay cached, and thus it's rather unpredictable when, if ever, it will switch to a route with a lower RTT. A possible solution is to just flush the cache at regular intervals. Also, the second and third point I mentioned are still issues that should be solved.

splitice · 2021-07-07T14:17:47Z

How does this PR affect relayed connections?

i.e

A has 10.0.0.1/24
B has 10.0.0.2/24
C has 10.0.0.3/24
D has 10.0.0.4/24

i.e where A is not connected to D but B&C are. A is connected to both B&C. Will this PR choose an improved route e.g A->B->D if B route has lower RTT than C?

gsliepen

There still needs to be some way to update the cache when RTTs have been updated. It doesn't have to be frequent.

src/subnet.c

iczero · 2021-07-11T18:36:52Z

Apparently github ate my previous comment... I currently do not have time to work on this, but I will update the PR when I do.

Regarding the current code, it seems that it could possibly work better if there was a separate table containing all routes announced by more than one node (anycast routes), then potentially doing things from there (udp probes, lookups for routes flagged anycast, etc).

@splitice this PR will not affect that case, it should only pick the lowest rtt node between 2 nodes advertising the same subnet.

Please ignore the rebases for now. They do not introduce changes.

splitice · 2021-07-11T23:27:24Z

@iczero Have you thought about aproaching this in a way that could be used for general purpose routing? You might be able to kill a few birds with one PR too.

Perhaps each node advertises known subnets (routes) with the best RTT to that node (route cache can be used and as of #270 as it will not be cleared as frequently) to it's known nodes. On receipt simply update the splay entry and (if no conflict or subnet already exists in cache) the route cache on the receiving node. The routing algorithm could simply select the lowest RTT stored in the splay tree on cache miss.

This would be predictable in that each node would be expected to supply regular updates and by invalidating the cache. There would still be some room for improvement by further reducing over clearing of the subnet cache and perhaps linking the cache more closely with the splay to prevent the need to do regular splay searches on these updates.

To rehash my example from before:

NODE A knows: NODE B (100ms), NODE C (100ms)
NODE B knows: NODE A (100ms), NODE C (10ms), NODE D (100ms)
NODE C knows: NODE A (100ms), NODE B (10ms), NODE D (200ms)
NODE D knows: NODE B (100ms), NODE C (200ms)

In this case node A having received updates from both B and C would know that it's best path to D is via B. This would also work for your case of anycast.

This would also resolve #45 (not of interest to me, but I do understand why it's interseting to other use cases) and the regular updates could be used to provide accurate RTT values.

gsliepen · 2021-07-13T20:59:49Z

[...] This would be predictable in that each node would be expected to supply regular updates [...]

This is going to cause a lot of traffic in large meshes. Also, finding the path with the lowest total RTT if direct communication with a node is not possible would require running the graph algorithms more frequently as well, and the sum of measured RTTs along a path might be a poor indication of the actual RTT since it doesn't include the overhead of forwarding packets by all the intermediate nodes.

We have to balance some things here. Perfect path selection requires continuous measurement of RTT along all possible paths, which is intractable on meshes with many nodes. So we should forget about this. We currently have some knowledge of RTTs, but for paths not taken the RTT is currently not updated. So I think the approach should be that if we have traffic to a Subnet with multiple possible nodes, we have the one we currently are using, and then we regularly try to improve the selection by probing the RTT of one of the other possible nodes (using try_tx(candidate, false)), until we get a RTT that is lower than the current RTT of the currently used node. Then we switch.

splitice · 2021-07-14T03:27:19Z

@gsliepen what would be too much traffic in your opinion?

Perhaps:

Each node sends a packet to each of it's directly connected nodes with the [node, rtt] of each subnet that has at-least one route cached (active check). I was thinking subnets originally but a consistent (32bit?) node id might be better. With the packet just being a list of these and node being an efficiently stored id (is there an ID that could be used or would it need to be the public address of the node?) you could target least 200+ updated routes per packet. The update cycle could be limited to one full packet if needed or configuration options could exist to limit the frequency of these updates.

Assuming that each node has has 200 active peers at a given time (thats alot) then that would be one full 1,500 byte packet every 30 seconds sent to each of them and received from each of those active peers. That means on a 30 second interval assuming all these numbers we would be looking in the ballpark of 10kb/s rx&tx at that large scale. Thats on the same scale as the TCP keepalives (if TCP connections are in use) and likely just as negligible.

For smaller networks the bandwidth cost of something like this would be miniscule and comparable to a regular tunnel check operation.

Sending these from each node to all their active peers every say 30 seconds with jitter (best to be configurable) wouldnt be substantial. Sending these updates to all peers (not just active) would be a nice feature to also resolve #45 (either a configuration option or an option to dio this on a reduced rate) and could be looked at.

As someone who operates a larger network I wouldnt have an issue with 10kb/s usage in tunnel upkeep (or even much more). I'm much more irritated when my traffic is hauled around the world twice.

(using try_tx(candidate, false)), until we get a RTT that is lower than the current RTT of the currently used node. Then we switch.

That sounds good but on larger networks it could take a VERY long time to find the best candidate if 200 connections have to be tested randomly.

I don't dislike the idea entirely (I do like it's seeming simplicity).

Some problems I see & thoughts:

The test of the new route couldnt use actual packet data or reordering could occur?
How do we know to do these tests (as opposed to using the current "best" route)?
Can the route search be ordered to improve the likelihood of finding the best route in the first say 10 routes or are we stuck with random order?
By being passive it's not able to cleanly react to short interruptions between nodes which is a common use case for forwarding
Would be a really good idea to improve the caching so both the current active and the secondary fallback route was maintained. This way if a route goes back up or is flapping the best relayed route was still known and the search for a good route would not have to begin again.
Do we need to search one node at a time? Could the TTL be searched over N nodes at a time?

Possibly difficult but perhaps on each route add (incl relayed route) each tinc node pings the node through that relay and generates a RTT to use. This RTT can be cached until its updated next.

gsliepen · 2021-07-19T21:29:44Z

@gsliepen what would be too much traffic in your opinion?

We want to limit background traffic as much as possible. We need some keepalive packets for the meta-connections, and we have on-demand UDP probe packets (ie, probe packets are only sent while actual VPN packets are being exchanged between two nodes). Probing for better routes should also be on-demand.

Each node sends a packet to each of it's directly connected nodes with the [node, rtt] of each subnet that has at-least one route cached (active check). I was thinking subnets originally but a consistent (32bit?) node id might be better. With the packet just being a list of these and node being an efficiently stored id (is there an ID that could be used or would it need to be the public address of the node?) you could target least 200+ updated routes per packet. The update cycle could be limited to one full packet if needed or configuration options could exist to limit the frequency of these updates.

Sure, but that's just exchanging already known information. The problem is that unless nodes are exchanging data with each other, there is no information to share, because RTT times are only measured via the UDP probes, which are only sent on-demand. Consider nodes A, B, C and D. A, B, C all announce Subnet = 192.168.1.0/24, and then D wants to send a packet to 192.168.1.1. If it picks a random node, say A, it will learn the RTT of A, but then that is the best known RTT. The only way to learn more is if D starts actively probing the RTT to B and C. RTT updates from other nodes don't help at all here.

Note that tinc tries to avoid hop-by-hop routing, it always prefers to send packets directly to their destination. So if you have nodes A <-> B <-> C, then A tries to send directly to C via UDP, and knowing the RTT between A and B or B and C is not going to help, unless it couldn't send packets directly to C. But that would be the exceptional case.

Assuming that each node has has 200 active peers at a given time (thats alot) then that would be one full 1,500 byte packet every 30 seconds sent to each of them and received from each of those active peers. That means on a 30 second interval assuming all these numbers we would be looking in the ballpark of 10kb/s rx&tx at that large scale. Thats on the same scale as the TCP keepalives (if TCP connections are in use) and likely just as negligible.
[...]
As someone who operates a larger network I wouldnt have an issue with 10kb/s usage in tunnel upkeep (or even much more). I'm much more irritated when my traffic is hauled around the world twice.

Uhm no, the PING/PONG requests are only sent once a minute by default and are more like 50 bytes in size. If you mean TCP's own keepalive packets, then those are disabled by default, and if enabled, the default time before it sends a packet on an idle connection is 2 hours. 10 kb/s is already quite a lot for some users, especially those that pay for bandwidth.

(using try_tx(candidate, false)), until we get a RTT that is lower than the current RTT of the currently used node. Then we switch.

That sounds good but on larger networks it could take a VERY long time to find the best candidate if 200 connections have to be tested randomly.

But we don't have to do that. If we send a packet for a Subnet that is advertized by multiple nodes, only then do we need to ensure we also probe the other nodes that advertize that Subnet, we don't need to test all nodes in the mesh.

Some problems I see & thoughts:

* The test of the new route couldnt use actual packet data or reordering could occur?

The new routes should use the UDP probe packets first. These will also help punch holes through NAT. Actual data packets should go to the known working node unless we are sure the new route is viable and better.

* How do we know to do these tests (as opposed to using the current "best" route)?

We should detect when a Subnet is advertized by multiple nodes and set some flag, so when we send a packet to that Subnet and it's time to send a probe, we also probe one of the other candidates. That could be just a random pick, or round-robin.

* Can the route search be ordered to improve the likelihood of finding the best route in the first say 10 routes or are we stuck with random order?

You mean finding the best route in the first 10 seconds?

* By being passive it's not able to cleanly react to short interruptions between nodes which is a common use case for forwarding

It shouldn't be passive, it should react to actual demand for VPN traffic.

* Would be a really good idea to improve the caching so both the current active and the secondary fallback route was maintained. This way if a route goes back up or is flapping the best relayed route was still known and the search for a good route would not have to begin again.

Well if we probe other candidates regularly anyway, then we should have reasonably up-to-date RTT information for all candidates. So if one goes down or suddenly gets a bad RTT, then we just flush the bad entry from the cache, and then the cache will be repopulated with the remaining best candidate.

* Do we need to search one node at a time? Could the TTL be searched over N nodes at a time?

We could indeed do that.

splitice · 2021-07-21T02:06:55Z

We want to limit background traffic as much as possible. We need some keepalive packets for the meta-connections, and we have on-demand UDP probe packets (ie, probe packets are only sent while actual VPN packets are being exchanged between two nodes). Probing for better routes should also be on-demand.

Wouldnt the liveness be covered simply by the route being in the cache?

Uhm no, the PING/PONG requests are only sent once a minute by default and are more like 50 bytes in size. If you mean TCP's own keepalive packets, then those are disabled by default, and if enabled, the default time before it sends a packet on an idle connection is 2 hours. 10 kb/s is already quite a lot for some users, especially those that pay for bandwidth.

I'm not suggesting is suggesting one datapoint per packet, as many records as possible should be packed into a single packet.

As you have already alluded to there is is already existing interactions that would use significantly more traffic than what this feature would.

But we don't have to do that. If we send a packet for a Subnet that is advertized by multiple nodes, only then do we need to ensure we also probe the other nodes that advertize that Subnet, we don't need to test all nodes in the mesh.

You do if you want to use the same implementation to also handle smarter routing for relaying (which is where my interest lies). Anycast improvement / support is just a buyproduct for me.

What about this as an alternative.

Route cache gets two slots per physical route, an active and a backup (key gains an additional member) or an extra member in the struct. This extra slot would be an "optimal" learnt relayed route.
- It may be best to store this in the same hash table slot as a member of the primary entry instead of as a totally seperate slot. TBD.
- If stored as a seperate slot a backup route has a lower storage priority, an active route should never be cleared to make room for a backup route
Periodically issue source probe requests for the nodes behind $N subnets where the subnets probed are taken from the route cache (therefore recently active), this could be performed on a periodic interval $T defined in config. We send these ping requests to randomly selected peer nodes. If the response comes back with a lower RTT than the stored backup route (or no backup route exists) then update the backup route if relayed (the active route if the node provides the subnet i.e anycast).
- When requesting the RTT to a relayer for a child node the relayer will use it's cached RTT if available (and not stale). If the RTT is stale / missing it will perform a new ping and indicate this with a flag to the requester. In this way there are three main RTT measurements:
1. ping request to subnet (where node provides subnet): RTT is detirmined by the measured time between request and response
2. ping request to relayer for subnet (where relayer had a non-stale RTT value): pong reply includes a data field for the cached RTT value which is added onto the measured time
3. ping request to the relayer for subnet (where relayer had a stale / missing RTT value): pong reply includes a zero RTT data field, the measured RTT time includes the ping to the child node.

This way the backup route is also ready when it's needed (providing that the chosen relay node isnt also offline). That's something I would also like to cover, but with the limitations expressed (no active information transfer) I'm not certain how at this stage. The reason I would like for more information transfer is that quite often when a partial interupution occurs between two points it's the result of a single transit provider or link but it may take out 50%+ of upstreams. Finding a working path is currently something tinc does not do well (quickly), and I would love if this was to improve that.

$N can be set to 0 in the config if active "route optimization" (name?) is not destired. Set to non zero with a regular enough interval this would also serve to keep holepunched ports open. Not important for my use cases, but for others it is.

If someone was in the situation where the <1MB/day (small network) to potentially 10's of MB/day (big network) something like this would use at reasonable $T interval and $N nodes they could disable the feature by setting the config option(s) to zero. Assuming that such nodes would also be non-relayers then there would be no difference to them, asside from the odd additional ping message (for their own subnet).

fangfufu · 2021-08-19T00:59:43Z

I think having a solution is better than having no solution. Personally, I wouldn't mind for using more bandwidth overall. I would much prefer if tinc can default to route with the lowest rtt, even if that means using more bandwidth overall. We could have this feature implemented and turned off by default, so tinc behaves the same as before by default. The users by default don't get an increase in bandwidth usage, unless they manually turn this feature on.

If you mean TCP's own keepalive packets, then those are disabled by default, and if enabled, the default time before it sends a packet on an idle connection is 2 hours. 10 kb/s is already quite a lot for some users, especially those that pay for bandwidth.

If the identities of these users are not sensitive information, I would love to know who they are. I think we could do with a testimonial section on tinc's website. I feel people don't realise how useful and versatile tinc is, compared to things like Wireguard and OpenVPN.

The amount of bandwidth people have access to have changed a lot since tinc was first originally written (20 years ago), and since tinc-1.1 was first released (10 years ago). Just to give some examples from my personal experiences, 10 years ago video chatting with my parents in China was a struggle probably around half of the time, and the resolution was probably 240p. These days I pretty much expect to see my parents' face in 720p, well, at least 480p. 20 years ago I was on dial up, "broadband" wasn't in my country yet... Even 5 years ago I wouldn't be watching Youtube non-stop on my mobile phone's data plan.

iczero · 2021-09-16T06:18:29Z

Fixed the subnet selection issue. The cache seems to be invalidated with graph changes, which may be enough.

gsliepen · 2024-03-03T13:26:01Z

src/subnet.c

 				break;
 			}
+
+			if(p->owner->status.reachable) {
+				int rtt = INT_MAX - 1; // use INT_MAX - 1 as rtt for nodes not directly reachable


A node can be directly reachable via TCP only. That still might be faster than via UDP to another node. But tinc doesn't measure TCP RTT unfortunately.

gsliepen · 2024-03-03T13:29:42Z

src/subnet.c

 	for splay_each(subnet_t, p, &subnet_tree) {
 		if(!p || p->type != SUBNET_IPV4) {
 			continue;
 		}

 		if(!maskcmp(address, &p->net.ipv4.address, p->net.ipv4.prefixlength)) {
-			r = p;


I didn't notice this before, but by removing this line you've subtly changed how tinc handles packets routed to unreachable nodes. Before, lookup_subnet_*() would return a non-NULL result even if the subnet found wasn't reachable. Then tinc would send ICMP_NET_UNREACH packets back. But now you'll return NULL, so then tinc will send back ICMP_NET_UNKNOWN.

iczero · 2024-03-03T21:08:54Z

Currently this patch is no longer being updated. It is, and likely will continue to be, a pile of hacks. I will close this PR for now.

I apologize for the noise from the rebase.

gsliepen requested changes Jun 27, 2021

View reviewed changes

wolfy1339 force-pushed the 1.1-anycast branch 2 times, most recently from 103d75f to 279c281 Compare June 28, 2021 01:01

fangfufu added 1.1 Issue related to Tinc 1.1 enhancement New feature requests or performance improvement. changes_requested labels Jun 30, 2021

wolfy1339 force-pushed the 1.1-anycast branch from 279c281 to fe4d526 Compare July 2, 2021 23:54

iczero marked this pull request as draft July 3, 2021 21:57

fangfufu removed the changes_requested label Jul 11, 2021

gsliepen requested changes Jul 11, 2021

View reviewed changes

src/subnet.c Show resolved Hide resolved

iczero force-pushed the 1.1-anycast branch from fe4d526 to 94f8c4b Compare September 16, 2021 06:08

fangfufu mentioned this pull request Mar 4, 2022

Route around packet loss better #350

Open

wolfy1339 force-pushed the 1.1-anycast branch from 94f8c4b to 694e0b3 Compare March 13, 2022 20:01

wolfy1339 force-pushed the 1.1-anycast branch from 694e0b3 to 97e7e21 Compare July 8, 2022 02:10

iczero added 2 commits March 2, 2024 21:35

Prefer routes with lowest rtt

5a65a48

Fix bug where less specific subnet may be selected

2b885a0

wolfy1339 force-pushed the 1.1-anycast branch from 97e7e21 to 2b885a0 Compare March 3, 2024 02:36

gsliepen reviewed Mar 3, 2024

View reviewed changes

iczero closed this Mar 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prefer routes with lowest rtt #235

Prefer routes with lowest rtt #235

iczero commented Feb 26, 2020

leptonyu commented Apr 8, 2020

wolfy1339 commented Apr 8, 2020 •

edited

Loading

iczero commented Apr 8, 2020

leptonyu commented Apr 9, 2020

fangfufu commented Jun 25, 2021 •

edited

Loading

gsliepen left a comment

splitice commented Jun 28, 2021

gsliepen commented Jun 30, 2021

splitice commented Jul 7, 2021

gsliepen left a comment

iczero commented Jul 11, 2021

splitice commented Jul 11, 2021 •

edited

Loading

gsliepen commented Jul 13, 2021

splitice commented Jul 14, 2021

gsliepen commented Jul 19, 2021

splitice commented Jul 21, 2021

fangfufu commented Aug 19, 2021

iczero commented Sep 16, 2021

gsliepen Mar 3, 2024

gsliepen Mar 3, 2024

iczero commented Mar 3, 2024

Prefer routes with lowest rtt #235

Prefer routes with lowest rtt #235

Conversation

iczero commented Feb 26, 2020

leptonyu commented Apr 8, 2020

wolfy1339 commented Apr 8, 2020 • edited Loading

iczero commented Apr 8, 2020

leptonyu commented Apr 9, 2020

fangfufu commented Jun 25, 2021 • edited Loading

gsliepen left a comment

Choose a reason for hiding this comment

splitice commented Jun 28, 2021

gsliepen commented Jun 30, 2021

splitice commented Jul 7, 2021

gsliepen left a comment

Choose a reason for hiding this comment

iczero commented Jul 11, 2021

splitice commented Jul 11, 2021 • edited Loading

gsliepen commented Jul 13, 2021

splitice commented Jul 14, 2021

gsliepen commented Jul 19, 2021

splitice commented Jul 21, 2021

fangfufu commented Aug 19, 2021

iczero commented Sep 16, 2021

gsliepen Mar 3, 2024

Choose a reason for hiding this comment

gsliepen Mar 3, 2024

Choose a reason for hiding this comment

iczero commented Mar 3, 2024

wolfy1339 commented Apr 8, 2020 •

edited

Loading

fangfufu commented Jun 25, 2021 •

edited

Loading

splitice commented Jul 11, 2021 •

edited

Loading