Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prefer routes with lowest rtt #235

Closed
wants to merge 2 commits into from
Closed

Conversation

iczero
Copy link
Contributor

@iczero iczero commented Feb 26, 2020

In situations where two identical routes exist, prefer the node with the lowest rtt.

An implementation of "poor man's anycast" as originally proposed in #194.

@leptonyu
Copy link
Contributor

leptonyu commented Apr 8, 2020

I tested it with wireguard. It seems that this strategy will affect tinc to choose the right endpoint of node.

I have 3 nodes, A,B,C

A, has public ip 1.2.3.4
B and C has no public ip.

I set tinc cidr as 10.0.0.0/24
wireguard cidr as 10.0.1.0/24

A has ip 10.0.0.1, 10.0.1.1, 1.2.3.4
B has ip 10.0.0.2,10.0.1.2
C has ip 10.0.0.3,10.0.1.3

Wireguard only support forword packets from B to C By A. So B can ping C using 10.0.1.3,and this is stable.

Tinc can make nat hole punching, using A, so C may has some public ip 2.3.4.5 with port 34562.

C endpoint 2.3.4.5:34562 is not stable due to nat timeout.

With this strategy, i find that tinc node C's endpoint will become 10.0.1.3:655.

I want this,
C = 2.3.4.5 => B

But, result in
C = 1.0.1.3 => A => 1.0.0.3 => B
.

I think this strategy should be controlled by configuration, not automatically. Some situations may not fit this strategy.

@wolfy1339
Copy link

wolfy1339 commented Apr 8, 2020

This shouldn't affect the current behaviour, only for situations where 1 or more nodes have the same subnet (what you're referring to as CIDR).

Example:

A has 0.0.0.0/0 and 10.0.0.1/24
B has 0.0.0.0/0 and 10.0.0.2/24
C has 10.0.0.3/24

If C has a lower RTT to A rather than B, it will choose to forward those packets to A. If B has a better RTT than A, C will forward the packets to B.

@iczero
Copy link
Contributor Author

iczero commented Apr 8, 2020

This PR does not affect the behavior you describe. Tinc already does this automatically. This PR only affects which node packets are sent to, not how.

@leptonyu
Copy link
Contributor

leptonyu commented Apr 9, 2020

yeah, you are right. @iczero @wolfy1339.

I should disable forward between 10.0.0.0/24 and 10.0.1.0/24. This will make C = 1.0.1.3 => A => 1.0.0.3 => B impossible.

@fangfufu
Copy link
Collaborator

fangfufu commented Jun 25, 2021

@gsliepen , I think this pull request is an excellent feature.

Copy link
Owner

@gsliepen gsliepen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an issue with this patch: the first time an address is looked up, the result of the lookup is cached. That means that if the RTT changes, packets will still go to the node found in the first attempt.

Another issue is that tinc only tries to establish UDP communication when there is traffic to a given node. This means there is a chicken and egg problem. In the beginning, all candidates for a given address might have a udp_ping_rtt of -1, so it will just choose the node that just came last, and then it will try to establish UDP with that node, but not with any of the other candidates, so the other candidates will never be preferred.

Finally, the original loop quit when the first matching subnet was found, and subnets are sorted by their prefixlength. However with this change, you will continue going through all the subnets until you find a matching one with the lowest RTT. But that means that if you have two subnets, say 192.168.1.1/32 and 192.168.1.0/24, then this might cause a lookup for 192.168.1.1 to choose the latter subnet, which is wrong.

@wolfy1339 wolfy1339 force-pushed the 1.1-anycast branch 2 times, most recently from 103d75f to 279c281 Compare June 28, 2021 01:01
@splitice
Copy link
Contributor

Regarding cache I think it's perfectly acceptable to employ a heavy level of route caching. Not only is it something that should not change frequently route switches for small changes in network conditions should be avoided as many protocols (those designed to handle such situations) will become unstable.

@fangfufu fangfufu added 1.1 Issue related to Tinc 1.1 enhancement New feature requests or performance improvement. changes_requested labels Jun 30, 2021
@gsliepen
Copy link
Owner

Frequent route switches should indeed be averted, but there's no telling how long the route will stay cached, and thus it's rather unpredictable when, if ever, it will switch to a route with a lower RTT. A possible solution is to just flush the cache at regular intervals. Also, the second and third point I mentioned are still issues that should be solved.

@iczero iczero marked this pull request as draft July 3, 2021 21:57
@splitice
Copy link
Contributor

splitice commented Jul 7, 2021

How does this PR affect relayed connections?

i.e

A has 10.0.0.1/24
B has 10.0.0.2/24
C has 10.0.0.3/24
D has 10.0.0.4/24

i.e where A is not connected to D but B&C are. A is connected to both B&C. Will this PR choose an improved route e.g A->B->D if B route has lower RTT than C?

Copy link
Owner

@gsliepen gsliepen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There still needs to be some way to update the cache when RTTs have been updated. It doesn't have to be frequent.

src/subnet.c Show resolved Hide resolved
@iczero
Copy link
Contributor Author

iczero commented Jul 11, 2021

Apparently github ate my previous comment... I currently do not have time to work on this, but I will update the PR when I do.

Regarding the current code, it seems that it could possibly work better if there was a separate table containing all routes announced by more than one node (anycast routes), then potentially doing things from there (udp probes, lookups for routes flagged anycast, etc).

@splitice this PR will not affect that case, it should only pick the lowest rtt node between 2 nodes advertising the same subnet.

Please ignore the rebases for now. They do not introduce changes.

@splitice
Copy link
Contributor

splitice commented Jul 11, 2021

@iczero Have you thought about aproaching this in a way that could be used for general purpose routing? You might be able to kill a few birds with one PR too.

Perhaps each node advertises known subnets (routes) with the best RTT to that node (route cache can be used and as of #270 as it will not be cleared as frequently) to it's known nodes. On receipt simply update the splay entry and (if no conflict or subnet already exists in cache) the route cache on the receiving node. The routing algorithm could simply select the lowest RTT stored in the splay tree on cache miss.

This would be predictable in that each node would be expected to supply regular updates and by invalidating the cache. There would still be some room for improvement by further reducing over clearing of the subnet cache and perhaps linking the cache more closely with the splay to prevent the need to do regular splay searches on these updates.

To rehash my example from before:

NODE A knows: NODE B (100ms), NODE C (100ms)
NODE B knows: NODE A (100ms), NODE C (10ms), NODE D (100ms)
NODE C knows: NODE A (100ms), NODE B (10ms), NODE D (200ms)
NODE D knows: NODE B (100ms), NODE C (200ms)

In this case node A having received updates from both B and C would know that it's best path to D is via B. This would also work for your case of anycast.

This would also resolve #45 (not of interest to me, but I do understand why it's interseting to other use cases) and the regular updates could be used to provide accurate RTT values.

@gsliepen
Copy link
Owner

[...] This would be predictable in that each node would be expected to supply regular updates [...]

This is going to cause a lot of traffic in large meshes. Also, finding the path with the lowest total RTT if direct communication with a node is not possible would require running the graph algorithms more frequently as well, and the sum of measured RTTs along a path might be a poor indication of the actual RTT since it doesn't include the overhead of forwarding packets by all the intermediate nodes.

We have to balance some things here. Perfect path selection requires continuous measurement of RTT along all possible paths, which is intractable on meshes with many nodes. So we should forget about this. We currently have some knowledge of RTTs, but for paths not taken the RTT is currently not updated. So I think the approach should be that if we have traffic to a Subnet with multiple possible nodes, we have the one we currently are using, and then we regularly try to improve the selection by probing the RTT of one of the other possible nodes (using try_tx(candidate, false)), until we get a RTT that is lower than the current RTT of the currently used node. Then we switch.

@splitice
Copy link
Contributor

@gsliepen what would be too much traffic in your opinion?

Perhaps:

Each node sends a packet to each of it's directly connected nodes with the [node, rtt] of each subnet that has at-least one route cached (active check). I was thinking subnets originally but a consistent (32bit?) node id might be better. With the packet just being a list of these and node being an efficiently stored id (is there an ID that could be used or would it need to be the public address of the node?) you could target least 200+ updated routes per packet. The update cycle could be limited to one full packet if needed or configuration options could exist to limit the frequency of these updates.

Assuming that each node has has 200 active peers at a given time (thats alot) then that would be one full 1,500 byte packet every 30 seconds sent to each of them and received from each of those active peers. That means on a 30 second interval assuming all these numbers we would be looking in the ballpark of 10kb/s rx&tx at that large scale. Thats on the same scale as the TCP keepalives (if TCP connections are in use) and likely just as negligible.

For smaller networks the bandwidth cost of something like this would be miniscule and comparable to a regular tunnel check operation.

Sending these from each node to all their active peers every say 30 seconds with jitter (best to be configurable) wouldnt be substantial. Sending these updates to all peers (not just active) would be a nice feature to also resolve #45 (either a configuration option or an option to dio this on a reduced rate) and could be looked at.

As someone who operates a larger network I wouldnt have an issue with 10kb/s usage in tunnel upkeep (or even much more). I'm much more irritated when my traffic is hauled around the world twice.

(using try_tx(candidate, false)), until we get a RTT that is lower than the current RTT of the currently used node. Then we switch.

That sounds good but on larger networks it could take a VERY long time to find the best candidate if 200 connections have to be tested randomly.

I don't dislike the idea entirely (I do like it's seeming simplicity).

Some problems I see & thoughts:

  • The test of the new route couldnt use actual packet data or reordering could occur?
  • How do we know to do these tests (as opposed to using the current "best" route)?
  • Can the route search be ordered to improve the likelihood of finding the best route in the first say 10 routes or are we stuck with random order?
  • By being passive it's not able to cleanly react to short interruptions between nodes which is a common use case for forwarding
  • Would be a really good idea to improve the caching so both the current active and the secondary fallback route was maintained. This way if a route goes back up or is flapping the best relayed route was still known and the search for a good route would not have to begin again.
  • Do we need to search one node at a time? Could the TTL be searched over N nodes at a time?

Possibly difficult but perhaps on each route add (incl relayed route) each tinc node pings the node through that relay and generates a RTT to use. This RTT can be cached until its updated next.

@gsliepen
Copy link
Owner

@gsliepen what would be too much traffic in your opinion?

We want to limit background traffic as much as possible. We need some keepalive packets for the meta-connections, and we have on-demand UDP probe packets (ie, probe packets are only sent while actual VPN packets are being exchanged between two nodes). Probing for better routes should also be on-demand.

Each node sends a packet to each of it's directly connected nodes with the [node, rtt] of each subnet that has at-least one route cached (active check). I was thinking subnets originally but a consistent (32bit?) node id might be better. With the packet just being a list of these and node being an efficiently stored id (is there an ID that could be used or would it need to be the public address of the node?) you could target least 200+ updated routes per packet. The update cycle could be limited to one full packet if needed or configuration options could exist to limit the frequency of these updates.

Sure, but that's just exchanging already known information. The problem is that unless nodes are exchanging data with each other, there is no information to share, because RTT times are only measured via the UDP probes, which are only sent on-demand. Consider nodes A, B, C and D. A, B, C all announce Subnet = 192.168.1.0/24, and then D wants to send a packet to 192.168.1.1. If it picks a random node, say A, it will learn the RTT of A, but then that is the best known RTT. The only way to learn more is if D starts actively probing the RTT to B and C. RTT updates from other nodes don't help at all here.

Note that tinc tries to avoid hop-by-hop routing, it always prefers to send packets directly to their destination. So if you have nodes A <-> B <-> C, then A tries to send directly to C via UDP, and knowing the RTT between A and B or B and C is not going to help, unless it couldn't send packets directly to C. But that would be the exceptional case.

Assuming that each node has has 200 active peers at a given time (thats alot) then that would be one full 1,500 byte packet every 30 seconds sent to each of them and received from each of those active peers. That means on a 30 second interval assuming all these numbers we would be looking in the ballpark of 10kb/s rx&tx at that large scale. Thats on the same scale as the TCP keepalives (if TCP connections are in use) and likely just as negligible.
[...]
As someone who operates a larger network I wouldnt have an issue with 10kb/s usage in tunnel upkeep (or even much more). I'm much more irritated when my traffic is hauled around the world twice.

Uhm no, the PING/PONG requests are only sent once a minute by default and are more like 50 bytes in size. If you mean TCP's own keepalive packets, then those are disabled by default, and if enabled, the default time before it sends a packet on an idle connection is 2 hours. 10 kb/s is already quite a lot for some users, especially those that pay for bandwidth.

(using try_tx(candidate, false)), until we get a RTT that is lower than the current RTT of the currently used node. Then we switch.

That sounds good but on larger networks it could take a VERY long time to find the best candidate if 200 connections have to be tested randomly.

But we don't have to do that. If we send a packet for a Subnet that is advertized by multiple nodes, only then do we need to ensure we also probe the other nodes that advertize that Subnet, we don't need to test all nodes in the mesh.

Some problems I see & thoughts:

* The test of the new route couldnt use actual packet data or reordering could occur?

The new routes should use the UDP probe packets first. These will also help punch holes through NAT. Actual data packets should go to the known working node unless we are sure the new route is viable and better.

* How do we know to do these tests (as opposed to using the current "best" route)?

We should detect when a Subnet is advertized by multiple nodes and set some flag, so when we send a packet to that Subnet and it's time to send a probe, we also probe one of the other candidates. That could be just a random pick, or round-robin.

* Can the route search be ordered to improve the likelihood of finding the best route in the first say 10 routes or are we stuck with random order?

You mean finding the best route in the first 10 seconds?

* By being passive it's not able to cleanly react to short interruptions between nodes which is a common use case for forwarding

It shouldn't be passive, it should react to actual demand for VPN traffic.

* Would be a really good idea to improve the caching so both the current active and the secondary fallback route was maintained. This way if a route goes back up or is flapping the best relayed route was still known and the search for a good route would not have to begin again.

Well if we probe other candidates regularly anyway, then we should have reasonably up-to-date RTT information for all candidates. So if one goes down or suddenly gets a bad RTT, then we just flush the bad entry from the cache, and then the cache will be repopulated with the remaining best candidate.

* Do we need to search one node at a time? Could the TTL be searched over N nodes at a time?

We could indeed do that.

@splitice
Copy link
Contributor

We want to limit background traffic as much as possible. We need some keepalive packets for the meta-connections, and we have on-demand UDP probe packets (ie, probe packets are only sent while actual VPN packets are being exchanged between two nodes). Probing for better routes should also be on-demand.

Wouldnt the liveness be covered simply by the route being in the cache?

Uhm no, the PING/PONG requests are only sent once a minute by default and are more like 50 bytes in size. If you mean TCP's own keepalive packets, then those are disabled by default, and if enabled, the default time before it sends a packet on an idle connection is 2 hours. 10 kb/s is already quite a lot for some users, especially those that pay for bandwidth.

I'm not suggesting is suggesting one datapoint per packet, as many records as possible should be packed into a single packet.

As you have already alluded to there is is already existing interactions that would use significantly more traffic than what this feature would.

But we don't have to do that. If we send a packet for a Subnet that is advertized by multiple nodes, only then do we need to ensure we also probe the other nodes that advertize that Subnet, we don't need to test all nodes in the mesh.

You do if you want to use the same implementation to also handle smarter routing for relaying (which is where my interest lies). Anycast improvement / support is just a buyproduct for me.


What about this as an alternative.

  • Route cache gets two slots per physical route, an active and a backup (key gains an additional member) or an extra member in the struct. This extra slot would be an "optimal" learnt relayed route.
    • It may be best to store this in the same hash table slot as a member of the primary entry instead of as a totally seperate slot. TBD.
    • If stored as a seperate slot a backup route has a lower storage priority, an active route should never be cleared to make room for a backup route
  • Periodically issue source probe requests for the nodes behind $N subnets where the subnets probed are taken from the route cache (therefore recently active), this could be performed on a periodic interval $T defined in config. We send these ping requests to randomly selected peer nodes. If the response comes back with a lower RTT than the stored backup route (or no backup route exists) then update the backup route if relayed (the active route if the node provides the subnet i.e anycast).
    • When requesting the RTT to a relayer for a child node the relayer will use it's cached RTT if available (and not stale). If the RTT is stale / missing it will perform a new ping and indicate this with a flag to the requester. In this way there are three main RTT measurements:
    1. ping request to subnet (where node provides subnet): RTT is detirmined by the measured time between request and response
    2. ping request to relayer for subnet (where relayer had a non-stale RTT value): pong reply includes a data field for the cached RTT value which is added onto the measured time
    3. ping request to the relayer for subnet (where relayer had a stale / missing RTT value): pong reply includes a zero RTT data field, the measured RTT time includes the ping to the child node.

This way the backup route is also ready when it's needed (providing that the chosen relay node isnt also offline). That's something I would also like to cover, but with the limitations expressed (no active information transfer) I'm not certain how at this stage. The reason I would like for more information transfer is that quite often when a partial interupution occurs between two points it's the result of a single transit provider or link but it may take out 50%+ of upstreams. Finding a working path is currently something tinc does not do well (quickly), and I would love if this was to improve that.

$N can be set to 0 in the config if active "route optimization" (name?) is not destired. Set to non zero with a regular enough interval this would also serve to keep holepunched ports open. Not important for my use cases, but for others it is.

If someone was in the situation where the <1MB/day (small network) to potentially 10's of MB/day (big network) something like this would use at reasonable $T interval and $N nodes they could disable the feature by setting the config option(s) to zero. Assuming that such nodes would also be non-relayers then there would be no difference to them, asside from the odd additional ping message (for their own subnet).

@fangfufu
Copy link
Collaborator

I think having a solution is better than having no solution. Personally, I wouldn't mind for using more bandwidth overall. I would much prefer if tinc can default to route with the lowest rtt, even if that means using more bandwidth overall. We could have this feature implemented and turned off by default, so tinc behaves the same as before by default. The users by default don't get an increase in bandwidth usage, unless they manually turn this feature on.

If you mean TCP's own keepalive packets, then those are disabled by default, and if enabled, the default time before it sends a packet on an idle connection is 2 hours. 10 kb/s is already quite a lot for some users, especially those that pay for bandwidth.

If the identities of these users are not sensitive information, I would love to know who they are. I think we could do with a testimonial section on tinc's website. I feel people don't realise how useful and versatile tinc is, compared to things like Wireguard and OpenVPN.

The amount of bandwidth people have access to have changed a lot since tinc was first originally written (20 years ago), and since tinc-1.1 was first released (10 years ago). Just to give some examples from my personal experiences, 10 years ago video chatting with my parents in China was a struggle probably around half of the time, and the resolution was probably 240p. These days I pretty much expect to see my parents' face in 720p, well, at least 480p. 20 years ago I was on dial up, "broadband" wasn't in my country yet... Even 5 years ago I wouldn't be watching Youtube non-stop on my mobile phone's data plan.

@iczero
Copy link
Contributor Author

iczero commented Sep 16, 2021

Fixed the subnet selection issue. The cache seems to be invalidated with graph changes, which may be enough.

break;
}

if(p->owner->status.reachable) {
int rtt = INT_MAX - 1; // use INT_MAX - 1 as rtt for nodes not directly reachable
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A node can be directly reachable via TCP only. That still might be faster than via UDP to another node. But tinc doesn't measure TCP RTT unfortunately.

for splay_each(subnet_t, p, &subnet_tree) {
if(!p || p->type != SUBNET_IPV4) {
continue;
}

if(!maskcmp(address, &p->net.ipv4.address, p->net.ipv4.prefixlength)) {
r = p;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't notice this before, but by removing this line you've subtly changed how tinc handles packets routed to unreachable nodes. Before, lookup_subnet_*() would return a non-NULL result even if the subnet found wasn't reachable. Then tinc would send ICMP_NET_UNREACH packets back. But now you'll return NULL, so then tinc will send back ICMP_NET_UNKNOWN.

@iczero
Copy link
Contributor Author

iczero commented Mar 3, 2024

Currently this patch is no longer being updated. It is, and likely will continue to be, a pile of hacks. I will close this PR for now.

I apologize for the noise from the rebase.

@iczero iczero closed this Mar 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1.1 Issue related to Tinc 1.1 enhancement New feature requests or performance improvement.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants