Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some work participant discovery #2086

Open
wants to merge 13 commits into
base: master
Choose a base branch
from

Conversation

eboasson
Copy link
Contributor

@eboasson eboasson commented Sep 6, 2024

This PR is somewhat massive, but it is primarily driven by a simple thing: make it possible to drop locators from SPDP when no participant seems to exist at that address. I've always refused to do that because of some edge cases in asymmetrical discovery and reconnecting after network interrupts, but the problem is that it also meant that MaxAutoParticipantIndex had to remain small to prevent massive bursts of periodic traffic, but that low limit caused enough trouble to make people prefer those bursts. The ROS 2 navigation stack suffers from this in particular: ros2/rmw_cyclonedds#458

Thus, this makes it prune locators after some time, distinguishing between peers specified in the configuration and locators discovered at run-time, and handling locators of participants of which the lease expired differently from those that terminated cleanly.

It also flips the scheduling of SPDP messages around: it used to have a timed event for each local participant and send to all known locators in one burst; now it scans a table of locators and sends SPDP messages for all participants at the same time. The advantage should be packing; most people will never notice because most applications only have a single participant ...

It's been working just fine for months on my laptop, but there are many configurations out there and I have not been using all of them. I'm sure there are still issues with it, but I think the time has come to just merge it and see what breaks. (Perhaps nothing, after all!)

I've made a PR because I want to be done with it 🙂 but have marked it as a draft because I still need to turn the handful of test scenarios in a bash script into tests that run as part of the regular CI.

This replaces the per-participant timed-event for periodically
broadcasting its SPDP message by a central device that schedules and
sends the SPDP messages for all participants and maintains the set of
addresses to which to send them.

There are a number of benefits to this model:

* We can make it more likely (certain, if we want) that SPDP messages to
  a single destination go out in a single packet, reducing the packet rate
  if there are many participants.

* We are no longer dependent on an "addrset" for tracking the set of
  addresses to publish to (and we don't need to special-case SPDP
  writers). That means we now have the option of "aging" locators and
  lowering the frequency or giving up on them altogether if there just
  doesn't seem to be anything at that address.
This eliminates looking up the SPDP sample in the WHC, which became rather painful with
the introduction of serdata_plist, and eliminates the problem of the dispose+unregister
not getting stored in the WHC (removed from the index, best-effort so not retained until
ACKs come in).

It also solves having to construct an address set based on the addresses in the SPDP
administration and patching that into the SPDP writer, which was a horrible hack to begin
with.
The old default of 9 (so 10 participants) was hit rather often.  This raises the default
to 99 (so 100 participants).  The only real downside is that increases the number of SPDP
packets 10x if unicast discovery is used.
If one adds a host to the peer list, we add MaxAutoParticipantIndex+1 locators for it, but
it generally doesn't make much sense to keep pinging all of them if there is no response.
Especially if MaxAutoParticipantIndex is large, it causes a lot of periodic traffic.
SPDP locators can be added by the configuration but also by discovering a peer that
we (presumably) cannot reach via multicast.  The question is what to do with such a
discovered address when the peer is no longer there.

If we learnt a locator through discovery and we know for certain that the peer is no
longer there (i.e., received a dispose/unregister) and that there are also no other peers
at the locator, it makes sense to remove the address: a new peer that shows up at that
locator would presumably try to contact us.  If instead the peer's lease expired, we need
to keep pinging it for a some time in case it was just the disconnect.

So for each SPDP locator we need to know why we have it in the table of addreses.

Signed-off-by: Erik Boasson <[email protected]>
This replaces the somewhat silly (and ancient) tracing of the queue length on every
insert.
Replaces src/core/xtests/spdp_scenarios.bash
@eboasson eboasson marked this pull request as ready for review September 25, 2024 18:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant