Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-Threaded Executor Starvation #2701

Closed
wants to merge 24 commits into from
Closed

Conversation

HarunTeper
Copy link

Pull request addressing the issues in #2360 #2645.

So far, I have added a test that detects starvation in the multi-threaded executor.
This test includes a mutually-exclusive callback group with two timers.
The executor should be alternating between these two tasks, never executing one task twice before the other.

To fix starvation, I have identified the following steps (which I was not able to completely implement yet):

  1. Introduce a new mutex (I refer to this as 'notify_mutex_') to the executor.hpp file that is used to guard callback group flags.
  2. Introduce a function that locks the 'notify_mutex_', triggers the 'interrupt_guard_condition_' and unlocks the callback group flag (and the 'notify_mutex_' afterwards). Currently, the 'MultiThreadedExecutor::run' function includes the guard condition trigger method, while the 'Executor::execute_any_executable' includes the call that unlocks the callback group. These need to be combined and guarded by the 'notify_mutex_'.
  3. The 'Executor::wait_for_work' function needs to be split into two funtions. One may be called 'Executor::prepare_wait_set', which does everything up to (but excluding) the 'wait' function which is currently in the 'wait_for_work' function. The rest of the wait for work function can be kept in the current function. This change is necessary to lock the 'notify_mutex' while a callback is being extracted from the wait set by 'get_next_executable', ensuring that no other thread can change a callback group flag at the same time.
  4. The most complex change: The function that collects and adds the entities to the 'wait_set_' needs to be updated. First, the 'wait_result_' should not be reset. Then a function needs to be added, which adds all callback instances from the previous 'wait_result_' to the current 'wait_result_', if they are blocked and not invalid. This new function should be executed after the 'wait_set_.wait' function. Otherwise, the previously blocked jobs would immediately unblock the wait function, as they are already ready when it is called, which would then lead to busy waiting. Furthermore, for this change, the position at which the previous job instances are added also needs to be decided, which would then decide the priority of the jobs after inserting. For this change, it may also be necessary to introduce a variable called 'previous_wait_result_'.

These steps are based on the work I published here:

https://ieeexplore.ieee.org/document/9622336

https://daes.cs.tu-dortmund.de/storages/daes-cs/r/publications/teper2024emsoft_preprint.pdf

I have already tried to implement some of these steps, and I will also commit some of the changes to this fork this week. However, for step 4, I may require some help. I also noticed that my changes break some of the tests are currently part of rclcpp, as I move the functions that set callback group flags and trigger guard conditions.

fujitatomoya and others added 24 commits December 9, 2024 14:23
* rephrase doc section of LoanedMessage Class.

Signed-off-by: Tomoya Fujita <[email protected]>
Co-authored-by: Chris Lalancette <[email protected]>
Co-authored-by: Christophe Bedard <[email protected]>
Signed-off-by: HarunTeper <[email protected]>
Signed-off-by: Chris Lalancette <[email protected]>
Signed-off-by: HarunTeper <[email protected]>
Signed-off-by: HarunTeper <[email protected]>
* accept custom allocator for LoanedMessage.

Signed-off-by: Tomoya Fujita <[email protected]>

* move message allocator using directives to public.

Signed-off-by: Tomoya Fujita <[email protected]>

---------

Signed-off-by: Tomoya Fujita <[email protected]>
Signed-off-by: HarunTeper <[email protected]>
Signed-off-by: Chris Lalancette <[email protected]>
Signed-off-by: HarunTeper <[email protected]>
Signed-off-by: HarunTeper <[email protected]>
They are both legacy from times past, and are both
no longer serving their intended purposes.  Just remove them.

Signed-off-by: Chris Lalancette <[email protected]>
Signed-off-by: HarunTeper <[email protected]>
Signed-off-by: HarunTeper <[email protected]>
* Omnibus fixes for running tests with Connext.

When running the tests with RTI Connext as the default
RMW, some of the tests are failing.  There are three
different failures fixed here:

1.  Setting the liveliness duration to a value smaller than
a microsecond causes Connext to throw an error.  Set it to
a millisecond.

2.  Using the SystemDefaultsQoS sets the QoS to KEEP_LAST 1.
Connext is somewhat slow in this regard, so it can be the case
that we are overwriting a previous service introspection event
with the next one.  Switch to the ServicesDefaultQoS in the test,
which ensures we will not lose events.

3.  Connext is slow to match publishers and subscriptions.  Thus,
when creating a subscription "on-the-fly", we should wait for the
publisher to match it before expecting the subscription to actually
receive data from it.

With these fixes in place, the test_client_common, test_generic_service,
test_service_introspection, and test_executors tests all pass for
me with rmw_connextdds.

Signed-off-by: Chris Lalancette <[email protected]>

* Fixes for executors.

Signed-off-by: Chris Lalancette <[email protected]>

* One more fix for services.

Signed-off-by: Chris Lalancette <[email protected]>

* More fixes for service_introspection.

Signed-off-by: Chris Lalancette <[email protected]>

* More fixes for introspection.

Signed-off-by: Chris Lalancette <[email protected]>

---------

Signed-off-by: Chris Lalancette <[email protected]>
Signed-off-by: HarunTeper <[email protected]>
* Make ament_cmake a buildtool dependency

The `ament_cmake` package isn't needed at runtime and so should only be
listed as a `buildtool_dependency`, as is done in most other packages.

* Remove the ament_cmake dependency entirely

Since there's already a buildtool dependency on ament_cmake_ros, having
ament_cmake as well is redundant.

Signed-off-by: Nathan Wiebe Neufeldt <[email protected]>
Signed-off-by: HarunTeper <[email protected]>
For reasons I admit I do not understand, the deprecation
warnings for StaticSingleThreadedExecutor on Windows
happen when we construct a shared_ptr for it in the tests.
If we construct a regular object, then it is fine.  Luckily
this test does not require a shared_ptr, so just make it
a regular object here, which rixes the warning.

While we are in here, make all of the tests camel case to
be consistent.

Signed-off-by: Chris Lalancette <[email protected]>
Signed-off-by: HarunTeper <[email protected]>
It supports the events executor now, so re-enable the test.

Signed-off-by: Chris Lalancette <[email protected]>
Signed-off-by: HarunTeper <[email protected]>
Signed-off-by: Patrick Roncagliolo <[email protected]>
Signed-off-by: HarunTeper <[email protected]>
For reasons I admit I do not understand, the deprecation
warnings for StaticSingleThreadedExecutor on Windows
happen when we construct a shared_ptr for it in the tests.
If we construct a regular object, then it is fine.  Luckily
this test does not require a shared_ptr, so just make it
a regular object here, which rixes the warning.

While we are in here, make all of the tests camel case to
be consistent.

Signed-off-by: Chris Lalancette <[email protected]>
It supports the events executor now, so re-enable the test.

Signed-off-by: Chris Lalancette <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants