-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failing test in CI: test_rclcpp gtest_subscription__rmw_connextdds #136
Comments
I realized that due to a recent refactoring, we were no longer marking the test as xfail anymore. ros2/system_tests#544 should fix that. |
i did not get to the root cause, https://github.com/ros2/rosbag2/blob/4882d30fc2d1e5b9305e5d46b8460466f9280d27/rosbag2_transport/test/rosbag2_transport/test_burst.cpp#L404-L408 looks like the same issue. when it comes to transient workload, it sometimes miss the subscription data, that leads to the test failure of that test here. |
I think this issue is happening consistently on Windows repeated, and it's also flaky on windows debug and rhel repeated. It's a timeout of this test. Reference builds:
Test regression: Log output:
Flaky ratio:
WDYT @clalancette? |
Reference build: https://ci.ros2.org/job/nightly_win_rep/3504/ Failing test: Log output
I see this error is happening mostly on windows. These started failing a lot more since testing parallelization in Feb 25th. Check:
Flaky ratio in the last 15 days:
Flaky ratio since paralelize testing (277 days in oct 09):
|
System Info
Bug Description
If you run the
gtest_subscription__rmw_connextdds
test from thetest_rclcpp
package with no load, it "always" passes. It also "always" seems to pass in CI. However, if you put load on the machine, it fails very often, maybe like 75% of the time.Expected Behavior
The test always passes, even with a lot of load on the machine.
How to Reproduce
In terminal 1, put a lot of stress on the machine. In my case:
In terminal 2, run the test:
You may have to adjust the amount of stress on the machine, and you may have to run the test a few times, but it should fail fairly quickly.
Workarounds
Mark this test as xfail.
Additional context
This same test works fine on Fast-DDS (
gtest_subscription__rmw_fastrtps_cpp
) and Cyclone DDS (gtest_subscription__rmw_cyclonedds_cpp
), even with load on the machine.This has come up now because we are about to merge in ros2/rclcpp#2142, which seems to exacerbate this problem. However, I can produce this completely with
rolling
packages as of today, so it is not the fault of that PR.I did some additional debugging to try to track this down. When it fails, the executor is waiting for new data to come in via
rmw_wait
via a condition variable:rmw_connextdds/rmw_connextdds_common/src/common/rmw_impl_waitset_std.cpp
Line 579 in a6053be
That condition variable should be triggered when new data comes in via the
on_data_available
callback in the subscriber:rmw_connextdds/rmw_connextdds_common/src/common/rmw_impl_waitset_std.cpp
Line 131 in a6053be
rmw_connextdds/rmw_connextdds_common/src/common/rmw_impl_waitset_std.cpp
Line 630 in a6053be
Finally, I've verified that the publisher side is indeed writing the data out via
DDS_DataWriter_write_untypedI
in message (rmw_connextdds/rmw_connextdds_common/src/ndds/dds_api_ndds.cpp
Line 771 in a6053be
So as far as I can tell, those are all of the pieces necessary to get this working, and it does work sometimes. But under load, it seems to fail. I could use some advice on how to debug this further.
In the meantime, I'm going to propose a PR to mark that particular test as xfail so we can make progress on the other PRs.
The text was updated successfully, but these errors were encountered: