Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rcl_wait() returns early when a timer awakes #687

Open
hidmic opened this issue Jun 16, 2020 · 22 comments
Open

rcl_wait() returns early when a timer awakes #687

hidmic opened this issue Jun 16, 2020 · 22 comments
Assignees
Labels
bug Something isn't working

Comments

@hidmic
Copy link
Contributor

hidmic commented Jun 16, 2020

Bug report

Required Info:

  • Operating System:
    • Windows 10
  • Installation type:
    • From source
  • Version or commit hash:
  • DDS implementation:
    • RTI Connext

Steps to reproduce issue

Run timer tests using rmw_connext_cpp.

Expected behavior

Tests pass.

Actual behavior

Often, tests fail when rcl_wait() returns early when a timer awakes.

Additional information

For one, even though a clock is specified for timers, how rmw_wait() abides to the given timeout is not specified. This issue persists for both steady time and system time timers.

This issue does not occur when using rmw_fastrtps*_cpp or rmw_cyclonedds_cpp, as these return at the right time or (most often) past it.

Suspecting a bug in rmw_connext_cpp, I could confirm that the expected timeout is passed to RTI Connext's WaitSet::wait() API which nonetheless returns early for both steady time and system time clocks. However, I believe this is less of a bug in RTI Connext than a misuse of rmw_wait() for rcl timers' implementation, considering time resolution. The closest achievable timeout on a given OS is not necessarily greater than or equal than any one specified.

@hidmic hidmic added the bug Something isn't working label Jun 16, 2020
@hidmic
Copy link
Contributor Author

hidmic commented Jun 16, 2020

One potential workaround would be to relax the definition of a ready rcl timer using a platform specific time tolerance, and even make that configurable for a user to tune it.

@hidmic
Copy link
Contributor Author

hidmic commented Jun 16, 2020

Alternatively, since it's quite likely not every possible timer period can be achieved due to time resolution constrains, the expectation on rcl timers' period could be relaxed to be just an average quantity over some time window. For instance, if some OS time resolution is 200us and a timer is configured with 500us period, by sleeping 400us and 600us in an alternate fashion it can average the intended 2kHz.

@hidmic
Copy link
Contributor Author

hidmic commented Jun 16, 2020

CC @ivanpauno @wjwwood @peterpena

@clalancette
Copy link
Contributor

I guess what I don't understand about this is that if you ask for 1 nanosecond sleep, then yeah, it might take the OS considerably longer than that to return (maybe on the order of 10ms for time slicing). But if you ask for 10ms sleep, the OS should never return in 9 ms; that's just not doing what the user application asked for.

Extending this to rmw_wait(), I would expect the same thing. It seems like the contract for rmw_wait() should be "wait for at least wait_timeout, unless there was a ready entity, then return early". Do you agree, or am I misunderstanding the problem entirely?

@hidmic
Copy link
Contributor Author

hidmic commented Jun 16, 2020

That is true for a plain sleep, but arguable for a time deadline (and even more so if you want to sustain some periodicity, but that's unrelated to rmw_wait()). And I have the impression that may be the case in Windows under specific circumstances (see second paragraph in this MSVC documentation article). We do not know how RTI Connext implements rmw_wait() timeouts though.

@wjwwood
Copy link
Member

wjwwood commented Jun 16, 2020

I agree with @clalancette about the expectation that it should never return before the time period the user specified. This is the case on Linux and macOS, and I would have extended that to rmw_wait or dds_wait, but maybe that's not guaranteed.

If a timer interval, or relative expiration time, is instead specified, the expiration can occur up to a period earlier than or a period later than the specified time, depending on where exactly the start and end times of this interval fall between system clock ticks.

That block from the Microsoft docs makes me sad, but basically I agree it could end early based on that, assuming that's the same kind of timer that is being used by RTI.

This is also a condition we could check for and at least log something about it. Or even we could go back to sleep if needed.

On the other hand, we could adjust the tests (and our documentation) to reflect reality on Windows (if I understand the core issue correctly).

@hidmic
Copy link
Contributor Author

hidmic commented Jun 16, 2020

This is also a condition we could check for and at least log something about it. Or even we could go back to sleep if needed.

That's fair for wait() APIs, but I'm not entirely convinced we should ignore how this affects rcl timers performance if we consistently overshoot timer timeouts. Or the documentation should be explicit about timer periods being a rough lower bound.

On the other hand, we could adjust the tests (and our documentation) to reflect reality on Windows (if I understand the core issue correctly).

I agree with updating wait() APIs documentation accordingly. Specifying what clock that timeout is defined against would be good too.

@ivanpauno
Copy link
Member

I guess what I don't understand about this is that if you ask for 1 nanosecond sleep, then yeah, it might take the OS considerably longer than that to return (maybe on the order of 10ms for time slicing)

👍

But if you ask for 10ms sleep, the OS should never return in 9 ms; that's just not doing what the user application asked for.

👍 I think this one is the problem we have to solve.

And I have the impression that may be the case in Windows under specific circumstances (see second paragraph in this MSVC documentation article).

It's not only that.
rmw_wait uses always the system clock (at least for connext), and timers can use any clock.
Thus, I think the logic in rcl to decide the rmw_wait duration isn't valid, and can be problematic in all platforms.
We might add a tolerance or not, but we do need some sort of "retry wait" logic in case rmw_wait waked up early.

@hidmic
Copy link
Contributor Author

hidmic commented Jun 16, 2020

Thus, I think the logic in rcl to decide the rmw_wait duration isn't valid, and can be problematic in all platforms.

There's that too, yeah. And waking up earlier won't do if the timer clock goes faster than the system's (typical in CI settings using simulation with an RTF > 1). This is a separate issue though.

But if you ask for 10ms sleep, the OS should never return in 9 ms; that's just not doing what the user application asked for.

I will say that a delay (or sleep) and a timeout are two conceptually different things. If you ask for a 10ms delay, then yes, you'd expect it to last 10ms at least. But if you set a 10ms timeout, I'd expect it to last 10ms at most. And thus RTI Connext's WaitSet::wait() behavior makes more sense to me than Fast-RTPS' or CycloneDDS' if and only if it's not possible to get any closer to the requested time due to clock resolution.

@clalancette
Copy link
Contributor

I will point out that that MSVC article says that starting on Windows 8, KeDelayExecutionThread is a lot more precise. It doesn't say anything about KeWaitForSingleObject or KeSetTimer, so I assume they still have the old behavior and can wake up early. Unfortunately, we don't really know which of the APIs Connext is using.

That's fair for wait() APIs, but I'm not entirely convinced we should ignore how this affects rcl timers performance if we consistently overshoot timer timeouts. Or the documentation should be explicit about timer periods being a rough lower bound.

To be fair, on a non-realtime OS, all timer periods are a rough lower bound. So I think that expectation is pretty well set, and I think we could go back to sleep in that case (note that in the above discussion, I've been ignoring signals on Unix anyway, which is another case that might cause a sleep to wake up early and have to go back to sleep). But I definitely agree with you that the documentation should be clear about this.

I will say that a delay (or sleep) and a timeout are two conceptually different things. If you ask for a 10ms delay, then yes, you'd expect it to last 10ms at least. But if you set a 10ms timeout, I'd expect it to last 10ms at most.

I don't think so. At least, that's not how I understand e.g. the timeout on select, the man page for which says:

       The timeout argument specifies the interval that select() should block waiting for a file descriptor to become ready.  The call will block  until
       either:

       *  a file descriptor becomes ready;

       *  the call is interrupted by a signal handler; or

       *  the timeout expires.

       Note  that  the timeout interval will be rounded up to the system clock granularity, and kernel scheduling delays mean that the blocking interval
       may overrun by a small amount.  If both fields of the timeval structure are zero,  then  select()  returns  immediately.   (This  is  useful  for
       polling.)  If timeout is NULL (no timeout), select() can block indefinitely.

@hidmic
Copy link
Contributor Author

hidmic commented Jun 16, 2020

To be fair, on a non-realtime OS, all timer periods are a rough lower bound.

That is true. Not sure about embedded ROS though.

At least, that's not how I understand e.g. the timeout on select

IMHO the choices that select() (or any other POSIX API) makes about how to deal with clock resolution is of no consequence. It does make it way easier to implement if the specified time is greater than 0 but less than one clock resolution unit. We can make the same choice, I just think we can do better.

@brawner
Copy link
Contributor

brawner commented Jun 16, 2020

I think resolving this issue could also resolve some rclpy related failures:
ros2/rclpy#267
ros2/rclpy#513

Or at least they are related, the rclpy issues happen across multiple rmw implementations.

@wjwwood
Copy link
Member

wjwwood commented Jun 16, 2020

But if you set a 10ms timeout, I'd expect it to last 10ms at most.

Yeah, I can't agree with this either. For me a timeout is always "we'll start trying to exit after this period of time". In fact I'd never assume it comes back before or "on time".

@ivanpauno
Copy link
Member

But if you set a 10ms timeout, I'd expect it to last 10ms at most.

👍 too

@hidmic
Copy link
Contributor Author

hidmic commented Jun 16, 2020

Oh well, I guess we agree to disagree.

In any case, we have an rmw_wait() that may return too early and an rcl_timer_t that depends on it not doing that. I'll try to come up with a patch to fix the latter. For the former, besides documenting expectations, maybe we can ask RTI for clarification on timeout behavior.

@ivanpauno
Copy link
Member

I think resolving this issue could also resolve some rclpy related failures:
ros2/rclpy#267
ros2/rclpy#513

Or at least they are related, the rclpy issues happen across multiple rmw implementations.

AFAIR, those failures are due to the timer waking up too late (windows server scheduler time slice is much bigger than macOS and linux). They might also be related with this though.

@hidmic
Copy link
Contributor Author

hidmic commented Jun 17, 2020

@neil-rti maybe you can weigh in here.

@hidmic
Copy link
Contributor Author

hidmic commented Oct 21, 2020

With ros2/rmw#275, rmw_wait() API documentation now mentions this little timing detail. I still have to benchmark modified timers in #707 though.

@hidmic
Copy link
Contributor Author

hidmic commented Apr 6, 2021

@clalancette
Copy link
Contributor

@asorbini had some thoughts about this. I don't remember the exact details, but there was a mismatch on Windows between the clock source that Connext is using internally and the one that we are using in rcl. Maybe he can comment.

@asorbini
Copy link
Contributor

asorbini commented Apr 6, 2021

My guess was that these errors might have had something to do with the fact that rcutils uses GetSystemTimePreciseAsFileTime() while Connext uses _ftime(). I posited that the problem might be stemming from the different resolutions supported by these functions (1ns and 1ms, respectively). I'll paste here my comments from an email thread. I doubt there is actually any "rounding up" happening, but it might instead have something to do with a "timers might wake up up to one period early" issue on Windows, that I could swear I saw mentioned in some ros2 issue, but I can't find at the moment...

Anyway, I'm afraid I don't yet have a good answer to why this is actually happening.

my investigation on the failure in TestTimerFixture.test_two_timers_ready_before_timeout on Windows leads me to confirm that the issue stems from the different clocks used by Connext and by rcl.

The rcl_wait() call in that test seems to sometimes wakeup at most 1ms earlier than expected, and this would be consistent (I think) with the fact that Connext uses a 1ms-resolution clock (provided by _ftime()), while rcl relies on a finer-grain clock (provided by GetSystemTimePreciseAsFileTime()). I haven't checked this assumption yet, but I think that rcl implements timers by making the timeout passed to rmw_wait() smaller than the next timer to trigger, which would be consistent with my theory: Connext is relying on its internal clock to detect the timeout, and ends up waking up too early, e.g. the clock tick used by Connext is "rounded up" to 10ms while the clock used by rcl still reports, say, 9500000ns.

This would explain why I've seen the test fail in two different points:

While this investigation leads me to exclude problems in the RMW, it still brings up a concerning situation for Windows users of rmw_connextdds, which, as of now, risk seeing a wait on a timer return at most 1ms too early (this upper bound is still another guess of mine, and should be confirmed...). I think the bug could be avoided by coding the application "defensively" upon return from rcl_wait(), to check that the timer is actually ready, and then calling rcl_wait() again if it isn't. This would of course be an unfortunate burden on user, and I'm not even sure if it's possible from other language bindings (e.g. rclcpp, and rclpy).

Note that once ros2/rmw_connextdds#22 is merged, the default implementation of WaitSets for rmw_connextdds will not rely on DDS objects anymore, so we might see changes (hopefully positive) with these test failures.

EDIT: maybe I was thinking of this Microsoft page:

If a timer interval, or relative expiration time, is instead specified, the expiration can occur up to a period earlier than or a period later than the specified time, depending on where exactly the start and end times of this interval fall between system clock ticks.

@clalancette
Copy link
Contributor

Thanks for the follow-up!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants