-
Notifications
You must be signed in to change notification settings - Fork 177
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: revert #1627 and add congestion shortcut #1737
fix: revert #1627 and add congestion shortcut #1737
Conversation
eclipse-zenoh#1627 introduced a congestion shortcut mechanism, but this one was not correctly synchronized. There was indeed situations experienced by users in which congested flag was set and never reset, which implies a drop of all successive messages (the publisher becomes kind of dead). The congestion flag is in fact set after the deadline of a message is reached, while it is unset when batches were refilled, only with relaxed atomic operations. Also, after the deadline is reached, there is no further check of the queue. The most obvious synchronization flow here is that between the instant where the thread is waken up because the deadline has been reached, and the one where the congested flag is set, it is possible that the tx task is unblocked and all the batches are sent and refilled. The congested flags would been set after, so there would be no batch to refill to unset it back. This flow seems hard to achieve when there are many batches in the queue, but it is still theoretically possible. And when fragmentation chain is dropped in the middle, pushing the ephemeral batch takes more time before setting the congested flag; this is precisely the case where the bug was observed by the way, so it may indicate the described flow is the reason behind, but it's not sure. The proposed fix reverts eclipse-zenoh#1627, and use a simpler and correctly synchronized shortcut: a congested flag is added behind the mutex of `StageIn`, it is set when the pipeline is congested and unset if the message is written (for both network and transport); if the congested flag is set, the deadline is not awaited. This shortcut is not "as short" as the previous, but it is again a lot simpler.
e7fb039
to
1286e72
Compare
@@ -795,22 +764,14 @@ impl TransmissionPipelineProducer { | |||
|
|||
// If message is droppable, compute a deadline after which the sample could be dropped | |||
let (wait_time, max_wait_time) = if msg.is_droppable() { | |||
// Checked if we are blocked on the priority queue and we drop directly the message | |||
if self.status.is_congested(priority) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By removing this check, the code will always block in the zlock!(self.stage_in[idx])
at line 773
when congestion happens. I think the general idea is to avoid getting the lock if we are already in the congested scenario.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will not block for long, because if the pipeline is congested, no batch will be available and push_network_message
will return directly.
Also, having the shortcut before the lock implies that each other messages waiting for the lock will have to wait the deadline while we know that the pipeline is congested.
And last, this is easier to synchronize properly, and touch the least possible amount of code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You will block for long because if the previous message is reliable, it will wait for a new batch. So it won't be possible to do an early drop in case of best effort messages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a valid argument.
Fixes #1738 |
Superseded by #1737 |
#1627 introduced a congestion shortcut mechanism, but this one was not correctly synchronized. There was indeed situations experienced by users in which congested flag was set and never reset, which implies a drop of all successive messages (the publisher becomes kind of dead).
The congestion flag is in fact set after the deadline of a message is reached, while it is unset when batches were refilled, only with relaxed atomic operations. Also, after the deadline is reached, there is no further check of the queue.
The most obvious synchronization flow here is that between the instant where the thread is waken up because the deadline has been reached, and the one where the congested flag is set, it is possible that the tx task is unblocked and all the batches are sent and refilled. The congested flags would been set after, so there would be no batch to refill to unset it back. This flow seems hard to achieve when there are many batches in the queue, but it is still theoretically possible. And when fragmentation chain is dropped in the middle, pushing the ephemeral batch takes more time before setting the congested flag; this is precisely the case where the bug was observed by the way, so it may indicate the described flow is the reason behind, but it's not sure.
The proposed fix reverts #1627, and use a simpler and correctly synchronized shortcut: a congested flag is added behind the mutex of
StageIn
, it is set when the pipeline is congested and unset if the message is written (for both network and transport); if the congested flag is set, the deadline is not awaited. This shortcut is not "as short" as the previous, but it is again a lot simpler.