Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: revert #1627 and add congestion shortcut #1737

Closed

Conversation

wyfo
Copy link
Contributor

@wyfo wyfo commented Jan 23, 2025

#1627 introduced a congestion shortcut mechanism, but this one was not correctly synchronized. There was indeed situations experienced by users in which congested flag was set and never reset, which implies a drop of all successive messages (the publisher becomes kind of dead).

The congestion flag is in fact set after the deadline of a message is reached, while it is unset when batches were refilled, only with relaxed atomic operations. Also, after the deadline is reached, there is no further check of the queue.

The most obvious synchronization flow here is that between the instant where the thread is waken up because the deadline has been reached, and the one where the congested flag is set, it is possible that the tx task is unblocked and all the batches are sent and refilled. The congested flags would been set after, so there would be no batch to refill to unset it back. This flow seems hard to achieve when there are many batches in the queue, but it is still theoretically possible. And when fragmentation chain is dropped in the middle, pushing the ephemeral batch takes more time before setting the congested flag; this is precisely the case where the bug was observed by the way, so it may indicate the described flow is the reason behind, but it's not sure.

The proposed fix reverts #1627, and use a simpler and correctly synchronized shortcut: a congested flag is added behind the mutex of StageIn, it is set when the pipeline is congested and unset if the message is written (for both network and transport); if the congested flag is set, the deadline is not awaited. This shortcut is not "as short" as the previous, but it is again a lot simpler.

@wyfo wyfo added bug Something isn't working internal Changes not included in the changelog labels Jan 23, 2025
eclipse-zenoh#1627 introduced a congestion shortcut mechanism, but this one was not
correctly synchronized. There was indeed situations experienced by
users in which congested flag was set and never reset, which implies a
drop of all successive messages (the publisher becomes kind of dead).

The congestion flag is in fact set after the deadline of a message is
reached, while it is unset when batches were refilled, only with
relaxed atomic operations. Also, after the deadline is reached, there
is no further check of the queue.

The most obvious synchronization flow here is that between the instant
where the thread is waken up because the deadline has been reached, and
the one where the congested flag is set, it is possible that the tx
task is unblocked and all the batches are sent and refilled. The
congested flags would been set after, so there would be no batch to
refill to unset it back. This flow seems hard to achieve when there are
many batches in the queue, but it is still theoretically possible. And
when fragmentation chain is dropped in the middle, pushing the
ephemeral batch takes more time before setting the congested flag; this
is precisely the case where the bug was observed by the way, so it may
indicate the described flow is the reason behind, but it's not sure.

The proposed fix reverts eclipse-zenoh#1627, and use a simpler and correctly
synchronized shortcut: a congested flag is added behind the mutex of
`StageIn`, it is set when the pipeline is congested and unset if the
message is written (for both network and transport); if the congested
flag is set, the deadline is not awaited. This shortcut is not
"as short" as the previous, but it is again a lot simpler.
@wyfo wyfo force-pushed the fix/fix_congestion_shortcut branch from e7fb039 to 1286e72 Compare January 23, 2025 09:10
@@ -795,22 +764,14 @@ impl TransmissionPipelineProducer {

// If message is droppable, compute a deadline after which the sample could be dropped
let (wait_time, max_wait_time) = if msg.is_droppable() {
// Checked if we are blocked on the priority queue and we drop directly the message
if self.status.is_congested(priority) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By removing this check, the code will always block in the zlock!(self.stage_in[idx]) at line 773 when congestion happens. I think the general idea is to avoid getting the lock if we are already in the congested scenario.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will not block for long, because if the pipeline is congested, no batch will be available and push_network_message will return directly.
Also, having the shortcut before the lock implies that each other messages waiting for the lock will have to wait the deadline while we know that the pipeline is congested.
And last, this is easier to synchronize properly, and touch the least possible amount of code.

Copy link
Member

@Mallets Mallets Jan 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You will block for long because if the previous message is reliable, it will wait for a new batch. So it won't be possible to do an early drop in case of best effort messages.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a valid argument.

@wyfo wyfo requested a review from Mallets January 23, 2025 09:21
@wyfo
Copy link
Contributor Author

wyfo commented Jan 23, 2025

Fixes #1738

@oteffahi
Copy link
Contributor

Superseded by #1737

@oteffahi oteffahi closed this Jan 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working internal Changes not included in the changelog
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants