-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Handle messages republished by MQTT bridge following a disconnection event #3018
Conversation
Signed-off-by: James Rhodes <[email protected]>
A given MQTT message might be published more than once, notably after a reconnect. For each attempt the rumqttc crate notifies an `Outgoing::Publish(pkid)` event. The first time such an event is received for a given `pkid`, the built-in bridge has to map this `pkid` to the forwarded message (so it will be able to properly acknowledge it later). However, when an acknoledgement is already expected for that `pkid` it means that the `Outgoing::Publish(pkid)` event must be ignored. Failing to do so introduces a shift in the mapping of in and out pkids, and, in the worse case, this blocks the built-in bridge as there is no pending message to associate to. Signed-off-by: Didier Wenzek <[email protected]>
These messages were not very informative, the first gives us the same information as the `debug` level notification log and the other spits out a large map, but it's very difficult to mentally piece together into anything useful, and makes it very hard to identify the specific messages in play as the payload isn't included in the output. Signed-off-by: James Rhodes <[email protected]>
Codecov ReportAttention: Patch coverage is
Additional details and impacted files
|
Robot Results
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only with minor comments
Signed-off-by: James Rhodes <[email protected]>
Signed-off-by: James Rhodes <[email protected]>
I've now removed the commented out code and tidied up the import. @reubenmiller suggested that I should also modify the JWT retrieving logic in the auth proxy so that it doesn't panic when we don't retrieve a JWT, although that isn't trivial and I think trying to get that change rushed in is more likely to do harm than good. I'm confident that the panicking JWT handler isn't the root cause (if the auth proxy panicking was causing the bridge to go down, I would expect it to crash the entire mapper, and the same error appeared multiple times, so the panic is limited to just the single request to the auth proxy). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Signed-off-by: Reuben Miller <[email protected]>
Signed-off-by: Reuben Miller <[email protected]>
There are more changes after my approval, therefore, I dismiss this approval
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re-approved
Proposed changes
The OSADL device running the built in bridge was occasionally disconnecting from cloud while sending messages and then failing to recover upon reconnection. The root cause was that unacknowledged messages were being republished by
rumqttc
upon reconnection, but the code was always waiting for a corresponding event from the other half of the bridge (as it implicitly assumed each message was only published once). Since there was no such corresponding event, this caused the bridge to block.This PR ignores the duplicate publishes in the section of code that was blocking to fix this bug.
Note: The PR includes a workaround for the detected flaky test
Types of changes
Paste Link to the issue
Checklist
cargo fmt
as mentioned in CODING_GUIDELINEScargo clippy
as mentioned in CODING_GUIDELINESFurther comments