Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Handle messages republished by MQTT bridge following a disconnection event #3018

Merged
merged 9 commits into from
Jul 29, 2024

Conversation

jarhodes314
Copy link
Contributor

@jarhodes314 jarhodes314 commented Jul 26, 2024

Proposed changes

The OSADL device running the built in bridge was occasionally disconnecting from cloud while sending messages and then failing to recover upon reconnection. The root cause was that unacknowledged messages were being republished by rumqttc upon reconnection, but the code was always waiting for a corresponding event from the other half of the bridge (as it implicitly assumed each message was only published once). Since there was no such corresponding event, this caused the bridge to block.

This PR ignores the duplicate publishes in the section of code that was blocking to fix this bug.

Note: The PR includes a workaround for the detected flaky test

Types of changes

  • Bugfix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Improvement (general improvements like code refactoring that doesn't explicitly fix a bug or add any new functionality)
  • Documentation Update (if none of the other choices apply)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)

Paste Link to the issue

Checklist

  • I have read the CONTRIBUTING doc
  • I have signed the CLA (in all commits with git commit -s)
  • I ran cargo fmt as mentioned in CODING_GUIDELINES
  • I used cargo clippy as mentioned in CODING_GUIDELINES
  • I have added tests that prove my fix is effective or that my feature works
  • I have added necessary documentation (if appropriate)

Further comments

jarhodes314 and others added 4 commits July 26, 2024 10:21
A given MQTT message might be published more than once, notably after a reconnect.
For each attempt the rumqttc crate notifies an `Outgoing::Publish(pkid)` event.
The first time such an event is received for a given `pkid`, the built-in bridge
has to map this `pkid` to the forwarded message (so it will be able to properly acknowledge it later).
However, when an acknoledgement is already expected for that `pkid` it
means that the `Outgoing::Publish(pkid)` event must be ignored.
Failing to do so introduces a shift in the mapping of in and out pkids,
and, in the worse case, this blocks the built-in bridge as there is no
pending message to associate to.

Signed-off-by: Didier Wenzek <[email protected]>
These messages were not very informative, the first gives us
the same information as the `debug` level notification log
and the other spits out a large map, but it's very difficult
to mentally piece together into anything useful, and makes
it very hard to identify the specific messages in play as
the payload isn't included in the output.

Signed-off-by: James Rhodes <[email protected]>
@jarhodes314 jarhodes314 added the theme:mqtt Theme: mqtt and mosquitto related topics label Jul 26, 2024
@jarhodes314 jarhodes314 temporarily deployed to Test Pull Request July 26, 2024 09:23 — with GitHub Actions Inactive
Copy link

codecov bot commented Jul 26, 2024

Codecov Report

Attention: Patch coverage is 81.81818% with 2 lines in your changes missing coverage. Please review.

Project coverage is 78.1%. Comparing base (e637b25) to head (1cab1a7).
Report is 5 commits behind head on main.

Additional details and impacted files
Files Coverage Δ
crates/extensions/tedge_mqtt_bridge/src/lib.rs 89.2% <81.8%> (-0.2%) ⬇️

... and 3 files with indirect coverage changes

Copy link
Contributor

github-actions bot commented Jul 26, 2024

Robot Results

✅ Passed ❌ Failed ⏭️ Skipped Total Pass % ⏱️ Duration
486 0 2 486 100 1h19m55.609488s

@jarhodes314 jarhodes314 marked this pull request as ready for review July 26, 2024 09:43
Copy link
Member

@rina23q rina23q left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only with minor comments

crates/extensions/tedge_mqtt_bridge/src/lib.rs Outdated Show resolved Hide resolved
crates/extensions/tedge_mqtt_bridge/tests/bridge.rs Outdated Show resolved Hide resolved
Signed-off-by: James Rhodes <[email protected]>
@jarhodes314 jarhodes314 temporarily deployed to Test Pull Request July 26, 2024 16:07 — with GitHub Actions Inactive
@jarhodes314
Copy link
Contributor Author

I've now removed the commented out code and tidied up the import.

@reubenmiller suggested that I should also modify the JWT retrieving logic in the auth proxy so that it doesn't panic when we don't retrieve a JWT, although that isn't trivial and I think trying to get that change rushed in is more likely to do harm than good. I'm confident that the panicking JWT handler isn't the root cause (if the auth proxy panicking was causing the bridge to go down, I would expect it to crash the entire mapper, and the same error appeared multiple times, so the panic is limited to just the single request to the auth proxy).

rina23q
rina23q previously approved these changes Jul 29, 2024
Copy link
Member

@rina23q rina23q left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@rina23q rina23q added this pull request to the merge queue Jul 29, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jul 29, 2024
@reubenmiller reubenmiller temporarily deployed to Test Pull Request July 29, 2024 15:36 — with GitHub Actions Inactive
@rina23q rina23q dismissed their stale review July 29, 2024 16:02

There are more changes after my approval, therefore, I dismiss this approval

Copy link
Member

@rina23q rina23q left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-approved

@reubenmiller reubenmiller added this pull request to the merge queue Jul 29, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jul 29, 2024
@reubenmiller reubenmiller added this pull request to the merge queue Jul 29, 2024
Merged via the queue into thin-edge:main with commit 19ba2cd Jul 29, 2024
33 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
theme:mqtt Theme: mqtt and mosquitto related topics
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants