fix: Handle messages republished by MQTT bridge following a disconnection event #3018

jarhodes314 · 2024-07-26T09:22:56Z

Proposed changes

The OSADL device running the built in bridge was occasionally disconnecting from cloud while sending messages and then failing to recover upon reconnection. The root cause was that unacknowledged messages were being republished by rumqttc upon reconnection, but the code was always waiting for a corresponding event from the other half of the bridge (as it implicitly assumed each message was only published once). Since there was no such corresponding event, this caused the bridge to block.

This PR ignores the duplicate publishes in the section of code that was blocking to fix this bug.

Note: The PR includes a workaround for the detected flaky test

Types of changes

Bugfix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Improvement (general improvements like code refactoring that doesn't explicitly fix a bug or add any new functionality)
Documentation Update (if none of the other choices apply)
Breaking change (fix or feature that would cause existing functionality to not work as expected)

Paste Link to the issue

built-in bridge connection not sending or receiving messages from the cloud #2960

Checklist

I have read the CONTRIBUTING doc
I have signed the CLA (in all commits with git commit -s)
I ran cargo fmt as mentioned in CODING_GUIDELINES
I used cargo clippy as mentioned in CODING_GUIDELINES
I have added tests that prove my fix is effective or that my feature works
I have added necessary documentation (if appropriate)

Further comments

Signed-off-by: James Rhodes <[email protected]>

A given MQTT message might be published more than once, notably after a reconnect. For each attempt the rumqttc crate notifies an `Outgoing::Publish(pkid)` event. The first time such an event is received for a given `pkid`, the built-in bridge has to map this `pkid` to the forwarded message (so it will be able to properly acknowledge it later). However, when an acknoledgement is already expected for that `pkid` it means that the `Outgoing::Publish(pkid)` event must be ignored. Failing to do so introduces a shift in the mapping of in and out pkids, and, in the worse case, this blocks the built-in bridge as there is no pending message to associate to. Signed-off-by: Didier Wenzek <[email protected]>

These messages were not very informative, the first gives us the same information as the `debug` level notification log and the other spits out a large map, but it's very difficult to mentally piece together into anything useful, and makes it very hard to identify the specific messages in play as the payload isn't included in the output. Signed-off-by: James Rhodes <[email protected]>

codecov · 2024-07-26T09:35:16Z

Codecov Report

Attention: Patch coverage is 81.81818% with 2 lines in your changes missing coverage. Please review.

Project coverage is 78.1%. Comparing base (e637b25) to head (1cab1a7).
Report is 5 commits behind head on main.

Additional details and impacted files

Files	Coverage Δ
crates/extensions/tedge_mqtt_bridge/src/lib.rs	`89.2% <81.8%> (-0.2%)`	⬇️

... and 3 files with indirect coverage changes

github-actions · 2024-07-26T09:42:36Z

Robot Results

✅ Passed	❌ Failed	⏭️ Skipped	Total	Pass %	⏱️ Duration
486	0	2	486	100	1h19m55.609488s

crates/extensions/tedge_mqtt_bridge/tests/bridge.rs

rina23q

Only with minor comments

crates/extensions/tedge_mqtt_bridge/src/lib.rs

crates/extensions/tedge_mqtt_bridge/tests/bridge.rs

Signed-off-by: James Rhodes <[email protected]>

jarhodes314 · 2024-07-26T16:09:55Z

I've now removed the commented out code and tidied up the import.

@reubenmiller suggested that I should also modify the JWT retrieving logic in the auth proxy so that it doesn't panic when we don't retrieve a JWT, although that isn't trivial and I think trying to get that change rushed in is more likely to do harm than good. I'm confident that the panicking JWT handler isn't the root cause (if the auth proxy panicking was causing the bridge to go down, I would expect it to crash the entire mapper, and the same error appeared multiple times, so the panic is limited to just the single request to the auth proxy).

rina23q

LGTM

Signed-off-by: Reuben Miller <[email protected]>

There are more changes after my approval, therefore, I dismiss this approval

rina23q

Re-approved

jarhodes314 and others added 4 commits July 26, 2024 10:21

Add debug logs for built-in bridge to run on OSADL device

32e836a

Signed-off-by: James Rhodes <[email protected]>

Add test to reproduce error after disconnection

c0b4595

jarhodes314 added the theme:mqtt Theme: mqtt and mosquitto related topics label Jul 26, 2024

jarhodes314 temporarily deployed to Test Pull Request July 26, 2024 09:23 — with GitHub Actions Inactive

jarhodes314 temporarily deployed to Test Auto July 26, 2024 09:28 — with GitHub Actions Inactive

jarhodes314 marked this pull request as ready for review July 26, 2024 09:43

jarhodes314 requested review from didier-wenzek, albinsuresh and rina23q as code owners July 26, 2024 09:43

reubenmiller reviewed Jul 26, 2024

View reviewed changes

crates/extensions/tedge_mqtt_bridge/tests/bridge.rs Outdated Show resolved Hide resolved

rina23q reviewed Jul 26, 2024

View reviewed changes

crates/extensions/tedge_mqtt_bridge/src/lib.rs Outdated Show resolved Hide resolved

crates/extensions/tedge_mqtt_bridge/tests/bridge.rs Outdated Show resolved Hide resolved

jarhodes314 added 2 commits July 26, 2024 17:02

Address review comments

3e0f4fc

Signed-off-by: James Rhodes <[email protected]>

Merge remote-tracking branch 'origin/main' into bug/osadl4

a43dfe9

jarhodes314 had a problem deploying to Test Pull Request July 26, 2024 16:03 — with GitHub Actions Failure

Run formatter

7d7883a

Signed-off-by: James Rhodes <[email protected]>

jarhodes314 temporarily deployed to Test Pull Request July 26, 2024 16:07 — with GitHub Actions Inactive

jarhodes314 temporarily deployed to Test Auto July 26, 2024 16:12 — with GitHub Actions Inactive

rina23q previously approved these changes Jul 29, 2024

View reviewed changes

rina23q added this pull request to the merge queue Jul 29, 2024

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jul 29, 2024

reubenmiller added 2 commits July 29, 2024 17:33

improve debugging during assertion of local client

3ef5a53

Signed-off-by: Reuben Miller <[email protected]>

add workaround for a flaky test

1cab1a7

Signed-off-by: Reuben Miller <[email protected]>

reubenmiller temporarily deployed to Test Pull Request July 29, 2024 15:36 — with GitHub Actions Inactive

reubenmiller temporarily deployed to Test Auto July 29, 2024 15:46 — with GitHub Actions Inactive

rina23q approved these changes Jul 29, 2024

View reviewed changes

reubenmiller added this pull request to the merge queue Jul 29, 2024

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jul 29, 2024

reubenmiller added this pull request to the merge queue Jul 29, 2024

Merged via the queue into thin-edge:main with commit 19ba2cd Jul 29, 2024
33 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Handle messages republished by MQTT bridge following a disconnection event #3018

fix: Handle messages republished by MQTT bridge following a disconnection event #3018

jarhodes314 commented Jul 26, 2024 •

edited by reubenmiller

Loading

codecov bot commented Jul 26, 2024 •

edited

Loading

github-actions bot commented Jul 26, 2024 •

edited

Loading

rina23q left a comment

jarhodes314 commented Jul 26, 2024

rina23q left a comment

rina23q left a comment

fix: Handle messages republished by MQTT bridge following a disconnection event #3018

fix: Handle messages republished by MQTT bridge following a disconnection event #3018

Conversation

jarhodes314 commented Jul 26, 2024 • edited by reubenmiller Loading

Proposed changes

Types of changes

Paste Link to the issue

Checklist

Further comments

codecov bot commented Jul 26, 2024 • edited Loading

Codecov Report

github-actions bot commented Jul 26, 2024 • edited Loading

Robot Results

rina23q left a comment

Choose a reason for hiding this comment

jarhodes314 commented Jul 26, 2024

rina23q left a comment

Choose a reason for hiding this comment

rina23q left a comment

Choose a reason for hiding this comment

jarhodes314 commented Jul 26, 2024 •

edited by reubenmiller

Loading

codecov bot commented Jul 26, 2024 •

edited

Loading

github-actions bot commented Jul 26, 2024 •

edited

Loading