Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix tedge agent restart on self update #2631

Conversation

didier-wenzek
Copy link
Contributor

Proposed changes

As highlighted by flaky tests, there is a race condition when a self-update is detected by the agent. While a first thread is publishing the operation status over MQTT, a concurrent thread is triggering a shutdown of the agent, possibly before the status is actually published.

This fix introduces two mitigations:

  • The agent shutdown is delayed.
  • On shutdown, all the MQTT messages already queued are published.

Types of changes

  • Bugfix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Improvement (general improvements like code refactoring that doesn't explicitly fix a bug or add any new functionality)
  • Documentation Update (if none of the other choices apply)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)

Paste Link to the issue

#2623

Checklist

  • I have read the CONTRIBUTING doc
  • I have signed the CLA (in all commits with git commit -s)
  • I ran cargo fmt as mentioned in CODING_GUIDELINES
  • I used cargo clippy as mentioned in CODING_GUIDELINES
  • I have added tests that prove my fix is effective or that my feature works
  • I have added necessary documentation (if appropriate)

Further comments

Copy link

codecov bot commented Jan 30, 2024

Codecov Report

Attention: 9 lines in your changes are missing coverage. Please review.

Comparison is base (148eff0) 75.9% compared to head (69cce99) 75.9%.
Report is 5 commits behind head on main.

Additional details and impacted files
Files Coverage Δ
...tes/core/tedge_agent/src/software_manager/actor.rs 55.6% <100.0%> (+0.4%) ⬆️
crates/core/tedge_actors/src/message_boxes.rs 79.0% <0.0%> (-2.3%) ⬇️
crates/extensions/tedge_mqtt_ext/src/lib.rs 66.0% <16.6%> (-1.8%) ⬇️

... and 1 file with indirect coverage changes

Copy link
Contributor

github-actions bot commented Jan 30, 2024

Robot Results

✅ Passed ❌ Failed ⏭️ Skipped Total Pass % ⏱️ Duration
384 0 3 384 100 53m50s

}

// On shutdown, first close input so no new messages can be pushed
self.input_receiver.close_input();
Copy link
Contributor

@albinsuresh albinsuresh Jan 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Query: Shouldn't the incoming_mqtt channel also be closed to avoid reading any more messages from the broker as well? For that, these relay_xxx methods would need to take the mutable Connection itself, instead of the published or received channels from it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, it would be pointless to close the incoming_mqtt channel as the underlying TCP connection to the broker is bi-directional: as we want to publish a latest batch of messages; we have to keep the connection open.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking closely, things could be improved when a message a received but there is no more recipient for it.

For that we need to acknowledge only messages that have been properly queued or even better processed.
This seems to be feasible now with the latest version of rumqttc.

Copy link
Contributor

@albinsuresh albinsuresh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@didier-wenzek didier-wenzek added this pull request to the merge queue Jan 30, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jan 30, 2024
@didier-wenzek didier-wenzek requested a review from a team as a code owner January 30, 2024 18:19
@didier-wenzek didier-wenzek added this pull request to the merge queue Jan 30, 2024
Merged via the queue into thin-edge:main with commit 00c14eb Jan 30, 2024
20 checks passed
@gligorisaev
Copy link
Contributor

QA has thoroughly checked the bug and here are the results:

  • Test for ticket exists in the test suite.
  • QA has tested the bug and it's not reproducable anymore

@didier-wenzek didier-wenzek deleted the fix/tedge-agent-restart-on-self-update branch February 7, 2024 09:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants