Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: fix flaky unit tests #3338

Open
wants to merge 15 commits into
base: main
Choose a base branch
from

Conversation

jarhodes314
Copy link
Contributor

@jarhodes314 jarhodes314 commented Jan 17, 2025

Proposed changes

This fixes some flaky unit tests I observed locally. One was the mqtt_channel tests, which were failing due to a genuine bug where the connection was closed before all messages were published. The other was a mapper test, which implicitly assumed messages would be delivered in a fixed order.

A few other things that have changed in this PR:

  • I've added --status-level fail to cargo nextest run in just test. This stops cargo nextest listing every passing test name, which was previously obscuring the error output in my terminal.
  • I've removed serial-test from mqtt_channel in favour of using distinct topic/session names against a single broker in each test
  • Replaced some std::env::var calls with env! and some relative file paths with absolute paths so the test processes can be called without running them under cargo test. This is important for using tools like cargo stress

Types of changes

  • Bugfix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Improvement (general improvements like code refactoring that doesn't explicitly fix a bug or add any new functionality)
  • Documentation Update (if none of the other choices apply)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)

Paste Link to the issue

I don't believe either of these had associated issues.

Checklist

  • I have read the CONTRIBUTING doc
  • I have signed the CLA (in all commits with git commit -s)
  • I ran cargo fmt as mentioned in CODING_GUIDELINES
  • I used cargo clippy as mentioned in CODING_GUIDELINES
  • I have added tests that prove my fix is effective or that my feature works
  • I have added necessary documentation (if appropriate)

Further comments

let mut servers = HashMap::new();
servers.insert("1".to_string(), server_config);

rumqttd::Config {
id: 0,
router: router_config,
cluster: None,
console: Some(console_settings),
console: None,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know why this was ever enabled, I haven't observed us using it anywhere

Copy link
Contributor

@albinsuresh albinsuresh Jan 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears to be a defensive diagnostic mechanism kept in place to debug the server when some rare flaky failure occurs, as running the test again by turning it ON may not reproduce the failure.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can safely ignore these console settings.


// The messages might get processed out of order, we don't care about the ordering of the messages
requests.sort();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably isn't technically necessary as I imagine the requests are always sent in the order from the original c8y message, but I think the point still stands that the ordering is irrelevant.

Copy link

codecov bot commented Jan 17, 2025

Codecov Report

Attention: Patch coverage is 93.41317% with 11 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
crates/common/mqtt_channel/src/tests.rs 90.00% 0 Missing and 9 partials ⚠️
crates/common/mqtt_channel/src/connection.rs 94.59% 1 Missing and 1 partial ⚠️
Additional details and impacted files

📢 Thoughts on this report? Let us know!

Copy link
Contributor

github-actions bot commented Jan 17, 2025

Robot Results

✅ Passed ❌ Failed ⏭️ Skipped Total Pass % ⏱️ Duration
556 0 2 556 100 1h31m39.853377s

Copy link
Contributor

@albinsuresh albinsuresh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main fix that ensures that all the messages to be published are exhausted properly with acks looks fine. But, the test failures look concerning. Just some queries/comments on the other bits as well.

Ok(Ok(())) => {
// I don't know why it happened, but I have observed this once while testing
// So just log the error and retry starting the broker on a new port
eprintln!("MQTT-TEST ERROR: `broker.start()` should not terminate until after `spawn_broker` returns")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But when this happens and when we retry the broker start on the next iteration, the client thread that's started (line. 137) in the last iteration also must be aborted somehow, right? Because once that client thread enters the loop, I don't see how it can break out even on a connection error because of the very specific if let check.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The client thread is not part of the loop, it's only started once we have a healthy broker.

let mut servers = HashMap::new();
servers.insert("1".to_string(), server_config);

rumqttd::Config {
id: 0,
router: router_config,
cluster: None,
console: Some(console_settings),
console: None,
Copy link
Contributor

@albinsuresh albinsuresh Jan 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears to be a defensive diagnostic mechanism kept in place to debug the server when some rare flaky failure occurs, as running the test again by turning it ON may not reproduce the failure.

Copy link
Contributor

@didier-wenzek didier-wenzek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for finding and fixing this bug on closing the connection a bit too soon. However, one has to do the same for QoS 2.

crates/common/mqtt_channel/src/connection.rs Outdated Show resolved Hide resolved
if let Poll::Ready(None) = futures::poll!(awaiting_acks.as_mut().peek()) {
// If the channel is dropped, the sender loop has stopped
// and we've received an ack for every message published
mqtt_client.disconnect().await.unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather ignore errors here.

Suggested change
mqtt_client.disconnect().await.unwrap();
let _ = mqtt_client.disconnect().await;

@jarhodes314 jarhodes314 requested review from a team as code owners January 24, 2025 16:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants