Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restart operation on child-devices #2245

Conversation

didier-wenzek
Copy link
Contributor

@didier-wenzek didier-wenzek commented Sep 12, 2023

Updating tedge-agent and c8y-mapper to support restart operation on child devices

Proposed changes

  • use mqtt.topic_root and mqtt.device_topic_id
  • use te/device/main///cmd/restart/+ instead of tedge/commands/+/control/restart
  • move the command id from the payload to the topic
  • publish restart commands as retained messages
  • The c8y-mapper has to register child device restart capability
  • Add system tests: restarting a child device from c8y

Types of changes

  • Bugfix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Improvement (general improvements like code refactoring that doesn't explicitly fix a bug or add any new functionality)
  • Documentation Update (if none of the other choices apply)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)

Paste Link to the issue

#2018

Checklist

  • I have read the CONTRIBUTING doc
  • I have signed the CLA (in all commits with git commit -s)
  • I ran cargo fmt as mentioned in CODING_GUIDELINES
  • I used cargo clippy as mentioned in CODING_GUIDELINES
  • I have added tests that prove my fix is effective or that my feature works
  • I have added necessary documentation (if appropriate)

Further comments

@didier-wenzek didier-wenzek temporarily deployed to Test Pull Request September 12, 2023 09:06 — with GitHub Actions Inactive
@github-actions
Copy link
Contributor

github-actions bot commented Sep 12, 2023

Robot Results

✅ Passed ❌ Failed ⏭️ Skipped Total Pass % ⏱️ Duration
288 0 3 288 100 1h1m19.047999999s

@didier-wenzek didier-wenzek force-pushed the feat/tedge-agent-restart-child-devices branch from 26ce617 to e37f87a Compare September 12, 2023 12:23
@codecov
Copy link

codecov bot commented Sep 12, 2023

Codecov Report

Merging #2245 (97a07a0) into main (09b7d30) will decrease coverage by 0.1%.
Report is 1 commits behind head on main.
The diff coverage is 69.6%.

❗ Current head 97a07a0 differs from pull request most recent head b717228. Consider uploading reports for the commit b717228 to get more accurate results

Additional details and impacted files
Files Changed Coverage Δ
crates/core/c8y_api/src/smartrest/topic.rs 86.2% <0.0%> (-5.6%) ⬇️
crates/core/tedge_agent/src/agent.rs 0.0% <0.0%> (ø)
...ates/core/tedge_agent/src/restart_manager/error.rs 0.0% <ø> (ø)
crates/core/tedge_api/src/topic.rs 97.1% <ø> (+2.6%) ⬆️
crates/extensions/c8y_mapper_ext/src/config.rs 53.5% <0.0%> (-1.4%) ⬇️
crates/extensions/c8y_mapper_ext/src/converter.rs 77.8% <35.2%> (-2.1%) ⬇️
...ates/core/tedge_agent/src/restart_manager/actor.rs 43.1% <38.1%> (-4.4%) ⬇️
...tes/core/tedge_agent/src/restart_manager/config.rs 45.8% <75.0%> (+12.5%) ⬆️
crates/extensions/tedge_log_manager/src/actor.rs 69.2% <75.0%> (+0.2%) ⬆️
...tedge_agent/src/tedge_operation_converter/actor.rs 78.7% <78.5%> (+4.2%) ⬆️
... and 12 more

... and 7 files with indirect coverage changes

@didier-wenzek didier-wenzek temporarily deployed to Test Pull Request September 12, 2023 19:14 — with GitHub Actions Inactive
@didier-wenzek didier-wenzek temporarily deployed to Test Pull Request September 13, 2023 15:45 — with GitHub Actions Inactive
@didier-wenzek didier-wenzek temporarily deployed to Test Pull Request September 14, 2023 09:04 — with GitHub Actions Inactive
@didier-wenzek didier-wenzek marked this pull request as ready for review September 14, 2023 17:06
@didier-wenzek didier-wenzek temporarily deployed to Test Pull Request September 14, 2023 17:23 — with GitHub Actions Inactive
@reubenmiller
Copy link
Contributor

@didier-wenzek is the PR in a state where we can add an integration test for the child device as well? Or is there still other work required for the tedge-agent to function on a child device (in the scope of the restart command only)

Copy link
Contributor

@albinsuresh albinsuresh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Was a bit surprised to see the sheer amount of files and crates that were touched. It just highlights the amount of fragmentation that we currently have.

I was also surprised to see that no integration tests were updated. But then found out that none exists atm. I'm hoping that it is feasible now, since we have containers in place of real devices.

crates/core/tedge_api/src/messages.rs Outdated Show resolved Hide resolved
crates/core/tedge_api/src/messages.rs Show resolved Hide resolved
.to_smartrest()?;

Ok(vec![Message::new(&topic, smartrest_set_operation)])
impl CumulocityConverter {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realised that you can have multiple plain impl blocks for the same struct, which are not for any trait implementations and this module already had 3 different impl blocks. Looks fine, but I just don't understand the benefits though. I'm assuming it is for better code organization to isolate different responsibilities into different impl blocks instead of one big fat impl block.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, something that needs to be fixed. But in a PR doing only that.

Why are we here? Previously, there was an implementations of a trait used only here. There were also functions which have been promoted to methods. When I introduced these changes, I decided to go that way in order to help the code reviewers by minimizing code shuffling on an already huge refactoring.

Now, this can be cleaned up.

crates/extensions/c8y_mapper_ext/src/converter.rs Outdated Show resolved Hide resolved
@reubenmiller
Copy link
Contributor

reubenmiller commented Sep 15, 2023

This might be a bit premature, but I did a quick manual test to see if I could trigger a restart operation from Cumulocity on the child device, but

  • c8y_Restart operation is not registered on the c8y device managed object
  • the operation stays in PENDING status in Cumulocity, but the tedge-agent goes into a restart cycle, where the container continuously keeps restarting

So if this is not the expected behaviour (due to that part not being implemented yet), then I would suggest we wait to add some new integration tests for this new functionality before we merge it.

@didier-wenzek
Copy link
Contributor Author

So if this is not the expected behaviour (due to that part not being implemented yet), then I would suggest we wait to add some new integration tests for this new functionality before we merge it.

Sure, we must add system tests with the 3 systems (c8y, c8y-mapper and tedge-agent). The integration tests are only checking partial exchanges between c8y-mapper and tedge-agent.

@reubenmiller
Copy link
Contributor

So if this is not the expected behaviour (due to that part not being implemented yet), then I would suggest we wait to add some new integration tests for this new functionality before we merge it.

Sure, we must add system tests with the 3 systems (c8y, c8y-mapper and tedge-agent). The integration tests are only checking partial exchanges between c8y-mapper and tedge-agent.

Then I'll add the system tests on Monday, though we definitely have issues here (as previously posted).

@didier-wenzek
Copy link
Contributor Author

@didier-wenzek is the PR in a state where we can add an integration test for the child device as well? Or is there still other work required for the tedge-agent to function on a child device (in the scope of the restart command only)

The PR should be in a state where a restart of the child-device should work.

I was also surprised to see that no integration tests were updated. But then found out that none exists atm. I'm hoping that it is feasible now, since we have containers in place of real devices.

There is at least this one for the main device:

https://github.com/thin-edge/thin-edge.io/blob/main/tests/RobotFramework/tests/cumulocity/restart/restart_device.robot

@didier-wenzek didier-wenzek temporarily deployed to Test Pull Request September 15, 2023 08:58 — with GitHub Actions Inactive
@albinsuresh
Copy link
Contributor

I was also surprised to see that no integration tests were updated. But then found out that none exists atm. I'm hoping that it is feasible now, since we have containers in place of real devices.

There is at least this one for the main device:

https://github.com/thin-edge/thin-edge.io/blob/main/tests/RobotFramework/tests/cumulocity/restart/restart_device.robot

Aah okay, using the Cumulocity APIs instead of triggering commands via te/.../cmd/restart topics. I was searching for a direct test using the tedge/ or te/ topics and that's why I didn't get any results.

@didier-wenzek didier-wenzek force-pushed the feat/tedge-agent-restart-child-devices branch from b8672bd to 8f662dc Compare September 15, 2023 12:50
@didier-wenzek
Copy link
Contributor Author

This PR has been rebased on top of #2265

@didier-wenzek didier-wenzek temporarily deployed to Test Pull Request September 15, 2023 13:05 — with GitHub Actions Inactive
@didier-wenzek
Copy link
Contributor Author

didier-wenzek commented Sep 15, 2023

I confirm something is going wrong :-(

Restart operations are pushed and executed but then the agent start looping while the mapper say to nothing to c8y.

$ tedge mqtt sub te/+/+/+/+/cmd/restart/+
INFO: Connected
[te/device/main///cmd/restart/VKF2G1m2xslA-oLkvoy3t] {"status":"failed"}
[te/device/main///cmd/restart/vtigzeuw8fXhDPUGFnvXc] {"status":"failed"}
[te/device/main///cmd/restart/WR38g0Q4FsL4I-KdTj3y-] {"status":"failed"}
[te/device/main///cmd/restart/AG0CQgUz7FedAzGS-NLFq] {"status":"init"}

[te/device/main///cmd/restart/VKF2G1m2xslA-oLkvoy3t] {"status":"failed"}
[te/device/main///cmd/restart/VKF2G1m2xslA-oLkvoy3t] {"status":"failed"}
[te/device/main///cmd/restart/VKF2G1m2xslA-oLkvoy3t] {"status":"failed"}
[te/device/main///cmd/restart/vtigzeuw8fXhDPUGFnvXc] {"status":"failed"}
[te/device/main///cmd/restart/vtigzeuw8fXhDPUGFnvXc] {"status":"failed"}
[te/device/main///cmd/restart/WR38g0Q4FsL4I-KdTj3y-] {"status":"failed"}
[te/device/main///cmd/restart/WR38g0Q4FsL4I-KdTj3y-] {"status":"failed"}
[te/device/main///cmd/restart/AG0CQgUz7FedAzGS-NLFq] {"status":"failed"}
[te/device/main///cmd/restart/AG0CQgUz7FedAzGS-NLFq] {"status":"failed"}
[te/device/main///cmd/restart/VKF2G1m2xslA-oLkvoy3t] {"status":"failed"}
[te/device/main///cmd/restart/VKF2G1m2xslA-oLkvoy3t] {"status":"failed"}
[te/device/main///cmd/restart/VKF2G1m2xslA-oLkvoy3t] {"status":"failed"}
[te/device/main///cmd/restart/VKF2G1m2xslA-oLkvoy3t] {"status":"failed"}
[te/device/main///cmd/restart/vtigzeuw8fXhDPUGFnvXc] {"status":"failed"}

@didier-wenzek
Copy link
Contributor Author

didier-wenzek commented Sep 15, 2023

To be fixed:

  • the agent must ignore commands which are not in the init state
  • the mapper should use better cmd_id e.g. c8y-{timestamp}
  • the mapper correctly send [c8y/s/us] 502,c8y_Restart,"Restart Failed" but this is ignored by the cloud. Why?
  • the mapper must clear successful and failed requests
  • the reason for a restart failure is not given
  • on restart, the mapper should not create a new restart request (when pending in the cloud and locally)
  • it would be helpful to add to each message the name of the service which changed the state

@reubenmiller
Copy link
Contributor

The reason behind the status updates and failure reason not working is because they're being published on the incorrect c8y topic. For child devices it should be c8y/s/us/{child_id}

@didier-wenzek didier-wenzek temporarily deployed to Test Pull Request September 21, 2023 12:59 — with GitHub Actions Inactive
Copy link
Contributor

@reubenmiller reubenmiller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PRs looking very close now, I pushed some minor fixes and added some more system tests.

Though only remaining minor change is the following:

  • Log entry is misleading when the restart command is killed by a signal. The entry with Restart command has been interrupted by a signal is good, but it is proceeded by an entry with the opposite meaning The restart command has been successfully executed.

    Sep 21 21:30:08 d2dcf7b3095f tedge-agent[189]: 2023-09-21T21:30:08.072623262Z  INFO tedge_agent::restart_manager::actor: Restart command has been interrupted by a signal: Command { std: "/usr/bin/sudo" "/usr/bin/on_shutdown_signal_interrupt.sh", kill_on_drop: false }
    Sep 21 21:30:08 d2dcf7b3095f tedge-agent[189]: 2023-09-21T21:30:08.072724928Z  INFO tedge_agent::restart_manager::actor: The restart command has been successfully executed
    

@reubenmiller reubenmiller temporarily deployed to Test Pull Request September 22, 2023 01:23 — with GitHub Actions Inactive
Copy link
Contributor

@albinsuresh albinsuresh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks fine except for a small issue that I've highlighted around failed state repository updates, that's not handled cleanly(even in the old code). Happy to approve once that's fixed.

}

if message.topic.name.as_str().starts_with("tedge") {
return Ok(());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's that case? Getting a message on some tedge/ topic that we've explicitly subscribed to, but still ignoring?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No this is temporary code. The first part of the method is related to tedge topics, the second part to te topics.

I could have write this differently, but came with this not so nice return to avoid confusing diffs in the PR.

crates/core/tedge_api/src/messages.rs Outdated Show resolved Hide resolved
crates/core/tedge_api/src/messages.rs Outdated Show resolved Hide resolved
Comment on lines +1021 to +1013
let ops_file = ops_dir.join("c8y_Restart");
create_directory_with_defaults(&ops_dir)?;
create_file_with_defaults(ops_file, None)?;
let device_operations = create_supported_operations(&ops_dir)?;
Ok(vec![device_operations])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initially I was wondering why we still need to rely on these files when we have all the info via retained messages. But then realised that we can't get rid of it until software management is also migrated to the topic based capability model and we expose the same topic API mechanism for custom operations as well.

continue;
}
info!("Triggering a restart");
let executing_response = self.update_state_repository(request.clone()).await;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not an error that you introduced but something that existed in the previous version as well: If update_state_repository returned a Failed status, we shouldn't be proceeding beyond the next line, right? It should have failed fast and return from here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed by 97a07a0

Comment on lines +79 to 99
match timeout(restart_timeout, self.message_box.recv_signal()).await {
Ok(Some(RuntimeRequest::Shutdown)) => {
info!("As requested, a shutdown has been triggered");
return Ok(());
}
Ok(None) | Err(_ /* timeout */) => {
// Something went wrong. The process should have been shutdown by the restart.
let error = "No shutdown has been triggered".to_string();
error!(error);
self.handle_error(request, error).await?;
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we doing this here? To make sure that the restart command that was executed genuinely triggered a device restart which would have translated into a RuntimeRequest::Shutdown request? What if the user intentionally replaced the restart script with something that doesn't really restart the device but does something else? In that case, do we want to mark that operation as failed? From his perspective, the successful execution of the script should have translated to a Success, right?

Copy link
Contributor

@albinsuresh albinsuresh Sep 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From his perspective, the successful execution of the script should have translated to a Success, right?

I guess this is the root of the problem. When it's a user-overridden script, we can't always assume that the device would have been restarted, and wait for the agent to reboot and really validate if the restart succeeded and then only send the Successful status. But, purely relying on the exit status is not sufficient for our default restart script either. So, it's like a chicken and egg problem.

If we can always assume that the user-provided scripts must restart the device, then this is fine. That's a very genuine assumption to make, IMO. Else we should expect their custom script to send the successful status update MQTT message as well, at the end of their "non-restarting" logic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the user intentionally replaced the restart script with something that doesn't really restart the device but does something else?

The user should not be surprised that the restart command fails ;-)

If we can always assume that the user-provided scripts must restart the device, then this is fine.

We can and we do. Even more, we check that this is the case: if the agent doesn't shutdown than restart after a reboot of the device, then command execution is reported as failed.

@didier-wenzek
Copy link
Contributor Author

  • Log entry is misleading when the restart command is killed by a signal.
    The entry with Restart command has been interrupted by a signal is good,
    but it is proceeded by an entry with the opposite meaning The restart command has been successfully executed.

Addressed by 6cb2de6

@didier-wenzek didier-wenzek temporarily deployed to Test Pull Request September 22, 2023 11:13 — with GitHub Actions Inactive
@didier-wenzek didier-wenzek temporarily deployed to Test Pull Request September 22, 2023 12:29 — with GitHub Actions Inactive
didier-wenzek and others added 14 commits September 22, 2023 14:32
This is an intermediate state toward the new te/+/+/+/+/cmd/restart/+ topics.

Done:
- replace tedge/commands/+/control/restart by te/device/main///cmd/restart/+

Still to be done:
- The command id has to moved from the payload to the topic
- The requests has to be published with a retain message
- tedge-agent must use the mqtt_schema and entity store
- tedge-agent has to handle child devices
- The c8y-mapper has to register child device restart capability

Signed-off-by: Didier Wenzek <[email protected]>
RestartCommand replaces RestartOperation, RestartOperationRequest and
RestartOperationResponse. It encapsulates data extracted from the topic
(target device, command id) with data extracted from the payload (status).

Signed-off-by: Didier Wenzek <[email protected]>
- On start tedge-agent, tedge-agent publishes a message on
  `te/{tedge-agent-device}/cmd/restart
- On reception of this message, c8y-mapper updates on c8y the list of
  operations supported by the device running the agent

Signed-off-by: Didier Wenzek <[email protected]>
Using unwrap is safe as this can fail only if thin-edge message structs
were using maps with non-string keys.

Signed-off-by: Didier Wenzek <[email protected]>
The previous port was 3000 i.e. the port used by default by docusaurus.
This was really inconvenient as all the file_transfer tests were failing
while locally serving thin-edge documentation.

Signed-off-by: Didier Wenzek <[email protected]>
Using a device identifier such as a/b/c/d (i.e not
device/{device-name}//) makes c8y mapper panic.

Signed-off-by: Didier Wenzek <[email protected]>
Signed-off-by: Didier Wenzek <[email protected]>
Signed-off-by: Didier Wenzek <[email protected]>
@didier-wenzek didier-wenzek force-pushed the feat/tedge-agent-restart-child-devices branch from 0a8b84c to b717228 Compare September 22, 2023 13:06
@didier-wenzek didier-wenzek temporarily deployed to Test Pull Request September 22, 2023 13:19 — with GitHub Actions Inactive
Copy link
Contributor

@reubenmiller reubenmiller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The requested log message changes were resolved.

Approved

@didier-wenzek didier-wenzek merged commit 5977440 into thin-edge:main Sep 22, 2023
@didier-wenzek didier-wenzek deleted the feat/tedge-agent-restart-child-devices branch September 22, 2023 14:00
didier-wenzek added a commit to didier-wenzek/thin-edge.io that referenced this pull request Sep 25, 2023
As suggested by thin-edge#2245 (comment),
the reason field makes sense only for a Failed command.

Signed-off-by: Didier Wenzek <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants