-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Restart operation on child-devices #2245
Restart operation on child-devices #2245
Conversation
Robot Results
|
26ce617
to
e37f87a
Compare
Codecov Report
Additional details and impacted files
|
@didier-wenzek is the PR in a state where we can add an integration test for the child device as well? Or is there still other work required for the tedge-agent to function on a child device (in the scope of the restart command only) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Was a bit surprised to see the sheer amount of files and crates that were touched. It just highlights the amount of fragmentation that we currently have.
I was also surprised to see that no integration tests were updated. But then found out that none exists atm. I'm hoping that it is feasible now, since we have containers in place of real devices.
.to_smartrest()?; | ||
|
||
Ok(vec![Message::new(&topic, smartrest_set_operation)]) | ||
impl CumulocityConverter { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just realised that you can have multiple plain impl blocks for the same struct, which are not for any trait implementations and this module already had 3 different impl blocks. Looks fine, but I just don't understand the benefits though. I'm assuming it is for better code organization to isolate different responsibilities into different impl blocks instead of one big fat impl block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, something that needs to be fixed. But in a PR doing only that.
Why are we here? Previously, there was an implementations of a trait used only here. There were also functions which have been promoted to methods. When I introduced these changes, I decided to go that way in order to help the code reviewers by minimizing code shuffling on an already huge refactoring.
Now, this can be cleaned up.
This might be a bit premature, but I did a quick manual test to see if I could trigger a restart operation from Cumulocity on the child device, but
So if this is not the expected behaviour (due to that part not being implemented yet), then I would suggest we wait to add some new integration tests for this new functionality before we merge it. |
Sure, we must add system tests with the 3 systems (c8y, c8y-mapper and tedge-agent). The integration tests are only checking partial exchanges between c8y-mapper and tedge-agent. |
Then I'll add the system tests on Monday, though we definitely have issues here (as previously posted). |
The PR should be in a state where a restart of the child-device should work.
There is at least this one for the main device: |
Aah okay, using the Cumulocity APIs instead of triggering commands via |
b8672bd
to
8f662dc
Compare
This PR has been rebased on top of #2265 |
I confirm something is going wrong :-( Restart operations are pushed and executed but then the agent start looping while the mapper say to nothing to c8y.
|
To be fixed:
|
The reason behind the status updates and failure reason not working is because they're being published on the incorrect c8y topic. For child devices it should be c8y/s/us/{child_id} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PRs looking very close now, I pushed some minor fixes and added some more system tests.
Though only remaining minor change is the following:
-
Log entry is misleading when the restart command is killed by a signal. The entry with
Restart command has been interrupted by a signal
is good, but it is proceeded by an entry with the opposite meaningThe restart command has been successfully executed
.Sep 21 21:30:08 d2dcf7b3095f tedge-agent[189]: 2023-09-21T21:30:08.072623262Z INFO tedge_agent::restart_manager::actor: Restart command has been interrupted by a signal: Command { std: "/usr/bin/sudo" "/usr/bin/on_shutdown_signal_interrupt.sh", kill_on_drop: false } Sep 21 21:30:08 d2dcf7b3095f tedge-agent[189]: 2023-09-21T21:30:08.072724928Z INFO tedge_agent::restart_manager::actor: The restart command has been successfully executed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code looks fine except for a small issue that I've highlighted around failed state repository updates, that's not handled cleanly(even in the old code). Happy to approve once that's fixed.
} | ||
|
||
if message.topic.name.as_str().starts_with("tedge") { | ||
return Ok(()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's that case? Getting a message on some tedge/
topic that we've explicitly subscribed to, but still ignoring?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No this is temporary code. The first part of the method is related to tedge
topics, the second part to te
topics.
I could have write this differently, but came with this not so nice return to avoid confusing diffs in the PR.
tests/RobotFramework/tests/cumulocity/restart/restart_device_child.robot
Outdated
Show resolved
Hide resolved
let ops_file = ops_dir.join("c8y_Restart"); | ||
create_directory_with_defaults(&ops_dir)?; | ||
create_file_with_defaults(ops_file, None)?; | ||
let device_operations = create_supported_operations(&ops_dir)?; | ||
Ok(vec![device_operations]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initially I was wondering why we still need to rely on these files when we have all the info via retained messages. But then realised that we can't get rid of it until software management is also migrated to the topic based capability model and we expose the same topic API mechanism for custom operations as well.
continue; | ||
} | ||
info!("Triggering a restart"); | ||
let executing_response = self.update_state_repository(request.clone()).await; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not an error that you introduced but something that existed in the previous version as well: If update_state_repository
returned a Failed
status, we shouldn't be proceeding beyond the next line, right? It should have failed fast and return from here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed by 97a07a0
match timeout(restart_timeout, self.message_box.recv_signal()).await { | ||
Ok(Some(RuntimeRequest::Shutdown)) => { | ||
info!("As requested, a shutdown has been triggered"); | ||
return Ok(()); | ||
} | ||
Ok(None) | Err(_ /* timeout */) => { | ||
// Something went wrong. The process should have been shutdown by the restart. | ||
let error = "No shutdown has been triggered".to_string(); | ||
error!(error); | ||
self.handle_error(request, error).await?; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we doing this here? To make sure that the restart command that was executed genuinely triggered a device restart which would have translated into a RuntimeRequest::Shutdown
request? What if the user intentionally replaced the restart script with something that doesn't really restart the device but does something else? In that case, do we want to mark that operation as failed? From his perspective, the successful execution of the script should have translated to a Success
, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From his perspective, the successful execution of the script should have translated to a Success, right?
I guess this is the root of the problem. When it's a user-overridden script, we can't always assume that the device would have been restarted, and wait for the agent to reboot and really validate if the restart succeeded and then only send the Successful
status. But, purely relying on the exit status is not sufficient for our default restart script either. So, it's like a chicken and egg problem.
If we can always assume that the user-provided scripts must restart the device, then this is fine. That's a very genuine assumption to make, IMO. Else we should expect their custom script to send the successful
status update MQTT message as well, at the end of their "non-restarting" logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if the user intentionally replaced the restart script with something that doesn't really restart the device but does something else?
The user should not be surprised that the restart command fails ;-)
If we can always assume that the user-provided scripts must restart the device, then this is fine.
We can and we do. Even more, we check that this is the case: if the agent doesn't shutdown than restart after a reboot of the device, then command execution is reported as failed.
Addressed by 6cb2de6 |
This is an intermediate state toward the new te/+/+/+/+/cmd/restart/+ topics. Done: - replace tedge/commands/+/control/restart by te/device/main///cmd/restart/+ Still to be done: - The command id has to moved from the payload to the topic - The requests has to be published with a retain message - tedge-agent must use the mqtt_schema and entity store - tedge-agent has to handle child devices - The c8y-mapper has to register child device restart capability Signed-off-by: Didier Wenzek <[email protected]>
RestartCommand replaces RestartOperation, RestartOperationRequest and RestartOperationResponse. It encapsulates data extracted from the topic (target device, command id) with data extracted from the payload (status). Signed-off-by: Didier Wenzek <[email protected]>
Signed-off-by: Didier Wenzek <[email protected]>
- On start tedge-agent, tedge-agent publishes a message on `te/{tedge-agent-device}/cmd/restart - On reception of this message, c8y-mapper updates on c8y the list of operations supported by the device running the agent Signed-off-by: Didier Wenzek <[email protected]>
Using unwrap is safe as this can fail only if thin-edge message structs were using maps with non-string keys. Signed-off-by: Didier Wenzek <[email protected]>
The previous port was 3000 i.e. the port used by default by docusaurus. This was really inconvenient as all the file_transfer tests were failing while locally serving thin-edge documentation. Signed-off-by: Didier Wenzek <[email protected]>
Signed-off-by: Didier Wenzek <[email protected]>
Signed-off-by: Reuben Miller <[email protected]>
Using a device identifier such as a/b/c/d (i.e not device/{device-name}//) makes c8y mapper panic. Signed-off-by: Didier Wenzek <[email protected]>
Signed-off-by: Reuben Miller <[email protected]>
Signed-off-by: Didier Wenzek <[email protected]>
Signed-off-by: Reuben Miller <[email protected]>
Signed-off-by: Reuben Miller <[email protected]>
Signed-off-by: Didier Wenzek <[email protected]>
0a8b84c
to
b717228
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The requested log message changes were resolved.
Approved
As suggested by thin-edge#2245 (comment), the reason field makes sense only for a Failed command. Signed-off-by: Didier Wenzek <[email protected]>
Updating
tedge-agent
andc8y-mapper
to support restart operation on child devicesProposed changes
mqtt.topic_root
andmqtt.device_topic_id
te/device/main///cmd/restart/+
instead oftedge/commands/+/control/restart
Types of changes
Paste Link to the issue
#2018
Checklist
cargo fmt
as mentioned in CODING_GUIDELINEScargo clippy
as mentioned in CODING_GUIDELINESFurther comments