Operation Workflow Specification & Implementation #2071

didier-wenzek · 2023-07-13T13:49:03Z

Proposed changes

Specification for MQTT-driven workflows that let the user extend the builtin operation support provided by thin-edge.

See the POC for more details on the idea.

Types of changes

Bugfix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Improvement (general improvements like code refactoring that doesn't explicitly fix a bug or add any new functionality)
Documentation Update (if none of the other choices apply)
Breaking change (fix or feature that would cause existing functionality to not work as expected)

Paste Link to the issue

Checklist

I have read the CONTRIBUTING doc
I have signed the CLA (in all commits with git commit -s)
I ran cargo fmt as mentioned in CODING_GUIDELINES
I used cargo clippy as mentioned in CODING_GUIDELINES
I have added tests that prove my fix is effective or that my feature works
I have added necessary documentation (if appropriate)

Further comments

docs/src/references/operation-workflow.md

reubenmiller · 2023-07-18T08:08:18Z

@didier-wenzek. I've added only a few minor comments. I really like the direction we're going 👍

docs/src/references/operation-workflow.md

albinsuresh · 2023-07-19T12:46:57Z

docs/src/references/operation-workflow.md

+  next = ["Download", "Failed"]
+  script = "/bin/schedule-configuration-update.sh"


So the contract is like, if the script execution succeeds, the operation is moved to the first state in the next array and on failure it is moved to the second one?

I still have to describe the contract.

The idea is that the script is given the current state as an argument and that it has to output the new state on stdout. If the output of the script cannot be decoded as a state, the command is moved to "failed".

albinsuresh · 2023-07-19T12:48:09Z

docs/src/references/operation-workflow.md

+  script = "/bin/schedule-configuration-update.sh"
+
+[Download]
+  owner = tedge-configuration-plugin


Does the tedge-agent do anything with this owner info, other than just determining if it is "tedge" or not? I mean, it doesn't care about the value as long as it is not tedge, right?

Indeed in the POC nothing is done if the owner is not tedge.

However, if the MQTT identifier of this owner is given (i.e. device/main/service/tedge-configuration-plugin) a sophisticated implementation could use this to check if thing has a chance to work.

=> This needs to be clarified

docs/src/references/operation-workflow.md

albinsuresh · 2023-07-19T13:01:52Z

docs/src/references/operation-workflow.md

+The priority rules give a higher priority to the workflow that are user-defined than to those pre-defined by thin-edge.
+If several user-defined workflows are matching a command state,
+then the alphabetic order of the workflow definition file names is used: 
+`001-configuration-update.toml` being of higher priority than `002-configuration-update.toml`.


Even with this priority mechanism, it's not fully clear to me as to how we can use different workflows for different targets of the same operation type. For example, with the config-update operation, there'd be different kinds of custom scripts to perform updates of different types of configs. So, how would the workflow be handed over to the respective script for that given type?

For example, with the config-update operation, there'd be different kinds of custom scripts to perform updates of different types of configs.

The proposal is to have a single set of actions for a (command, status) pair. If you want to behave differently depending on some property of the state payload (here the config type) then this has to be handle inside the script. If you want to handle this at the workflow level, you can introduce new state (e.g. update_mosquitto_config and update_other_config) but you need then to run a script on the state update_config that move to one or the other state depending on the config type.

rina23q

Few questions 👍

docs/src/references/operation-workflow.md

rina23q · 2023-07-19T15:00:33Z

docs/src/references/operation-workflow.md

+  next = ["Install", "Failed"]
+
+[Install]
+  owner = "tedge-configuration-plugin"


Provided tedge-configuration-plguin is a daemon, this owner field is just indicating the daemon name for logging purpose or like that? I assume that the state machine doesn't really care about the process owner?

Correct. See #2071 (comment)

Something still needs to be clarified.

Indeed, the tedge-configuration-plugin can be implemented as a:

as a daemon - as of today

as a script - as you suggested here

as an actor running inside the agent - as implemented by the POC

These 3 alternatives have different pros/cons as well as implementation impacts. I will dive into these.

albinsuresh

LGTM. The comments are mostly rewording suggestions and a few queries for clarification. Happy to approve once those queries are clarified.

docs/src/references/agent/device-management-api.md

docs/src/references/agent/operation-workflow.md

albinsuresh · 2023-11-22T07:24:45Z

docs/src/references/agent/operation-workflow.md

+  - possible extra instructions on how to process the command at this stage, e.g.
+    - run a script
+
+```toml title="file: firmware_update_example.toml"


The states mentioned in this file are not the same that were described in the example above. Keeping them consistent would be nice.

Though this is in a section called "workflow overriding", so it is meant to show different states.

Okay, in that case it might be a good idea to add a workflow file example conforming to that same simple example somewhere earlier: may be, in the Operation API section itself, right after the intro of the APIs, so that we have something basic before getting into advanced overriding here.

Yeah we can add this in later.

docs/src/references/agent/operation-workflow.md

albinsuresh · 2023-11-22T07:38:07Z

docs/src/references/agent/operation-workflow.md

+
+[init]
+  script = "/usr/bin/firmware_handler.sh plan"
+  next = ["scheduled", "failed"]


It may be obvious from these examples, but it might be better to explicitly document that the first value is the target state on a successful state action and the second one is for failure.

The order of values does not mean anything at this point in time.

I'm missing something here then. My understanding was that, if the script exits with non-zero exit code, without any JSON output with explicit status, then the workflow moves to the first state mentioned in the array. In the case of a script execution failure, it moves to the second. Isn't that the case?

That isn't implemented yet, and we are still deciding on what to do there. This will be done in a follow up PR

For now, this list is mostly documentation.

albinsuresh

LGTM

reubenmiller

Approved. Any further improvements can be done in followup PRs

Signed-off-by: Didier Wenzek <[email protected]>

The motivation is to be able to register a custom version of a built-in workflow. The mappers still create commands in the Init state, but the built-in operations only react on the Schedule state. If no custom version of the workflow has been provided, the agent moves operation requests from the Init to the Scheduled state. If a custom version is provided, then this user-defined workflow can add extra checks and steps before triggering the built-in operation steps (by putting the operation in the Schedule state). As of now, the log and config operation are unchanged (starting on Init and ignoring the Schedule state). Signed-off-by: Didier Wenzek <[email protected]>

Signed-off-by: Didier Wenzek <[email protected]>

- Basic concepts are moved under the operation API introduction - Workflow specifications is focused on user-specific workflow definition and start with an example. Signed-off-by: Didier Wenzek <[email protected]>

Signed-off-by: Didier Wenzek <[email protected]>

Signed-off-by: Reuben Miller <[email protected]>

Signed-off-by: Didier Wenzek <[email protected]>

didier-wenzek · 2023-11-22T14:38:23Z

Here are the follow-up tasks for all the unresolved comments: #2478

gligorisaev · 2023-12-22T06:21:09Z

QA has thoroughly checked the feature and here are the results:

Test for ticket exists in the test suite.
QA has tested the function and it's functioning according description.

didier-wenzek had a problem deploying to Test Pull Request July 13, 2023 14:00 — with GitHub Actions Failure

didier-wenzek had a problem deploying to Test Pull Request July 13, 2023 17:58 — with GitHub Actions Failure

reubenmiller mentioned this pull request Jul 17, 2023

Guidelines for MQTT topics #2030

Merged

11 tasks

didier-wenzek had a problem deploying to Test Pull Request July 17, 2023 17:24 — with GitHub Actions Failure

didier-wenzek marked this pull request as ready for review July 17, 2023 21:45

didier-wenzek requested review from albinsuresh, reubenmiller and rina23q July 17, 2023 21:46

didier-wenzek had a problem deploying to Test Pull Request July 17, 2023 21:53 — with GitHub Actions Failure

reubenmiller reviewed Jul 18, 2023

View reviewed changes