Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Operation Workflow Specification & Implementation #2071

Merged
merged 14 commits into from
Nov 22, 2023

Conversation

didier-wenzek
Copy link
Contributor

@didier-wenzek didier-wenzek commented Jul 13, 2023

Proposed changes

Specification for MQTT-driven workflows that let the user extend the builtin operation support provided by thin-edge.

See the POC for more details on the idea.

  • Operation overview
  • MQTT topics
  • Workflow Definition
  • Workflow Implementation
  • Workflow Overriding
  • Workflow Actions
  • Workflow Composition
  • Security
  • Backward Compatibility

Types of changes

  • Bugfix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Improvement (general improvements like code refactoring that doesn't explicitly fix a bug or add any new functionality)
  • Documentation Update (if none of the other choices apply)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)

Paste Link to the issue


Checklist

  • I have read the CONTRIBUTING doc
  • I have signed the CLA (in all commits with git commit -s)
  • I ran cargo fmt as mentioned in CODING_GUIDELINES
  • I used cargo clippy as mentioned in CODING_GUIDELINES
  • I have added tests that prove my fix is effective or that my feature works
  • I have added necessary documentation (if appropriate)

Further comments

@reubenmiller reubenmiller mentioned this pull request Jul 17, 2023
11 tasks
@didier-wenzek didier-wenzek marked this pull request as ready for review July 17, 2023 21:45
@reubenmiller
Copy link
Contributor

@didier-wenzek. I've added only a few minor comments. I really like the direction we're going 👍

docs/src/references/operation-workflow.md Outdated Show resolved Hide resolved
docs/src/references/operation-workflow.md Outdated Show resolved Hide resolved
docs/src/references/operation-workflow.md Outdated Show resolved Hide resolved
docs/src/references/operation-workflow.md Outdated Show resolved Hide resolved
docs/src/references/operation-workflow.md Outdated Show resolved Hide resolved
Comment on lines 333 to 323
next = ["Download", "Failed"]
script = "/bin/schedule-configuration-update.sh"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the contract is like, if the script execution succeeds, the operation is moved to the first state in the next array and on failure it is moved to the second one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still have to describe the contract.

The idea is that the script is given the current state as an argument and that it has to output the new state on stdout. If the output of the script cannot be decoded as a state, the command is moved to "failed".

script = "/bin/schedule-configuration-update.sh"

[Download]
owner = tedge-configuration-plugin
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the tedge-agent do anything with this owner info, other than just determining if it is "tedge" or not? I mean, it doesn't care about the value as long as it is not tedge, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed in the POC nothing is done if the owner is not tedge.

However, if the MQTT identifier of this owner is given (i.e. device/main/service/tedge-configuration-plugin) a sophisticated implementation could use this to check if thing has a chance to work.

=> This needs to be clarified

docs/src/references/operation-workflow.md Outdated Show resolved Hide resolved
docs/src/references/operation-workflow.md Outdated Show resolved Hide resolved
The priority rules give a higher priority to the workflow that are user-defined than to those pre-defined by thin-edge.
If several user-defined workflows are matching a command state,
then the alphabetic order of the workflow definition file names is used:
`001-configuration-update.toml` being of higher priority than `002-configuration-update.toml`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even with this priority mechanism, it's not fully clear to me as to how we can use different workflows for different targets of the same operation type. For example, with the config-update operation, there'd be different kinds of custom scripts to perform updates of different types of configs. So, how would the workflow be handed over to the respective script for that given type?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, with the config-update operation, there'd be different kinds of custom scripts to perform updates of different types of configs.

The proposal is to have a single set of actions for a (command, status) pair. If you want to behave differently depending on some property of the state payload (here the config type) then this has to be handle inside the script. If you want to handle this at the workflow level, you can introduce new state (e.g. update_mosquitto_config and update_other_config) but you need then to run a script on the state update_config that move to one or the other state depending on the config type.

Copy link
Member

@rina23q rina23q left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few questions 👍

docs/src/references/operation-workflow.md Outdated Show resolved Hide resolved
next = ["Install", "Failed"]

[Install]
owner = "tedge-configuration-plugin"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Provided tedge-configuration-plguin is a daemon, this owner field is just indicating the daemon name for logging purpose or like that? I assume that the state machine doesn't really care about the process owner?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. See #2071 (comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something still needs to be clarified.

Indeed, the tedge-configuration-plugin can be implemented as a:

  • as a daemon - as of today
  • as a script - as you suggested here
  • as an actor running inside the agent - as implemented by the POC

These 3 alternatives have different pros/cons as well as implementation impacts. I will dive into these.

@didier-wenzek didier-wenzek self-assigned this Jul 21, 2023
@didier-wenzek didier-wenzek added the improvement User value label Jul 21, 2023
Copy link
Contributor

@albinsuresh albinsuresh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. The comments are mostly rewording suggestions and a few queries for clarification. Happy to approve once those queries are clarified.

docs/src/references/agent/device-management-api.md Outdated Show resolved Hide resolved
docs/src/references/agent/device-management-api.md Outdated Show resolved Hide resolved
docs/src/references/agent/device-management-api.md Outdated Show resolved Hide resolved
docs/src/references/agent/device-management-api.md Outdated Show resolved Hide resolved
docs/src/references/agent/device-management-api.md Outdated Show resolved Hide resolved
docs/src/references/agent/operation-workflow.md Outdated Show resolved Hide resolved
- possible extra instructions on how to process the command at this stage, e.g.
- run a script

```toml title="file: firmware_update_example.toml"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The states mentioned in this file are not the same that were described in the example above. Keeping them consistent would be nice.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though this is in a section called "workflow overriding", so it is meant to show different states.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, in that case it might be a good idea to add a workflow file example conforming to that same simple example somewhere earlier: may be, in the Operation API section itself, right after the intro of the APIs, so that we have something basic before getting into advanced overriding here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah we can add this in later.

docs/src/references/agent/operation-workflow.md Outdated Show resolved Hide resolved

[init]
script = "/usr/bin/firmware_handler.sh plan"
next = ["scheduled", "failed"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be obvious from these examples, but it might be better to explicitly document that the first value is the target state on a successful state action and the second one is for failure.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The order of values does not mean anything at this point in time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm missing something here then. My understanding was that, if the script exits with non-zero exit code, without any JSON output with explicit status, then the workflow moves to the first state mentioned in the array. In the case of a script execution failure, it moves to the second. Isn't that the case?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That isn't implemented yet, and we are still deciding on what to do there. This will be done in a follow up PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, this list is mostly documentation.

Copy link
Contributor

@albinsuresh albinsuresh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@reubenmiller reubenmiller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved. Any further improvements can be done in followup PRs

didier-wenzek and others added 14 commits November 22, 2023 14:59
Signed-off-by: Didier Wenzek <[email protected]>
Signed-off-by: Didier Wenzek <[email protected]>
Signed-off-by: Didier Wenzek <[email protected]>
The motivation is to be able to register a custom version of a built-in
workflow. The mappers still create commands in the Init state, but the
built-in operations only react on the Schedule state. If no custom
version of the workflow has been provided, the agent moves operation
requests from the Init to the Scheduled state. If a custom version is
provided, then this user-defined workflow can add extra checks and steps
before triggering the built-in operation steps (by putting the operation
in the Schedule state).

As of now, the log and config operation are unchanged (starting on Init
and ignoring the Schedule state).

Signed-off-by: Didier Wenzek <[email protected]>
- Basic concepts are moved under the operation API introduction
- Workflow specifications is focused on user-specific workflow
  definition and start with an example.

Signed-off-by: Didier Wenzek <[email protected]>
Signed-off-by: Didier Wenzek <[email protected]>
Signed-off-by: Didier Wenzek <[email protected]>
@didier-wenzek didier-wenzek force-pushed the specs/operation-workflow branch from 83e365d to 17e3d66 Compare November 22, 2023 14:01
@didier-wenzek didier-wenzek merged commit 5259a45 into thin-edge:main Nov 22, 2023
18 checks passed
@didier-wenzek didier-wenzek deleted the specs/operation-workflow branch November 22, 2023 14:33
@didier-wenzek didier-wenzek mentioned this pull request Nov 22, 2023
15 tasks
@didier-wenzek
Copy link
Contributor Author

Here are the follow-up tasks for all the unresolved comments: #2478

@gligorisaev
Copy link
Contributor

QA has thoroughly checked the feature and here are the results:

  • Test for ticket exists in the test suite.
  • QA has tested the function and it's functioning according description.

@gligorisaev gligorisaev self-assigned this Dec 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Change request improvement User value
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants