Reorganise operation workflow documentation

- Basic concepts are moved under the operation API introduction - Workflow specifications is focused on user-specific workflow definition and start with an example. Signed-off-by: Didier Wenzek <[email protected]>
thin-edge · Nov 22, 2023 · 1920190 · 1920190
1 parent b6d00c9
commit 1920190
Show file tree

Hide file tree

Showing 3 changed files with 117 additions and 167 deletions.
diff --git a/docs/src/references/agent/device-management-api.md b/docs/src/references/agent/device-management-api.md
@@ -22,6 +22,52 @@ However, despite their diversity, all these APIs are designed along the same lin
 - create new command requests of a specific type for some target device
 - monitor the progression of a specific command request upto completion.
 
+## Concepts
+
+### Operations, Capabilities, and Commands
+
+From a user perspective an *operation* is a predefined sequence of actions
+that an operator can trigger on a device to reach some desirable state.
+It can be to restart the device or to install some new software.
+From an implementation perspective, an operation is an API identified by a well-known name such as `restart` or `software_update`.
+This API rules the coordination among the software components that need to interact to advance the operation.
+
+Not all entities and components of a thin-edge device support all the operations,
+and, even if they do, the implementations might be specific.
+Installing a software package on top of service makes no sense.
+Restarting the device is not the same as restarting one of its services.
+Each entity or component has to declare its *capabilities* i.e. the operations made available on this target.
+
+Strictly speaking, capabilities are not implemented nor declared by the devices and the services themselves.
+They are implemented by thin-edge services and plugins.
+These are the components which actually implement the operations interacting with the operating system and other software.
+For instance, device restart and software updates are implemented by the `tedge-agent`.
+
+Once an operation has been registered as a capability of some target entity or component,
+an operator can trigger operation requests a.k.a *commands*,
+for this kind of operation on this target,
+say to request a software update, then a restart of the device.
+
+### MQTT-Driven Workflows
+
+The core idea is to expose over MQTT the different states a specific operation request might go through;
+so independent sub-systems can observe the progress of the request
+and participate as per their role, when it is their turn.
+
+- A specific topic is attached to each command under-execution.
+  - This topic is specific to the target of the command, the requested operation and the request instance.
+  - e.g. `te/device/child-xyz///cmd/configuration-update/req-123`
+- The messages published over this topic represent the current state of the command.
+  - Each message indicates at which step of its progression the command is and gives all the required information to proceed.
+  - e.g. `{ "status": "init", "target": "mosquitto", "url": "https://..." }`
+- The state messages are published as retained.
+  - They capture the latest state of the operation request.
+  - Till some change occurs, this latest state is dispatched to any participant on reconnect.
+- Several participants act in concert to move the command execution forward.
+  - The participants observe the progress of all the operations they are interested in.
+  - They watch for the specific states they are responsible in moving forward.
+  - When a step is performed, successfully or not, the new state is published accordingly by the performer.
+
 ## Topics
 
 Following [thin-edge MQTT topic conventions](../mqtt-api.md#commands),
@@ -31,8 +77,10 @@ and specific sub-topics for the requests.
 
 ### Command metadata topics
 
-The __command metadata topics__ are used to declare which commands are available for a device,
-and, if so, to which extent.
+The command metadata topics are used to declare the *capabilities* of a device.
+
+The ability for an entity *a*/*b*/*c*/*d* to handle a given *operation*, is published as a retained message
+on the topic __te__/*a*/*b*/*c*/*d*/__cmd__/*operation*.
 
 ```mermaid
 graph LR
@@ -60,8 +108,13 @@ Where the groups are described as follows:
 | `cmd`        | The [command channel](../mqtt-api.md/#channel-identifier) grouping all of the commands for this target.                                                                               |
 | command_type | The type name of the operation.                                                                                                                                                       |
 
-A service that implements an operation for a device publishes on start a message notifying on the MQTT Bus
+A service that implements an operation for a device publishes on start, a capability message notifying
 that this device can be sent commands of this type.
+As an example, the `tedge-agent` which implements the `restart` operation emits on start a capability message for that operation:
+
+```sh te2mqtt
+tedge mqtt pub -r 'te/device/main///cmd/restart' '{}' 
+```
 
 These messages are published with the retained flag set. So, a client process, such a mapper, can discover on start
 what are __all the capabilities of all the devices__:
@@ -72,9 +125,11 @@ tedge mqtt sub 'te/+/+/+/+/cmd/+'
 
 ### Command status topics
 
-The actual command requests are published on the __command status topics__.
-For each request, a specific command topic is created to monitor the progress of the command from its initial state to its completion.
-These topics are named using a unique command identifier forged by the requester.
+The actual command requests are published on the command status topics.
+
+Each request is given a unique *command identifier*
+and the topic __te__/*a*/*b*/*c*/*d*/__cmd__/*operation*/*command-identifier*
+is used to trigger and monitor this request for a given *operation* on a target entity *a*/*b*/*c*/*d*.
 
 ```mermaid
 graph LR
@@ -99,7 +154,9 @@ graph LR
 
 :::note
 The `command_id` is an arbitrary string however it should be unique.
-It is recommended to either use a unique id generator, or add a unix timestamp as a suffix, e.g. date +%s
+It is recommended to either use a unique id generator, or add a unix timestamp as a suffix, e.g. date +%s.
+This unique id assigned by the requester, who is also responsible for creating the topic
+with an initial state and for finally removing it.
 :::
 
 The messages published on these topics represent each the current status of a running command.
@@ -109,6 +166,36 @@ So, one can list __all the in-progress commands of any type across all the devic
 tedge mqtt sub 'te/+/+/+/+/cmd/+/+'
 ```
 
+As an example, software update is an operation that requires coordination between a mapper and `tedge-agent`.
+On reception of a software update request from the cloud operator,
+the `tedge-mapper` creates a fresh new topic for this command,
+say `te/device/main///cmd/software_update/c8y-mapper-123` with a unique command id: `c8y-mapper-123`.
+On this topic, a first retained messages is published to describe the operator expectations for the software updates.
+
+```sh te2mqtt
+tedge mqtt pub -r 'te/device/main///cmd/software_update/c8y-mapper-123' '{
+    "status": "init",
+    "modules": [
+        {
+            "type": "apt",
+            "name": "collectd",
+            "version": "5.7",
+            "action": "install"
+        }
+    ]
+}' 
+```
+
+Then, the `tedge-agent` and possibly other software components take charge of the command,
+making it advance to some final state,
+publishing all the successive states as retained messages on the command topic.
+
+Eventually, the `tedge-mapper` will have to clean the command topic with an empty retained message:
+
+```sh te2mqtt
+tedge mqtt pub -r 'te/device/main///cmd/software_update/c8y-mapper-123' ''
+```
+
 ## Message payloads
 
 The message payloads are all specific to each operation type.
@@ -142,6 +229,19 @@ triggering a health message response published on the `status/health` channel of
 
 ## Operation workflow
 
+An operation workflow defines the possible sequences of actions for an operation request
+from its initialization up to its success or failure. It specifies the actions to perform
+as well as any prerequisite checks, outcome validations and possible rollbacks.
+However, a workflow doesn't define how to perform these actions.
+These are delegated to thin-edge services, scripts, application-specific services or other devices.
+More precisely, an operation workflow defines:
+- the *observable states* of an ongoing operation instance
+  from initialization up to a final success or failure
+- the *participants* and their interactions, passing the baton to the software component
+  whose responsibility is to advance the operation in a given state
+  and to notify the other participants what is the new resulting state
+- the *possible state sequences* so that the system can detect any stale or misbehaving operation request.
+
 A specific workflow rules each operation type, with specific:
 - states
 - message payloads

diff --git a/docs/src/references/agent/index.md b/docs/src/references/agent/index.md
@@ -23,6 +23,8 @@ __Any service securely connected to the local MQTT bus can trigger commands as w
   It can run on the main device as well as child devices.
   It can be replaced with any other user-developed components that implement these device management APIs 
   addressing specific requirements or hardware.
+- Thin-edge also provides the tools to define, extend and combine *user-defined operation workflows*
+  that rule the sequence of steps applied when an *operation* is triggered by an operator or a software component.
 
 ```mermaid
 ---

diff --git a/docs/src/references/agent/operation-workflow.md b/docs/src/references/agent/operation-workflow.md
@@ -6,167 +6,15 @@ sidebar_position: 7
 
 # User-defined Operation Workflows
 
-Thin-edge provides the tools to define, extend and combine *operation workflows*
-that rule the sequence of steps applied when a maintenance *operation* is triggered by an operator or some software component,
-whether it is a *command* to restart the device, to update a configuration file or to install a new software.
-
-An operation workflow defines the possible sequences of actions for an operation request
-from its initialization up to its success or failure. It specifies the actions to perform
-as well as any prerequisite checks, outcome validations and possible rollbacks.
-However, a workflow doesn't define how to perform these actions.
-These are delegated to software components participating in the operation progress.
-More precisely, an operation workflow defines:
-- the *observable states* of an ongoing operation instance
-  from initialization up to a final success or failure
-- the *participants* and their interactions, passing the baton to the software component
-  whose responsibility is to advance the operation in a given state
-  and to notify the other participants what is the new resulting state
-- the *possible state sequences* so the system can detect any stale or misbehaving operation request.
-
-These workflows are extensible. An agent developer can:
-- override existing workflows by replacing the components responsible for certain steps with new ones
-- implement new components to handle the specificities of some action such as domain-specific checks
-- define new states and tell the system which software component will handle them: a script, a unix daemon, an external device
-- introduce new transitions such as rollbacks or conditional executions
-- create new workflows, combining other workflows and steps
-
-## Operations, Capabilities, and Commands
-
-From a user perspective an *operation* is a predefined sequence of actions
-that an operator can trigger on a device to reach some desirable state.
-It can be to restart the device or to install some new software.
-From an implementation perspective, an operation is an API identified by a well-known name such as `restart` or `software_update`.
-This API rules the coordination among the software components that need to interact to advance the operation.
-
-Not all entities and components of a thin-edge device support all the operations,
-and, even if they do, the implementations might be specific.
-Installing a software package on top of service makes no sense.
-Restarting the device is not the same as restarting one of its services.
-Each entity or component has to declare its *capabilities* i.e. the operations made available on this target.
-
-Strictly speaking, capabilities are not implemented nor declared by the devices and the services themselves.
-They are implemented by thin-edge services and plugins.
-These are the components which actually implement the operations interacting with the operating system and other software.
-For instance, device restart and software updates are implemented by the `tedge-agent`.
-
-Once an operation has been registered as a capability of some target entity or component,
-an operator can trigger operation requests a.k.a *commands*,
-for this kind of operation on this target,
-say to request a software update than a restart of the device.
-
-## MQTT Topics
-
-Operations, capabilities and commands are declared, triggered and managed using MQTT topics,
-all built along the same schema, matching the topic filter `te/+/+/+/+/cmd/+/+`,
-with a target prefix `te/+/+/+/+` and a command specific suffix `/cmd/+/+`:
-
-| root   | target           | command keyword | operation name | command instance id |
-|--------|------------------|-----------------|----------------|---------------------|
-| __te__ | /*a*/*b*/*c*/*d* | /__cmd__        | /*operation*   | /*command-id*       |
-
-The prefix __te__/*a*/*b*/*c*/*d* uniquely identifies the entity or component that is the target of commands.
-It can be:
-- the main device: `te/device/main//`
-- a child device: `te/device/child-xyz//`
-- a service: `te/device/main/service/tedge-agent`
-- or any application specific entity identifier such as `te/raspberry-pi/123/process/collectd`.
-
-The longer prefix __te__/*a*/*b*/*c*/*d*/__cmd__ groups all the capabilities and commands
-related to the entity identified by __te__/*a*/*b*/*c*/*d*.
-
-### Capabilities
-
-A capability, the ability for an entity __te__/*a*/*b*/*c*/*d* to handle a given *operation*, is published as a retained message
-on the topic __te__/*a*/*b*/*c*/*d*/__cmd__/*operation*, in which the suffix is the well-known name of the operation.
-
-One can subscribe to the following topic to get all the capabilities of a thin-edge device and its child-devices and services.
-
-```sh te2mqtt
-tedge mqtt sub 'te/+/+/+/+/cmd/+' 
-```
-
-The retained messages published on these topics are operation specific and defined by the operation APIs.
-They provide operation specific parameters such as the list of software package types that can be installed,
-or the list of file types that configured.
+An agent developer can define application specific *operation workflows*.
+Thin-edge `tedge-agent` provides the tools to:
 
-As an example, the `tedge-agent` which implements the `restart` and `software_update` capabilities for the main device,
-will emit two retained messages.
-
-A first message to tell that the main device can be restarted:
-
-```sh te2mqtt
-tedge mqtt pub -r 'te/device/main///cmd/restart' '{}' 
-```
-
-A second one to tell that debian packages can be installed on the main device: 
-
-```sh te2mqtt
-tedge mqtt pub -r 'te/device/main///cmd/software_update' '{ "type": ["apt"] }' 
-```
-
-### Commands
-
-The topics matching __te__/*a*/*b*/*c*/*d*/__cmd__/*operation*/*command-id* are used to trigger and manage commands,
-i.e. operation requests on a specific target for a specific *operation*.
-
-Each request is given a unique command identifier.
-Combined with the target identifier and the operation name this defines a request specific topic
-where the current state of the command workflow is published as a retained message.
-This unique id assigned by the requester, who is also responsible for creating the topic
-with an initial state and for finally removing it.
-
-As an example, software update is an operation that requires coordination between a mapper and `tedge-agent`.
-On reception of a software update request from the cloud operator,
-the `tedge-mapper` creates a fresh new topic for this command,
-say `te/device/main///cmd/software_update/c8y-mapper-123` for the 123<sup>rd</sup> request.
-On this topic, a first retained messages is published to describe the operator expectations for the software updates.
-
-```sh te2mqtt
-tedge mqtt pub -r 'te/device/main///cmd/software_update/c8y-mapper-123' '{
-    "status": "init",
-    "modules": [
-        {
-            "type": "apt",
-            "name": "collectd",
-            "version": "5.7",
-            "action": "install"
-        }
-    ]
-}' 
-```
-
-Then, the `tedge-agent` and possibly other software components take in charge the command,
-making it advance to some final state,
-publishing all the successive states as retained messages on the command topic.
-
-Eventually, the `tedge-mapper` will have to clean the command topic with an empty retained message: 
-
-```sh te2mqtt
-tedge mqtt pub -r 'te/device/main///cmd/software_update/c8y-mapper-123' ''
-```
-
-## MQTT-Driven Workflows
-
-Operations that require coordination among several software components are managed using *MQTT-driven workflows*.
-
-The core idea is to expose over MQTT the different states a specific operation request might go through;
-so independent sub-systems can observe the progress of the request and act accordingly to their role.
-
-- A specific topic is attached to each command under-execution.
-  - This topic is specific to the target of the command, the requested operation and the request instance.
-  - e.g. `te/device/child-xyz///cmd/configuration-update/req-123`
-- The messages published over this topic represent the current state of the command.
-  - Each message indicates at which step of its progression the command is and gives all the required information to proceed.
-  - e.g. `{ "status": "Requested", "target": "mosquitto", "url": "https://..." }`
-- The state messages are published as retained.
-  - They capture the latest state of the operation request.
-  - Till some change occurs, this latest state is dispatched to any participant on reconnect.
-- Several participants act in concert to move forward the command execution.
-  - The participants observe the progress of all the operations they are interested in.
-  - They watch for the specific states they are responsible in moving forward.
-  - When a step is performed, successfully or not, the new state is published accordingly by the performer.
+- override existing workflows
+- define new states and actions such as pre-requisite or post-execution checks 
+- introduce new transitions such as rollbacks or conditional executions
+- create new workflows, combining workflows and steps
 
-### Example
+## Example
 
 Here is an example where three software components participate in a `configuration-update` command.
 - The `tedge-mapper` creates the initial state of the command
@@ -241,7 +89,7 @@ are published on an MQTT topic which prefix is the entity identifier.
 - A workflow can be extended differently for each target.
   As an example, an agent developer can define an extra rollback state on the main device but not on the child devices.
 
-### Operation API
+## Operation API
 
 As several software components have to collaborate when executing a command, each operation must define a specific API.
 This API should be based on the principles of MQTT-driven workflow and defines: