Skip to content

Commit

Permalink
Reorganise operation workflow documentation
Browse files Browse the repository at this point in the history
- Basic concepts are moved under the operation API introduction
- Workflow specifications is focused on user-specific workflow
  definition and start with an example.

Signed-off-by: Didier Wenzek <[email protected]>
  • Loading branch information
didier-wenzek committed Nov 22, 2023
1 parent b6d00c9 commit 1920190
Show file tree
Hide file tree
Showing 3 changed files with 117 additions and 167 deletions.
114 changes: 107 additions & 7 deletions docs/src/references/agent/device-management-api.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,52 @@ However, despite their diversity, all these APIs are designed along the same lin
- create new command requests of a specific type for some target device
- monitor the progression of a specific command request upto completion.

## Concepts

### Operations, Capabilities, and Commands

From a user perspective an *operation* is a predefined sequence of actions
that an operator can trigger on a device to reach some desirable state.
It can be to restart the device or to install some new software.
From an implementation perspective, an operation is an API identified by a well-known name such as `restart` or `software_update`.
This API rules the coordination among the software components that need to interact to advance the operation.

Not all entities and components of a thin-edge device support all the operations,
and, even if they do, the implementations might be specific.
Installing a software package on top of service makes no sense.
Restarting the device is not the same as restarting one of its services.
Each entity or component has to declare its *capabilities* i.e. the operations made available on this target.

Strictly speaking, capabilities are not implemented nor declared by the devices and the services themselves.
They are implemented by thin-edge services and plugins.
These are the components which actually implement the operations interacting with the operating system and other software.
For instance, device restart and software updates are implemented by the `tedge-agent`.

Once an operation has been registered as a capability of some target entity or component,
an operator can trigger operation requests a.k.a *commands*,
for this kind of operation on this target,
say to request a software update, then a restart of the device.

### MQTT-Driven Workflows

The core idea is to expose over MQTT the different states a specific operation request might go through;
so independent sub-systems can observe the progress of the request
and participate as per their role, when it is their turn.

- A specific topic is attached to each command under-execution.
- This topic is specific to the target of the command, the requested operation and the request instance.
- e.g. `te/device/child-xyz///cmd/configuration-update/req-123`
- The messages published over this topic represent the current state of the command.
- Each message indicates at which step of its progression the command is and gives all the required information to proceed.
- e.g. `{ "status": "init", "target": "mosquitto", "url": "https://..." }`
- The state messages are published as retained.
- They capture the latest state of the operation request.
- Till some change occurs, this latest state is dispatched to any participant on reconnect.
- Several participants act in concert to move the command execution forward.
- The participants observe the progress of all the operations they are interested in.
- They watch for the specific states they are responsible in moving forward.
- When a step is performed, successfully or not, the new state is published accordingly by the performer.

## Topics

Following [thin-edge MQTT topic conventions](../mqtt-api.md#commands),
Expand All @@ -31,8 +77,10 @@ and specific sub-topics for the requests.

### Command metadata topics

The __command metadata topics__ are used to declare which commands are available for a device,
and, if so, to which extent.
The command metadata topics are used to declare the *capabilities* of a device.

The ability for an entity *a*/*b*/*c*/*d* to handle a given *operation*, is published as a retained message
on the topic __te__/*a*/*b*/*c*/*d*/__cmd__/*operation*.

```mermaid
graph LR
Expand Down Expand Up @@ -60,8 +108,13 @@ Where the groups are described as follows:
| `cmd` | The [command channel](../mqtt-api.md/#channel-identifier) grouping all of the commands for this target. |
| command_type | The type name of the operation. |

A service that implements an operation for a device publishes on start a message notifying on the MQTT Bus
A service that implements an operation for a device publishes on start, a capability message notifying
that this device can be sent commands of this type.
As an example, the `tedge-agent` which implements the `restart` operation emits on start a capability message for that operation:

```sh te2mqtt
tedge mqtt pub -r 'te/device/main///cmd/restart' '{}'
```

These messages are published with the retained flag set. So, a client process, such a mapper, can discover on start
what are __all the capabilities of all the devices__:
Expand All @@ -72,9 +125,11 @@ tedge mqtt sub 'te/+/+/+/+/cmd/+'

### Command status topics

The actual command requests are published on the __command status topics__.
For each request, a specific command topic is created to monitor the progress of the command from its initial state to its completion.
These topics are named using a unique command identifier forged by the requester.
The actual command requests are published on the command status topics.

Each request is given a unique *command identifier*
and the topic __te__/*a*/*b*/*c*/*d*/__cmd__/*operation*/*command-identifier*
is used to trigger and monitor this request for a given *operation* on a target entity *a*/*b*/*c*/*d*.

```mermaid
graph LR
Expand All @@ -99,7 +154,9 @@ graph LR

:::note
The `command_id` is an arbitrary string however it should be unique.
It is recommended to either use a unique id generator, or add a unix timestamp as a suffix, e.g. date +%s
It is recommended to either use a unique id generator, or add a unix timestamp as a suffix, e.g. date +%s.
This unique id assigned by the requester, who is also responsible for creating the topic
with an initial state and for finally removing it.
:::

The messages published on these topics represent each the current status of a running command.
Expand All @@ -109,6 +166,36 @@ So, one can list __all the in-progress commands of any type across all the devic
tedge mqtt sub 'te/+/+/+/+/cmd/+/+'
```

As an example, software update is an operation that requires coordination between a mapper and `tedge-agent`.
On reception of a software update request from the cloud operator,
the `tedge-mapper` creates a fresh new topic for this command,
say `te/device/main///cmd/software_update/c8y-mapper-123` with a unique command id: `c8y-mapper-123`.
On this topic, a first retained messages is published to describe the operator expectations for the software updates.

```sh te2mqtt
tedge mqtt pub -r 'te/device/main///cmd/software_update/c8y-mapper-123' '{
"status": "init",
"modules": [
{
"type": "apt",
"name": "collectd",
"version": "5.7",
"action": "install"
}
]
}'
```

Then, the `tedge-agent` and possibly other software components take charge of the command,
making it advance to some final state,
publishing all the successive states as retained messages on the command topic.

Eventually, the `tedge-mapper` will have to clean the command topic with an empty retained message:

```sh te2mqtt
tedge mqtt pub -r 'te/device/main///cmd/software_update/c8y-mapper-123' ''
```

## Message payloads

The message payloads are all specific to each operation type.
Expand Down Expand Up @@ -142,6 +229,19 @@ triggering a health message response published on the `status/health` channel of

## Operation workflow

An operation workflow defines the possible sequences of actions for an operation request
from its initialization up to its success or failure. It specifies the actions to perform
as well as any prerequisite checks, outcome validations and possible rollbacks.
However, a workflow doesn't define how to perform these actions.
These are delegated to thin-edge services, scripts, application-specific services or other devices.
More precisely, an operation workflow defines:
- the *observable states* of an ongoing operation instance
from initialization up to a final success or failure
- the *participants* and their interactions, passing the baton to the software component
whose responsibility is to advance the operation in a given state
and to notify the other participants what is the new resulting state
- the *possible state sequences* so that the system can detect any stale or misbehaving operation request.

A specific workflow rules each operation type, with specific:
- states
- message payloads
Expand Down
2 changes: 2 additions & 0 deletions docs/src/references/agent/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,8 @@ __Any service securely connected to the local MQTT bus can trigger commands as w
It can run on the main device as well as child devices.
It can be replaced with any other user-developed components that implement these device management APIs
addressing specific requirements or hardware.
- Thin-edge also provides the tools to define, extend and combine *user-defined operation workflows*
that rule the sequence of steps applied when an *operation* is triggered by an operator or a software component.

```mermaid
---
Expand Down
168 changes: 8 additions & 160 deletions docs/src/references/agent/operation-workflow.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,167 +6,15 @@ sidebar_position: 7

# User-defined Operation Workflows

Thin-edge provides the tools to define, extend and combine *operation workflows*
that rule the sequence of steps applied when a maintenance *operation* is triggered by an operator or some software component,
whether it is a *command* to restart the device, to update a configuration file or to install a new software.

An operation workflow defines the possible sequences of actions for an operation request
from its initialization up to its success or failure. It specifies the actions to perform
as well as any prerequisite checks, outcome validations and possible rollbacks.
However, a workflow doesn't define how to perform these actions.
These are delegated to software components participating in the operation progress.
More precisely, an operation workflow defines:
- the *observable states* of an ongoing operation instance
from initialization up to a final success or failure
- the *participants* and their interactions, passing the baton to the software component
whose responsibility is to advance the operation in a given state
and to notify the other participants what is the new resulting state
- the *possible state sequences* so the system can detect any stale or misbehaving operation request.

These workflows are extensible. An agent developer can:
- override existing workflows by replacing the components responsible for certain steps with new ones
- implement new components to handle the specificities of some action such as domain-specific checks
- define new states and tell the system which software component will handle them: a script, a unix daemon, an external device
- introduce new transitions such as rollbacks or conditional executions
- create new workflows, combining other workflows and steps

## Operations, Capabilities, and Commands

From a user perspective an *operation* is a predefined sequence of actions
that an operator can trigger on a device to reach some desirable state.
It can be to restart the device or to install some new software.
From an implementation perspective, an operation is an API identified by a well-known name such as `restart` or `software_update`.
This API rules the coordination among the software components that need to interact to advance the operation.

Not all entities and components of a thin-edge device support all the operations,
and, even if they do, the implementations might be specific.
Installing a software package on top of service makes no sense.
Restarting the device is not the same as restarting one of its services.
Each entity or component has to declare its *capabilities* i.e. the operations made available on this target.

Strictly speaking, capabilities are not implemented nor declared by the devices and the services themselves.
They are implemented by thin-edge services and plugins.
These are the components which actually implement the operations interacting with the operating system and other software.
For instance, device restart and software updates are implemented by the `tedge-agent`.

Once an operation has been registered as a capability of some target entity or component,
an operator can trigger operation requests a.k.a *commands*,
for this kind of operation on this target,
say to request a software update than a restart of the device.

## MQTT Topics

Operations, capabilities and commands are declared, triggered and managed using MQTT topics,
all built along the same schema, matching the topic filter `te/+/+/+/+/cmd/+/+`,
with a target prefix `te/+/+/+/+` and a command specific suffix `/cmd/+/+`:

| root | target | command keyword | operation name | command instance id |
|--------|------------------|-----------------|----------------|---------------------|
| __te__ | /*a*/*b*/*c*/*d* | /__cmd__ | /*operation* | /*command-id* |

The prefix __te__/*a*/*b*/*c*/*d* uniquely identifies the entity or component that is the target of commands.
It can be:
- the main device: `te/device/main//`
- a child device: `te/device/child-xyz//`
- a service: `te/device/main/service/tedge-agent`
- or any application specific entity identifier such as `te/raspberry-pi/123/process/collectd`.

The longer prefix __te__/*a*/*b*/*c*/*d*/__cmd__ groups all the capabilities and commands
related to the entity identified by __te__/*a*/*b*/*c*/*d*.

### Capabilities

A capability, the ability for an entity __te__/*a*/*b*/*c*/*d* to handle a given *operation*, is published as a retained message
on the topic __te__/*a*/*b*/*c*/*d*/__cmd__/*operation*, in which the suffix is the well-known name of the operation.

One can subscribe to the following topic to get all the capabilities of a thin-edge device and its child-devices and services.

```sh te2mqtt
tedge mqtt sub 'te/+/+/+/+/cmd/+'
```

The retained messages published on these topics are operation specific and defined by the operation APIs.
They provide operation specific parameters such as the list of software package types that can be installed,
or the list of file types that configured.
An agent developer can define application specific *operation workflows*.
Thin-edge `tedge-agent` provides the tools to:

As an example, the `tedge-agent` which implements the `restart` and `software_update` capabilities for the main device,
will emit two retained messages.

A first message to tell that the main device can be restarted:

```sh te2mqtt
tedge mqtt pub -r 'te/device/main///cmd/restart' '{}'
```

A second one to tell that debian packages can be installed on the main device:

```sh te2mqtt
tedge mqtt pub -r 'te/device/main///cmd/software_update' '{ "type": ["apt"] }'
```

### Commands

The topics matching __te__/*a*/*b*/*c*/*d*/__cmd__/*operation*/*command-id* are used to trigger and manage commands,
i.e. operation requests on a specific target for a specific *operation*.

Each request is given a unique command identifier.
Combined with the target identifier and the operation name this defines a request specific topic
where the current state of the command workflow is published as a retained message.
This unique id assigned by the requester, who is also responsible for creating the topic
with an initial state and for finally removing it.

As an example, software update is an operation that requires coordination between a mapper and `tedge-agent`.
On reception of a software update request from the cloud operator,
the `tedge-mapper` creates a fresh new topic for this command,
say `te/device/main///cmd/software_update/c8y-mapper-123` for the 123<sup>rd</sup> request.
On this topic, a first retained messages is published to describe the operator expectations for the software updates.

```sh te2mqtt
tedge mqtt pub -r 'te/device/main///cmd/software_update/c8y-mapper-123' '{
"status": "init",
"modules": [
{
"type": "apt",
"name": "collectd",
"version": "5.7",
"action": "install"
}
]
}'
```

Then, the `tedge-agent` and possibly other software components take in charge the command,
making it advance to some final state,
publishing all the successive states as retained messages on the command topic.

Eventually, the `tedge-mapper` will have to clean the command topic with an empty retained message:

```sh te2mqtt
tedge mqtt pub -r 'te/device/main///cmd/software_update/c8y-mapper-123' ''
```

## MQTT-Driven Workflows

Operations that require coordination among several software components are managed using *MQTT-driven workflows*.

The core idea is to expose over MQTT the different states a specific operation request might go through;
so independent sub-systems can observe the progress of the request and act accordingly to their role.

- A specific topic is attached to each command under-execution.
- This topic is specific to the target of the command, the requested operation and the request instance.
- e.g. `te/device/child-xyz///cmd/configuration-update/req-123`
- The messages published over this topic represent the current state of the command.
- Each message indicates at which step of its progression the command is and gives all the required information to proceed.
- e.g. `{ "status": "Requested", "target": "mosquitto", "url": "https://..." }`
- The state messages are published as retained.
- They capture the latest state of the operation request.
- Till some change occurs, this latest state is dispatched to any participant on reconnect.
- Several participants act in concert to move forward the command execution.
- The participants observe the progress of all the operations they are interested in.
- They watch for the specific states they are responsible in moving forward.
- When a step is performed, successfully or not, the new state is published accordingly by the performer.
- override existing workflows
- define new states and actions such as pre-requisite or post-execution checks
- introduce new transitions such as rollbacks or conditional executions
- create new workflows, combining workflows and steps

### Example
## Example

Here is an example where three software components participate in a `configuration-update` command.
- The `tedge-mapper` creates the initial state of the command
Expand Down Expand Up @@ -241,7 +89,7 @@ are published on an MQTT topic which prefix is the entity identifier.
- A workflow can be extended differently for each target.
As an example, an agent developer can define an extra rollback state on the main device but not on the child devices.

### Operation API
## Operation API

As several software components have to collaborate when executing a command, each operation must define a specific API.
This API should be based on the principles of MQTT-driven workflow and defines:
Expand Down

0 comments on commit 1920190

Please sign in to comment.