Skip to content
This repository has been archived by the owner on Feb 25, 2023. It is now read-only.

Add state and description codes: Part 2 #305

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

0xB10C
Copy link
Contributor

@0xB10C 0xB10C commented Dec 6, 2019

This PR adds the handling of the state and description codes by the Middleware and Supervisor. This is based on the groundwork done in #300.


Requirements:

  • The BitBoxBase should provide a global state code (IDLE, WORKING, WARNING, ERROR) to the user interfaces (BitBoxApp and HSM status LED)
  • The BitBoxBase should provide a more detailed description code for each subsystem (Base, Middleware, Bitcoin Core, c-lightning, ...), if something happens that the user should know.
  • In a first iteration only the description code with the highest priority should be displayed by the App and the HSM. The Supervisor handles the prioritization
  • Both codes should be stored in Redis.
  • The Supervisor should be the only one writing the codes to Redis when a subsystem state changes
  • The Middleware will provide the codes to the HSM and the App

Definition

table with state codes

table with description codes

(taken from this private Google Doc)

Approach

The description codes are implemented using a Redis Sorted set. A Redis Sorted set has members in the form of [score] [value]. The set is sorted by the [score]. In the implementation the [score] is the priority of a subsystem state and the [value] is the code of a subsystem state. A subsystem state is active if it's a member of the sorted set. The state is inactive when it's not in the sorted set. The highest priority is at the top of the sorted set, the lowest priority at the bottom of the sorted set. The Middleware queries the highest priority subsystem state and relays it to the HSM and App.

Special cases:

  • The description code with a code of 0 and a priority of 0 is always active. If it's the only member in the set, then there is no subsystem state that requires the users attention.
  • If Redis is not reachable the Middleware sends description code 4 to the HSM and the App. This is meant as a fallback, but should never appear as an actual member in Redis.

Implementation

The Supervisor activates and deactivates description codes / subsystem states as soon as he learns that the state changes (currently by watching logs and Prometheus values). When a new state is set the Supervisor queries the highest priority description code (top element from the sorted set) and sets the state code accordingly. The Supervisor then sends notification via IPC to the Middleware that the subsystem state might have changed. The Middleware lets the HSM know about the new state.

For the subsystem states DOWNLOADING_UPDATE, UPDATE_FAILED, REBOOT and SHUTDOWN the Middleware logs distinct log-tags that let the Supervisors log-watcher know
that the subsystem state has changed. The INITIAL_BLOCK_SYNC subsystem state is set based on the Prometheus value bitcoin_ibd which is watched by the Prometheus-watcher of the Supervisor.

TODOs out of scope for this PR

This readds the Redis keys for the Base state which where mistakenly
removed from 71c173f in a rebase.

The descriptionCode key references a Redis sorted-set containing
the currently active subsystem states. The stateCode references a
Redis string storing the (integer) value of the currently active
system state (one of IDLE;WORKING;WARNING;ERROR).
This adds the handling of the following system states to the
Supervisor:
- DOWNLOAD_UPDATE
- UPDATE_FAILED
- REBOOT
- SHUTDOWN

The Middleware logs a destinct logtag once one of these states become
active. The Supervisor watches the Middleware log and handles the
state changes by setting the respective `descriptionCodes` in the
sorted set in Redis and the respective stateCode. The Middleware is
notified that there was a state change in Redis via the named pipe
IPC, but doesn't handle the notification yet.
This adds the handling of the `INITIAL_BLOCK_SYNC` (Bitcoin Core IDB)
state to the Supervisor.

The Supervisor activates the subsystem state when Bitcoin Core switches
to IBD and deactivates the state once Bitcoin Core finishes IBD.
This adds an extra HSM heartbeat that is send when a IPC notification
with the topic "systemstate-changed" arrives over the named pipe.
@0xB10C
Copy link
Contributor Author

0xB10C commented Dec 6, 2019

Added some needed logging to the Supervisor triggers. Reviewing should be easier now.

@Stadicus
Copy link
Collaborator

Stadicus commented Dec 7, 2019

After updating a running BitBoxBase, the Supervisor logs the following over and over:

Dec 07 15:08:39 bitbox-base bbbsupervisor[15882]: 2019/12/07 15:08:39 Setting bitcoin_ibd since no prior state exists.
Dec 07 15:08:39 bitbox-base bbbsupervisor[15882]: 2019/12/07 15:08:39 setBBBConfigValue: command '/usr/local/sbin/bbb-config.sh set bitcoin_ibd false' executed.
Dec 07 15:08:39 bitbox-base bbbsupervisor[15882]: 2019/12/07 15:08:39 Deactivating the sub-system state "INITIAL_BLOCK_SYNC"
Dec 07 15:08:39 bitbox-base bbbsupervisor[15882]: 2019/12/07 15:08:39 Recovered: could not trigger prometheusBitcoindIBD: Handling trigger "prometheusBitcoindIBD": Initial set. Could not deactivate subsystem state "INITIAL_BLOCK_SYNC": could not get the top element from "base:descriptionCode": expected exactly one element, but got 0

Both the Grafana dashboard and the Bitcoin scraper show bitcoin_ibd as 0

When bitcoind enters Initial Block Download mode, the state change seems to work fine:

Dec 07 15:13:09 bitbox-base bbbsupervisor[15882]: 2019/12/07 15:13:09 Setting bitcoin_ibd since no prior state exists.
Dec 07 15:13:16 bitbox-base bbbsupervisor[15882]: 2019/12/07 15:13:16 setBBBConfigValue: command '/usr/local/sbin/bbb-config.sh set bitcoin_ibd true' executed.
Dec 07 15:13:16 bitbox-base bbbsupervisor[15882]: 2019/12/07 15:13:16 Activating the sub-system state "INITIAL_BLOCK_SYNC"
Dec 07 15:13:16 bitbox-base bbbsupervisor[15882]: 2019/12/07 15:13:16 Notifying the Middleware that a sub-system state changed

Switching back to IBD == false, the error logging continues, with a new entry every 10 seconds:

Dec 07 15:18:09 bitbox-base bbbsupervisor[15882]: 2019/12/07 15:18:09 IBD finished: unsetting bitcoind_idb.
Dec 07 15:18:24 bitbox-base bbbsupervisor[15882]: 2019/12/07 15:18:24 setBBBConfigValue: command '/usr/local/sbin/bbb-config.sh set bitcoin_ibd false' executed.
Dec 07 15:18:24 bitbox-base bbbsupervisor[15882]: 2019/12/07 15:18:24 Deactivating the sub-system state "INITIAL_BLOCK_SYNC"
Dec 07 15:18:24 bitbox-base bbbsupervisor[15882]: 2019/12/07 15:18:24 Recovered: could not trigger prometheusBitcoindIBD: Handling trigger "prometheusBitcoindIBD": Could not deactivate subsystem state "INITIAL_BLOCK_SYNC": could not get the top element from "base:descriptionCode": expected exactly one element, but got 0
Dec 07 15:18:24 bitbox-base bbbsupervisor[15882]: 2019/12/07 15:18:24 IBD finished: unsetting bitcoind_idb.
Dec 07 15:18:24 bitbox-base bbbsupervisor[15882]: 2019/12/07 15:18:24 setBBBConfigValue: command '/usr/local/sbin/bbb-config.sh set bitcoin_ibd false' executed.
Dec 07 15:18:24 bitbox-base bbbsupervisor[15882]: 2019/12/07 15:18:24 Deactivating the sub-system state "INITIAL_BLOCK_SYNC"
Dec 07 15:18:24 bitbox-base bbbsupervisor[15882]: 2019/12/07 15:18:24 Recovered: could not trigger prometheusBitcoindIBD: Handling trigger "prometheusBitcoindIBD": Could not deactivate subsystem state "INITIAL_BLOCK_SYNC": could not get the top element from "base:descriptionCode": expected exactly one element, but got 0
Dec 07 15:18:34 bitbox-base bbbsupervisor[15882]: 2019/12/07 15:18:34 IBD finished: unsetting bitcoind_idb.
Dec 07 15:18:34 bitbox-base bbbsupervisor[15882]: 2019/12/07 15:18:34 setBBBConfigValue: command '/usr/local/sbin/bbb-config.sh set bitcoin_ibd false' executed.
Dec 07 15:18:34 bitbox-base bbbsupervisor[15882]: 2019/12/07 15:18:34 Deactivating the sub-system state "INITIAL_BLOCK_SYNC"
Dec 07 15:18:34 bitbox-base bbbsupervisor[15882]: 2019/12/07 15:18:34 Recovered: could not trigger prometheusBitcoindIBD: Handling trigger "prometheusBitcoindIBD": Could not deactivate subsystem state "INITIAL_BLOCK_SYNC": could not get the top element from "base:descriptionCode": expected exactly one element, but got 0

@Stadicus
Copy link
Collaborator

Stadicus commented Dec 7, 2019

The following notes are out of scope for this first iteration, but @0xB10C's thoughts on how to implement would be interesting.

The subsystem states DOWNLOADING_UPDATE and UPDATE_FAILED are coming from the Middleware, but I think Mender should be a separate subsystem. Currently, these state changes are only recognized if triggered from the Middleware.

An update can also be triggered from the command line from a USB drive or the maintenance menu (e.g. for a factory reset). It would be better if the bbb-cmd.sh and the systemd-update-checks.sh would notify the Superivsor that an update has been started, or that it failed, e.g. through the standard systemd logs.

The same goes for REBOOT and SHUTDOWN. When powering down the device with the physical power button, or from the command line, the state should be set correctly, e.g. using a system unit that is executed on shutdown (and the HSM then showing "safe to unplug" or similar).

@Stadicus
Copy link
Collaborator

Stadicus commented Dec 9, 2019

FYI, when I merge this PR on top of all other outstanding PRs, I get the following error message, so I guess it needs some minor adjustments after rebasing it:

src/middleware.go:271:21: middleware.hsmHeartbeat undefined (type *Middleware has no field or method hsmHeartbeat)

@0xB10C
Copy link
Contributor Author

0xB10C commented Dec 9, 2019

Could not deactivate subsystem state "INITIAL_BLOCK_SYNC": could not get the top element from "base:descriptionCode": expected exactly one element, but got 0

This error usually means that there is no "idle" state in the sorted-set. Did you test on an recent (including the changes from #300) image? Otherwise you'd have to add the the Redis keys manually. See 71c173f.

@0xB10C
Copy link
Contributor Author

0xB10C commented Dec 9, 2019

An update can also be triggered from the command line from a USB drive or the maintenance menu (e.g. for a factory reset). It would be better if the bbb-cmd.sh and the systemd-update-checks.sh would notify the Superivsor that an update has been started, or that it failed, e.g. through the standard systemd logs.

The same goes for REBOOT and SHUTDOWN. When powering down the device with the physical power button, or from the command line, the state should be set correctly, e.g. using a system unit that is executed on shutdown (and the HSM then showing "safe to unplug" or similar)

Having the Supervisor watch a systemd log with journald doesnt seem to be a good fit for this. Using a named pipe to notify the Supervisor on triggers should work. The ipcnotificaton package in the Middleware could be used for a Supervisor specific pipe. The shell scripts could emit a notification, which would trigger the Supervisor.

@Stadicus
Copy link
Collaborator

Could not deactivate subsystem state "INITIAL_BLOCK_SYNC": could not get the top element from "base:descriptionCode": expected exactly one element, but got 0

This error usually means that there is no "idle" state in the sorted-set. Did you test on an recent (including the changes from #300) image? Otherwise you'd have to add the the Redis keys manually. See 71c173f.

Seems the key base:descriptionCode is not added during the build process, despite being present in the factorysettings.txt.

$ redis-cli ZREVRANGE base:descriptionCode 0 0
(empty list or set)

After adding the key/value manually with ZADD base:descriptionCode 0 0, this error is no longer logged. I will take a look at the build process to figure this out.

Copy link
Contributor

@TheCharlatan TheCharlatan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed middleware part.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants