Skip to content
Lars Kuhtz edited this page Aug 13, 2019 · 2 revisions

Failure handling during initialization

Components of a chainweb-node should check during initialization for failure conditions that would prevent a node from performing its task. If a node detects such a condition it should,

  1. try to fix the issue, and if that isn't possible,
  2. emit an error log that describes the problem and possibly provides hints how to resolve the issue, and
  3. throw an exception, which will cause the node to terminate.

The exception message will show up on stderr of the node. On systems with systemd the exception message will be recorded in the journal. On the testnet nodes journalctl -b -u chainweb can be used to view the journal.

Also any failure in the logging system or any log messages that are emitted before the logging system is initialized are logged to stderr, and show up in the journal.

Failure handling during operations

Once the node is initialized and API servers and the P2P clients are started, components should try hard to avoid failing. Components should

  1. catch all synchronous exceptions,
  2. emit an error log or warning log that describes the problem, including possible actions that must be take to address the issue,
  3. restart the component, subject to backoff or throttling logic as needed.

Most components do this by being wrapped in runForever or runForeverThrottled from Chainweb.Utils.

Components must not catch asynchronous exceptions, that don't originate from the component itself. The functions catchSynchronous and catchAllSynchronous (and their variants) from Chainweb.Utils can be used to catch synchronous exceptions but ignoring asynchronous exceptions.

List of Initialization Failures

  • Configuration:

    • parsing of configuration fails
    • validation of configuration fails
  • Logging system:

    • Elasticsearch index can't be created
    • Log files can't be opened
  • Databases:

    • RocksDb database can't be opened
    • sqlite database can't be opened
    • not enough disk space available
  • Networking:

    • Certificate generation fails
    • Certificate or Key can't be read
    • Certificate is invalid (e.g. expired)
  • Chain Resources:

    • pruning of block header database files files (detects inconsistencies)
  • BlockHeaderDb / Consensus:

    • Hashes of genesis headers don't match expected hashes for the given chainweb version
    • Missing dependencies in BlockHeaderDb (in part checked by db pruning)
  • Pact Service:

    • Hashes of genesis payloads don't match the expected hashes for the given chainweb version
  • Mempool:

  • P2P Networking:

    • No bootstrap nodes configured (should this be a failure?)
    • Synchronization with all bootstrap nodes fails
    • No network link available
    • DNS lookup not available (is this a failure? most peers are know by IP)
    • All HTTP connections fail with 502
  • HTTP Server:

    • port can't be allocated
  • Miner:

List of Unrecoverable Operation Failures

TODO