Version cut for v1.1.0 (#400)

ObolNetwork · Aug 29, 2024 · 4b587b7 · 4b587b7
1 parent 44d2098
commit 4b587b7
Show file tree

Hide file tree

Showing 71 changed files with 7,082 additions and 0 deletions.
diff --git a/versioned_docs/version-v1.1.0/advanced/_category_.json b/versioned_docs/version-v1.1.0/advanced/_category_.json
@@ -0,0 +1,5 @@
+{
+    "label": "Advanced Guides",
+    "position": 3,
+    "collapsed": true
+  }
diff --git a/versioned_docs/version-v1.1.0/advanced/adv-docker-configs.md b/versioned_docs/version-v1.1.0/advanced/adv-docker-configs.md
@@ -0,0 +1,38 @@
+---
+sidebar_position: 12
+description: Use advanced docker-compose features to have more flexibility and power to change the default configuration.
+---
+
+# Advanced Docker Configs
+
+:::info
+This section is intended for *docker power users*, i.e.: for those who are familiar with working with `docker compose` and want to have more flexibility and power to change the default configuration.
+:::
+
+We use the "Multiple Compose File" feature which provides a very powerful way to override any configuration in `docker-compose.yml` without needing to modify git-checked-in files since that results in conflicts when upgrading this repo.
+See [this](https://docs.docker.com/compose/extends/#multiple-compose-files) for more details.
+
+There are some additional compose files in [this repository](https://github.com/ObolNetwork/charon-distributed-validator-node/), `compose-debug.yml` and `docker-compose.override.yml.sample`, along-with the default `docker-compose.yml` file that you can use for this purpose.
+
+- `compose-debug.yml` contains some additional containers that developers can use for debugging, like `jaeger`. To achieve this, you can run:
+
+```shell
+docker compose -f docker-compose.yml -f compose-debug.yml up
+```
+
+- `docker-compose.override.yml.sample` is intended to override the default configuration provided in `docker-compose.yml`. This is useful when, for example, you wish to add port mappings or want to disable a container.
+
+- To use it, just copy the sample file to `docker-compose.override.yml` and customise it to your liking. Please create this file ONLY when you want to tweak something. This is because the default override file is empty and docker errors if you provide an empty compose file.
+
+```shell
+cp docker-compose.override.yml.sample docker-compose.override.yml
+
+# Tweak docker-compose.override.yml and then run docker compose up
+docker compose up
+```
+
+- You can also run all these compose files together. This is desirable when you want to use both the features. For example, you may want to have some debugging containers AND also want to override some defaults. To achieve this, you can run:
+
+```shell
+docker compose -f docker-compose.yml -f docker-compose.override.yml -f compose-debug.yml up
+```
diff --git a/versioned_docs/version-v1.1.0/advanced/deployment-best-practices.md b/versioned_docs/version-v1.1.0/advanced/deployment-best-practices.md
@@ -0,0 +1,95 @@
+---
+sidebar_position: 11
+description: DV Deployment best practices, for running an optimal Distributed Validator setup at scale.
+---
+
+# Deployment Best Practices
+
+The following are a selection of best practices for deploying Distributed Validator Clusters at scale on mainnet.
+
+
+## Hardware Specifications
+
+The following specifications are recommended for bare metal machines for clusters intending to run a significant number of mainnet validators:
+
+### Minimum Specs
+
+- A CPU with 4+ cores, favouring high clock speed over more cores. ( >3.0GHz and higher or a cpubenchmark [single thread](https://www.cpubenchmark.net/singleThread.html) score of >2,500)
+- 16GB of RAM
+- 2TB+ free SSD disk space (for mainnet)
+- 10mb/s internet bandwidth
+
+### Recommended Specs for extremely large clusters
+
+- A CPU with 8+ physical cores, with clock speeds >3.5Ghz
+- 32GB+ RAM (depending on the EL+CL clients)
+- 4TB+ NVMe storage
+- 25mb/s internet bandwidth
+
+An NVMe storage device is **highly recommended for optimal performance**, offering nearly 10x more random read/writes per second than a standard SSD.
+
+Inadequate hardware (low-performance virtualized servers and/or slow HDD storage) has been observed to hinder performance, indicating the necessity of provisioning adequate resources. **CPU clock speed and Disk throughput+latency are the most important factors for running a performant validator.**
+
+Note that the Charon client itself takes less than 1GB of RAM and minimal CPU load. In order to optimize both performance and cost-effectiveness, it is recommended to prioritize physical over virtualized setups. Such configurations typically offer greater performance and minimize overhead associated with virtualization, contributing to improved efficiency and reliability.
+
+When constructing a DV cluster, it is important to be conscious of whether a cluster runs across cloud providers or stays within a single provider's private networking. This likely can impact the bandwidth and latency of the connections between nodes, as well as the egress costs of the cluster (Charon has a relatively low communication with its peers, averaging 10s of kb/s in large mainnet clusters). Ideally, bare metal machines in different locations within the same continent and with at least two providers, balances redundancy and performance.
+
+## Intra-cluster Latency
+
+It is recommended to **keep peer ping latency below 235 milliseconds for all peers in a cluster**. Charon should report a consensus duration averaging under 1 second through its prometheus metric `core_consensus_duration_seconds_bucket` and associated grafana panel titled "Consensus Duration".
+
+In cases where latencies exceed these thresholds, efforts should be made to reduce the physical distance between nodes or optimize Internet Service Provider (ISP) settings accordingly. Ensure all nodes are connecting to one another directly rather than through a relay.
+
+For high-scale, performance deployments; inter-peer latency of &lt; 25ms is optimal, along with an average consensus duration under 100ms.
+
+## Node Locations
+
+For optimal performance and high availability, it is recommended to provision machines or virtual machines (VMs) within the same continent. This practice helps minimize potential latency issues ensuring efficient communication and responsiveness. Consider maps of [undersea internet cables](https://www.submarinecablemap.com/) when selecting locations across oceans with low latency.
+
+## Peer Connections
+
+Charon clients can establish connections with one another in two ways: either through a third publicly accessible server known as [a relay](../charon/charon-cli-reference.md#host-a-relay) or directly with one another if they can establish a connection. The former is known as a relay connection and the latter is known as a direct connection.
+
+It is important that all nodes in a cluster be directly connected to one another - this can halve the latency between them and reduces bandwidth constraints significantly. Opening Charon’s p2p port (the default is `3610`) to the Internet, or configuring your routers NAT gateway to permit connections to your Charon client, are what are required to facilitate a direct connection between clients.
+
+## Instance Independence
+
+Each node in the cluster should have its own independent beacon node (EL+CL) and validator client as well as Charon client. Sharing beacon nodes between the different nodes would potentially impact the fault tolerance of the cluster and as a result should be avoided.
+
+## Placement of Charon clients
+
+If you wish to divide a Distributed Validator node across multiple physical or virtual machines; locaate the Charon client on the EL/CL machine instead of the VC machine. This setup reduces latency from Charon to the consensus layer, as well as keeping the public-internet connected clients separate from the clients that hold the validator private keys. Be sure to use encrypted communication between your VC and the Charon client, potentially through a cloud-provided network, a self-managed network tunnel, a VPN, a Kubernetes [CNI](https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/network-plugins/), or other manner. 
+
+## Node Configuration
+
+Cluster sizes that allow for Byzantine Fault Tolerance are recommended as they are safer than clusters with simply Crash Fault Tolerance (See this guide for reference - [Cluster Size and Resilience](../charon/cluster-configuration#cluster-size-and-resilience)).
+
+## MEV-Boost Relays
+
+MEV relays are configured at the Consensus Layer or MEV-boost client level. Refer to our [guide](./quickstart-builder-api.mdx) to ensure all necessary configuration has been applied to your clients. As with all validators, low latency during proposal opportunities is extremely important. By default, MEV-Boost waits for all configured relays to return a bid, or will timeout if any have not returned a bid within 950ms. This default timeout is generally too slow for a distributed cluster (think of this time as additive to the time it takes the cluster to come to consensus, both of which need to happen within a 2 second window for optimal proposal broadcasting). It is likely better to only list relays that are located geographically near your node, so that once all relays respond (e.g. in &lt; 50ms) your cluster will move forward with the proposal.
+
+## Client Diversity
+
+The clusters should consist of a combination of your preferred consensus, execution, and validator clients. It is recommended to include a combination of multiple clients in order to have a healthy client diversity within the cluster, ideally, if any single client type fails, it should be less than the fault tolerance of the cluster, and the validators should stay online/not do anything slashable.
+
+Remote signers can be included as well, such as Web3signer or Dirk. A diversity of private key infrastructure setups further reduces the risk of total key compromise.
+
+Tested client combinations can be found in the [release notes](https://github.com/ObolNetwork/charon/releases) for each Charon version.
+
+## Metrics Monitoring
+
+As requested by Obol Labs, node operators can push [standard monitoring](./obol-monitoring.md) (Prometheus) and logging (Loki) data to Obol Labs' core team's cloud infrastructure for in-depth analysis of performance data and to assist during potential issues that may arise. Our recommendation for operators is to independently store information on their node health over the course of the validator lifecycle as well as any information on validator performance that they collect during the normal life cycle of a validator.
+
+## Obol Splits
+
+Leveraging [Obol Splits](../sc/introducing-obol-splits.mdx) smart contracts allows for non-custodial fund handling and allows for net customer payouts in an ongoing manner. Obol Splits ensure no commingling of funds across customers, and maintain full non-custodial integrity. Read more about Obol Splits [here](../faq/general.mdx#obol-splits).
+
+## Deposit Process
+
+Deposit processes can be done via an automated script. This can be used for DV clusters until they reach the desired number of validators.
+
+It is important to allow time for the validators to be activated (usually &lt; 24 hours).
+
+Consider using batching smart contracts to reduce the gas cost of a script, but take caution in their integration not to make an invalid deposit.
+
+
diff --git a/versioned_docs/version-v1.1.0/advanced/monitoring.md b/versioned_docs/version-v1.1.0/advanced/monitoring.md
@@ -0,0 +1,90 @@
+---
+sidebar_position: 2
+description: Add monitoring credentials to help the Obol Team monitor the health of your cluster
+---
+# Monitoring your Node
+
+This comprehensive guide will assist you in effectively monitoring your Charon clusters and setting up alerts by running your own Prometheus and Grafana server. If you want to use Obol’s [public dashboard](https://grafana.monitoring.gcp.obol.tech/d/d895e47a-3c2d-46b7-9b15-8f31202681af/clusters-aggregate-view?orgId=6) instead of running your servers, refer to [this section](./obol-monitoring.md) in Obol docs that teaches you how to push Prometheus metrics to Obol.
+
+To explain quickly, Prometheus generates the metrics and Grafana visualizes them. To learn more about Prometheus and Grafana, visit [here](https://grafana.com/docs/grafana/latest/getting-started/get-started-grafana-prometheus/). If you are using **[CDVN repository](https://github.com/ObolNetwork/charon-distributed-validator-node)** or **[CDVC repository](https://github.com/ObolNetwork/charon-distributed-validator-cluster)**, then Prometheus and Grafana are part of docker compose file and will be installed when you run `docker compose up`.
+
+The local Grafana server will have a few pre-built dashboards:
+
+1. Charon Overview
+
+    This is the main dashboard that provides all the relavant details about the Charon node, for example - peer connectivity, duty completion, health of beacon node and downstream validator, etc. To open, navigate to `charon-distributed-validator-node` directory and open the following `uri`  in the browser `http://localhost:3000/d/d6qujIJVk/`.
+
+2. Single Charon Node Dashboard (deprecated)
+
+    This is an older dashboard Charon node monitoring which is now deprecated. If you are still using it, we would highly recommend to move to Charon Overview for most up to date panels.
+
+3. Charon Log Dashboard
+
+    This dashboard can be used to query the logs emitted while running your Charon node. It utilises [Grafana Loki](https://grafana.com/oss/loki/). This dashboard is not active by default and should only be used in debug mode. Refer to [advanced docker config](./adv-docker-configs) section on how to set up a debug mode.
+
+| Alert Name | Description | Troubleshoot |
+| --- | --- | --- |
+| ClusterBeaconNodeDown | This alert is activated when the beacon node in a the cluster is offline. The beacon node is crucial for validating transactions and producing new blocks. Its unavailability could disrupt the overall functionality of the cluster. | Most likely data is corrupted. Wipe data from the point you know data was corrupted and restart beacon node so it can be synced again. |
+| ClusterMissedAttestations | This alert indicates that there have been missed attestations in the cluster. Missed attestations may suggest that validators are not operating correctly, compromising the security and efficiency of the cluster. | This alert is triggered when 3 attestation are missed in 2 minutes. Check if the minimum threshold of peers are online. If correct, check for beacon node API errors and downstream validator errors using Loki. Lastly, debug from Docker using `docker compose debug`. |
+| ClusterInUnknownStatus | This alert is designed to activate when a node within the cluster is detected to be in an unknown state. The condition is evaluated by checking whether the maximum of the `app_monitoring_readyz` metric is 0. | This is most likely a bug in Charon. Report to us via [Discord](https://discord.com/channels/849256203614945310/970759460693901362). |
+| ClusterInsufficientPeers | This alert is set to activate when the number of peers for a node in the cluster is insufficient. The condition is evaluated by checking whether the maximum of the `app_monitoring_readyz` equals 4. | If you are running group cluster, check with other peers to troubleshoot the issue. If you are running solo cluster, look into other machines running the DVs to find the problem. |
+| ClusterFailureRate | This alert is activated when the failure rate of the cluster exceeds a certain threshold, more specifically - more than 5% failures in duties in the last 6 hours. | Check the upstream and downstream dependencies, latency and hardware issues. |
+| ClusterVCMissingValidators | This alert is activated if any validators in the cluster are missing. This happens when validator client cannot load validator keys in the past 10 minutes. | Find if validator keys are missing and load them. |
+| ClusterHighPctFailedSyncMsgDuty |  This alert is activated if a high percentage of sync message duties failed in the cluster. The alert is activated if the sum of the increase in failed duties tagged with "sync_message" in the last hour divided by the sum of the increase in total duties tagged with "sync_message" in the last hour is greater than 10%. | This may be due to limitations in beacon node performance on nodes within the cluster. In charon, this duty is the most demanding, however, an increased failure rate does not impact rewards. |
+| ClusterNumConnectedRelays | This alert is activated if the number of connected relays in the cluster falls to 0. | Make sure correct relay is configured. If you still get the error report to us via [Discord](https://discord.com/channels/849256203614945310/970759460693901362). |
+| PeerPingLatency | This alert is activated if the 90th percentile of the ping latency to the peers in a cluster exceeds 400ms within 2 minutes. | Make sure to set up stable and high speed internet connection. If you have geographically distributed nodes, make sure latency does not go over 250 ms. |
+| ClusterBeaconNodeZeroPeers | This alert is activated when beacon node cannot find peers. | Go to docs of beacon node client to troubleshoot. Make sure there is no port overlap and p2p discovery is open. |
+
+## Setting Up a Contact Point
+
+When alerts are triggered, they are routed to contact points according notification policies. For this, contact points must be added. Grafana supports several kind of contact points like email, PagerDuty, Discord, Slack, Telegram etc. This document will teach how to add Discord channel as contact point.
+
+1. On left nav bar in Grafana console, under `Alerts`  section, click on contact points.
+2. Click on `+ Add contact point`. It will show following page. Choose Discord in the  `Integration` drop down.
+
+    ![AlertsContactPoint](/img/AlertsContactPoint.png)
+
+3. Give a descriptive name to the alert. Create a channel in Discord and copy its `webhook url`.  Once done, click `Save contact point` to finish.
+4. When the alerts are fired, it will send without filling in the variables for cluster detail. For example, `cluster_hash` variable is missing here `cluster_hash = {{.cluster_hash}}`. This is done to save disk space. To find the details, use `docker compose -f docker-compose.yml -f compose-debug.yml up`. More description [here](https://docs.obol.tech/docs/advanced/adv-docker-configs).
+
+## Best Practices for Monitoring Charon Nodes & Cluster
+
+- **Establish Baselines**: Familiarize yourself with the normal operation metrics like CPU, memory, and network usage. This will help you detect anomalies.
+- **Define Key Metrics**: Set up alerts for essential metrics, encompassing both system-level and Charon-specific ones.
+- **Configure Alerts**: Based on these metrics, set up actionable alerts.
+- **Monitor Network**: Regularly assess the connectivity between nodes and the network.
+- **Perform Regular Health Checks**: Consistently evaluate the status of your nodes and clusters.
+- **Monitor System Logs**: Keep an eye on logs for error messages or unusual activities.
+- **Assess Resource Usage**: Ensure your nodes are neither over- nor under-utilized.
+- **Automate Monitoring**: Use automation to ensure no issues go undetected.
+- **Conduct Drills**: Regularly simulate failure scenarios to fine-tune your setup.
+- **Update Regularly**: Keep your nodes and clusters updated with the latest software versions.
+
+## Third-Party Services for Uptime Testing
+
+- [updown.io](https://updown.io/)
+- [Grafana synthetic Monitoring](https://grafana.com/grafana/plugins/grafana-synthetic-monitoring-app/)
+
+## Key metrics to watch to verify node health based on jobs
+
+**CPU Usage**: High or spiking CPU usage can be a sign of a process demanding more resources than it should.
+
+**Memory Usage**: If a node is consistently running out of memory, it could be due to a memory leak or simply under-provisioning.
+
+**Disk I/O**: Slow disk operations can cause applications to hang or delay responses. High disk I/O can indicate storage performance issues or a sign of high load on the system.
+
+**Network Usage**: High network traffic or packet loss can signal network configuration issues, or that a service is being overwhelmed by requests.
+
+**Disk Space**: Running out of disk space can lead to application errors and data loss.
+
+**Uptime**: The amount of time a system has been up without any restarts. Frequent restarts can indicate instability in the system.
+
+**Error Rates**: The number of errors encountered by your application. This could be 4xx/5xx HTTP errors, exceptions, or any other kind of error your application may log.
+
+**Latency**: The delay before a transfer of data begins following an instruction for its transfer.
+
+It is also important to check:
+
+- NTP clock skew;
+- Process restarts and failures (eg. through `node_systemd`);
+- Alert on high error and panic log counts.