ObolNetwork · haroldsphinx · Jul 12, 2023 · Jul 25, 2023 · Feb 23, 2024 · Feb 26, 2024
diff --git a/scripts/README.md b/scripts/README.md
@@ -0,0 +1,58 @@
+# Grafana Dashboard and Alert Management Script
+
+This Python script facilitates the export and import of Grafana dashboards and the import of Alertmanager configurations into a Grafana instance. It's designed to work with Grafana v8 and above.
+
+## Prerequisites
+
+- Python 3.x installed on your machine.
+- `requests` library installed. You can install it using pip:
+  ```sh
+  pip install requests
+  ```
+- Access to a Grafana instance (v8 and above) with API access enabled.
+- A Grafana API key with permissions to view and create dashboards and manage Alertmanager configurations.
+
+## Configuration
+
+Before running the script, you need to set up a few configurations:
+
+1. **Grafana URL and API Key**: Update the `grafana_url` and `grafana_api_key` variables in the script with your Grafana instance URL and the API key.
+
+2. **Export and Import Paths**: Set the `export_path` and `import_path` variables to the desired locations on your filesystem where the dashboards should be exported to or imported from.
+
+3. **Alertmanager Configuration File**: Ensure you have an Alertmanager configuration file in YAML format ready for import. Update the `alert_config_path` variable with the path to this file.
+
+## Running the Script
+
+To run the script, follow these steps:
+
+1. Open your terminal or command prompt.
+
+2. Navigate to the directory where the script is located:
+   ```sh
+   cd path/to/script_directory
+   ```
+
+3. Run the script using Python:
+   ```sh
+   python script.py
+   ```
+
+### What the Script Does
+
+- **Export Dashboards**: Fetches all dashboards from the specified Grafana instance and saves them as JSON files in the directory specified by `export_path`.
+
+- **Import Dashboards**: Reads dashboard JSON files from the directory specified by `import_path` and imports them into the specified Grafana instance.
+
+- **Import Alertmanager Configuration**: Imports an Alertmanager configuration from a YAML file into the specified Grafana instance.
+
+## Notes
+
+- Ensure the Grafana API key provided has the necessary permissions to perform the actions required by the script.
+- The script assumes Grafana v8 and above for compatibility with the Alertmanager configuration import feature.
+- Adjust file paths and configurations according to your operating system and environment.
+
+## Troubleshooting
+
+- **API Key Permissions**: If you encounter permissions errors, ensure your Grafana API key has the correct roles assigned.
+- **File Paths**: Ensure file paths are correct and accessible by the script. This is a common issue when running the script across different operating systems.
diff --git a/scripts/grafana.py b/scripts/grafana.py
@@ -0,0 +1,140 @@
+import os
+import json
+import logging
+import requests
+import urllib.parse
+from dotenv import load_dotenv
+
+load_dotenv()
+grafana_url = os.getenv("GRAFANA_URL")
+grafana_api_key = os.getenv("GRAFANA_API_KEY")
+
+export_path = "./dashboards"
+import_path = "./dashboards"
+alert_config_path = "./dashboard/alertmanager_config.yml"
+
+def export_dashboards_from_grafana(grafana_url, grafana_api_key, export_path):
+    headers = {"Authorization": "Bearer " + grafana_api_key}    
+    folder_api_endpoint = "/api/folders"
+
+    try:
+        response = requests.get(grafana_url + folder_api_endpoint, headers=headers)
+        response.raise_for_status()
+    except requests.exceptions.RequestException as e:
+        logging.error(f"Error getting Grafana folders: {str(e)}")
+        exit()
+
+    folders = json.loads(response.text)
+
+    for folder in folders:
+        folder_name = folder['title'].replace("/", "_")
+        folder_id = folder['id']
+        folder_path = "./" + folder_name
+
+        dashboard_api_endpoint = "/api/search?folderIds=" + str(folder_id)
+        response = requests.get(grafana_url + dashboard_api_endpoint, headers=headers)
+        dashboards = json.loads(response.text)
+
+        if not os.path.exists(folder_path):
+            os.makedirs(folder_path)
+
+        for dashboard in dashboards:
+            dashboard_id = dashboard['id']
+            dashboard_name = dashboard['title'].replace("/", "_")
+            dashboard_api_endpoint = "/api/dashboards/uid/" + str(dashboard['uid'])
+
+            response = requests.get(grafana_url + dashboard_api_endpoint, headers=headers)
+            dashboard_data = json.loads(response.text)
+            dashboard_data.pop('meta', None)
+
+            dashboard_file_path = os.path.join(folder_path, dashboard_name + ".json")
+
+            try:
+                with open(dashboard_file_path, "w") as f:
+                    f.write(json.dumps(dashboard_data, indent=4))
+            except Exception as e:
+                print(f"Error exporting dashboard {dashboard_name} in folder {folder_name}: {e}")
+
+    print("Dashboard export complete!")
+
+
+def import_dashboards_to_grafana(grafana_url, grafana_api_key, folder_path):
+    headers = {"Authorization": "Bearer " + grafana_api_key}
+
+    for folder_name in os.listdir(import_path):
+        folder_path = os.path.join(import_path, folder_name)
+
+        if not os.path.isdir(folder_path):
+            continue
+
+        folder_api_endpoint = "/api/folders"
+        folder_payload = {"title": folder_name}
+        response = requests.post(grafana_url + folder_api_endpoint, headers=headers, json=folder_payload)
+
+        folder_id = json.loads(response.text)['id']
+        folder_uid = json.loads(response.text)['uid']
+
+        print(f"Created folder {folder_name} with ID {folder_id} and UID {folder_uid}")
+
+        for filename in os.listdir(folder_path):
+            dashboard_path = os.path.join(folder_path, filename)
+
+            if not dashboard_path.endswith(".json"):
+                continue
+
+            with open(dashboard_path, "r") as f:
+                dashboard_data = json.load(f)
+
+            dashboard_data = {
+                "dashboard": dashboard_data['dashboard'],
+                "folderId": 0,
+                "folderUid": folder_uid,
+                "message": "Imported from JSON",
+                "overwrite": True
+            }
+
+            dashboard_data['dashboard']['folderId'] = folder_id
+            dashboard_data['dashboard']['folderTitle'] = folder_name
+
+            encoded_dashboard_name = urllib.parse.quote(filename[:-5])
+
+            dashboard_api_endpoint = "{grafana_url}/api/dashboards/db"
+            response = requests.post(dashboard_api_endpoint, headers=headers, json=dashboard_data)
+
+            if response.status_code == 200:
+                print(f"Dashboard {filename} imported successfully!")
+            else:
+                print(f"Error importing dashboard {filename}: {response.text}")
+
+    print("Dashboard import complete!")
+
+
+def import_exported_alerts(grafana_url, grafana_api_key, alert_config_path):
+    """
+    Args:
+    - grafana_url: URL of the Grafana instance.
+    - grafana_api_key: API key for authentication.
+    - alert_config_path: Path to the Alertmanager configuration file (YAML format).
+    """
+
+    headers = {"Authorization": "Bearer " + grafana_api_key, "Content-Type": "application/yaml"}
+    alertmanager_api_endpoint = "/api/alertmanager/grafana/config/api/v1/alerts"
+
+    try:
+        with open(alert_config_path, 'r') as file:
+            alert_config = file.read()
+
+        response = requests.post(grafana_url + alertmanager_api_endpoint, headers=headers, data=alert_config)
+
+        if response.status_code == 200:
+            print("Alertmanager configuration imported successfully!")
+        else:
+            print(f"Failed to import Alertmanager configuration: {response.text}")
+
+    except Exception as e:
+        print(f"Error importing Alertmanager configuration: {e}")
+
+
+# export_dashboards_from_grafana(grafana_url, grafana_api_key, export_path)
+# import_dashboards_to_grafana(grafana_url, grafana_api_key, import_path)
+import_exported_alerts(grafana_url, grafana_api_key, alert_config_path)
diff --git a/versioned_docs/version-v0.17.0/int/quickstart/monitoring.md b/versioned_docs/version-v0.17.0/int/quickstart/monitoring.md
@@ -0,0 +1,94 @@
+# Getting Started Monitoring your Node
+
+Welcome to this comprehensive guide, designed to assist you in effectively monitoring your Charon cluster and nodes, and setting up alerts based on specified parameters.
+
+## Pre-requisites
+
+Ensure the following software are installed:
+
+- Docker: Find the installation guide for Ubuntu **[here](https://docs.docker.com/engine/install/ubuntu/)**
+- Prometheus: You can install it using the guide available **[here](https://prometheus.io/docs/prometheus/latest/installation/)**
+- Grafana: Follow this **[link](https://grafana.com/docs/grafana/latest/setup-grafana/installation/)** to install Grafana
+
+## Import Pre-Configured Charon Dashboards
+
+- Navigate to the **[repository](https://github.com/ObolNetwork/terraform-modules/tree/main/grafana-dashboards/dashboards)** that contains a variety of Grafana dashboards. For this demonstration, we will utilize the Charon Dashboard json.
+- In your Grafana interface, create a new dashboard and select the import option.
+
+![Screenshot 2023-06-26 at 1.00.05 PM.png](https://s3-us-west-2.amazonaws.com/secure.notion-static.com/2bba3f52-ff32-452e-811b-f2ac7a4905fb/Screenshot_2023-06-26_at_1.00.05_PM.png)
+
+- Copy the content of the Charon Dashboard json from the repository and paste it into the import box in Grafana. Click "Load" to proceed.
+
+![Screenshot 2023-06-26 at 1.03.08 PM.png](https://s3-us-west-2.amazonaws.com/secure.notion-static.com/6790e67a-eb51-4bfb-b7b1-df14f214b72d/Screenshot_2023-06-26_at_1.03.08_PM.png)
+
+- Finalize the import by clicking on the "Import" button. At this point, your dashboard should begin displaying metrics. Ensure your Charon client and Prometheus are operational for this to occur.
+
+![Screenshot 2023-06-26 at 1.16.27 PM.png](https://s3-us-west-2.amazonaws.com/secure.notion-static.com/cc0b4a9e-c21c-4ce4-b613-9c3f84e696ed/Screenshot_2023-06-26_at_1.16.27_PM.png)
+
+## Example alerting rules
+
+- Alerts for Node-Exporter can be created using the sample rules provided here
+
+[Awesome Prometheus alerts](https://samber.github.io/awesome-prometheus-alerts/rules.html#host-and-hardware)
+
+- For Charon/Alpha alerts, refer to the alerting rules available
+
+[monitoring/alerting-rules at main · ObolNetwork/monitoring](https://github.com/ObolNetwork/monitoring/tree/main/alerting-rules)
+
+## Understanding Alert rules
+
+1. `AlphaClusterBeaconNodeDown`This alert is activated when the beacon node in a specified Alpha cluster is offline. The beacon node is crucial for validating transactions and producing new blocks. Its unavailability could disrupt the overall functionality of the cluster.
+2. `AlphaClusterBeaconNodeSyncing`This alert indicates that the beacon node in a specified Alpha cluster is synchronizing, i.e., catching up with the latest blocks in the cluster.
+3. `AlphaClusterNodeDown`This alert is activated when a node in a specified Alpha cluster is offline.
+4. `AlphaClusterMissedAttestations`:This alert indicates that there have been missed attestations in a specified Alpha cluster. Missed attestations may suggest that validators are not operating correctly, compromising the security and efficiency of the cluster.
+5. `AlphaClusterInUnknownStatus`: This alert is designed to activate when a node within the "Alpha M1 Cluster #1" is detected to be in an unknown state. The condition is evaluated by checking whether the maximum of the app_monitoring_readyz metric is 0.
+6. `AlphaClusterInsufficientPeers`:This alert is set to activate when the number of peers for a node in the Alpha M1 Cluster #1 is insufficient. The condition is evaluated by checking whether the maximum of the **`app_monitoring_readyz`** equals 4.
+7. `AlphaClusterFailureRate`: This alert is activated when the failure rate of the Alpha M1 Cluster #1 exceeds a certain threshold.
+8. `AlphaClusterVCMissingValidators`: This alert is activated if any validators in the Alpha M1 Cluster #1 are missing.
+9. `AlphaClusterHighPctFailedSyncMsgDuty`: This alert is activated if a high percentage of sync message duties failed in the "Alpha M1 Cluster #1". The alert is activated if the sum of the increase in failed duties tagged with "sync_message" in the last hour divided by the sum of the increase in total duties tagged with "sync_message" in the last hour is greater than 0.1.
+10. `AlphaClusterNumConnectedRelays`: This alert is activated if the number of connected relays in the "Alpha M1 Cluster #1" falls to 0.
+11. PeerPingLatency: 1. This alert is activated if the 90th percentile of the ping latency to the peers in a cluster exceeds 500ms within 2 minutes.
+
+## ****Best Practices for Monitoring Charon Nodes & Cluster****
+
+- **Establish Baselines**: Familiarize yourself with the normal operation metrics like CPU, memory, and network usage. This will help you detect anomalies.
+- **Define Key Metrics**: Set up alerts for essential metrics, encompassing both system-level and Charon-specific ones.
+- **Configure Alerts**: Based on these metrics, set up actionable alerts.
+- **Monitor Network**: Regularly assess the connectivity between nodes and the network.
+- **Perform Regular Health Checks**: Consistently evaluate the status of your nodes and clusters.
+- **Monitor System Logs**: Keep an eye on logs for error messages or unusual activities.
+- **Assess Resource Usage**: Ensure your nodes are neither over- nor under-utilized.
+- **Automate Monitoring**: Use automation to ensure no issues go undetected.
+- **Conduct Drills**: Regularly simulate failure scenarios to fine-tune your setup.
+- **Update Regularly**: Keep your nodes and clusters updated with the latest software versions.
+
+## ****Third-Party Services for Uptime Testing****
+
+- [updown.io](https://updown.io/)
+- [Grafana synthetic Monitoring](https://grafana.com/blog/2022/03/10/best-practices-for-alerting-on-synthetic-monitoring-metrics-in-grafana-cloud/?src=ggl-s&mdm=cpc&camp=nb-synthetic-monitoring-pm&cnt=130224525351&trm=grafana%20synthetic%20monitoring&device=c&gclid=CjwKCAjwzJmlBhBBEiwAEJyLu4A0quHdic_UAyYuJgqUntwGTq6DKIFq0rfPkp9fxt4lK8VMgYmo4BoCO3EQAvD_BwE)
+
+## **Key metrics to watch to verify node health based on jobs**
+
+**node_exporter:**
+
+**CPU Usage**: High or spiking CPU usage can be a sign of a process demanding more resources than it should.
+
+**Memory Usage**: If a node is consistently running out of memory, it could be due to a memory leak or simply under-provisioning.
+
+**Disk I/O**: Slow disk operations can cause applications to hang or delay responses. High disk I/O can indicate storage performance issues or a sign of high load on the system.
+
+**Network Usage**: High network traffic or packet loss can signal network configuration issues, or that a service is being overwhelmed by requests.
+
+**Disk Space**: Running out of disk space can lead to application errors and data loss.
+
+**Uptime**: The amount of time a system has been up without any restarts. Frequent restarts can indicate instability in the system.
+
+**Error Rates**: The number of errors encountered by your application. This could be 4xx/5xx HTTP errors, exceptions, or any other kind of error your application may log.
+
+**Latency**: The delay before a transfer of data begins following an instruction for its transfer.
+
+It is also important to check:
+
+- NTP clock skew
+- Process restarts and failures (eg. through `node_systemd`)
+- alert on high error and panic log counts.