Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add utility scripts #323

Closed
wants to merge 5 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
58 changes: 58 additions & 0 deletions scripts/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
# Grafana Dashboard and Alert Management Script

This Python script facilitates the export and import of Grafana dashboards and the import of Alertmanager configurations into a Grafana instance. It's designed to work with Grafana v8 and above.

## Prerequisites

- Python 3.x installed on your machine.
- `requests` library installed. You can install it using pip:
```sh
pip install requests
```
- Access to a Grafana instance (v8 and above) with API access enabled.
- A Grafana API key with permissions to view and create dashboards and manage Alertmanager configurations.

## Configuration

Before running the script, you need to set up a few configurations:

1. **Grafana URL and API Key**: Update the `grafana_url` and `grafana_api_key` variables in the script with your Grafana instance URL and the API key.

2. **Export and Import Paths**: Set the `export_path` and `import_path` variables to the desired locations on your filesystem where the dashboards should be exported to or imported from.

3. **Alertmanager Configuration File**: Ensure you have an Alertmanager configuration file in YAML format ready for import. Update the `alert_config_path` variable with the path to this file.

## Running the Script

To run the script, follow these steps:

1. Open your terminal or command prompt.

2. Navigate to the directory where the script is located:
```sh
cd path/to/script_directory
```

3. Run the script using Python:
```sh
python script.py
```

### What the Script Does

- **Export Dashboards**: Fetches all dashboards from the specified Grafana instance and saves them as JSON files in the directory specified by `export_path`.

- **Import Dashboards**: Reads dashboard JSON files from the directory specified by `import_path` and imports them into the specified Grafana instance.

- **Import Alertmanager Configuration**: Imports an Alertmanager configuration from a YAML file into the specified Grafana instance.

## Notes

- Ensure the Grafana API key provided has the necessary permissions to perform the actions required by the script.
- The script assumes Grafana v8 and above for compatibility with the Alertmanager configuration import feature.
- Adjust file paths and configurations according to your operating system and environment.

## Troubleshooting

- **API Key Permissions**: If you encounter permissions errors, ensure your Grafana API key has the correct roles assigned.
- **File Paths**: Ensure file paths are correct and accessible by the script. This is a common issue when running the script across different operating systems.
140 changes: 140 additions & 0 deletions scripts/grafana.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,140 @@
import os
import json
import logging
import requests
import urllib.parse
from dotenv import load_dotenv

load_dotenv()
grafana_url = os.getenv("GRAFANA_URL")
grafana_api_key = os.getenv("GRAFANA_API_KEY")

export_path = "./dashboards"
import_path = "./dashboards"
alert_config_path = "./dashboard/alertmanager_config.yml"

def export_dashboards_from_grafana(grafana_url, grafana_api_key, export_path):
headers = {"Authorization": "Bearer " + grafana_api_key}
folder_api_endpoint = "/api/folders"

try:
response = requests.get(grafana_url + folder_api_endpoint, headers=headers)
response.raise_for_status()
except requests.exceptions.RequestException as e:
logging.error(f"Error getting Grafana folders: {str(e)}")
exit()

folders = json.loads(response.text)

for folder in folders:
folder_name = folder['title'].replace("/", "_")
folder_id = folder['id']
folder_path = "./" + folder_name

dashboard_api_endpoint = "/api/search?folderIds=" + str(folder_id)
response = requests.get(grafana_url + dashboard_api_endpoint, headers=headers)
dashboards = json.loads(response.text)

if not os.path.exists(folder_path):
os.makedirs(folder_path)

for dashboard in dashboards:
dashboard_id = dashboard['id']
dashboard_name = dashboard['title'].replace("/", "_")
dashboard_api_endpoint = "/api/dashboards/uid/" + str(dashboard['uid'])

response = requests.get(grafana_url + dashboard_api_endpoint, headers=headers)
dashboard_data = json.loads(response.text)
dashboard_data.pop('meta', None)

dashboard_file_path = os.path.join(folder_path, dashboard_name + ".json")

try:
with open(dashboard_file_path, "w") as f:
f.write(json.dumps(dashboard_data, indent=4))
except Exception as e:
print(f"Error exporting dashboard {dashboard_name} in folder {folder_name}: {e}")

print("Dashboard export complete!")


def import_dashboards_to_grafana(grafana_url, grafana_api_key, folder_path):
headers = {"Authorization": "Bearer " + grafana_api_key}

for folder_name in os.listdir(import_path):
folder_path = os.path.join(import_path, folder_name)

if not os.path.isdir(folder_path):
continue

folder_api_endpoint = "/api/folders"
folder_payload = {"title": folder_name}
response = requests.post(grafana_url + folder_api_endpoint, headers=headers, json=folder_payload)

folder_id = json.loads(response.text)['id']
folder_uid = json.loads(response.text)['uid']

print(f"Created folder {folder_name} with ID {folder_id} and UID {folder_uid}")

for filename in os.listdir(folder_path):
dashboard_path = os.path.join(folder_path, filename)

if not dashboard_path.endswith(".json"):
continue

with open(dashboard_path, "r") as f:
dashboard_data = json.load(f)

dashboard_data = {
"dashboard": dashboard_data['dashboard'],
"folderId": 0,
"folderUid": folder_uid,
"message": "Imported from JSON",
"overwrite": True
}

dashboard_data['dashboard']['folderId'] = folder_id
dashboard_data['dashboard']['folderTitle'] = folder_name

encoded_dashboard_name = urllib.parse.quote(filename[:-5])

dashboard_api_endpoint = "{grafana_url}/api/dashboards/db"
response = requests.post(dashboard_api_endpoint, headers=headers, json=dashboard_data)

if response.status_code == 200:
print(f"Dashboard {filename} imported successfully!")
else:
print(f"Error importing dashboard {filename}: {response.text}")

print("Dashboard import complete!")


def import_exported_alerts(grafana_url, grafana_api_key, alert_config_path):
"""
Args:
- grafana_url: URL of the Grafana instance.
- grafana_api_key: API key for authentication.
- alert_config_path: Path to the Alertmanager configuration file (YAML format).
"""

headers = {"Authorization": "Bearer " + grafana_api_key, "Content-Type": "application/yaml"}
alertmanager_api_endpoint = "/api/alertmanager/grafana/config/api/v1/alerts"

try:
with open(alert_config_path, 'r') as file:
alert_config = file.read()

response = requests.post(grafana_url + alertmanager_api_endpoint, headers=headers, data=alert_config)

if response.status_code == 200:
print("Alertmanager configuration imported successfully!")
else:
print(f"Failed to import Alertmanager configuration: {response.text}")

except Exception as e:
print(f"Error importing Alertmanager configuration: {e}")


# export_dashboards_from_grafana(grafana_url, grafana_api_key, export_path)
# import_dashboards_to_grafana(grafana_url, grafana_api_key, import_path)
import_exported_alerts(grafana_url, grafana_api_key, alert_config_path)
94 changes: 94 additions & 0 deletions versioned_docs/version-v0.17.0/int/quickstart/monitoring.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
# Getting Started Monitoring your Node
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this an accidental commit of something stashed, or was it meant to come in in this? if it was, why is it in the v0.17.0 folder? Generally stuff goes in the docs folder, and when a version is cut, it gets copied to the versioned_docs folder.


Welcome to this comprehensive guide, designed to assist you in effectively monitoring your Charon cluster and nodes, and setting up alerts based on specified parameters.

## Pre-requisites

Ensure the following software are installed:

- Docker: Find the installation guide for Ubuntu **[here](https://docs.docker.com/engine/install/ubuntu/)**
- Prometheus: You can install it using the guide available **[here](https://prometheus.io/docs/prometheus/latest/installation/)**
- Grafana: Follow this **[link](https://grafana.com/docs/grafana/latest/setup-grafana/installation/)** to install Grafana

## Import Pre-Configured Charon Dashboards

- Navigate to the **[repository](https://github.com/ObolNetwork/terraform-modules/tree/main/grafana-dashboards/dashboards)** that contains a variety of Grafana dashboards. For this demonstration, we will utilize the Charon Dashboard json.
- In your Grafana interface, create a new dashboard and select the import option.

![Screenshot 2023-06-26 at 1.00.05 PM.png](https://s3-us-west-2.amazonaws.com/secure.notion-static.com/2bba3f52-ff32-452e-811b-f2ac7a4905fb/Screenshot_2023-06-26_at_1.00.05_PM.png)

- Copy the content of the Charon Dashboard json from the repository and paste it into the import box in Grafana. Click "Load" to proceed.

![Screenshot 2023-06-26 at 1.03.08 PM.png](https://s3-us-west-2.amazonaws.com/secure.notion-static.com/6790e67a-eb51-4bfb-b7b1-df14f214b72d/Screenshot_2023-06-26_at_1.03.08_PM.png)

- Finalize the import by clicking on the "Import" button. At this point, your dashboard should begin displaying metrics. Ensure your Charon client and Prometheus are operational for this to occur.

![Screenshot 2023-06-26 at 1.16.27 PM.png](https://s3-us-west-2.amazonaws.com/secure.notion-static.com/cc0b4a9e-c21c-4ce4-b613-9c3f84e696ed/Screenshot_2023-06-26_at_1.16.27_PM.png)

## Example alerting rules

- Alerts for Node-Exporter can be created using the sample rules provided here

[Awesome Prometheus alerts](https://samber.github.io/awesome-prometheus-alerts/rules.html#host-and-hardware)

- For Charon/Alpha alerts, refer to the alerting rules available

[monitoring/alerting-rules at main · ObolNetwork/monitoring](https://github.com/ObolNetwork/monitoring/tree/main/alerting-rules)

## Understanding Alert rules

1. `AlphaClusterBeaconNodeDown`This alert is activated when the beacon node in a specified Alpha cluster is offline. The beacon node is crucial for validating transactions and producing new blocks. Its unavailability could disrupt the overall functionality of the cluster.
2. `AlphaClusterBeaconNodeSyncing`This alert indicates that the beacon node in a specified Alpha cluster is synchronizing, i.e., catching up with the latest blocks in the cluster.
3. `AlphaClusterNodeDown`This alert is activated when a node in a specified Alpha cluster is offline.
4. `AlphaClusterMissedAttestations`:This alert indicates that there have been missed attestations in a specified Alpha cluster. Missed attestations may suggest that validators are not operating correctly, compromising the security and efficiency of the cluster.
5. `AlphaClusterInUnknownStatus`: This alert is designed to activate when a node within the "Alpha M1 Cluster #1" is detected to be in an unknown state. The condition is evaluated by checking whether the maximum of the app_monitoring_readyz metric is 0.
6. `AlphaClusterInsufficientPeers`:This alert is set to activate when the number of peers for a node in the Alpha M1 Cluster #1 is insufficient. The condition is evaluated by checking whether the maximum of the **`app_monitoring_readyz`** equals 4.
7. `AlphaClusterFailureRate`: This alert is activated when the failure rate of the Alpha M1 Cluster #1 exceeds a certain threshold.
8. `AlphaClusterVCMissingValidators`: This alert is activated if any validators in the Alpha M1 Cluster #1 are missing.
9. `AlphaClusterHighPctFailedSyncMsgDuty`: This alert is activated if a high percentage of sync message duties failed in the "Alpha M1 Cluster #1". The alert is activated if the sum of the increase in failed duties tagged with "sync_message" in the last hour divided by the sum of the increase in total duties tagged with "sync_message" in the last hour is greater than 0.1.
10. `AlphaClusterNumConnectedRelays`: This alert is activated if the number of connected relays in the "Alpha M1 Cluster #1" falls to 0.
11. PeerPingLatency: 1. This alert is activated if the 90th percentile of the ping latency to the peers in a cluster exceeds 500ms within 2 minutes.

## ****Best Practices for Monitoring Charon Nodes & Cluster****

- **Establish Baselines**: Familiarize yourself with the normal operation metrics like CPU, memory, and network usage. This will help you detect anomalies.
- **Define Key Metrics**: Set up alerts for essential metrics, encompassing both system-level and Charon-specific ones.
- **Configure Alerts**: Based on these metrics, set up actionable alerts.
- **Monitor Network**: Regularly assess the connectivity between nodes and the network.
- **Perform Regular Health Checks**: Consistently evaluate the status of your nodes and clusters.
- **Monitor System Logs**: Keep an eye on logs for error messages or unusual activities.
- **Assess Resource Usage**: Ensure your nodes are neither over- nor under-utilized.
- **Automate Monitoring**: Use automation to ensure no issues go undetected.
- **Conduct Drills**: Regularly simulate failure scenarios to fine-tune your setup.
- **Update Regularly**: Keep your nodes and clusters updated with the latest software versions.

## ****Third-Party Services for Uptime Testing****

- [updown.io](https://updown.io/)
- [Grafana synthetic Monitoring](https://grafana.com/blog/2022/03/10/best-practices-for-alerting-on-synthetic-monitoring-metrics-in-grafana-cloud/?src=ggl-s&mdm=cpc&camp=nb-synthetic-monitoring-pm&cnt=130224525351&trm=grafana%20synthetic%20monitoring&device=c&gclid=CjwKCAjwzJmlBhBBEiwAEJyLu4A0quHdic_UAyYuJgqUntwGTq6DKIFq0rfPkp9fxt4lK8VMgYmo4BoCO3EQAvD_BwE)

## **Key metrics to watch to verify node health based on jobs**

**node_exporter:**

**CPU Usage**: High or spiking CPU usage can be a sign of a process demanding more resources than it should.

**Memory Usage**: If a node is consistently running out of memory, it could be due to a memory leak or simply under-provisioning.

**Disk I/O**: Slow disk operations can cause applications to hang or delay responses. High disk I/O can indicate storage performance issues or a sign of high load on the system.

**Network Usage**: High network traffic or packet loss can signal network configuration issues, or that a service is being overwhelmed by requests.

**Disk Space**: Running out of disk space can lead to application errors and data loss.

**Uptime**: The amount of time a system has been up without any restarts. Frequent restarts can indicate instability in the system.

**Error Rates**: The number of errors encountered by your application. This could be 4xx/5xx HTTP errors, exceptions, or any other kind of error your application may log.

**Latency**: The delay before a transfer of data begins following an instruction for its transfer.

It is also important to check:

- NTP clock skew
- Process restarts and failures (eg. through `node_systemd`)
- alert on high error and panic log counts.
Loading