-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add utility scripts #323
Closed
Closed
Add utility scripts #323
Changes from all commits
Commits
Show all changes
5 commits
Select commit
Hold shift + click to select a range
b7fd81d
Monitoing & Alerting guide
haroldsphinx b4526ba
Merge branch 'main' of github.com:ObolNetwork/obol-docs
haroldsphinx df7f8a7
update
haroldsphinx d0ca1f4
Utility Script
haroldsphinx 381280a
Utility Script to import alerts and dashboard
haroldsphinx File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,58 @@ | ||
# Grafana Dashboard and Alert Management Script | ||
|
||
This Python script facilitates the export and import of Grafana dashboards and the import of Alertmanager configurations into a Grafana instance. It's designed to work with Grafana v8 and above. | ||
|
||
## Prerequisites | ||
|
||
- Python 3.x installed on your machine. | ||
- `requests` library installed. You can install it using pip: | ||
```sh | ||
pip install requests | ||
``` | ||
- Access to a Grafana instance (v8 and above) with API access enabled. | ||
- A Grafana API key with permissions to view and create dashboards and manage Alertmanager configurations. | ||
|
||
## Configuration | ||
|
||
Before running the script, you need to set up a few configurations: | ||
|
||
1. **Grafana URL and API Key**: Update the `grafana_url` and `grafana_api_key` variables in the script with your Grafana instance URL and the API key. | ||
|
||
2. **Export and Import Paths**: Set the `export_path` and `import_path` variables to the desired locations on your filesystem where the dashboards should be exported to or imported from. | ||
|
||
3. **Alertmanager Configuration File**: Ensure you have an Alertmanager configuration file in YAML format ready for import. Update the `alert_config_path` variable with the path to this file. | ||
|
||
## Running the Script | ||
|
||
To run the script, follow these steps: | ||
|
||
1. Open your terminal or command prompt. | ||
|
||
2. Navigate to the directory where the script is located: | ||
```sh | ||
cd path/to/script_directory | ||
``` | ||
|
||
3. Run the script using Python: | ||
```sh | ||
python script.py | ||
``` | ||
|
||
### What the Script Does | ||
|
||
- **Export Dashboards**: Fetches all dashboards from the specified Grafana instance and saves them as JSON files in the directory specified by `export_path`. | ||
|
||
- **Import Dashboards**: Reads dashboard JSON files from the directory specified by `import_path` and imports them into the specified Grafana instance. | ||
|
||
- **Import Alertmanager Configuration**: Imports an Alertmanager configuration from a YAML file into the specified Grafana instance. | ||
|
||
## Notes | ||
|
||
- Ensure the Grafana API key provided has the necessary permissions to perform the actions required by the script. | ||
- The script assumes Grafana v8 and above for compatibility with the Alertmanager configuration import feature. | ||
- Adjust file paths and configurations according to your operating system and environment. | ||
|
||
## Troubleshooting | ||
|
||
- **API Key Permissions**: If you encounter permissions errors, ensure your Grafana API key has the correct roles assigned. | ||
- **File Paths**: Ensure file paths are correct and accessible by the script. This is a common issue when running the script across different operating systems. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,140 @@ | ||
import os | ||
import json | ||
import logging | ||
import requests | ||
import urllib.parse | ||
from dotenv import load_dotenv | ||
|
||
load_dotenv() | ||
grafana_url = os.getenv("GRAFANA_URL") | ||
grafana_api_key = os.getenv("GRAFANA_API_KEY") | ||
|
||
export_path = "./dashboards" | ||
import_path = "./dashboards" | ||
alert_config_path = "./dashboard/alertmanager_config.yml" | ||
|
||
def export_dashboards_from_grafana(grafana_url, grafana_api_key, export_path): | ||
headers = {"Authorization": "Bearer " + grafana_api_key} | ||
folder_api_endpoint = "/api/folders" | ||
|
||
try: | ||
response = requests.get(grafana_url + folder_api_endpoint, headers=headers) | ||
response.raise_for_status() | ||
except requests.exceptions.RequestException as e: | ||
logging.error(f"Error getting Grafana folders: {str(e)}") | ||
exit() | ||
|
||
folders = json.loads(response.text) | ||
|
||
for folder in folders: | ||
folder_name = folder['title'].replace("/", "_") | ||
folder_id = folder['id'] | ||
folder_path = "./" + folder_name | ||
|
||
dashboard_api_endpoint = "/api/search?folderIds=" + str(folder_id) | ||
response = requests.get(grafana_url + dashboard_api_endpoint, headers=headers) | ||
dashboards = json.loads(response.text) | ||
|
||
if not os.path.exists(folder_path): | ||
os.makedirs(folder_path) | ||
|
||
for dashboard in dashboards: | ||
dashboard_id = dashboard['id'] | ||
dashboard_name = dashboard['title'].replace("/", "_") | ||
dashboard_api_endpoint = "/api/dashboards/uid/" + str(dashboard['uid']) | ||
|
||
response = requests.get(grafana_url + dashboard_api_endpoint, headers=headers) | ||
dashboard_data = json.loads(response.text) | ||
dashboard_data.pop('meta', None) | ||
|
||
dashboard_file_path = os.path.join(folder_path, dashboard_name + ".json") | ||
|
||
try: | ||
with open(dashboard_file_path, "w") as f: | ||
f.write(json.dumps(dashboard_data, indent=4)) | ||
except Exception as e: | ||
print(f"Error exporting dashboard {dashboard_name} in folder {folder_name}: {e}") | ||
|
||
print("Dashboard export complete!") | ||
|
||
|
||
def import_dashboards_to_grafana(grafana_url, grafana_api_key, folder_path): | ||
headers = {"Authorization": "Bearer " + grafana_api_key} | ||
|
||
for folder_name in os.listdir(import_path): | ||
folder_path = os.path.join(import_path, folder_name) | ||
|
||
if not os.path.isdir(folder_path): | ||
continue | ||
|
||
folder_api_endpoint = "/api/folders" | ||
folder_payload = {"title": folder_name} | ||
response = requests.post(grafana_url + folder_api_endpoint, headers=headers, json=folder_payload) | ||
|
||
folder_id = json.loads(response.text)['id'] | ||
folder_uid = json.loads(response.text)['uid'] | ||
|
||
print(f"Created folder {folder_name} with ID {folder_id} and UID {folder_uid}") | ||
|
||
for filename in os.listdir(folder_path): | ||
dashboard_path = os.path.join(folder_path, filename) | ||
|
||
if not dashboard_path.endswith(".json"): | ||
continue | ||
|
||
with open(dashboard_path, "r") as f: | ||
dashboard_data = json.load(f) | ||
|
||
dashboard_data = { | ||
"dashboard": dashboard_data['dashboard'], | ||
"folderId": 0, | ||
"folderUid": folder_uid, | ||
"message": "Imported from JSON", | ||
"overwrite": True | ||
} | ||
|
||
dashboard_data['dashboard']['folderId'] = folder_id | ||
dashboard_data['dashboard']['folderTitle'] = folder_name | ||
|
||
encoded_dashboard_name = urllib.parse.quote(filename[:-5]) | ||
|
||
dashboard_api_endpoint = "{grafana_url}/api/dashboards/db" | ||
response = requests.post(dashboard_api_endpoint, headers=headers, json=dashboard_data) | ||
|
||
if response.status_code == 200: | ||
print(f"Dashboard {filename} imported successfully!") | ||
else: | ||
print(f"Error importing dashboard {filename}: {response.text}") | ||
|
||
print("Dashboard import complete!") | ||
|
||
|
||
def import_exported_alerts(grafana_url, grafana_api_key, alert_config_path): | ||
""" | ||
Args: | ||
- grafana_url: URL of the Grafana instance. | ||
- grafana_api_key: API key for authentication. | ||
- alert_config_path: Path to the Alertmanager configuration file (YAML format). | ||
""" | ||
|
||
headers = {"Authorization": "Bearer " + grafana_api_key, "Content-Type": "application/yaml"} | ||
alertmanager_api_endpoint = "/api/alertmanager/grafana/config/api/v1/alerts" | ||
|
||
try: | ||
with open(alert_config_path, 'r') as file: | ||
alert_config = file.read() | ||
|
||
response = requests.post(grafana_url + alertmanager_api_endpoint, headers=headers, data=alert_config) | ||
|
||
if response.status_code == 200: | ||
print("Alertmanager configuration imported successfully!") | ||
else: | ||
print(f"Failed to import Alertmanager configuration: {response.text}") | ||
|
||
except Exception as e: | ||
print(f"Error importing Alertmanager configuration: {e}") | ||
|
||
|
||
# export_dashboards_from_grafana(grafana_url, grafana_api_key, export_path) | ||
# import_dashboards_to_grafana(grafana_url, grafana_api_key, import_path) | ||
import_exported_alerts(grafana_url, grafana_api_key, alert_config_path) |
94 changes: 94 additions & 0 deletions
94
versioned_docs/version-v0.17.0/int/quickstart/monitoring.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,94 @@ | ||
# Getting Started Monitoring your Node | ||
|
||
Welcome to this comprehensive guide, designed to assist you in effectively monitoring your Charon cluster and nodes, and setting up alerts based on specified parameters. | ||
|
||
## Pre-requisites | ||
|
||
Ensure the following software are installed: | ||
|
||
- Docker: Find the installation guide for Ubuntu **[here](https://docs.docker.com/engine/install/ubuntu/)** | ||
- Prometheus: You can install it using the guide available **[here](https://prometheus.io/docs/prometheus/latest/installation/)** | ||
- Grafana: Follow this **[link](https://grafana.com/docs/grafana/latest/setup-grafana/installation/)** to install Grafana | ||
|
||
## Import Pre-Configured Charon Dashboards | ||
|
||
- Navigate to the **[repository](https://github.com/ObolNetwork/terraform-modules/tree/main/grafana-dashboards/dashboards)** that contains a variety of Grafana dashboards. For this demonstration, we will utilize the Charon Dashboard json. | ||
- In your Grafana interface, create a new dashboard and select the import option. | ||
|
||
![Screenshot 2023-06-26 at 1.00.05 PM.png](https://s3-us-west-2.amazonaws.com/secure.notion-static.com/2bba3f52-ff32-452e-811b-f2ac7a4905fb/Screenshot_2023-06-26_at_1.00.05_PM.png) | ||
|
||
- Copy the content of the Charon Dashboard json from the repository and paste it into the import box in Grafana. Click "Load" to proceed. | ||
|
||
![Screenshot 2023-06-26 at 1.03.08 PM.png](https://s3-us-west-2.amazonaws.com/secure.notion-static.com/6790e67a-eb51-4bfb-b7b1-df14f214b72d/Screenshot_2023-06-26_at_1.03.08_PM.png) | ||
|
||
- Finalize the import by clicking on the "Import" button. At this point, your dashboard should begin displaying metrics. Ensure your Charon client and Prometheus are operational for this to occur. | ||
|
||
![Screenshot 2023-06-26 at 1.16.27 PM.png](https://s3-us-west-2.amazonaws.com/secure.notion-static.com/cc0b4a9e-c21c-4ce4-b613-9c3f84e696ed/Screenshot_2023-06-26_at_1.16.27_PM.png) | ||
|
||
## Example alerting rules | ||
|
||
- Alerts for Node-Exporter can be created using the sample rules provided here | ||
|
||
[Awesome Prometheus alerts](https://samber.github.io/awesome-prometheus-alerts/rules.html#host-and-hardware) | ||
|
||
- For Charon/Alpha alerts, refer to the alerting rules available | ||
|
||
[monitoring/alerting-rules at main · ObolNetwork/monitoring](https://github.com/ObolNetwork/monitoring/tree/main/alerting-rules) | ||
|
||
## Understanding Alert rules | ||
|
||
1. `AlphaClusterBeaconNodeDown`This alert is activated when the beacon node in a specified Alpha cluster is offline. The beacon node is crucial for validating transactions and producing new blocks. Its unavailability could disrupt the overall functionality of the cluster. | ||
2. `AlphaClusterBeaconNodeSyncing`This alert indicates that the beacon node in a specified Alpha cluster is synchronizing, i.e., catching up with the latest blocks in the cluster. | ||
3. `AlphaClusterNodeDown`This alert is activated when a node in a specified Alpha cluster is offline. | ||
4. `AlphaClusterMissedAttestations`:This alert indicates that there have been missed attestations in a specified Alpha cluster. Missed attestations may suggest that validators are not operating correctly, compromising the security and efficiency of the cluster. | ||
5. `AlphaClusterInUnknownStatus`: This alert is designed to activate when a node within the "Alpha M1 Cluster #1" is detected to be in an unknown state. The condition is evaluated by checking whether the maximum of the app_monitoring_readyz metric is 0. | ||
6. `AlphaClusterInsufficientPeers`:This alert is set to activate when the number of peers for a node in the Alpha M1 Cluster #1 is insufficient. The condition is evaluated by checking whether the maximum of the **`app_monitoring_readyz`** equals 4. | ||
7. `AlphaClusterFailureRate`: This alert is activated when the failure rate of the Alpha M1 Cluster #1 exceeds a certain threshold. | ||
8. `AlphaClusterVCMissingValidators`: This alert is activated if any validators in the Alpha M1 Cluster #1 are missing. | ||
9. `AlphaClusterHighPctFailedSyncMsgDuty`: This alert is activated if a high percentage of sync message duties failed in the "Alpha M1 Cluster #1". The alert is activated if the sum of the increase in failed duties tagged with "sync_message" in the last hour divided by the sum of the increase in total duties tagged with "sync_message" in the last hour is greater than 0.1. | ||
10. `AlphaClusterNumConnectedRelays`: This alert is activated if the number of connected relays in the "Alpha M1 Cluster #1" falls to 0. | ||
11. PeerPingLatency: 1. This alert is activated if the 90th percentile of the ping latency to the peers in a cluster exceeds 500ms within 2 minutes. | ||
|
||
## ****Best Practices for Monitoring Charon Nodes & Cluster**** | ||
|
||
- **Establish Baselines**: Familiarize yourself with the normal operation metrics like CPU, memory, and network usage. This will help you detect anomalies. | ||
- **Define Key Metrics**: Set up alerts for essential metrics, encompassing both system-level and Charon-specific ones. | ||
- **Configure Alerts**: Based on these metrics, set up actionable alerts. | ||
- **Monitor Network**: Regularly assess the connectivity between nodes and the network. | ||
- **Perform Regular Health Checks**: Consistently evaluate the status of your nodes and clusters. | ||
- **Monitor System Logs**: Keep an eye on logs for error messages or unusual activities. | ||
- **Assess Resource Usage**: Ensure your nodes are neither over- nor under-utilized. | ||
- **Automate Monitoring**: Use automation to ensure no issues go undetected. | ||
- **Conduct Drills**: Regularly simulate failure scenarios to fine-tune your setup. | ||
- **Update Regularly**: Keep your nodes and clusters updated with the latest software versions. | ||
|
||
## ****Third-Party Services for Uptime Testing**** | ||
|
||
- [updown.io](https://updown.io/) | ||
- [Grafana synthetic Monitoring](https://grafana.com/blog/2022/03/10/best-practices-for-alerting-on-synthetic-monitoring-metrics-in-grafana-cloud/?src=ggl-s&mdm=cpc&camp=nb-synthetic-monitoring-pm&cnt=130224525351&trm=grafana%20synthetic%20monitoring&device=c&gclid=CjwKCAjwzJmlBhBBEiwAEJyLu4A0quHdic_UAyYuJgqUntwGTq6DKIFq0rfPkp9fxt4lK8VMgYmo4BoCO3EQAvD_BwE) | ||
|
||
## **Key metrics to watch to verify node health based on jobs** | ||
|
||
**node_exporter:** | ||
|
||
**CPU Usage**: High or spiking CPU usage can be a sign of a process demanding more resources than it should. | ||
|
||
**Memory Usage**: If a node is consistently running out of memory, it could be due to a memory leak or simply under-provisioning. | ||
|
||
**Disk I/O**: Slow disk operations can cause applications to hang or delay responses. High disk I/O can indicate storage performance issues or a sign of high load on the system. | ||
|
||
**Network Usage**: High network traffic or packet loss can signal network configuration issues, or that a service is being overwhelmed by requests. | ||
|
||
**Disk Space**: Running out of disk space can lead to application errors and data loss. | ||
|
||
**Uptime**: The amount of time a system has been up without any restarts. Frequent restarts can indicate instability in the system. | ||
|
||
**Error Rates**: The number of errors encountered by your application. This could be 4xx/5xx HTTP errors, exceptions, or any other kind of error your application may log. | ||
|
||
**Latency**: The delay before a transfer of data begins following an instruction for its transfer. | ||
|
||
It is also important to check: | ||
|
||
- NTP clock skew | ||
- Process restarts and failures (eg. through `node_systemd`) | ||
- alert on high error and panic log counts. |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was this an accidental commit of something stashed, or was it meant to come in in this? if it was, why is it in the v0.17.0 folder? Generally stuff goes in the
docs
folder, and when a version is cut, it gets copied to theversioned_docs
folder.