-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
✨ Add: Prometheus Federation 🚧 #439
✨ Add: Prometheus Federation 🚧 #439
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looking good. good luck!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thanks for the notification!
Can you notify me when the rule name change becomes effective, so I update the code in the new metrics repo?
First try to add a federated scraping of metrics to prometheus.
This PR introduces a federated Prometheus setup designed to optimize metrics storage and retention. It features two Prometheus instances: one holding a small number of metrics with a long retention period for long-term data analysis, and the other maintaining a large volume of metrics with a short retention period for real-time monitoring and troubleshooting. This setup aims to balance resource usage and data availability, improving overall system performance.
See #422
Further Reading:
-> I tested that this works on my local machine
This section CC @elisabettai
Common Prometheus naming & retention scheme:
Only metrics starting with
osaprc_
ors4l_
will be retained in the federated prometheus for long-term storage.Renamed metrics/rules:
node_cpu_seconds_total-nonidle-increase-over-nodes-12weeks-v2
->osparc_node_cpu_seconds_total-nonidle-increase-over-nodes-12weeks
osparc_metrics:cpu_usage_per_node_percentage
->osparc_cpu_usage_per_node_percentage
simcore_simcore_service_director_services_started_total-sum_by_key_tag
->osparc_director_services_started_total_sum_by_key_tag
simcore_simcore_service_webserver_services_started_total-sum_by_key_tag
->osparc_webserver_services_started_total_sum_by_key_tag
Added rules:
Removed rules:
http_requests_total-rate-5min
container_tasks_state-count_by_image
node_cpu_seconds_total:nonidle_increase_over_nodes_12weeks
node_cpu_seconds_total-nonidle-sum_over_nodes
We will have to backfill the rules:
https://jessicagreben.medium.com/prometheus-fill-in-data-for-new-recording-rules-30a14ccb8467