Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

✨ Add: Prometheus Federation 🚧 #439

Merged
merged 10 commits into from
Nov 17, 2023

Conversation

mrnicegyu11
Copy link
Member

@mrnicegyu11 mrnicegyu11 commented Nov 15, 2023

First try to add a federated scraping of metrics to prometheus.

This PR introduces a federated Prometheus setup designed to optimize metrics storage and retention. It features two Prometheus instances: one holding a small number of metrics with a long retention period for long-term data analysis, and the other maintaining a large volume of metrics with a short retention period for real-time monitoring and troubleshooting. This setup aims to balance resource usage and data availability, improving overall system performance.

See #422

Further Reading:

-> I tested that this works on my local machine


This section CC @elisabettai

Common Prometheus naming & retention scheme:

Only metrics starting with osaprc_ or s4l_ will be retained in the federated prometheus for long-term storage.

Renamed metrics/rules:

  • node_cpu_seconds_total-nonidle-increase-over-nodes-12weeks-v2 -> osparc_node_cpu_seconds_total-nonidle-increase-over-nodes-12weeks
  • osparc_metrics:cpu_usage_per_node_percentage -> osparc_cpu_usage_per_node_percentage
  • simcore_simcore_service_director_services_started_total-sum_by_key_tag -> osparc_director_services_started_total_sum_by_key_tag
  • simcore_simcore_service_webserver_services_started_total-sum_by_key_tag -> osparc_webserver_services_started_total_sum_by_key_tag

Added rules:

  • osparc_autoscaling_machines_active
  • osparc_autoscaling_machines_buffer
  • osparc_container_instances_s4lcorelite

Removed rules:

  • http_requests_total-rate-5min
  • container_tasks_state-count_by_image
  • node_cpu_seconds_total:nonidle_increase_over_nodes_12weeks
  • node_cpu_seconds_total-nonidle-sum_over_nodes

We will have to backfill the rules: https://jessicagreben.medium.com/prometheus-fill-in-data-for-new-recording-rules-30a14ccb8467

@mrnicegyu11 mrnicegyu11 added t:enhancement New feature or request observability alerting/monitoring t:infra-ops Adjustments to the way or resources with that microservices are run labels Nov 15, 2023
@mrnicegyu11 mrnicegyu11 added this to the 7peaks milestone Nov 15, 2023
@mrnicegyu11 mrnicegyu11 self-assigned this Nov 15, 2023
@mrnicegyu11 mrnicegyu11 marked this pull request as ready for review November 15, 2023 16:06
Copy link
Member

@sanderegg sanderegg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking good. good luck!

Copy link
Collaborator

@elisabettai elisabettai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks for the notification!

Can you notify me when the rule name change becomes effective, so I update the code in the new metrics repo?

@mrnicegyu11 mrnicegyu11 changed the title Add: Prometheus Federation ✨ Add: Prometheus Federation 🚧 Nov 17, 2023
@mrnicegyu11 mrnicegyu11 enabled auto-merge (squash) November 17, 2023 13:04
@mrnicegyu11 mrnicegyu11 disabled auto-merge November 17, 2023 13:04
@mrnicegyu11 mrnicegyu11 merged commit dae66c4 into ITISFoundation:main Nov 17, 2023
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
observability alerting/monitoring t:enhancement New feature or request t:infra-ops Adjustments to the way or resources with that microservices are run
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants