Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve monitoring of WMAgents #12169

Open
vkuznet opened this issue Nov 18, 2024 · 0 comments
Open

Improve monitoring of WMAgents #12169

vkuznet opened this issue Nov 18, 2024 · 0 comments

Comments

@vkuznet
Copy link
Contributor

vkuznet commented Nov 18, 2024

Impact of the new feature
With more accessible monitoring of remote services, and those within container, we will get better understanding their performance issues and requirements. Monitoring can also be used within k8s infrastructure for autoscaling the services based on custom metrics

Is your feature request related to a problem? Please describe.
At the moment we have some monitoring for central services which runs on k8s infrastructure. But we lack of monitoring of services running in WMAgents nodes within containers.

Describe the solution you'd like
Add Prometheus and AlertManager static executables directly to WMA containers and configure them properly to write metrics to remote CMS Monitoring end-point. This will require the following actions to be completed:

  • instrument /metrics end-point to all WM services if it does not exists
  • setup prometheus within docker image and start it together with all other services
  • configure prometheus to write metrics to remote CMS Monitoring, e.g.
remote_write:
  - url: http://xxx.cern.ch:port/api/v1/write
    remote_timeout: 30s
    queue_config:
      capacity: 100000
      max_shards: 30
      max_samples_per_send: 10000
      batch_send_deadline: 5s
      max_retries: 10
      min_backoff: 30ms
      max_backoff: 100ms
  • define prometheus rules for services and aggregated metrics
  • configure AlertManager to act upon (aggregated) metrics
  • adjust start-up scripts for services to include start-up of Prometheus and AlertManager
  • once metrics are received by CMS Monitoring infrastructure setup necessary Grafana dashboards for services running within distributed WMAgents.

Describe alternatives you've considered
None, keep using WMAgent as is

Additional context
This can be meta-issue as independent steps can be put into separate issues.

Please also see USCMS WM Monitoring and Alerts talk for more details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant