Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RHINENG-14720: Add metrics for counting host publication checks and creations #2127

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

thearifismail
Copy link
Contributor

rh-pre-commit.version: 2.2.0
rh-pre-commit.check-secrets: ENABLED

Overview

This PR is being created to address RHINENG-14720.

PR Checklist

  • Keep PR title short, ideally under 72 characters
  • Descriptive comments provided in complex code blocks
  • Include raw query examples in the PR description, if adding/modifying SQL query
  • Tests: validate optimal/expected output
  • Tests: validate exceptions and failure scenarios
  • Tests: edge cases
  • Recovers or fails gracefully during potential resource outages (e.g. DB, Kafka)
  • Uses type hinting, if convenient
  • Documentation, if this PR changes the way other services interact with host inventory
  • Links to related PRs

Secure Coding Practices Documentation Reference

You can find documentation on this checklist here.

Secure Coding Checklist

  • Input Validation
  • Output Encoding
  • Authentication and Password Management
  • Session Management
  • Access Control
  • Cryptographic Practices
  • Error Handling and Logging
  • Data Protection
  • Communication Security
  • System Configuration
  • Database Security
  • File Management
  • Memory Management
  • General Coding Practices

rh-pre-commit.version: 2.2.0
rh-pre-commit.check-secrets: ENABLED
@thearifismail thearifismail requested a review from a team as a code owner December 10, 2024 21:50
@computercamplove
Copy link
Contributor

/retest

@chambridge
Copy link
Contributor

I'm not sure that adding prometheus metrics into this job will accomplish the goal here. This job doesn't have an API layer with a metrics endpoint that would be configured to be scrapped by Prometheus, so these metrics will not be consumed. They only way to introduce the metrics would be to push the metrics to a prometheus push gateway. While that is possible, I think its simpler to just utilize already embedded Kubernetes metrics that job status (specifically failure) with kube_job_status_failed. When a traceback occurs or the exit(1) happens due to an inactive replication slot the failed count will increase for the job. So an alert can be driven off of this with something like the following:

sum(kube_job_status_failed{namespace="<your-namespace>"}) - sum(kube_job_status_failed{namespace="<your-namespace>"} offset 10m) > 0

@thearifismail
Copy link
Contributor Author

thearifismail commented Dec 11, 2024

@chambridge I forgot to add the http_server to serve metrics. Yes I agree using the K8S provided metrics is easier as less work. I plan to use include job_name also to narrow down the error source.

sum(kube_job_status_failed{namespace="host-inventory-prod", job_name=~"syndicator-.*"}) - sum(kube_job_status_failed{namespace="host-inventory-prod", job_name=~"syndicator-.*"} offset 10m) > 0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants