Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance Rollout controller metrics to better track analysis states and outcomes over time #4008

Open
jahvon opened this issue Dec 17, 2024 · 0 comments
Labels
enhancement New feature or request

Comments

@jahvon
Copy link
Contributor

jahvon commented Dec 17, 2024

Summary

Currently tracking the state of Rollout analyses and outcomes over time requires complex PromQL queries and recording rules due to the gauge-based nature of the existing metrics (rollout_phase and analysis_run_phase). This makes it difficult to:

  1. Track the latest analysis state for a Rollout
  2. Count how many times a specific Rollout has been rolled back or progressed after an analysis run
  3. Understand the historical progression of analysis runs
  4. Calculate key metrics like Change Failure Rate (CFR) for services using Rollouts

There are a couple of options that I could think of:

  1. Add "latest" and "rollout" labels to existing analysis_run_* gauge metrics:
analysis_run_phase{phase="Successful", rollout="my-svc", latest="true"} 1

This enables easier identification of current state without adding new metric (although it would increase cardinality a bit)

  1. Add counter metrics for state transitions / outcomes:
rollout_analysis_transitions_total{rollout="my-svc", from_phase="Completed", to_phase="Aborted"} 4
rollout_analysis_outcomes_total{rollout="my-svc", outcome="rollback"} 2

Are there other options that folks can think of here?

Use Cases

My organization manages numerous Kubernetes services and is evaluating Argo Rollouts for progressive delivery. These enhanced metrics would:

  1. Enable tracking of deployment success rates per service
  2. Allow calculation of key reliability metrics (e.g., Change Failure Rate)
  3. Provide historical insights into rollout patterns and failure modes
  4. Simplify integration with existing monitoring and alerting systems

The current metrics require complex PromQL manipulations that are both fragile and potentially unreliable for these use cases. These enhancements would make it significantly easier to monitor and analyze rollout behavior at scale.


Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.

@jahvon jahvon added the enhancement New feature or request label Dec 17, 2024
@jahvon jahvon changed the title Add time series metrics for tracking Rollout analysis states Enhance Rollout controller metrics to better track analysis states and outcomes over time Dec 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant