AnalysisRun Metrics Differ from Prometheus Causing False Failures #4148
Replies: 1 comment
-
Duplicate of #4149, closing. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
TL;DR
The metric readings observed by Argo Rollouts during an
AnalysisRun
do not always match the values reported by Google Managed Prometheus (GMP). This discrepancy sometimes leads to false negatives, where an analysis run fails due to a metric appearing below the target threshold when, in reality, it is not.Argo Rollouts Version
1.6.6
Context
We use Argo Rollouts to canary our Deployments in GCP. Our setup includes a
ClusterAnalysisTemplate
like this:The
provider
address points to the Prometheus global load balancer as per this GMP documentation.To ensure we only evaluate the health of the canary when valid data is available, we set an
initialDelay
of 5 minutes. Our pods take around 3 minutes to start, and Prometheus scrapes every 30 seconds, so this buffer should be sufficient.Our queries typically measure error rates or availability, expecting values between
0 < x < 1
. Additionally, since the same Deployment exists across multiple clusters, we check the canary’s health across all clusters (i.e., we do not group bycluster
). This ensures that if the canary fails in any cluster, we trigger a rollback—even if it succeeds elsewhere.Observed Issue
From our understanding, Argo Rollouts creates an instance of a
Provider
and passes the metric task directly to it (relevant code). Given this, we expectAnalysisRun
results to match GMP metrics. However, this is not always the case.At times, we even observe queries reporting values greater than 1 for a ratio, which shouldn't happen. Here’s an example
AnalysisRun
:Discrepancy
For the same timeframe, querying GMP directly always returns
1
, but Argo Rollouts reports different values—sometimes even exceeding1
. We are confident GMP is not lying to us because the Deployment doesn't experience any issue and seems to be receiving traffic normally.This issue occurs intermittently but frequently enough that we cannot reliably trust our canaries for health evaluations. It is also difficult to debug, as we have been unable to reproduce it on demand.
We are looking for guidance on where to investigate further and whether others have encountered similar issues. Any insights would be greatly appreciated.
Thanks in advance!
Beta Was this translation helpful? Give feedback.
All reactions