AnalysisRun Metrics Differ from Prometheus Causing False Failures #4148

alfredomusumeci · 2025-02-21T13:53:25Z

alfredomusumeci
Feb 21, 2025

TL;DR

The metric readings observed by Argo Rollouts during an AnalysisRun do not always match the values reported by Google Managed Prometheus (GMP). This discrepancy sometimes leads to false negatives, where an analysis run fails due to a metric appearing below the target threshold when, in reality, it is not.

Argo Rollouts Version

1.6.6

Context

We use Argo Rollouts to canary our Deployments in GCP. Our setup includes a ClusterAnalysisTemplate like this:

apiVersion: argoproj.io/v1alpha1
kind: ClusterAnalysisTemplate
metadata:
   name: success-rate
spec:
   args:
       - name: target
       - name: query
       - name: interval
       - name: delay
   metrics:
       - name: success-rate
         interval: "{{args.interval}}"
         initialDelay: "{{args.delay}}"
         successCondition: "{{args.target}}"
         failureLimit: 1
         provider:
           prometheus:
             address: my-prometheus-address
             query: "{{args.query}}"

The provider address points to the Prometheus global load balancer as per this GMP documentation.

To ensure we only evaluate the health of the canary when valid data is available, we set an initialDelay of 5 minutes. Our pods take around 3 minutes to start, and Prometheus scrapes every 30 seconds, so this buffer should be sufficient.

Our queries typically measure error rates or availability, expecting values between 0 < x < 1. Additionally, since the same Deployment exists across multiple clusters, we check the canary’s health across all clusters (i.e., we do not group by cluster). This ensures that if the canary fails in any cluster, we trigger a rollback—even if it succeeds elsewhere.

Observed Issue

From our understanding, Argo Rollouts creates an instance of a Provider and passes the metric task directly to it (relevant code). Given this, we expect AnalysisRun results to match GMP metrics. However, this is not always the case.

At times, we even observe queries reporting values greater than 1 for a ratio, which shouldn't happen. Here’s an example AnalysisRun:

apiVersion: argoproj.io/v1alpha1
kind: AnalysisRun
metadata:
  [...]
spec:
  args:
  - name: target
    value: len(result == 0) || isNan(result[0]) || result[0] > 0.99
  - name: query
    value: sum by (project_id) (rate(grpc_server_handled_total{grpc_code=~"(OK|PERMISSION_DENIED|OUT_OF_RANGE_INVALID_ARGUMENT)", job="my-deployment", project_id="my-project", role="canary"}[1m])) / sum by (project_id) (rate(grpc_server_handled_total{job="my-deployment", project_id="my-project", role="canary"}[1m]))
  - name: interval
    value: 1m0s
  - name: delay
    value: 5m0s
  metrics:
  - failureLimit: 1
    initialDelay: '{{args.delay}}'
    interval: '{{args.interval}}'
    name: success-rate
    provider:
      prometheus: my-prometheus-address
      authentication:
        sigv4: {}
      query: '{{args.query}}'
    successCondition: '{{args.target}}'
  terminate: true
status:
  dryRunSummary: {}
  message: Metric "success-rate" assessed Failed due to failed (3) > failureLimit
  metricResults:
  - count: 12
    failed: 3
    measurements:
      # Skipping successful ones
      [...]
      # Example of a successful result > 1
      - finishedAt: some-timestamp
        phase: Successful
        startedAt: same-as-above-timestamp
        value: '[1.02]'
      # Example of a failed result
      - finishedAt: some-timestamp
        phase: Failed
        startedAt: same-as-above-timestamp
        value: '[0.98]'
      [...]
    metadata:
      ResolvedPrometheusQuery: sum by (project_id) (rate(grpc_server_handled_total{grpc_code=~"(OK|PERMISSION_DENIED|OUT_OF_RANGE_INVALID_ARGUMENT)", job="my-deployment", project_id="my-project", role="canary"}[1m])) / sum by (project_id) (rate(grpc_server_handled_total{job="my-deployment", project_id="my-project", role="canary"}[1m]))
    name: success-rate
    phase: Failed
  [...]

Discrepancy

For the same timeframe, querying GMP directly always returns 1, but Argo Rollouts reports different values—sometimes even exceeding 1. We are confident GMP is not lying to us because the Deployment doesn't experience any issue and seems to be receiving traffic normally.

This issue occurs intermittently but frequently enough that we cannot reliably trust our canaries for health evaluations. It is also difficult to debug, as we have been unable to reproduce it on demand.

We are looking for guidance on where to investigate further and whether others have encountered similar issues. Any insights would be greatly appreciated.

Thanks in advance!

alfredomusumeci · 2025-02-24T09:34:05Z

alfredomusumeci
Feb 24, 2025
Author

Duplicate of #4149, closing.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AnalysisRun Metrics Differ from Prometheus Causing False Failures #4148

{{title}}

Replies: 1 comment

{{title}}

Select a reply

AnalysisRun Metrics Differ from Prometheus Causing False Failures #4148

alfredomusumeci Feb 21, 2025

TL;DR

Argo Rollouts Version

Context

Observed Issue

Discrepancy

Replies: 1 comment

alfredomusumeci Feb 24, 2025 Author

alfredomusumeci
Feb 21, 2025

alfredomusumeci
Feb 24, 2025
Author