Add configurable automatic retries for failed rollouts #4023

andrii-korotkov-verkada · 2024-12-24T18:36:58Z

Summary

It would be helpful to configure an ability to have some retries if rollout fails, just like if you click the Retry popup menu item.

Use Cases

Sometimes the rollout fails due to flakiness (e.g. if using Datadog's apdex metric which is known to sometimes have bad values at current time), and the solution may be to just retry and it would succeed. Even 1 automatic retry would be very useful.

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.

meeech · 2024-12-28T19:01:21Z

Wouldnt you adjust your analysis template to be more tolerant? Otherwise, how will this fix things? It will just fail on the retry?

andrii-korotkov-verkada · 2024-12-28T23:48:46Z

Not necessarily. Like for Datadog apdex it occasionally has periods of values being unstable, e.g. instead of being ~1 with a normal build they may jump from 0 to 10+, so even moving rollup doesn't fully solve the problem. Adjusting analysis template might help, but it'd also miss more legitimate cases. My belief that retry can help is based on some experiences in practice where I just did manual retries and they worked.

meeech · 2024-12-29T02:45:05Z

Understood. I guess i just imagined you allow an extra fail or two if its such an intermittent thing. It should still catch legit fails, but I dont know your specific queries.

andrii-korotkov-verkada · 2024-12-29T03:04:33Z

I've noticed that such flakiness may affect most data points during the run, when it happens.
The current query for apdex is like

moving_rollup(avg:{{args.metric-prefix}}.apdex{env:{{args.env}},service:{{args.service-name}},version:{{args.version}}}.fill(last), 60, 'avg')

meeech · 2024-12-29T19:22:42Z

out of curiosity, what is your interval for this query?

andrii-korotkov-verkada · 2024-12-29T20:55:53Z

out of curiosity, what is your interval for this query?

The interval is typically 10-20 min. The query interval is 1 min.

apiVersion: argoproj.io/v1alpha1
kind: ClusterAnalysisTemplate
metadata:
  name: apdex-rate-fill
  labels:
    type: background
  annotations:
    argocd.argoproj.io/sync-wave: "1"
spec:
  args:
    - name: interval
    - name: threshold
    - name: metric-prefix
    - name: env
    - name: service-name
    - name: version
  metrics:
    - name: apdex-rate-fill
      interval: 60s
      successCondition: default(result, 1.0) >= {{args.threshold}}
      failureLimit: 3
      provider:
        datadog:
          # we need to have bigger intervals for evaluation so the query doesn't return nil
          interval: "{{args.interval}}"
          query: moving_rollup(avg:{{args.metric-prefix}}.apdex{env:{{args.env}},service:{{args.service-name}},version:{{args.version}}}.fill(last), 60, 'avg')

      analysis:
        templates:
        - templateName: apdex-rate-fill
          clusterScope: true
        startingStep: 2
        args:
        - name: interval
          value: "20m"
        - name: threshold
          value: "0.97"
        - name: metric-prefix
          value: trace.http.request
        - name: service-name
          valueFrom:
            fieldRef:
              fieldPath: metadata.labels['app']
        - name: version
          valueFrom:
            fieldRef:
              fieldPath: metadata.labels['version']
        - name: env
          valueFrom:
            fieldRef:
              fieldPath: metadata.labels['env']

todaywasawesome · 2025-01-02T16:25:39Z

Recommendation from @zachaller is to use a separate controller when you need retry and send abort status. View CLI for good example.

andrii-korotkov-verkada added the enhancement New feature or request label Dec 24, 2024

andrii-korotkov-verkada changed the title ~~Add configurabel automatic retries for failed rollouts~~ Add configurable automatic retries for failed rollouts Dec 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add configurable automatic retries for failed rollouts #4023

Add configurable automatic retries for failed rollouts #4023

andrii-korotkov-verkada commented Dec 24, 2024

meeech commented Dec 28, 2024

andrii-korotkov-verkada commented Dec 28, 2024

meeech commented Dec 29, 2024

andrii-korotkov-verkada commented Dec 29, 2024

meeech commented Dec 29, 2024

andrii-korotkov-verkada commented Dec 29, 2024

todaywasawesome commented Jan 2, 2025

Add configurable automatic retries for failed rollouts #4023

Add configurable automatic retries for failed rollouts #4023

Comments

andrii-korotkov-verkada commented Dec 24, 2024

Summary

Use Cases

meeech commented Dec 28, 2024

andrii-korotkov-verkada commented Dec 28, 2024

meeech commented Dec 29, 2024

andrii-korotkov-verkada commented Dec 29, 2024

meeech commented Dec 29, 2024

andrii-korotkov-verkada commented Dec 29, 2024

todaywasawesome commented Jan 2, 2025