Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add configurable automatic retries for failed rollouts #4023

Open
andrii-korotkov-verkada opened this issue Dec 24, 2024 · 7 comments
Open
Labels
enhancement New feature or request

Comments

@andrii-korotkov-verkada

Summary

It would be helpful to configure an ability to have some retries if rollout fails, just like if you click the Retry popup menu item.

Use Cases

Sometimes the rollout fails due to flakiness (e.g. if using Datadog's apdex metric which is known to sometimes have bad values at current time), and the solution may be to just retry and it would succeed. Even 1 automatic retry would be very useful.


Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.

@andrii-korotkov-verkada andrii-korotkov-verkada added the enhancement New feature or request label Dec 24, 2024
@andrii-korotkov-verkada andrii-korotkov-verkada changed the title Add configurabel automatic retries for failed rollouts Add configurable automatic retries for failed rollouts Dec 24, 2024
@meeech
Copy link
Contributor

meeech commented Dec 28, 2024

Wouldnt you adjust your analysis template to be more tolerant? Otherwise, how will this fix things? It will just fail on the retry?

@andrii-korotkov-verkada
Copy link
Author

Not necessarily. Like for Datadog apdex it occasionally has periods of values being unstable, e.g. instead of being ~1 with a normal build they may jump from 0 to 10+, so even moving rollup doesn't fully solve the problem. Adjusting analysis template might help, but it'd also miss more legitimate cases. My belief that retry can help is based on some experiences in practice where I just did manual retries and they worked.

@meeech
Copy link
Contributor

meeech commented Dec 29, 2024

Understood. I guess i just imagined you allow an extra fail or two if its such an intermittent thing. It should still catch legit fails, but I dont know your specific queries.

@andrii-korotkov-verkada
Copy link
Author

I've noticed that such flakiness may affect most data points during the run, when it happens.
The current query for apdex is like

moving_rollup(avg:{{args.metric-prefix}}.apdex{env:{{args.env}},service:{{args.service-name}},version:{{args.version}}}.fill(last), 60, 'avg')

@meeech
Copy link
Contributor

meeech commented Dec 29, 2024

out of curiosity, what is your interval for this query?

@andrii-korotkov-verkada
Copy link
Author

out of curiosity, what is your interval for this query?

The interval is typically 10-20 min. The query interval is 1 min.

apiVersion: argoproj.io/v1alpha1
kind: ClusterAnalysisTemplate
metadata:
  name: apdex-rate-fill
  labels:
    type: background
  annotations:
    argocd.argoproj.io/sync-wave: "1"
spec:
  args:
    - name: interval
    - name: threshold
    - name: metric-prefix
    - name: env
    - name: service-name
    - name: version
  metrics:
    - name: apdex-rate-fill
      interval: 60s
      successCondition: default(result, 1.0) >= {{args.threshold}}
      failureLimit: 3
      provider:
        datadog:
          # we need to have bigger intervals for evaluation so the query doesn't return nil
          interval: "{{args.interval}}"
          query: moving_rollup(avg:{{args.metric-prefix}}.apdex{env:{{args.env}},service:{{args.service-name}},version:{{args.version}}}.fill(last), 60, 'avg')
      analysis:
        templates:
        - templateName: apdex-rate-fill
          clusterScope: true
        startingStep: 2
        args:
        - name: interval
          value: "20m"
        - name: threshold
          value: "0.97"
        - name: metric-prefix
          value: trace.http.request
        - name: service-name
          valueFrom:
            fieldRef:
              fieldPath: metadata.labels['app']
        - name: version
          valueFrom:
            fieldRef:
              fieldPath: metadata.labels['version']
        - name: env
          valueFrom:
            fieldRef:
              fieldPath: metadata.labels['env']

@todaywasawesome
Copy link
Contributor

Recommendation from @zachaller is to use a separate controller when you need retry and send abort status. View CLI for good example.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants