-
Notifications
You must be signed in to change notification settings - Fork 883
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add configurable automatic retries for failed rollouts #4023
Comments
Wouldnt you adjust your analysis template to be more tolerant? Otherwise, how will this fix things? It will just fail on the retry? |
Not necessarily. Like for Datadog apdex it occasionally has periods of values being unstable, e.g. instead of being ~1 with a normal build they may jump from 0 to 10+, so even moving rollup doesn't fully solve the problem. Adjusting analysis template might help, but it'd also miss more legitimate cases. My belief that retry can help is based on some experiences in practice where I just did manual retries and they worked. |
Understood. I guess i just imagined you allow an extra fail or two if its such an intermittent thing. It should still catch legit fails, but I dont know your specific queries. |
I've noticed that such flakiness may affect most data points during the run, when it happens.
|
out of curiosity, what is your interval for this query? |
The interval is typically 10-20 min. The query interval is 1 min.
|
Recommendation from @zachaller is to use a separate controller when you need retry and send abort status. View CLI for good example. |
Summary
It would be helpful to configure an ability to have some retries if rollout fails, just like if you click the Retry popup menu item.
Use Cases
Sometimes the rollout fails due to flakiness (e.g. if using Datadog's apdex metric which is known to sometimes have bad values at current time), and the solution may be to just retry and it would succeed. Even 1 automatic retry would be very useful.
Message from the maintainers:
Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.
The text was updated successfully, but these errors were encountered: