Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

controller: try to make step timeout somewhat more accurate #3663

Open
krancour opened this issue Mar 17, 2025 · 0 comments
Open

controller: try to make step timeout somewhat more accurate #3663

krancour opened this issue Mar 17, 2025 · 0 comments

Comments

@krancour
Copy link
Member

Numerous discussions have taken place over the last few months re: configurability of the retry interval for step retries and timeouts, possibly not being honored.

I'll borrow my own recent explanation for the difficulty behind this from #3515:

the timeouts are not exact because steps do not continuously retry internally. If the timeout hasn't elapsed, a step that's still running (waiting on something external) is retried on the next reconciliation attempt.

In general, those attempts are every five minutes, but the next attempt can be sooner if a related resource has a state change that forces the Promotion back onto the queue. It can also be later depending on the depth of the queue.

...

We're a little bit at the mercy of the controller runtime here since we don't have precise control over the interval before the next reconciliation...

That said, we can probably get closer to the specified timeout by shortening the requeue interval when timeout is sooner than when the next reconciliation attempt would typically be.

If someone wishes to tackle this, it should only be a few lines of code right around here.

We can keep the hard coded five minutes, but shorten it by the appropriate amount if the running step's start time + timeout is sooner than now + five minutes.

Note: At this time, I'm still hesitant to apply a similar strategy for making retry intervals configurable because I think our inability to offer much precision there is more problematic than it is with timeouts. Let's, at least, consider it out of scope for this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant