Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backoff retry delay in status check is increasing to hours rendering services offline #2897

Open
eugene-sadovsky opened this issue Nov 14, 2023 · 5 comments

Comments

@eugene-sadovsky
Copy link

Spring Boot Admin Server information

  • Version:
    3.1.4

  • Spring Boot version:
    3.1.0

Client information

  • Used discovery mechanism:
    Consul

Description

Exponential back-off delay in de.codecentric.boot.admin.server.services.IntervalCheck is increasing to hours. I noticed that after I run SBA for 2+ weeks, previously registered services go offline for hours and then they become available again. Restarting SBA helps right away. This is always accompanied by the error message: Unexpected error in status-check: reactor.core.Exceptions$OverflowException: Could not emit tick NN due to lack of requests (interval doesn't support small downstream requests that replenish slower than the ticks)
After some investigation it looks like this happens when checkAllInstances method times-out (takes longer to complete than the interval check) and it triggers a retry. The back-off interval keeps increasing with each failure during the life-time of the SBA and eventually grows to hours. I actually takes about 12+ retries, The situation improved by lowering spring.boot.admin.timeout.health to 3 seconds. By default health endpoint timeout is equal to spring.boot.admin.status-interval (10s).

Here's the code snippet that reproduces this behavior. It will slow-down with each retry

@erikpetzold
Copy link
Member

erikpetzold commented Nov 17, 2023

Hi @eugene-sadovsky ,

that the retry time increases is intended behaviour. But you are right that the waiting time might get too high.
We introduced a new property for maxBackoff, so you can configure this on your own. The default maxBackoff for status check is now 60 seconds.

@eugene-sadovsky
Copy link
Author

thank you for the quick response 🙇🏼

@eugene-sadovsky
Copy link
Author

I think the main issue is that back-off time is never reset back to zero after successful retry. It will just saturate to maxBackoff and stay like this for the lifetime of the process. This still solves my issue, thank you 👍🏼

erikpetzold added a commit that referenced this issue Nov 17, 2023
* #2897: WIP Fix exponential backoff

* reduce number of places where defaults can be defined

* use configured backoff in retry

* #2897: javaformat

* add Test

* add docs

* reduce number of places with defaults

---------

Co-authored-by: ulrichschulte <[email protected]>
@erikpetzold
Copy link
Member

if this is really true that would be a bug in project reactor I think

@eugene-sadovsky
Copy link
Author

yeah, this is the behavior I observed. You can reproduce it by running my gist, it closely resembles the code in IntervalCheck. It randomly simulates a timeout, then there may be few successful checks, then timeout again. With each retry delay becomes longer and never goes back to zero

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants