Switch missed tick behavior to delay instead of default burst #108
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
We encountered an interesting bug that we fixed and felt we should upstream, as it presents itself in the sample as well.
We observed that seemingly randomly, this loop (or rather, our similar but modified one) would get "hot" and run as fast as possible. This would consume a lot of CPU and other resources, and of course meant a lot of SYN packets being sent somewhere proximal to the peer. These "reconnect storms" could last as little as a few seconds, or up to hours, after which they would spontaneously resolve.
We could observe this because we'd added logging to the reconnect attempts, so all we knew was that some loop was hot, but not which of them. We initially tried rewriting the loop to use combinators with
take
and maximum attempt values, but this did not fix the problem. This suggested that the outer loop was the issue, but it took us some time to figure out how that could be.The interval here uses the default missed tick behavior, which is
Burst
:https://docs.rs/tokio/latest/tokio/time/enum.MissedTickBehavior.html
We believe that with a large number of peers offline (we sometimes had >5), this loop could run for longer than the
Interval
duration of one second. When this happened, theInterval
would enterBurst
mode and try to re-establish an exact cadence of running this loop on a whole multiple of the interval from the start.We switched the interval to instead use the
Delay
behavior, which should cause this loop to wait the interval's duration (1 second) each timetick()
is called, which is probably what the author (@TheBlueMatt) intended.