Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timeout in GossipReenterAfterSend #2130

Open
AqlaSolutions opened this issue Jul 30, 2024 · 11 comments
Open

Timeout in GossipReenterAfterSend #2130

AqlaSolutions opened this issue Jul 30, 2024 · 11 comments

Comments

@AqlaSolutions
Copy link
Contributor

AqlaSolutions commented Jul 30, 2024

Hi, sometimes we see a spam of such error messages in a log (2-3 per second): Timeout in GossipReenterAfterSend. It may continue for hours. There are no other messages in the log preceding this. After an hour of this we also see these messages:

TimeoutException: Request didn't receive any Response within the expected time.\n at Proto.Future.FutureProcess.GetTask(CancellationToken cancellationToken)\n at Proto.SenderContextExtensions.RequestAsync[T](ISenderContext self, PID target, Object message, CancellationToken cancellationToken)\n at Proto.Cluster.Gossip.Gossiper.GetStateEntry(String key)\n at Proto.Cluster.Gossip.Gossiper.BlockGracefullyLeft()\n at Proto.Cluster.Gossip.Gossiper.GossipLoop()",
"MessageTemplate": "Gossip loop failed"

They continue for hours and may be until a restart.

We can't reproduce it locally but it regularly happens in kuber on staging and prod servers. Is there a way to debug this? Any help?

@rogeralsing
Copy link
Contributor

Hi, we recently added a link from the documentation to this article: https://home.robusta.dev/blog/stop-using-cpu-limits

Kubernetes is prone to throttle the CPU in this kind of systems, and thus resulting in timeouts.
(the same applies to Orleans or GetEventstore also, anything realtime-ish)

Could you give that a try and see if this fixes the problems in your case?

@AqlaSolutions
Copy link
Contributor Author

AqlaSolutions commented Jul 31, 2024

We may try but according to our monitoring there is no high CPU activity going on at the time of the issue.

@rogeralsing
Copy link
Contributor

Could you also give this a try?

actorSystemConfig = actorSystemConfig with { SharedFutures = false };

And pass this in to the actor system config.

The exception you linked above is from the gossip loop and it seems to be timing out when trying to just get gossip state, indicating that the gossip actor is for some reason deadlocked.
Maybe there is some unknown bug in the shared futures that are enabled by default.

That specific exception does not look like it could be kubernetes related tbh.
I´ve started investigating this on my side also

@AqlaSolutions
Copy link
Contributor Author

We already use SharedFutures = false. I am the one who reported the issue with SharedFutures)

@rogeralsing
Copy link
Contributor

Ah right.
Can you see if you get any of these log messages for the gossip actor?:

Actor {Self} deadline {Deadline}, exceeded on message

e.g. search for "$gossip" in the logs

It would be great to know if the system detects if the gossip actor is timing out on messages

@rogeralsing
Copy link
Contributor

Or any of these

System {Id} - ThreadPool is running hot, ThreadPool latency {ThreadPoolLatency}"

@rogeralsing
Copy link
Contributor

Or this one

GossipActor Failed {MessageType}

@AqlaSolutions
Copy link
Contributor Author

AqlaSolutions commented Aug 2, 2024

I've found only this one and only once:

System {Id} - ThreadPool is running hot, ThreadPool latency 00:00:01.0000184

@benbenwilde
Copy link
Contributor

@AqlaSolutions I believe #2133 may fix the underlying issue that can cause alot of Timeout in GossipReenterAfterSend and Gossip loop failed errors. It's available in the latest 1.6.1-alpha.0.25, hopefully you see an improvement.

@AqlaSolutions
Copy link
Contributor Author

Great news!

@AqlaSolutions
Copy link
Contributor Author

AqlaSolutions commented Dec 18, 2024

We still experience this issue with the gossip timeout in 1.7.0. But no Gossip loop failed this time.

After some time of spamming these errors, one of nodes just blocks the problematic one:

Connection Refused from remote member c012bc35ff2847269801cfefd72e9822 address [redacted], they are blocked

Then that node receives the block (may be from gossip from 3rd node or from the cluster provider):

I have been blocked, exiting c012bc35ff2847269801cfefd72e9822

There is also another problem that the host never terminates even though we use WithExitOnShutdown. We see this error in the log:

Actor [redacted]/$partition-activator deadline 00:00:00.1000000, exceeded on message Proto.Stopping

ActorSystem.Shutdown token is never cancelled.

Also I noticed that even after the shutdown has been started, the node still tries to reconnect to other nodes (but it can't because it's blocked).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants