Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve deadlock detection around slow/unresponsive replies #4372

Open
romange opened this issue Dec 26, 2024 · 0 comments
Open

improve deadlock detection around slow/unresponsive replies #4372

romange opened this issue Dec 26, 2024 · 0 comments
Labels
enhancement New feature or request

Comments

@romange
Copy link
Collaborator

romange commented Dec 26, 2024

The problem:
consider the case where multi/exec or lua transactions fetch large bulks of data and their commands are stuck during the replies (stuck on socket send). If these transactions are still in the tx queue, then the whole queue can not progress. In most acute scenarios it can lead to client-initiated deadlocks. See #4182 for example.

It is easy to simulate with pipeline_queue_limit=10 and running multiple gets on several huge large keys in pipeline mode together with another connection running multi-exec on the same keys. Once these keys are locked, and gets will be placed into tx queue, we may create a deadlock because the pipelined connection won't be able to progress and it will stall Dragonfly globally.

We have tx_queue_warning_len that helps identifying these scenarios but it's too noisy because transaction length can grow due to valid reasons.

Solution: maintain a timer for a multi-hop transaction per shard queue. We will identify a problematic scenario based on two signals, how long the head is the tx queue and the queue length.

  1. The first milestone would be just to track the problematic state and reduce the noiseness of this warning.
  2. I am sure it is possible to recognise the multi-exec transaction state where it finished with its current command but still resides in the queue because of the next commands. This will provide even more precise identification that can be added to the warning.
  3. With chore: add ability to track connections stuck at send #4330 we also track the send delay, which potentially can lead to self-healing mechanism that force closes connections that are being stuck. In matter of fact, this can be useful for other scenarios like pubsub. See The pipeline queue can grow over time when a client maintains a continuous connection #4182 for example.
@romange romange added the enhancement New feature or request label Dec 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant