improve deadlock detection around slow/unresponsive replies #4372

romange · 2024-12-26T07:48:29Z

The problem:
consider the case where multi/exec or lua transactions fetch large bulks of data and their commands are stuck during the replies (stuck on socket send). If these transactions are still in the tx queue, then the whole queue can not progress. In most acute scenarios it can lead to client-initiated deadlocks. See #4182 for example.

It is easy to simulate with pipeline_queue_limit=10 and running multiple gets on several huge large keys in pipeline mode together with another connection running multi-exec on the same keys. Once these keys are locked, and gets will be placed into tx queue, we may create a deadlock because the pipelined connection won't be able to progress and it will stall Dragonfly globally.

We have tx_queue_warning_len that helps identifying these scenarios but it's too noisy because transaction length can grow due to valid reasons.

Solution: maintain a timer for a multi-hop transaction per shard queue. We will identify a problematic scenario based on two signals, how long the head is the tx queue and the queue length.

The first milestone would be just to track the problematic state and reduce the noiseness of this warning.
I am sure it is possible to recognise the multi-exec transaction state where it finished with its current command but still resides in the queue because of the next commands. This will provide even more precise identification that can be added to the warning.
With chore: add ability to track connections stuck at send #4330 we also track the send delay, which potentially can lead to self-healing mechanism that force closes connections that are being stuck. In matter of fact, this can be useful for other scenarios like pubsub. See The pipeline queue can grow over time when a client maintains a continuous connection #4182 for example.

The text was updated successfully, but these errors were encountered:

romange added the enhancement New feature or request label Dec 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improve deadlock detection around slow/unresponsive replies #4372

improve deadlock detection around slow/unresponsive replies #4372

romange commented Dec 26, 2024

improve deadlock detection around slow/unresponsive replies #4372

improve deadlock detection around slow/unresponsive replies #4372

Comments

romange commented Dec 26, 2024