You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The problem:
consider the case where multi/exec or lua transactions fetch large bulks of data and their commands are stuck during the replies (stuck on socket send). If these transactions are still in the tx queue, then the whole queue can not progress. In most acute scenarios it can lead to client-initiated deadlocks. See #4182 for example.
It is easy to simulate with pipeline_queue_limit=10 and running multiple gets on several huge large keys in pipeline mode together with another connection running multi-exec on the same keys. Once these keys are locked, and gets will be placed into tx queue, we may create a deadlock because the pipelined connection won't be able to progress and it will stall Dragonfly globally.
We have tx_queue_warning_len that helps identifying these scenarios but it's too noisy because transaction length can grow due to valid reasons.
Solution: maintain a timer for a multi-hop transaction per shard queue. We will identify a problematic scenario based on two signals, how long the head is the tx queue and the queue length.
The first milestone would be just to track the problematic state and reduce the noiseness of this warning.
I am sure it is possible to recognise the multi-exec transaction state where it finished with its current command but still resides in the queue because of the next commands. This will provide even more precise identification that can be added to the warning.
The problem:
consider the case where multi/exec or lua transactions fetch large bulks of data and their commands are stuck during the replies (stuck on socket send). If these transactions are still in the tx queue, then the whole queue can not progress. In most acute scenarios it can lead to client-initiated deadlocks. See #4182 for example.
It is easy to simulate with pipeline_queue_limit=10 and running multiple gets on several huge large keys in pipeline mode together with another connection running multi-exec on the same keys. Once these keys are locked, and gets will be placed into tx queue, we may create a deadlock because the pipelined connection won't be able to progress and it will stall Dragonfly globally.
We have
tx_queue_warning_len
that helps identifying these scenarios but it's too noisy because transaction length can grow due to valid reasons.Solution: maintain a timer for a multi-hop transaction per shard queue. We will identify a problematic scenario based on two signals, how long the head is the tx queue and the queue length.
The text was updated successfully, but these errors were encountered: