-
Notifications
You must be signed in to change notification settings - Fork 72
Deadlock when calling lasp:query/1 #316
Comments
Looking at the traces now I see that the |
I looked more at the traces and found the following call flows leading to this deadlock:
The other call flow:
So it looks like the race is between two flows: an external TCP packet triggers a I don't know what these flows supposed to do, my guess based on variable and function names is that both want to update some kind of internal state, who are in the cluster. I looked into the
so it looks line |
I'm working on an application that (in my test) runs in a Kubernetes stateful set on 4 pods. There's one Erlang node in each pod. The nodes (in this test) have many processes, but only one process (a
gen_server
) that reads and writes some data to/fromlasp
. The reason why these processes store this data inlasp
is that these processes might get migrated to a different node and if they are started on the new node, they need the data from the old node. There's no concurrent access to these variables even from within a single node (before a previous refactoring step they we in the state of thegen_server
). These processes are also registered usinglasp_pg
(I don't know if it matters). Each process has 3 variables (agcounter
, anawmap
and anawset_ps
). The ID of thegcounter
variable on node0 is{key0, latest_committed}
, on node1 it's{key1, latest_committed}
, etc. These are separate variables with possibly separate values, I'm just adding it here because they show up the traces below.For some reason the system gets stuck on a
lasp:query
call for the{key0, latest_committed}
variable on node0. I made backtraces on the processes in the system and it looks like:1, we're making a
lasp:query
call that will eventually make agen_server:call
towards thelasp_distribution_backend
process.2,
lasp_distribution_backend
process makes agen_server:call
towardslasp_storage_backend
.3,
lasp_storage_backend
process makes agen_server:call
towardslasp_ets_storage_backend
.4,
lasp_ets_storage_backend
process makes agen_server:call
towardspartisan_pluggable_peer_service_manager
.5,
partisan_pluggable_peer_service_manager
process makes some kind of call towardspartisan_peer_service_events
. I haven’t found actually this call in the code yet, the backtrace showsgen_event:rpc/2
.6,
partisan_peer_service_events
makes agen_server:call
towardslasp_distribution_backend
- and we have a cycle, back to step 2.The backtrace from the
gen_server
making thelasp:query
call:Process 1482 is
lasp_distribution_backend
. Backtrace:Process 1484 is
lasp_storage_backend
backtrace:
Process 1485 is
lasp_ets_storage_backend
backtrace:
Process 2806 is
partisan_pluggable_peer_service_manager
backtrace:
Process 1410 is
partisan_peer_service_events
backtrace:
And this last process calls the first in the above list. I'm going to follow the code and try to figure out exactly how this happens, so I plan to update this issue when I have more information.
The text was updated successfully, but these errors were encountered: