Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] fix deadlock in concurrent connection disconnection #302

Closed

Conversation

doudou
Copy link
Contributor

@doudou doudou commented Jun 17, 2019

This PR aims at fixing the issue identified in #300. The deadlock happens essentially when disconnecting the same ports from two directions. The easiest way is to disconnect the same connection from both directions at the same time, which leads reliably to deadlocking (associated test in bb6fbc1).

So far, the PR fixed the few easy places. However, there are a few very large codepaths still executed under lock, as removing the lock was breaking the tests without an obvious reason from my side. I figured that you guys might have more insight.

In general, my gut feeling is that the general design should be to connect/disconnect a channel from a port under the port lock, but wait until the lock is released to actually dismantle the channel. This would ensure that the large latency part (hitting the transports) is done without affecting the component.

@meyerj
Copy link
Member

meyerj commented Jun 24, 2019

Thanks for your patch!

The new problem introduced with #114 is that some buffers might only have to be created once per input port or once per output port. The first implementation of the new dataflow semantics was implemented without that port-wide connection lock, but that caused race conditions when connecting concurrently because the new shared buffer or data object instances were eventually created by two connections, such that one of them would end up in a de-facto disconnected state. This was fixed by #283, which introduced the per-port connection locks.

The ConnectionManager actually plays a subordinate now with that patch: It still keeps track of the port's connection, but is not involved at all in the actual dataflow anymore. It is purely for management purposes.

Same issue here: Without that lock it can happen that one thread disconnects the port resulting in the destruction of a shared buffer (ConnInputEndPoint.hpp:91-95), while another thread creates a new connection. That also could lead to race conditions and a channel pipeline not consistent with the port connections stored in the ConnectionManager.

In general, my gut feeling is that the general design should be to connect/disconnect a channel from a port under the port lock, but wait until the lock is released to actually dismantle the channel. This would ensure that the large latency part (hitting the transports) is done without affecting the component.

I agree. Your patch looks good at first sight. I will do some more tests and check why it fails to build on Travis.

const ChannelElementBase::shared_ptr &input = *it++;
input->disconnect(this, false);
removeInput(input.get()); // invalidates input
Inputs::const_iterator found = std::find(inputs.begin(), inputs.end(), channel);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently the iterator needs to be non-const for std::list::splice for compatibility with C++98.

Outputs removedOutput;
{
RTT::os::MutexLock lock(outputs_lock);
Outputs::const_iterator found = std::find(outputs.begin(), outputs.end(), channel);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently the iterator needs to be non-const for std::list::splice for compatibility with C++98.

@@ -243,6 +243,12 @@ namespace RTT { namespace base {
MultipleInputsChannelElementBase::removeInput(input);
}

virtual void removedInputs(Inputs const& inputs)
{
if (find(inputs.begin(), inputs.end(), last) != inputs.end())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing namespace qualifier: std::find?

RTT::os::MutexLock lock(outputs_lock);
for (Outputs::iterator it = outputs.begin(); it != outputs.end(); ++it) {
if (it->disconnected)
disconnectedOutputs.splice(disconnectedOutputs.end(), this->outputs, it);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This loop is invalid because removing element it from outputs in splice() invalidates the iterator, at least as an iterator of outputs. it would need to be incremented before the element is removed from this->outputs.

doudou added a commit to rock-core/package_set that referenced this pull request Nov 11, 2019
…ring

This refactoring led to massive deadlocking on a lot of dynamic
disconnection scenarios.

See orocos-toolchain/rtt#302
doudou added a commit to rock-core/package_set that referenced this pull request Nov 12, 2019
…ring

This refactoring led to massive deadlocking on a lot of dynamic
disconnection scenarios.

See orocos-toolchain/rtt#302
@doudou doudou closed this Jun 8, 2022
@doudou doudou deleted the fix_deadlock_in_concurrent_connection_disconnection branch June 8, 2022 17:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants