Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thread-safe access to graph cache #258

Merged
merged 14 commits into from
Sep 27, 2024
Merged

Thread-safe access to graph cache #258

merged 14 commits into from
Sep 27, 2024

Conversation

Yadunund
Copy link
Member

@Yadunund Yadunund commented Aug 6, 2024

Step 1 in addressing #249.

This PR updates the rmw_context_impl_s class to accurately manage the lifetime of its members while ensure thread-safe data access. For starters, it ensures that the Graph Cache updates and lookups are thread safe.

In subsequent PRs, I will update the rmw_context_impl_s class to store rmw_publisher_data_t, etc with member functions to manage/access their functionalities.

@Yadunund Yadunund changed the title Make rmw_context_impl_s an RAII class Thread-safe access to graph cache Aug 6, 2024
Copy link
Collaborator

@clalancette clalancette left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left a few things to think about.

rmw_zenoh_cpp/src/rmw_init.cpp Outdated Show resolved Hide resolved
rmw_zenoh_cpp/src/rmw_zenoh.cpp Show resolved Hide resolved
@MichaelOrlov
Copy link

@Yadunund Friendly ping to follow up on this issue.

@MichaelOrlov
Copy link

Discussion from maintenance triage: Decided to assign this issue to @Yadunund

@clalancette
Copy link
Collaborator

rmw_context_impl_s::publish() etc

I think this is what I'm struggling with most in this change. It just doesn't seem right to me to be having one giant class that encapsulates all functionality of the RMW. It does fix the locking problem, but it doesn't seem very elegant.

If we ignore the threading problem for the moment, in my ideal world we'd have a ContextImpl class at the top-level. During rmw_create_node, we'd call ContextImpl::create_node(), which would return a Node object. During rmw_create_publisher, we'd call Node::create_publisher(), which would return a Publisher object. During rmw_publish, we'd call Publisher::publish(). (There would be similar classes for Subscription, ServiceClient, and ServiceServer). And the GraphCache would be embedded inside of the ContextImpl (since that is indeed a session-wide entity). In that class hierarchy, the functionality for each of these entities is encapsulated into its own class, and further they don't have to share one giant lock. This is, incidentally, how zenoh-cpp works (though there is no Node entity there, so you ask the Session to create a Publisher object).

Now we have to think about the locking. If we just did the above, it wouldn't be much different from what we have today, and it wouldn't fix our locking problems. So what we need is for each entity to be able to query its "parent" on whether it is still alive. For instance, in rmw_publish, we'd have to ask the Node whether the Publisher is still alive. But before that, we have to actually ask the ContextImpl whether the Node is still alive. (There's an argument to be made here that we have a similar problem with ContextImpl, but I'm not sure we can solve that with the current RMW API).

Thus, the Node class would keep a list of active Publisher, Subscription, ServiceServer, and ServiceClient objects. The ContextImpl class would keep a list of active Node objects. And the opaque data (rmw_publisher->data, etc) that we return from rmw_create_publisher (and siblings) would be a pointer to the ContextImpl, some kind of identifier for the Node, and some kind of identifier for the Publisher. Then we'd have enough information to walk the entire hierarchy and determine whether the thing we want is still alive. During an rmw_destroy_publisher, we'd remove it from the Node list, and thus further calls wouldn't work. rmw_destroy_subscription is a bit harder, because a callback may race here; we'd have to make the callback ask the node whether the subscription is still alive at this point.

I admit I haven't totally thought through all of the implications here. But I think this proposal creates a class hierarchy, but also allows us to add in additional checking/locking. What do you think of the general idea?

@Yadunund
Copy link
Member Author

rmw_context_impl_s::publish() etc

I think this is what I'm struggling with most in this change. It just doesn't seem right to me to be having one giant class that encapsulates all functionality of the RMW. It does fix the locking problem, but it doesn't seem very elegant.

If we ignore the threading problem for the moment, in my ideal world we'd have a ContextImpl class at the top-level. During rmw_create_node, we'd call ContextImpl::create_node(), which would return a Node object. During rmw_create_publisher, we'd call Node::create_publisher(), which would return a Publisher object. During rmw_publish, we'd call Publisher::publish(). (There would be similar classes for Subscription, ServiceClient, and ServiceServer). And the GraphCache would be embedded inside of the ContextImpl (since that is indeed a session-wide entity). In that class hierarchy, the functionality for each of these entities is encapsulated into its own class, and further they don't have to share one giant lock. This is, incidentally, how zenoh-cpp works (though there is no Node entity there, so you ask the Session to create a Publisher object).

Now we have to think about the locking. If we just did the above, it wouldn't be much different from what we have today, and it wouldn't fix our locking problems. So what we need is for each entity to be able to query its "parent" on whether it is still alive. For instance, in rmw_publish, we'd have to ask the Node whether the Publisher is still alive. But before that, we have to actually ask the ContextImpl whether the Node is still alive. (There's an argument to be made here that we have a similar problem with ContextImpl, but I'm not sure we can solve that with the current RMW API).

Thus, the Node class would keep a list of active Publisher, Subscription, ServiceServer, and ServiceClient objects. The ContextImpl class would keep a list of active Node objects. And the opaque data (rmw_publisher->data, etc) that we return from rmw_create_publisher (and siblings) would be a pointer to the ContextImpl, some kind of identifier for the Node, and some kind of identifier for the Publisher. Then we'd have enough information to walk the entire hierarchy and determine whether the thing we want is still alive. During an rmw_destroy_publisher, we'd remove it from the Node list, and thus further calls wouldn't work. rmw_destroy_subscription is a bit harder, because a callback may race here; we'd have to make the callback ask the node whether the subscription is still alive at this point.

I admit I haven't totally thought through all of the implications here. But I think this proposal creates a class hierarchy, but also allows us to add in additional checking/locking. What do you think of the general idea?

Thanks for the additional thoughts here. I agree such a hierarchy would be ideal. What i'm not clear about is whether Node will have an API to return a Publisher::SharedPtr? My only concern here is that we need to ensure that the only thing that rmw_destroy_publisher does is remove the Publisher::SharedPtr from the container in which it is stored within Node. Then the destructor of Publisher takes care of all the cleaning up. So even if rmw_publish obtained this Publisher::SharedPtr from Node, and is thus keeping it alive while rmw_destroy_publisher is invoked in a separate thread, the Publisher's destructor will only be invoked after rmw_publish returns. This would also be the case if Node is destroyed when rmw_publish is being invoked.

In any case it seems like this is a bridge we will need to cross only after we merge this PR; perhaps when addressing #259? So should we move the discussion there since the changes here simple make rmw_context_impl_s a concrete class with private data members?

@Yadunund
Copy link
Member Author

Yadunund commented Sep 3, 2024

In 7cc52bd, I updated the rmw_context_impl_s class to return a shared_ptr<GraphCache>. The graph_sub_data_handler() cb should still be thread-safe with the new implementation as the callback and other methods that query the shared_ptr<GraphCache> should lock the same mutex within GraphCache.
Hoping this addresses the concern of rmw_context_impl_s class embedding everything within it. I will update #259 in a similar way such that rmw_context_impl_s returns a shared_ptr<NodeData>.

rmw_zenoh_cpp/src/detail/rmw_context_impl_s.cpp Outdated Show resolved Hide resolved
rmw_zenoh_cpp/src/detail/rmw_context_impl_s.cpp Outdated Show resolved Hide resolved
rmw_zenoh_cpp/src/detail/rmw_context_impl_s.cpp Outdated Show resolved Hide resolved
rmw_zenoh_cpp/src/detail/rmw_context_impl_s.cpp Outdated Show resolved Hide resolved
rmw_zenoh_cpp/src/detail/rmw_context_impl_s.cpp Outdated Show resolved Hide resolved
Signed-off-by: Yadunund <[email protected]>
Copy link
Contributor

@ahcorde ahcorde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor fixes, otherwise LGTM

rmw_zenoh_cpp/src/detail/rmw_context_impl_s.cpp Outdated Show resolved Hide resolved
rmw_zenoh_cpp/src/detail/rmw_context_impl_s.cpp Outdated Show resolved Hide resolved
rmw_zenoh_cpp/src/detail/rmw_context_impl_s.hpp Outdated Show resolved Hide resolved
rmw_zenoh_cpp/src/detail/rmw_context_impl_s.hpp Outdated Show resolved Hide resolved
Copy link
Collaborator

@clalancette clalancette left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs a rebase to fix conflicts, and we should fix @ahcorde 's comments, but this otherwise looks good to me.

Signed-off-by: Yadunund <[email protected]>
Signed-off-by: Yadunund <[email protected]>
@Yadunund Yadunund merged commit 67ed661 into rolling Sep 27, 2024
8 checks passed
@Yadunund Yadunund deleted the yadu/raii-context branch September 27, 2024 17:50
@Yadunund Yadunund restored the yadu/raii-context branch September 27, 2024 18:13
YuanYuYuan pushed a commit to ZettaScaleLabs/rmw_zenoh that referenced this pull request Sep 30, 2024
* Make rmw_context_impl_s an RAII class

Signed-off-by: Yadunund <[email protected]>

* fix regression with graph_guard_condition not triggering when entity is removed

Signed-off-by: Yadunund <[email protected]>

* Have the context create the zenoh artifacts

Signed-off-by: Yadunund <[email protected]>

* Add comment for session() api

Signed-off-by: Yadunund <[email protected]>

* Style

Signed-off-by: Yadunund <[email protected]>

* Add api to register querying_sub cb in rmw_context_impl_s

Signed-off-by: Yadunund <[email protected]>

* Have rmw_context_impl_s return a shared_ptr to GraphCache

Signed-off-by: Yadunund <[email protected]>

* Add todo on thread safety

Signed-off-by: Yadunund <[email protected]>

* Update rmw_zenoh_cpp/src/detail/rmw_context_impl_s.cpp

Co-authored-by: Chris Lalancette <[email protected]>
Signed-off-by: Yadu <[email protected]>

* Address feedback

Signed-off-by: Yadunund <[email protected]>

* Do not use allocator for creating graph_guard_condition

Signed-off-by: Yadunund <[email protected]>

* Address feedback

Signed-off-by: Yadunund <[email protected]>

---------

Signed-off-by: Yadunund <[email protected]>
Signed-off-by: Yadu <[email protected]>
Co-authored-by: Chris Lalancette <[email protected]>
imstevenpmwork pushed a commit to ZettaScaleLabs/rmw_zenoh that referenced this pull request Sep 30, 2024
* Make rmw_context_impl_s an RAII class

Signed-off-by: Yadunund <[email protected]>

* fix regression with graph_guard_condition not triggering when entity is removed

Signed-off-by: Yadunund <[email protected]>

* Have the context create the zenoh artifacts

Signed-off-by: Yadunund <[email protected]>

* Add comment for session() api

Signed-off-by: Yadunund <[email protected]>

* Style

Signed-off-by: Yadunund <[email protected]>

* Add api to register querying_sub cb in rmw_context_impl_s

Signed-off-by: Yadunund <[email protected]>

* Have rmw_context_impl_s return a shared_ptr to GraphCache

Signed-off-by: Yadunund <[email protected]>

* Add todo on thread safety

Signed-off-by: Yadunund <[email protected]>

* Update rmw_zenoh_cpp/src/detail/rmw_context_impl_s.cpp

Co-authored-by: Chris Lalancette <[email protected]>
Signed-off-by: Yadu <[email protected]>

* Address feedback

Signed-off-by: Yadunund <[email protected]>

* Do not use allocator for creating graph_guard_condition

Signed-off-by: Yadunund <[email protected]>

* Address feedback

Signed-off-by: Yadunund <[email protected]>

---------

Signed-off-by: Yadunund <[email protected]>
Signed-off-by: Yadu <[email protected]>
Co-authored-by: Chris Lalancette <[email protected]>
@Yadunund Yadunund deleted the yadu/raii-context branch September 30, 2024 17:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants