Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dispatcher deadlock workaround #10517

Open
wants to merge 1 commit into
base: development
Choose a base branch
from

Conversation

v-lopez
Copy link

@v-lopez v-lopez commented May 16, 2022

Details on #10482

Stopping _device_watcher before unregistering callbacks.

@Nir-Az Nir-Az requested a review from maloel September 8, 2022 18:49
@Nir-Az Nir-Az requested review from maloel and removed request for maloel January 2, 2023 18:52
@@ -195,6 +195,7 @@ device::device(std::shared_ptr<context> ctx,

device::~device()
{
_context->stop_device_watcher();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like it'd work, sure -- but why stop the context picking up new devices?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@v-lopez Can you suggest a better solution to this issue?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't looked into the issue further. But I've been using this since then and I haven't appreciated any side effects.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@v-lopez
First, thanks for the PR, we appreciate it.
However, after reading the origin thread I am a little confused, as it seems to mix ROS (initial_reset) and plain libRS where the fix was made.
Can you provide a way for this deadlock to be reproduced with librealsense, either in C++ or Python? Once I understand how to reproduce I can add it to our unit-tests and see if I can solve it. Otherwise we'll likely direct it to our ROS guy to try and reproduce and see if the problem is there.

@Milnaye
Copy link

Milnaye commented Jun 8, 2023

Hi,
I have the same issue using two cameras, ROS and the initial_reset, I can reproduce it consistently by launching the ROS node before plugging the camera. This fix resolve the issue but prevent recovery of unplugged cameras... Adding a delay in the ROS wrapper just after the hardware reset seems to resolve the issue, but it's not a real fix.
Does someone has a better solution ?

@Arun-Prasad-V
Copy link
Contributor

Hi @Milnaye , could you please give more details about your test scenario?

I have the same issue using two cameras, ROS and the initial_reset, I can reproduce it consistently by launching the ROS node before plugging the camera. This fix resolve the issue but prevent recovery of unplugged cameras...

By this I assume, you have one camera always connected and the another camera is connected after calling the launch command. Please correct me if I'm wrong.

For which camera are you trying to do 'initial_reset'? The one which is already connected or the one you are connecting after calling launch command or for both?

@Milnaye
Copy link

Milnaye commented Jun 13, 2023

Hi @Milnaye , could you please give more details about your test scenario?

I have the same issue using two cameras, ROS and the initial_reset, I can reproduce it consistently by launching the ROS node before plugging the camera. This fix resolve the issue but prevent recovery of unplugged cameras...

By this I assume, you have one camera always connected and the another camera is connected after calling the launch command. Please correct me if I'm wrong.

For which camera are you trying to do 'initial_reset'? The one which is already connected or the one you are connecting after calling launch command or for both?

Hi @Arun-Prasad-V,

After reading again what i wrote, i can understand that I wasn't clear at all...

On our robot, the two cameras stay connected all the time, the issue happens only on the second camera checked by the ros node.
To explain more, on the robot we use two ros nodes, we force the serial number of the two cameras so we know where they are, the ros nodes loop between the cameras and always start checking the "camera A" then "camera B". The issue happens only with the node "B", after resetting "camera B". We do an initial reset on both cameras.

I'm able to consistently reproduce the issue on a computer with only one camera, by lauching the ros node before plugging the camera. It happen only with initial_reset to true.

I hope it makes more sense now :)

@Arun-Prasad-V
Copy link
Contributor

Hi @Milnaye,

Thanks for the details. If I understand correctly, you have explained two scenarios where the issue is seen:

  1. Two cameras always connected. Enable initial_reset on both cameras. Issue happens with second camera
  2. Single camera dynamically connected after launching the ros node. Issue happens with initial_reset
    • Is the second camera still connected here?

Could you please share

  • the commands used? I assume you are launching both cameras with single command.
  • the results? maybe a log or screenshot for both scenarios

Also, please share the below details:

Required Info
Camera Model { R200 / F200 / SR300 / ZR300 / D400 }
Firmware Version (Open RealSense Viewer --> Click info)
Operating System & Version {Win (8.1/10) / Linux (Ubuntu 14/16/17) / MacOS
Kernel Version (Linux Only) (e.g. 4.14.13)
Platform PC/Raspberry Pi/ NVIDIA Jetson / etc..
SDK Version { legacy / 2.. }
ROS platform Humble/Foxy/etc..
Language {C/C#/labview/nodejs/opencv/pcl/python/unity }
Segment {Robot/Smartphone/VR/AR/others }

@Milnaye
Copy link

Milnaye commented Jun 15, 2023

Hi @Arun-Prasad-V,

Your description of both scenario are corrects. For the scenario 2, the second camera is not connected.

Required Info Scenario 1
Camera Model Two D455
Firmware Version 5.14.0.0
Operating System & Version custom Linux
Kernel Version (Linux Only) 4.9.253
Platform NVIDIA Jetson nano
SDK Version version 2.53.1
ROS platform Humble
Language C and python
Segment Robot

On scenario 1, we use a python launch file which load the cameras settings and some node parameters. Both camera node are defined separately in this launch file but except the serial number and the usb port, the way we launch them is similar.

camera_advanced_settings.json.txt
camera_node.log.txt
node_param.yaml.txt

This is a error log quite old, but it's still the same. node-26 correspond to "camera_left" and node-27 to "camera_right".
You can see that, after the "Resseting device..." line, the node 27 is stuck.

Required Info Scenario 2
Camera Model One D455
Firmware Version 5.14.0.0
Operating System & Version Ubuntu 22.04
Kernel Version (Linux Only) 5.19.0
Platform PC
SDK Version happen with both 2.53.1 and 2.54.1
ROS platform Humble
Language C and python
Segment Robot

On scenario 2, I use the command
ros2 run realsense2_camera realsense2_camera_node --ros-args -p initial_reset:=true
with a fresh build of librealsense sdk and realsense-ros.

The result log file is similar

@Arun-Prasad-V
Copy link
Contributor

@Milnaye , Thanks for the details. I'm able to reproduce scenario_1 with and without ROS2 wrapper, which also matches with @v-lopez 's observation. Will work on the solution and update.

@Arun-Prasad-V
Copy link
Contributor

@Milnaye , @v-lopez , This issue has been fixed in the 'development' branch. Please refer #10482 (comment).

Please try and let us know if you need any further help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants