-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Small Memory Leak with Controller Server, Planner Server, and AMCL #1889
Comments
Interesting. I think step 1 would be to figure out if its even Navigation2 itself or its rclcpp or message filters that's leaking (which is my bet). Given the 3 servers you identify are the only 3 servers that get regular information even when static (in the form of sensor data) my guess is that rclcpp or message filters are leaking not navigation2 itself. Step 2 I would think of doing is to remove the plugins (navfn and DWB) with dummies that don't do anything. Then see if it still happens. At that point you'd know if the algorithms are leaking or the server is leaking. After that I'd go at it with a profiler knowing where I should be looking (algorithm, server, rclcpp). If you said just controller and planner, I'd think it was costmap since that would be the same in both, with AMCL (and AMCL lower) makes me think its ROS2 / message filters since the take in sensor data and AMCL takes in about half as much sensor data. The rclutils warnings should be reported to the appropriate projects (not here) with a full traceback of where it was triggered from (to make sure its ROS2 error and not user error). |
have you been able to look into this more? |
I have not. I was able to ignore it through most of our testing but a solution is still in our backlog. |
I ran into the same issue, I'll try and dig into this soon
|
This is the GDB backtrace on my system. I'm running ROS1 on a PC that is running the robot in gazebo and supplying the relevant information to the nav stack along with the None of these seem to be caused by the nav stack, they're related to rclcpp and fastrtps in my case Backtrace for the planner server
Backtrace for controller_server
|
I'm not sure what this has to do with the ticket? A memory leak doesn't result in a crash report via GDB. CC @EduPonz @JaimeMartin see the fast-rtps crash report above. |
I'm not too familiar with this, but if there is a memory leak, could that result in a process accessing memory that is being manipulated by another process (that caused the memory leak)? |
No. That's not what this ticket is talking about. Memory leaking is not properly deleting memory usage on shutdown, not invalid access that would result in a buffer overflow or segmentation faults. As far as I'm aware, we've never had issues with either of those issues. |
We have similar behavior like @Michael-Equi on foxy deb pkgs. We already saw similar behavior here, and although a seg fault issue here. |
@maxlein is that then from message filters you think? Did you capture any valgrind info when running these experiments so we can hone in on where the growth is deriving from? |
Apologies, this went completely our of my radar. The memory leak is reported for CyclondeDDS. I'll look into the crash right away! |
@SteveMacenski @mrunaljsarvaiya Looking a bit closer at the GDB traces from here, I cannot see a Fast DDS related crash. From what I can see, there is an on-going deserialization in
I don't know the details about |
No I didn't. Also a core dump was sadly not generated... |
@maxlein can you file a ticket in message filters about this? I believe you are correct from looking over the nodes that are growing - they all use message filters (planner/controller via costmap obstacle and voxel layers and AMCL itself uses it). You could also easily verify that this is the case by modifying AMCL to not use message filters and show that it doesn't grow. If that is the case, we should close this since the bug isn't in navigation2 but in message filters. Edit: Here's the ticket ros2/message_filters#49 please @maxlein add any additional context or images like you showed above there for them to debug with.
There's an awful lot of fast-rtps references in there for it not to be an issue in fast-rtps |
Hey now that you mention it I'm getting a crash very infrequently in non-navigation node. I will file a ticket or add to one in that repo. But I am using cyclone, so I'm thinking this is fastrtps unrelated. |
I'd file that with the rmw for cyclone |
@maxlein ros2/message_filters#49 (comment)
Is that something you could test? All you'd need to do is clone the master for geometry2 and run on that. |
@SteveMacenski Yes, we will try that. |
Seems like it's fixed in this pr: |
Sweetness, closing ticket since its not from Nav2 + solution has been merged for all new distros and proposed for Foxy. |
Bug report
Steps to reproduce issue
Run the Nav2 stack with the TB3 simulation for a long duration while recording memory usage using the system monitor.
Expected behavior
Memory should reach a stable point at which it no longer increases on controller, planner, and AMCL nodes especially when the robot is not actively navigating or being moved around.
Actual behavior
Controller Server shows consistent memory increase at around 55MB/h (taken over 3 hours without moving robot)
Planner Server shows consistent memory increase at around 50MB/h (taken over 3 hours without moving robot)
AMCL shows consistent memory increase at around 25MB/h (taken over 3 hours without moving robot)
Gzserver and gzclient memory usage remains constant (perhaps this information helps in localizing the issue)
Additional information
I have been having issues with the controller and planner servers suddenly crashing after running for long periods of time especially with limited robot motion or use. In exploring the cause of this issue I have noticed what appear to be minor memory leaks on the troublesome nodes. I tried getting the controller server to crash in gdb so I could get a more helpful segfault error but have not been able to get it to crash under those conditions (really strange). Outside of gdb I have experienced frequent crashing of both the controller and planner server after a couple of hours of limited use or idling.
In monitoring the controller server with gdb I did see some errors but they did not seem to result in a fatal segmentation fault but may help debug the issue:
and upon killing the controller server I got the error
While running the node, I receive lots of repeated warnings such as the one below
[planner_server-5] [INFO] [1595818100.434728681] [global_costmap.global_costmap_rclcpp_node]: Message Filter dropping message: frame 'scanner_link' at time 15239.380 for reason 'Unknown'
I am aware that the issues I am having could be caused by a multitude of factors many of which are not be in the scope of navigation2, but being that this seems to be a particularly "severe" issue with nav2 related software, I thought it would be best to start the issue here and see what similar experiences exist and what conditions are like on different computers/setups.
The text was updated successfully, but these errors were encountered: