Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve the overall performance #109

Closed
xmfcx opened this issue Dec 18, 2023 · 5 comments
Closed

Improve the overall performance #109

xmfcx opened this issue Dec 18, 2023 · 5 comments

Comments

@xmfcx
Copy link
Contributor

xmfcx commented Dec 18, 2023

Related Issue:

Description

Currently, the driver has a lot of latency which can be avoided by some key changes.
Especially for large point clouds.

graph TD
    subgraph nebula_velodyne_hw_interfaces
        udp_packet:::cls_06
        subgraph ReceiveCloudPacketCallback
        udp_packet -->|"each 1206 bytes"| velodyne_packet:::cls_06
        -->|"100ms, aggregate ~560 msgs"| velodyne_scan:::cls_06
        end
    end

    subgraph velodyne_decoder_ros_wrapper
        velodyne_scan -->|"publish through ReceiveScanDataCallback"| ReceiveScanMsgCallback
        subgraph ReceiveScanMsgCallback
            subgraph ConvertScanToPointcloud
                VelodyneScanDecoder_unpack:::cls_03 -->|"For each packet ~560 times"| VelodyneScanDecoder_unpack
                text_00["(Takes about 10ms)"]:::cls_02
            end
            ConvertScanToPointcloud --> nebula_pointcloud:::cls_06
            --> pcl_toROSMsg["pcl_toROSMsg \n(Takes about 15ms)"]:::cls_02
            --> cloud_ex_XYZIRADT:::cls_06
        end
    end

nebula_velodyne_hw_interfaces:::cls_00
velodyne_decoder_ros_wrapper:::cls_00
ReceiveCloudPacketCallback:::cls_01
ReceiveScanMsgCallback:::cls_01
ConvertScanToPointcloud:::cls_02
pcl_toROSMsg:::cls_02

classDef cls_00 fill:#DDDDDD,stroke:#999,stroke-width:1px;
classDef cls_01 fill:#FFCFD2,stroke:#999,stroke-width:1px;
classDef cls_02 fill:#FDE4CF,stroke:#999,stroke-width:1px;
classDef cls_03 fill:#F8F8CC,stroke:#999,stroke-width:1px;
classDef cls_04 fill:#B9FBC0,stroke:#999,stroke-width:1px;
classDef cls_05 fill:#98F5E1,stroke:#999,stroke-width:1px;
classDef cls_06 fill:#8EECF5,stroke:#999,stroke-width:1px;
classDef cls_07 fill:#90DBF4,stroke:#999,stroke-width:1px;
classDef cls_08 fill:#A3C4F3,stroke:#999,stroke-width:1px;
classDef cls_09 fill:#CFBAF0,stroke:#999,stroke-width:1px;
classDef cls_10 fill:#F1C0E8,stroke:#999,stroke-width:1px;
Loading

What can be done?

  • Unpack/parse packages as they arrive
  • Store packages directly in a point cloud message using point_cloud_msg_wrapper
    • Eliminate PCL dependency
  • This way eliminate scan message completely
  • Eliminate 2 node requirement, handle all in one node

Proposed architecture

graph TD
    subgraph nebula_velodyne_hw_interfaces
        udp_packet:::cls_06
        subgraph ReceiveCloudPacketCallback
        udp_packet -->|"each 1206 bytes"| velodyne_packet:::cls_06
        
        velodyne_packet-->VelodyneScanDecoder_unpack:::cls_03
        -->|"aggregate in a point cloud message directly"| cloud_ex_XYZIRADT:::cls_06
        end
    end
    velodyne_packet-->|"publish for storage"| rosbag_recorder:::cls_00

nebula_velodyne_hw_interfaces:::cls_00
ReceiveCloudPacketCallback:::cls_01

classDef cls_00 fill:#DDDDDD,stroke:#999,stroke-width:1px;
classDef cls_01 fill:#FFCFD2,stroke:#999,stroke-width:1px;
classDef cls_02 fill:#FDE4CF,stroke:#999,stroke-width:1px;
classDef cls_03 fill:#F8F8CC,stroke:#999,stroke-width:1px;
classDef cls_04 fill:#B9FBC0,stroke:#999,stroke-width:1px;
classDef cls_05 fill:#98F5E1,stroke:#999,stroke-width:1px;
classDef cls_06 fill:#8EECF5,stroke:#999,stroke-width:1px;
classDef cls_07 fill:#90DBF4,stroke:#999,stroke-width:1px;
classDef cls_08 fill:#A3C4F3,stroke:#999,stroke-width:1px;
classDef cls_09 fill:#CFBAF0,stroke:#999,stroke-width:1px;
classDef cls_10 fill:#F1C0E8,stroke:#999,stroke-width:1px;
Loading

This way, we eliminate complexity, increase efficiency, eliminate cpu load spikes.

Support multiple point types

And if we want to support multiple point cloud types, we can maintain multiple raw PointCloud2 types and in the last step of aggregation, we can add them to these messages before publishing.

cc. @drwnz @mojomex @miursh @yukkysaito @kminoda

@xmfcx
Copy link
Contributor Author

xmfcx commented Dec 18, 2023

@mojomex
Copy link
Collaborator

mojomex commented Dec 18, 2023

Hi @xmfcx, thank you for your proposal!

  • I largely agree with your criticisms, especially that packets should not be copied and/or aggregated if we can in any way avoid it.
  • I also agree that currently we have quite a few wrapper layers around the core functionality that add to the complexity
  • @drwnz and I talked about decoding straight into the pointcloud message buffer before and I agree that we should implement it this way!
  • As for the one-node solution: we have use-cases where we want to run the decoder without the hardware interface, e.g. replaying from rosbag. So ideally, having those as separate entities would be nice.
  • ROS2 composable nodes allow for shared memory pub/sub within a process (ROS2 can pub/sub shared_ptrs within the same process) and thus we can eliminate copying even if we have multiple nodes

We are currently implementing the improvements discussed in the working group 3 weeks ago autowarefoundation#4024, namely a performant/usable straight-to-disk recording solution and per-packet decoding (without aggregating a whole scan). PR will be out shortly.

Also, thank you for pointing out point_cloud_msg_wrapper, looks interesting!

@drwnz
Copy link
Collaborator

drwnz commented Dec 19, 2023

Thank you @xmfcx, as we briefly discussed yesterday, I think we are mostly in agreement with some extra comments as below:

  • As @mojomex has mentioned, the per packet decoding we identified in Enhancement: allow dynamic scan message size to reduce latency #94 and this is definitely going to be implemented - additionally splitting the publishing of packet messages for recording only, and not using the extra pub/sub of the scan or packet messages in the pointcloud creation pipeline
    • Scan message can be quite probably removed entirely if buffer settings allow high frequency topics to be published, recorded and read without too much overhead
  • Removing PCL, agree entirely. Writing directly into the point cloud messages is also fine, as long as it is implemented in a way that the decoder code base can be used as a library without ROS dependency on output (trivial - I imagine the implementation would be such anyway)
  • Combining into one node for an operating sensor is reasonable, and hardware configuration handled separately sounds good to me. I would like to reduce ROS dependence in the core code as much as possible - not a design requirement now as our use cases all involve ROS, but something I think we should keep in mind in any rewrite/refactoring.

Looking forward to putting together a plan 👍

@mojomex
Copy link
Collaborator

mojomex commented Mar 21, 2024

I'm currently working on step 1 of this:

Packets are processed as they arrive, and not sent over DDS or intra-process comms but handled in one wrapper node altogether. It's still in an early stage but should drastically reduce Nebula latency.

@drwnz
Copy link
Collaborator

drwnz commented Nov 25, 2024

The single node refactor is now complete and all merged into main branch, so I will close this issue. Thank you everyone, it has made a big difference in performance!

@drwnz drwnz closed this as completed Nov 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants