Arxiv2022 , Haiyan Wang, Sequential Point Clouds: A Survey.
Statistics: 🔥 code is available & stars >= 100 | ⭐ citation >= 50
-
4D MinkNet [14] 4D SpatioTemporal ConvNets: Minkowski Convolutional Neural Networks. [CVPR 2019] [page] [cite 587] ⭐
-
PSTNet [22] PSTNet: Point spatio-temporal convolution on point cloud sequences. [ICLR 2021] [cite 25]
- MeteorNet [48] MeteorNet: Deep learning on dynamic 3D point cloud sequences. [ICCV 2019] [cite 102] ⭐
- PointRNN [21] PointRNN: Point recurrent neural network for moving point cloud processing. [arxiv 2019] [cite 40]
Discussion
- For detection & segmentation tasks which require a better semantic understanding will be benefited more from feature learning of ConvNet.
- Those long-range sequence tasks such as action recognition or object tracking are more appropriate with RNN.
- VoxelNet [128] [] [cite ]
- PointPillars [44] [] [cite ]
- PointFlowNet [5] [] [cite ]
- Scalable Scene Flow [41] [] [cite ]
- VoxFlowNet [63] [] [cite ]
- PV-RAFT [100] [] [cite ]
- FlowNet3D [47, pioneer work] [] [cite ]
- FlowNet3D++ [98] [] [cite ]
- Dense RGBD Scene Flow [79, seg] [] [cite ]
- Festa [94] [] [cite ]
-
Kinet No Pain, Big Gain: Classify Dynamic Point Cloud Sequences with Static Models by Fitting Feature-level Space-time Surfaces. [CVPR 2022] [cite 1]
To capture 3D motions without explicitly tracking correspondences, we propose a kinematicsinspired neural network (Kinet) by generalizing the kinematic concept of ST-surfaces to the feature space. By unrolling the normal solver of ST-surfaces in the feature space, Kinet implicitly encodes feature-level dynamics and gains advantages from the use of mature backbones for static point cloud processing.
Discussion
- Overall, the point-based and the lattice-based methods outperform the voxel-based methods by a large margin.
- Almost all types of methods demonstrate a well generalization ability from the synthetic domain (eg.FlyingThings3D) to the real world domain (eg.KITTI ).
- The unsupervised methods still achieve comparable performance even without any supervision. The unsupervised methods become a more and more popular research trend for the community.
-
FaF [51] 4D detection, tracking, motion forecasting, voxel+2Dconv, early+late fusion
-
SSD [117] FaF follows the detection pipeline of SSD
-
Second [115] sparse conv on voxel
-
IntentNet [11] detection+intent prediction, BEV format
-
What you see [32] 2.5D data (RGBD or range image), built upon PointPillar, early+late fusion
- Yin et al. [119] AST-GRU, PMPNet
-
Huange et al. [35] the first one that modeled temporal relations among SPL with an RNN-based (LSTM) schema
-
YOLO4D [18]
-
McCrae et al. [53] PointPillar as baseline
-
Qi et al. [65, MVF++] followed the similar method of [32] which aggregated temporal information by transforming other point cloud frames to the current one to get rid of ego-motion and encoded time offsets as an additional feature.
-
MVF [127]
Discussion
- By using SPL data and devising spatio-temporal feature extracting techniques to conduct object detection, those false bounding box results are largely suppressed to ensure temporal consistency and thus improve overall detection accuracy of the multi-frame methods.
- The RNN-based networks exploit more on temporal relations among long-range time series, while high-level semantic understanding tasks like detection prefer temporal consistency in both spatial and temporal domains.
- Almost all of the multi-frame detection methods are restricted to less than 10 frames. Thus long-range SPL object detection still remains as a challenging problem.
-
AB3DMOT [103] as a compact baseline: pre-trained 3D object detector + 3D Kalman Filter with constant velocity model + Hungarian algorithm
-
Chiu et al. [13] 3D Kalman Filter with a constant linear and angular velocity model. Mahalanobis distance for data association process and co-variance matrices for the state prediction process.
-
PointTrackNet [95] conduct object detection for two continuous point cloud frames, refined by an association model.
-
P2B [69] with a point-wise schema, no Kalman filter, end-to-end network, treated the tracking task as the detection task inspired by VoteNet; better than [27].
-
Leveraging shape completion [27]
-
Giancola et al. [27] proposed the first 4D MOT Siamese network structure
-
FaF [51] as mentioned before: 4D detection, tracking, motion forecasting, voxel+2Dconv, early+late fusion. solved the tracking problem in an associative manner.
In addition to SPL input, they involved another modality RGB image to the network as well.
-
DSM [23] predicted object proposals using a Detection Network from the input point cloud and RGB sequence. After formulating discrete trajectories, a liner optimization process was utilized to generate final tracking results.
-
GNN3DMOT [106] Unlike [103] extracting object features independently to perform the Hungarian data association, GNN3DMOT offered a multi-modality feature extractor; firstly introduced a graph-based pipeline.
-
ComplexerYOLO [82] generated semantic segmentation maps from input images, further back-projected to 3D space to obtain class-aware point clouds; predicted 3D bbox from the voxelized semantic point cloud.
-
mmMOT [122] Robust multi-modality multi-object tracking
Discussion
- joint 2D&3D-based methods are more frequently used by the recent research community with a relatively higher performance, which shows the superiority of more modalities.
- Most high-performance methods still require an additional 2D input to ensure tracking accuracy. This is a limitation with extra data. In the real self-driving scenario, usually, it costs much more to process multi-modalities at the same time.
- For almost all of 3D MOT methods, tracking performance is based on detection performance. Only PointTrackNet [95] and P2B [69] belong to a full end-to-end pipeline breaking the limit of the off-shell detector. However, their performance is not satisfied which leaves a potential improvement for future research on this track.
(simple gathering中介绍的应该都是单帧方案,比如经典的pointnet)
- The input point clouds are primarily projected to the BEV (Bird’s Eye View) or the spherical space and then 2D segmentation pipelines can be easily applied to 2D projected data.
- Zhang et al. [121] and PolarNet [123] followed the BEV (Bird’s Eye View) projection track.
- Studies [59], [107], [108], [114] followed the spherical projection track which treated the range image as the input.
- Some studies [34], [50], [54], [72], [89] transferred point cloud to voxel representation and adopted 3D conv.
- Papers [76], [84] splatted point cloud into the permutohedral lattice space to perform sparse conv.
- Octnet [73], PointConv [110], KPConv [90]
- Inspired by PointNet and PointNet++, a tremendous of point-based methods such as [19], [33], [40], [124], [126] have been investigated to estimate semantic scene labels for point clouds.
- Some other methods such as [12], [92], [116], [125] introduced the attention mechanism.
-
4D MinkNet [14] the first method that applied the deep conv net on high dimensional data such as SPL.
-
SpSequenceNet [80] To better fuse global and local features:Cross-frame Global Attention (CGA) and cross-frame local interpolation (CLI); followed the U-net design in paper SSCN [29]
-
MeteorNet [48] MeteorNet built MeteorNet-Seg to conduct point-wise semantic label prediction; using a similar structure as PointNet++; MeteorNet-Seg harnessed the Meteor-ind module and the early-fusion strategy.
-
PSTNet [22] PST (transposed) conv; PSTNet was more compact yet effective while 4D MinkNet [14] required a relatively large computation cost.
- Duerr et al. [17] projected each point cloud frame to the image plane dubbed as range image; the semantic feature would be perpetually reused, not just once in SpSequenceNet [80]; two recurrent strategies for feature fusion: Residual Net + Residual Net.
-
Panoptic segmentation is a merged joint segmentation task including semantic segmentation and instance segmentation, which was first introduced in [43] in the image space and further extended from image to video by [42].
-
Aygun et al. [4] firstly proposed a 4D Panoptic Segmentation pipeline; infer semantic classes for each point along with identifying the instance ID; One major contribution: devising a new point-centric evaluation method LSTQ (LiDAR Segmentation and Tracking Quality).
-
PanopticTrackNet [36] blended panoptic segmentation and multi-object tracking tasks; multi-head end-to-end network; continuous RGB frames or point clouds as input.
Discussion
- Additional temporal data improves the overall segmentation accuracy by a large margin compared to static point cloud methods.
- From Table 11, point-based convolution outperforms gridbased convolution in terms of both efficacy and efficiency.
- Overall segmentation performance is still limited on moving object classes which shows the large impact of motion information.
- The panoptic segmentation methods significantly outperform other basic segmentation methods by exploring a holistic semantic scene understanding. The increase of scan numbers brings consistent performance gain.
The below methods are all developed following the object detection-tracking-forecasting schema.
-
FaF [51] was also the first one proposing a holistic network that jointly conducted object detection, tracking and motion forecasting from SPL input.
-
IntentNet [11] extended FaF [51] by predicting the intent, defined as the combination of the target behavior (e.g. moving directions) and motion trajectory. Besides SPL input, took an extra rasterized map as network’s input; these signals ( roads, traffic lights, traffic signs, etc.) provided a strong motion prior and contributed a lot to the intent prediction.
-
NMP [120] further extended IntentNet [11] to integrate motion planning into the end-to-end motion forecasting system. Instead of just predicting the moving angle as IntentNet [11], the purpose of motion planning was to generate one optimistic trajectory with minimum cost. multimodality models were trained together in an end-to-end manner.
-
Spagnn [10] was also developed based on IntentNet [11] by adding the interaction model at the end for motion predictions; graph-based conv;
one major problem of the occupancy grid representation is hard to find the temporal correspondence between cells. It also excludes object class information.
-
Schreiber et al. [78] converted point cloud frames to a sequence of dynamic OGM and input them to a ConvLSTM; added skip connections to the RNN.
-
MotionNet [109] combined BEV and occupancy map representations and devised a novel representation named BEV map; exploited a novel spatio-temporal pyramid network named STPN to extract hierarchical features; light block spatio-temporal convolution (STC).
The range map comes from spherical projection of point clouds.
- LaserFlow [57] the multi-sweep fusion architecture was proposed to solve the coordinate system dis-alignment problem; transformer sub-network ; uncertainty curriculum learning.
The SPF (Sequential Pointcloud Forecasting) task is defined to predict future M point cloud frames given previous N frames. Instead of forecasting future point cloud information on the object level, SPF predicts the whole scene point clouds including foreground objects and background static scene.
-
Sun et al. [87] devised a ConvLSTM structure to predict future point cloud frames instead of using the 1D LSTM in [104].
-
Deng et al. [15] adopted the scene flow embedding [47] to model the temporal relation among four input point cloud frames. PointNet++ [68] and Edge Conv [97] were introduced to extract 3D spatial features
-
Weng et al. [104] firstly investigated the SPF task and proposed SPFNet; uses forecast-then-detect schema to replace the conventional detect-then-forecast idea; exploited a new evaluation protocol.
-
Mersch et al. [56] proposed to utilize the 3D conv; Skip Connections and Horizontal Circular Padding was introduced to capture detailed spatial-temporal information;
-
S2net [102]
Discussion
- Though BEV representation is more frequently used, the methods adopting range view representation achieve better performance due to more complete information embedded.
- The errors sharply increase when the time range is extended. This shows the limitation for handling longer-range SPL data.
-
Unsupervised learning Though there are a few unsupervised methods for the scene flow estimation, most existing research of sequential point clouds still rely on groundtruth labels as the supervision signal.
-
Longer-range temporal dependency One possible solution is to exploit point cloud compression techniques such as utilizing flow information to fill the temporal gaps. Meanwhile, transformers have been approved to be quite good at modeling temporal attention and capturing long-range dependencies.
-
Multitask Learning one possible solution is to jointly learn those essential features (e.g. semantic flow) across multitasks.