Voxelnet proposed in [1] is a milestone because the model minimizes the effort of manual preprocessing by automatic feature extraction. The sparsely embedded convolutional detection model proposed in [2] replaces the regular convolutional layers with sparse convolutions and therefore got a big speed-up.
The structure of the two models is shown in Figure 2.2. The voxel grouping and sampling step processes the raw point cloud so that its density is reduced and its point distribution is more uniform.
The voxel feature learning network then takes the processed point cloud and generates a 3D feature map of voxel-wise features. The 3D convolutional layers then extract interactions between voxels in the z axis, the axis orthogonal to the ground.
As the z-axis is eliminated, the 3D feature map is converted to 2D and ready for the regional proposal network (RPN) as if it is a 2D object detection task.
[1] Y. Zhou and O. Tuzel, “Voxelnet: End-to-end learning for point cloud based 3d object detection,” in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 4490–4499.
[2] Y. Yan, Y. Mao, and B. Li, “Second: Sparsely embedded convolutional detection,” Sensors (Basel, Switzerland), vol. 18, 2018.