Region proposal networks (RPNs) are a significant component in many object detection models. The input to RPNs is typically an extracted feature map. Every pixel in the feature map is an anchor point. Each anchor point can have multiple anchors, which are candidate bounding boxes. In the training stage, the randomly initialized anchors are adjusted, so the bounding boxes match the ground truth better if the boxes contain objects. Because the learning process starts with proposing dummy guesses.

In SECOND and Voxelnet, the region proposal network is after the (sparse) convolutional network. The input to the RPN is the 2D image-like extracted feature map.

The output of the RPN is a direction map for object direction recognition, a probability score map for detection and a regression map for determining the bounding boxes.

The steps for the region proposal network are: first, downsample H and W by half with a convolutional layer of stride 2; second, put through several stacked convolutional layers of stride 1; and third, repeat step one and two more times with the input being the output of the previous stage. After each convolutional layer, a batch normalization layer and a ReLU layer are performed. The dimensions of the output of the three stages are therefore Nchannel × 1/2 H × 1/2 W, Nchannel × 1/4 H × 1/4 W, and Nchannel × 1/8 H × 1/8 W. Then, the outputs of the three stages are upsampled and concatenate together along the channel axis to form a high-resolution feature map.

The upsampling is done by deconvolutional layers. The name deconvolutional is not related to deconvolution in signal processing. The operation in deconvolutional layers is actually transposed convolution. Deconvolutional layers are used to perform learnable upsamplings. For simplicity, for a unit stride no padding deconvolutional layer, assuming the kernel size and the input size to be M×M and N×N respectively, it is equivalent to a convolutional layer with padding size M − 1. Padding of size p for a deconvolutional layer means to remove the outermost p rows and columns from the original output. When the stride size is s, it is equivalent to having a stride of size 1/s in a convolutional layer. The stride size of deconvolutional layers is equal to the upsampling factor.

Finally, the high-resolution feature map after upsampling and concatenation is fed into three independent 1×1 convolutional layers to obtain the probability score map, regression map and the direction map. The final detection results can be generated
by passing the three maps into a decision scheme.