We know that the R-CNN opened a new prospect in the field of computer vision, but it is time and resource consuming and a multi-stage pipeline (three stages: ConvNet, SVM, bounding box regressor) - not end to end.

Fast R-CNN is proposed to solve the above problems :running_man:

To begin with, its architecture first.

  1. Inputs to be an entire image and a set of object proposals (via selective search)
  2. Create a conv feature map using the whole image
  3. Run through a region of interest (RoI) pooling layer and extract a fixed-length feature vector from each object proposal
  4. Through FC layers in series, we will get two outputs:
    • softmax probability for classifying into K+1 classes (1 for background)
    • real-valued numbers for estimating four points of a bounding box regression

2021-03-10-fcnn2


Focus on difference between R-CNN and Fast R-CNN.

The first thing you might be unfamilar with is RoI Pooling layer. Considering the R-CNN structure, since the input size for the last FC layer should be fixed, R-CNN warps all pixels in a tight box before computing CNN features.

However, what if adjusting the image size after all the CNN process?

RoI Pooling Layer

2021-03-10-fastrcnn1

After creating a conv feature map (step 2), the RoI pooling layer uses max pooling on region proposals (black box with h x w in the left side) to convert the image size into a fixed size of H x W (in the right side of the picture). Whatever the size of input image is, the RoI pooling layer extracts a fixed-length feature vector needed for FC layers.

RoI max pooling splits RoI with h x w into an output of H × W that is consisted with some sub-windows of approximate size h/H × w/W and then take a max-pooling to each sub-area. Assume that h=5, w=7, H=2 and W=2. Then the area for each sub-window (look at four black boxes in the middle) would be 2x2, 3x3 like that.

Remember that in R-CNN, we need to process CNN for every region proposal. 3,000 times CNN computations for 3,000 region proposals. However, Fast R-CNN only needs to compute CNN once, excelerating the whole procedure eventually.


Next thing to consider is…

End-to-end learning with multi-task loss

R-CNN has a multi-stage pipeline consisting of three modules: ConvNet, SVM and bounding box regressors. It is controversial that end-to-end structure always performs better than multi-stage pipeline, but clearly it is straight-forward and based on what deep learning is and should be. (Of course, it performs well in many cases)

Fast R-CNN defines Multi-task Loss so that it achieves end-to-end learning. The multi-task loss is a combination of the classification loss and bounding box regression to jointly train.

2021-03-11-fasterrcnn_math

Reference