“Faster RCNN  Towards RealTime Object Detection with Region Proposal Networks”
ROI pooling and ROI alignment:
 an RoI is a rectangular window into a conv feature map. Each RoI is defined by a fourtuple (r, c, h, w) that specifies its topleft corner (r, c) and its height and width (h, w).
 It is a type of pooling layer which performs max pooling on inputs (here, convnet feature maps) of nonuniform sizes and produces a small feature map of fixed size (say 7x7). The choice of this fixed size is a network hyperparameter and is predefined
 The purpose for this specific design is that it will computation much faster.
Region Proposal Networks :
 A Region Proposal Network (RPN) takes an image (of any size) as input and outputs a set of rectangular object proposals, each with an objectness score
 goal is to share computation with a Fast RCNN object detection network
 To generate region proposals, slide a small network over the convolutional feature map output by the last shared convolutional layer
Anchors:
 At each slidingwindow location, we simultaneously predict multiple region proposals, where the number of maximum possible proposals for each location is denoted as k. So the reg layer has 4k outputs encoding the coordinates of k boxes, and the cls layer outputs 2k scores that estimate probability of object or not object for each proposal
 The k proposals are parameterized relative to k reference boxes–anchors
 An anchor is centered at the sliding window in question, and is associated with a scale and aspect ratio . By default we use 3 scales and 3 aspect ratios, yielding k = 9 anchors at each sliding position
 An important property of our approach is that it is translation invariant
MultiScale Anchors as Regression References:
 there have been two popular ways for multiscale predictions. The first way is based on image/feature pyramids(useful but timeconsuming)
 The second way is to use sliding windows of multiple scales (and/or aspect ratios) on the feature maps.
 our anchorbased method is built on a pyramid of anchors, which is more costefficient. Our method classifies and regresses bounding boxes with reference to anchor boxes of multiple scales and aspect ratios. It only relies on images and feature maps of a single scale, and uses filters (sliding windows on the feature map) of a single size
Loss Function

We assign a positive label to two kinds of anchors: (i) the anchor/anchors with the highest IntersectionoverUnion (IoU) overlap with a groundtruth box, or (ii) an anchor that has an IoU overlap higher than 0.7 with 5 any groundtruth box.
\(\begin{aligned} L\left(\left\{p_{i}\right\},\left\{t_{i}\right\}\right) &=\frac{1}{N_{c l s}} \sum_{i} L_{c l s}\left(p_{i}, p_{i}^{*}\right) \\+\lambda & \frac{1}{N_{r e g}} \sum_{i} p_{i}^{*} L_{r e g}\left(t_{i}, t_{i}^{*}\right) \end{aligned}\) 
\(t_{i}\) is a vector representing the 4 parameterized coordinates of the predicted bounding box;The groundtruth label \(p^{∗}_{i}\) is 1 if the anchor is positive, and is 0 if the anchor is negative
Bounding box regression:
 from above, the parameterization of t can be described as follows: \(\begin{aligned} t_{\mathrm{x}} &=\left(xx_{\mathrm{a}}\right) / w_{\mathrm{a}}, \quad t_{\mathrm{y}}=\left(yy_{\mathrm{a}}\right) / h_{\mathrm{a}} \\ t_{\mathrm{w}} &=\log \left(w / w_{\mathrm{a}}\right), \quad t_{\mathrm{h}}=\log \left(h / h_{\mathrm{a}}\right) \\ t_{\mathrm{x}}^{*} git&=\left(x^{*}x_{\mathrm{a}}\right) / w_{\mathrm{a}}, \quad t_{\mathrm{y}}^{*}=\left(y^{*}y_{\mathrm{a}}\right) / h_{\mathrm{a}} \\ t_{\mathrm{w}}^{*} &=\log \left(w^{*} / w_{\mathrm{a}}\right), \quad t_{\mathrm{h}}^{*}=\log \left(h^{*} / h_{\mathrm{a}}\right) \end{aligned}\)