spatial transformer networks uses an explicit procedure to learn invariance to translation, scaling, rotation and other more general warps, making the network pay attention to the most relevant regions. STN was the first attention mechanism to explicitly predict important regions and provide a deep neural network with transformation invariance.
Taking a 2D image as an example, a 2D affine transformation can be formulated as followed, where A denotes a $ 2 \times 3 $ learneable affine matrix:
\begin{align} A = f_\text{loc}(U) \end{align} \begin{align} x_i^s = A x_i^t \end{align}
Here, $U$ is the input feature map, and $f_\text{loc}$ can be any differentiable function, such as a lightweight fullyconnected network or convolutional neural network. $x_{i}^{s}$ is coordinates in the output feature map, while $x_{i}^{t}$ is corresponding coordinates in the input feature map and the $ A $ matrix is the learnable affine matrix. After obtaining the correspondence, the network can sample relevant input regions using the correspondence. To ensure that the whole process is differentiable and can be updated in an endtoend manner, bilinear sampling is used to sample the input features.
STNs focus on discriminative regions automatically and learn invariance to some geometric transformations.
Source: Spatial Transformer NetworksPaper  Code  Results  Date  Stars 

Task  Papers  Share 

General Classification  4  22.22% 
Image Classification  2  11.11% 
Person ReIdentification  2  11.11% 
Object Detection  1  5.56% 
Tumor Segmentation  1  5.56% 
Image Reconstruction  1  5.56% 
Medical Image Segmentation  1  5.56% 
Semantic Segmentation  1  5.56% 
Decision Making  1  5.56% 
Component  Type 


🤖 No Components Found  You can add them if they exist; e.g. Mask RCNN uses RoIAlign 