Deformable Attention Module is an attention module used in the Deformable DETR architecture, which seeks to overcome one issue base Transformer attention in that it looks over all possible spatial locations. Inspired by deformable convolution, the deformable attention module only attends to a small set of key sampling points around a reference point, regardless of the spatial size of the feature maps. By assigning only a small fixed number of keys for each query, the issues of convergence and feature spatial resolution can be mitigated.
Given an input feature map $x \in \mathbb{R}^{C \times H \times W}$, let $q$ index a query element with content feature $\mathbf{z}_{q}$ and a 2d reference point $\mathbf{p}_{q}$, the deformable attention feature is calculated by:
$$ \text{DeformAttn}\left(\mathbf{z}_{q}, \mathbf{p}_{q}, \mathbf{x}\right)=\sum_{m=1}^{M} \mathbf{W}_{m}\left[\sum_{k=1}^{K} A_{m q k} \cdot \mathbf{W}_{m}^{\prime} \mathbf{x}\left(\mathbf{p}_{q}+\Delta \mathbf{p}_{m q k}\right)\right] $$
where $m$ indexes the attention head, $k$ indexes the sampled keys, and $K$ is the total sampled key number $(K \ll H W) . \Delta p_{m q k}$ and $A_{m q k}$ denote the sampling offset and attention weight of the $k^{\text {th }}$ sampling point in the $m^{\text {th }}$ attention head, respectively. The scalar attention weight $A_{m q k}$ lies in the range $[0,1]$, normalized by $\sum_{k=1}^{K} A_{m q k}=1 . \Delta \mathbf{p}_{m q k} \in \mathbb{R}^{2}$ are of 2d real numbers with unconstrained range. As $p_{q}+\Delta p_{m q k}$ is fractional, bilinear interpolation is applied as in Dai et al. (2017) in computing $\mathbf{x}\left(\mathbf{p}_{q}+\Delta \mathbf{p}_{m q k}\right)$. Both $\Delta \mathbf{p}_{m q k}$ and $A_{m q k}$ are obtained via linear projection over the query feature $z_{q} .$ In implementation, the query feature $z_{q}$ is fed to a linear projection operator of $3 M K$ channels, where the first $2 M K$ channels encode the sampling offsets $\Delta p_{m q k}$, and the remaining $M K$ channels are fed to a softmax operator to obtain the attention weights $A_{m q k}$.
Source: Deformable DETR: Deformable Transformers for EndtoEnd Object DetectionPaper  Code  Results  Date  Stars 

Task  Papers  Share 

Object Detection  13  35.14% 
RealTime Object Detection  2  5.41% 
Instance Segmentation  2  5.41% 
Optical Flow Estimation  2  5.41% 
Video Object Detection  2  5.41% 
Image Classification  1  2.70% 
Image Reconstruction  1  2.70% 
Language Modelling  1  2.70% 
MultiTask Learning  1  2.70% 
Component  Type 


🤖 No Components Found  You can add them if they exist; e.g. Mask RCNN uses RoIAlign 