Attention Modules

Deformable Attention Module

Introduced by Zhu et al. in Deformable DETR: Deformable Transformers for End-to-End Object Detection

Deformable Attention Module is an attention module used in the Deformable DETR architecture, which seeks to overcome one issue base Transformer attention in that it looks over all possible spatial locations. Inspired by deformable convolution, the deformable attention module only attends to a small set of key sampling points around a reference point, regardless of the spatial size of the feature maps. By assigning only a small fixed number of keys for each query, the issues of convergence and feature spatial resolution can be mitigated.

Given an input feature map $x \in \mathbb{R}^{C \times H \times W}$, let $q$ index a query element with content feature $\mathbf{z}_{q}$ and a 2-d reference point $\mathbf{p}_{q}$, the deformable attention feature is calculated by:

$$ \text{DeformAttn}\left(\mathbf{z}_{q}, \mathbf{p}_{q}, \mathbf{x}\right)=\sum_{m=1}^{M} \mathbf{W}_{m}\left[\sum_{k=1}^{K} A_{m q k} \cdot \mathbf{W}_{m}^{\prime} \mathbf{x}\left(\mathbf{p}_{q}+\Delta \mathbf{p}_{m q k}\right)\right] $$

where $m$ indexes the attention head, $k$ indexes the sampled keys, and $K$ is the total sampled key number $(K \ll H W) . \Delta p_{m q k}$ and $A_{m q k}$ denote the sampling offset and attention weight of the $k^{\text {th }}$ sampling point in the $m^{\text {th }}$ attention head, respectively. The scalar attention weight $A_{m q k}$ lies in the range $[0,1]$, normalized by $\sum_{k=1}^{K} A_{m q k}=1 . \Delta \mathbf{p}_{m q k} \in \mathbb{R}^{2}$ are of 2-d real numbers with unconstrained range. As $p_{q}+\Delta p_{m q k}$ is fractional, bilinear interpolation is applied as in Dai et al. (2017) in computing $\mathbf{x}\left(\mathbf{p}_{q}+\Delta \mathbf{p}_{m q k}\right)$. Both $\Delta \mathbf{p}_{m q k}$ and $A_{m q k}$ are obtained via linear projection over the query feature $z_{q} .$ In implementation, the query feature $z_{q}$ is fed to a linear projection operator of $3 M K$ channels, where the first $2 M K$ channels encode the sampling offsets $\Delta p_{m q k}$, and the remaining $M K$ channels are fed to a softmax operator to obtain the attention weights $A_{m q k}$.

Source: Deformable DETR: Deformable Transformers for End-to-End Object Detection


Paper Code Results Date Stars


Component Type
🤖 No Components Found You can add them if they exist; e.g. Mask R-CNN uses RoIAlign