Deformable Attention Module

Introduced by Zhu et al. in Deformable DETR: Deformable Transformers for End-to-End Object Detection

Deformable Attention Module is an attention module used in the Deformable DETR architecture, which seeks to overcome one issue base Transformer attention in that it looks over all possible spatial locations. Inspired by deformable convolution, the deformable attention module only attends to a small set of key sampling points around a reference point, regardless of the spatial size of the feature maps. By assigning only a small fixed number of keys for each query, the issues of convergence and feature spatial resolution can be mitigated.

Given an input feature map $x \in \mathbb{R}^{C \times H \times W}$, let $q$ index a query element with content feature $\mathbf{z}_{q}$ and a 2-d reference point $\mathbf{p}_{q}$, the deformable attention feature is calculated by:

$$ \text{DeformAttn}\left(\mathbf{z}_{q}, \mathbf{p}_{q}, \mathbf{x}\right)=\sum_{m=1}^{M} \mathbf{W}_{m}\left[\sum_{k=1}^{K} A_{m q k} \cdot \mathbf{W}_{m}^{\prime} \mathbf{x}\left(\mathbf{p}_{q}+\Delta \mathbf{p}_{m q k}\right)\right] $$

where $m$ indexes the attention head, $k$ indexes the sampled keys, and $K$ is the total sampled key number $(K \ll H W) . \Delta p_{m q k}$ and $A_{m q k}$ denote the sampling offset and attention weight of the $k^{\text {th }}$ sampling point in the $m^{\text {th }}$ attention head, respectively. The scalar attention weight $A_{m q k}$ lies in the range $[0,1]$, normalized by $\sum_{k=1}^{K} A_{m q k}=1 . \Delta \mathbf{p}_{m q k} \in \mathbb{R}^{2}$ are of 2-d real numbers with unconstrained range. As $p_{q}+\Delta p_{m q k}$ is fractional, bilinear interpolation is applied as in Dai et al. (2017) in computing $\mathbf{x}\left(\mathbf{p}_{q}+\Delta \mathbf{p}_{m q k}\right)$. Both $\Delta \mathbf{p}_{m q k}$ and $A_{m q k}$ are obtained via linear projection over the query feature $z_{q} .$ In implementation, the query feature $z_{q}$ is fed to a linear projection operator of $3 M K$ channels, where the first $2 M K$ channels encode the sampling offsets $\Delta p_{m q k}$, and the remaining $M K$ channels are fed to a softmax operator to obtain the attention weights $A_{m q k}$.

Source: Deformable DETR: Deformable Transformers for End-to-End Object Detection

Read Paper See Code

Papers

Paper	Code	Results	Date	Stars

Tasks

Task	Papers	Share
Object Detection	23	34.33%
Instance Segmentation	3	4.48%
Language Modelling	2	2.99%
Semi-Supervised Object Detection	2	2.99%
Autonomous Driving	2	2.99%
Semantic Segmentation	2	2.99%
Video Instance Segmentation	2	2.99%
2D Object Detection	2	2.99%
Real-Time Object Detection	2	2.99%

Usage Over Time

This feature is experimental; we are continuously improving our matching algorithm.

Components

Component	Type	Add Remove
🤖 No Components Found	You can add them if they exist; e.g. Mask R-CNN uses RoIAlign

Categories

Add Remove

Attention Modules