Global Sub-Sampled Attention Explained

Method Name:*

Method Full Name:*

Description with Markdown (optional):

**Global Sub-Sampled Attention**, or **GSA**, is a local [attention mechanism](https://paperswithcode.com/methods/category/attention-mechanisms-1) used in the [Twins-SVT](https://paperswithcode.com/method/twins-svt) architecture.

A single representative is used to summarize the key information for each of $m \times n$ subwindows and the representative is used to communicate with other sub-windows (serving as the key in self-attention), which can reduce the cost to $\mathcal{O}(m n H W d)=\mathcal{O}\left(\frac{H^{2} W^{2} d}{k\_{1} k\_{2}}\right)$. This is essentially equivalent to using the sub-sampled feature maps as the key in attention operations, and thus it is termed global sub-sampled attention (GSA).

If we alternatively use the [LSA](https://paperswithcode.com/method/locally-grouped-self-attention) and GSA like [separable convolutions](https://paperswithcode.com/method/depthwise-separable-convolution) (depth-wise + point-wise). The total computation cost is $\mathcal{O}\left(\frac{H^{2} W^{2} d}{k\_{1} k\_{2}}+k\_{1} k\_{2} H W d\right) .$ We have:

$$\frac{H^{2} W^{2} d}{k\_{1} k\_{2}}+k_{1} k_{2} H W d \geq 2 H W d \sqrt{H W} $$

The minimum is obtained when $k\_{1} \cdot k\_{2}=\sqrt{H W}$. Note that $H=W=224$ is popular in classification. Without loss of generality, square sub-windows are used, i.e., $k\_{1}=k\_{2}$. Therefore, $k\_{1}=k\_{2}=15$ is close to the global minimum for $H=W=224$. However, the network is designed to include several stages with variable resolutions. Stage 1 has feature maps of $56 \times 56$, the minimum is obtained when $k\_{1}=k\_{2}=\sqrt{56} \approx 7$. Theoretically, we can calibrate optimal $k\_{1}$ and $k\_{2}$ for each of the stages. For simplicity, $k\_{1}=k\_{2}=7$ is used everywhere. As for stages with lower resolutions, the summarizing window-size of GSA is controlled to avoid too small amount of generated keys. Specifically, the sizes of 4,2 and 1 are used for the last three stages respectively.

Code Snippet URL (optional):

Image

Currently: methods/3ae8bb05-da24-4440-a5cf-3352a9d1b749.png Clear
Change:

Attached collections:

ATTENTION MECHANISMS

Add:

New collection name:

Top-level area:

Parent collection (if any):

Description (optional):

Task	Papers	Share
Benchmarking	1	20.00%
Fact Verification	1	20.00%
Retrieval	1	20.00%
Image Classification	1	20.00%
Semantic Segmentation	1	20.00%

Global Sub-Sampled Attention

Papers

Tasks

Usage Over Time

Components

Categories

Add Remove