Attention Modules

Spatially Separable Self-Attention

Introduced by Chu et al. in Twins: Revisiting the Design of Spatial Attention in Vision Transformers

Spatially Separable Self-Attention, or SSSA, is an attention module used in the Twins-SVT architecture that aims to reduce the computational complexity of vision transformers for dense prediction tasks (given high-resolution inputs). SSSA is composed of locally-grouped self-attention (LSA) and global sub-sampled attention (GSA).

Formally, spatially separable self-attention (SSSA) can be written as:

$$ \hat{\mathbf{z}}_{i j}^{l}=\text { LSA }\left(\text { LayerNorm }\left(\mathbf{z}_{i j}^{l-1}\right)\right)+\mathbf{z}_{i j}^{l-1} $$

$$\mathbf{z}_{i j}^{l}=\mathrm{FFN}\left(\operatorname{LayerNorm}\left(\hat{\mathbf{z}}_{i j}^{l}\right)\right)+\hat{\mathbf{z}}_{i j}^{l} $$

$$ \hat{\mathbf{z}}^{l+1}=\text { GSA }\left(\text { LayerNorm }\left(\mathbf{z}^{l}\right)\right)+\mathbf{z}^{l} $$

$$ \mathbf{z}^{l+1}=\text { FFN }\left(\text { LayerNorm }\left(\hat{\mathbf{z}}^{l+1}\right)\right)+\hat{\mathbf{z}}^{l+1}$$

$$i \in{1,2, \ldots ., m}, j \in{1,2, \ldots ., n} $$

where LSA means locally-grouped self-attention within a sub-window; GSA is the global sub-sampled attention by interacting with the representative keys (generated by the sub-sampling functions) from each sub-window $\hat{\mathbf{z}}_{i j} \in \mathcal{R}^{k_{1} \times k_{2} \times C} .$ Both LSA and GSA have multiple heads as in the standard self-attention.

Source: Twins: Revisiting the Design of Spatial Attention in Vision Transformers

Papers


Paper Code Results Date Stars

Tasks


Task Papers Share
Benchmarking 1 20.00%
Fact Verification 1 20.00%
Retrieval 1 20.00%
Image Classification 1 20.00%
Semantic Segmentation 1 20.00%

Categories