Spatially Separable Self-Attention, or SSSA, is an attention module used in the Twins-SVT architecture that aims to reduce the computational complexity of vision transformers for dense prediction tasks (given high-resolution inputs). SSSA is composed of locally-grouped self-attention (LSA) and global sub-sampled attention (GSA).
Formally, spatially separable self-attention (SSSA) can be written as:
$$ \hat{\mathbf{z}}_{i j}^{l}=\text { LSA }\left(\text { LayerNorm }\left(\mathbf{z}_{i j}^{l-1}\right)\right)+\mathbf{z}_{i j}^{l-1} $$
$$\mathbf{z}_{i j}^{l}=\mathrm{FFN}\left(\operatorname{LayerNorm}\left(\hat{\mathbf{z}}_{i j}^{l}\right)\right)+\hat{\mathbf{z}}_{i j}^{l} $$
$$ \hat{\mathbf{z}}^{l+1}=\text { GSA }\left(\text { LayerNorm }\left(\mathbf{z}^{l}\right)\right)+\mathbf{z}^{l} $$
$$ \mathbf{z}^{l+1}=\text { FFN }\left(\text { LayerNorm }\left(\hat{\mathbf{z}}^{l+1}\right)\right)+\hat{\mathbf{z}}^{l+1}$$
$$i \in{1,2, \ldots ., m}, j \in{1,2, \ldots ., n} $$
where LSA means locally-grouped self-attention within a sub-window; GSA is the global sub-sampled attention by interacting with the representative keys (generated by the sub-sampling functions) from each sub-window $\hat{\mathbf{z}}_{i j} \in \mathcal{R}^{k_{1} \times k_{2} \times C} .$ Both LSA and GSA have multiple heads as in the standard self-attention.
Source: Twins: Revisiting the Design of Spatial Attention in Vision TransformersPaper | Code | Results | Date | Stars |
---|
Task | Papers | Share |
---|---|---|
Benchmarking | 1 | 20.00% |
Fact Verification | 1 | 20.00% |
Retrieval | 1 | 20.00% |
Image Classification | 1 | 20.00% |
Semantic Segmentation | 1 | 20.00% |
Component | Type |
|
---|---|---|
Dense Connections
|
Feedforward Networks | |
Global Sub-Sampled Attention
|
Attention Mechanisms | |
Layer Normalization
|
Normalization | |
Locally-Grouped Self-Attention
|
Attention Mechanisms |