Spatially Separable SelfAttention, or SSSA, is an attention module used in the TwinsSVT architecture that aims to reduce the computational complexity of vision transformers for dense prediction tasks (given highresolution inputs). SSSA is composed of locallygrouped selfattention (LSA) and global subsampled attention (GSA).
Formally, spatially separable selfattention (SSSA) can be written as:
$$ \hat{\mathbf{z}}_{i j}^{l}=\text { LSA }\left(\text { LayerNorm }\left(\mathbf{z}_{i j}^{l1}\right)\right)+\mathbf{z}_{i j}^{l1} $$
$$\mathbf{z}_{i j}^{l}=\mathrm{FFN}\left(\operatorname{LayerNorm}\left(\hat{\mathbf{z}}_{i j}^{l}\right)\right)+\hat{\mathbf{z}}_{i j}^{l} $$
$$ \hat{\mathbf{z}}^{l+1}=\text { GSA }\left(\text { LayerNorm }\left(\mathbf{z}^{l}\right)\right)+\mathbf{z}^{l} $$
$$ \mathbf{z}^{l+1}=\text { FFN }\left(\text { LayerNorm }\left(\hat{\mathbf{z}}^{l+1}\right)\right)+\hat{\mathbf{z}}^{l+1}$$
$$i \in{1,2, \ldots ., m}, j \in{1,2, \ldots ., n} $$
where LSA means locallygrouped selfattention within a subwindow; GSA is the global subsampled attention by interacting with the representative keys (generated by the subsampling functions) from each subwindow $\hat{\mathbf{z}}_{i j} \in \mathcal{R}^{k_{1} \times k_{2} \times C} .$ Both LSA and GSA have multiple heads as in the standard selfattention.
Source: Twins: Revisiting the Design of Spatial Attention in Vision TransformersPaper  Code  Results  Date  Stars 

Task  Papers  Share 

Benchmarking  1  20.00% 
Fact Verification  1  20.00% 
Retrieval  1  20.00% 
Image Classification  1  20.00% 
Semantic Segmentation  1  20.00% 
Component  Type 


Dense Connections

Feedforward Networks  
Global SubSampled Attention

Attention Mechanisms  
Layer Normalization

Normalization  
LocallyGrouped SelfAttention

Attention Mechanisms 