Spatially Separable SelfAttention, or SSSA, is an attention module used in the TwinsSVT architecture that aims to reduce the computational complexity of vision transformers for dense prediction tasks (given highresolution inputs). SSSA is composed of locallygrouped selfattention (LSA) and global subsampled attention (GSA).
Formally, spatially separable selfattention (SSSA) can be written as:
$$ \hat{\mathbf{z}}_{i j}^{l}=\text { LSA }\left(\text { LayerNorm }\left(\mathbf{z}_{i j}^{l1}\right)\right)+\mathbf{z}_{i j}^{l1} $$
$$\mathbf{z}_{i j}^{l}=\mathrm{FFN}\left(\operatorname{LayerNorm}\left(\hat{\mathbf{z}}_{i j}^{l}\right)\right)+\hat{\mathbf{z}}_{i j}^{l} $$
$$ \hat{\mathbf{z}}^{l+1}=\text { GSA }\left(\text { LayerNorm }\left(\mathbf{z}^{l}\right)\right)+\mathbf{z}^{l} $$
$$ \mathbf{z}^{l+1}=\text { FFN }\left(\text { LayerNorm }\left(\hat{\mathbf{z}}^{l+1}\right)\right)+\hat{\mathbf{z}}^{l+1}$$
$$i \in{1,2, \ldots ., m}, j \in{1,2, \ldots ., n} $$
where LSA means locallygrouped selfattention within a subwindow; GSA is the global subsampled attention by interacting with the representative keys (generated by the subsampling functions) from each subwindow $\hat{\mathbf{z}}_{i j} \in \mathcal{R}^{k_{1} \times k_{2} \times C} .$ Both LSA and GSA have multiple heads as in the standard selfattention.
