LocallyGrouped SelfAttention, or LSA, is a local attention mechanism used in the TwinsSVT architecture. Locallygrouped selfattention (LSA). Motivated by the group design in depthwise convolutions for efficient inference, we first equally divide the 2D feature maps into subwindows, making selfattention communications only happen within each subwindow. This design also resonates with the multihead design in selfattention, where the communications only occur within the channels of the same head. To be specific, the feature maps are divided into $m \times n$ subwindows. Without loss of generality, we assume $H \% m=0$ and $W \% n=0$. Each group contains $\frac{H W}{m n}$ elements, and thus the computation cost of the selfattention in this window is $\mathcal{O}\left(\frac{H^{2} W^{2}}{m^{2} n^{2}} d\right)$, and the total cost is $\mathcal{O}\left(\frac{H^{2} W^{2}}{m n} d\right)$. If we let $k_{1}=\frac{H}{n}$ and $k_{2}=\frac{W}{n}$, the cost can be computed as $\mathcal{O}\left(k_{1} k_{2} H W d\right)$, which is significantly more efficient when $k_{1} \ll H$ and $k_{2} \ll W$ and grows linearly with $H W$ if $k_{1}$ and $k_{2}$ are fixed.
Although the locallygrouped selfattention mechanism is computation friendly, the image is divided into nonoverlapping subwindows. Thus, we need a mechanism to communicate between different subwindows, as in Swin. Otherwise, the information would be limited to be processed locally, which makes the receptive field small and significantly degrades the performance as shown in our experiments. This resembles the fact that we cannot replace all standard convolutions by depthwise convolutions in CNNs.
Source: Twins: Revisiting the Design of Spatial Attention in Vision TransformersPaper  Code  Results  Date  Stars 

Task  Papers  Share 

Benchmarking  1  20.00% 
Fact Verification  1  20.00% 
Retrieval  1  20.00% 
Image Classification  1  20.00% 
Semantic Segmentation  1  20.00% 
Component  Type 


🤖 No Components Found  You can add them if they exist; e.g. Mask RCNN uses RoIAlign 