Global SubSampled Attention, or GSA, is a local attention mechanism used in the TwinsSVT architecture.
A single representative is used to summarize the key information for each of $m \times n$ subwindows and the representative is used to communicate with other subwindows (serving as the key in selfattention), which can reduce the cost to $\mathcal{O}(m n H W d)=\mathcal{O}\left(\frac{H^{2} W^{2} d}{k_{1} k_{2}}\right)$. This is essentially equivalent to using the subsampled feature maps as the key in attention operations, and thus it is termed global subsampled attention (GSA).
If we alternatively use the LSA and GSA like separable convolutions (depthwise + pointwise). The total computation cost is $\mathcal{O}\left(\frac{H^{2} W^{2} d}{k_{1} k_{2}}+k_{1} k_{2} H W d\right) .$ We have:
$$\frac{H^{2} W^{2} d}{k_{1} k_{2}}+k_{1} k_{2} H W d \geq 2 H W d \sqrt{H W} $$
The minimum is obtained when $k_{1} \cdot k_{2}=\sqrt{H W}$. Note that $H=W=224$ is popular in classification. Without loss of generality, square subwindows are used, i.e., $k_{1}=k_{2}$. Therefore, $k_{1}=k_{2}=15$ is close to the global minimum for $H=W=224$. However, the network is designed to include several stages with variable resolutions. Stage 1 has feature maps of $56 \times 56$, the minimum is obtained when $k_{1}=k_{2}=\sqrt{56} \approx 7$. Theoretically, we can calibrate optimal $k_{1}$ and $k_{2}$ for each of the stages. For simplicity, $k_{1}=k_{2}=7$ is used everywhere. As for stages with lower resolutions, the summarizing windowsize of GSA is controlled to avoid too small amount of generated keys. Specifically, the sizes of 4,2 and 1 are used for the last three stages respectively.
