Spatial Gating Unit, or SGU, is a gating unit used in the gMLP architecture to captures spatial interactions. To enable crosstoken interactions, it is necessary for the layer $s(\cdot)$ to contain a contraction operation over the spatial dimension. The layer $s(\cdot)$ is formulated as the output of linear gating:
$$ s(Z)=Z \odot f_{W, b}(Z) $$
where $\odot$ denotes elementwise multiplication. For training stability, the authors find it critical to initialize $W$ as nearzero values and $b$ as ones, meaning that $f_{W, b}(Z) \approx 1$ and therefore $s(Z) \approx Z$ at the beginning of training. This initialization ensures each gMLP block behaves like a regular FFN at the early stage of training, where each token is processed independently, and only gradually injects spatial information across tokens during the course of learning.
The authors find it further effective to split $Z$ into two independent parts $\left(Z_{1}, Z_{2}\right)$ along the channel dimension for the gating function and for the multiplicative bypass:
$$ s(Z)=Z_{1} \odot f_{W, b}\left(Z_{2}\right) $$
They also normalize the input to $f_{W, b}$ which empirically improved the stability of large NLP models.
Source: Pay Attention to MLPsPaper  Code  Results  Date  Stars 

Task  Papers  Share 

Image Classification  3  17.65% 
Instance Segmentation  2  11.76% 
Object Detection  2  11.76% 
Semantic Segmentation  2  11.76% 
MultiLabel Classification  1  5.88% 
MultiLabel Text Classification  1  5.88% 
Text Classification  1  5.88% 
Language Modelling  1  5.88% 
ZeroShot Learning  1  5.88% 
Component  Type 


🤖 No Components Found  You can add them if they exist; e.g. Mask RCNN uses RoIAlign 