Feedforward Networks

# Spatial Gating Unit

Introduced by Liu et al. in Pay Attention to MLPs

Spatial Gating Unit, or SGU, is a gating unit used in the gMLP architecture to captures spatial interactions. To enable cross-token interactions, it is necessary for the layer $s(\cdot)$ to contain a contraction operation over the spatial dimension. The layer $s(\cdot)$ is formulated as the output of linear gating:

$$s(Z)=Z \odot f_{W, b}(Z)$$

where $\odot$ denotes element-wise multiplication. For training stability, the authors find it critical to initialize $W$ as near-zero values and $b$ as ones, meaning that $f_{W, b}(Z) \approx 1$ and therefore $s(Z) \approx Z$ at the beginning of training. This initialization ensures each gMLP block behaves like a regular FFN at the early stage of training, where each token is processed independently, and only gradually injects spatial information across tokens during the course of learning.

The authors find it further effective to split $Z$ into two independent parts $\left(Z_{1}, Z_{2}\right)$ along the channel dimension for the gating function and for the multiplicative bypass:

$$s(Z)=Z_{1} \odot f_{W, b}\left(Z_{2}\right)$$

They also normalize the input to $f_{W, b}$ which empirically improved the stability of large NLP models.

Source: Pay Attention to MLPs

#### Papers

Paper Code Results Date Stars