gMLP is an MLP-based alternative to Transformers without self-attention, which simply consists of channel projections and spatial projections with static parameterization. It is built out of basic MLP layers with gating. The model consists of a stack of $L$ blocks with identical size and structure. Let $X \in \mathbb{R}^{n \times d}$ be the token representations with sequence length $n$ and dimension $d$. Each block is defined as:
$$ Z=\sigma(X U), \quad \tilde{Z}=s(Z), \quad Y=\tilde{Z} V $$
where $\sigma$ is an activation function such as GeLU. $U$ and $V$ define linear projections along the channel dimension - the same as those in the FFNs of Transformers (e.g., their shapes are $768 \times 3072$ and $3072 \times 768$ for $\text{BERT}_{\text {base }}$).
A key ingredient is $s(\cdot)$, a layer which captures spatial interactions. When $s$ is an identity mapping, the above transformation degenerates to a regular FFN, where individual tokens are processed independently without any cross-token communication. One of the major focuses is therefore to design a good $s$ capable of capturing complex spatial interactions across tokens. This leads to the use of a Spatial Gating Unit which involves a modified linear gating.
The overall block layout is inspired by inverted bottlenecks, which define $s(\cdot)$ as a spatial depthwise convolution. Note, unlike Transformers, gMLP does not require position embeddings because such information will be captured in $s(\cdot)$.
Source: Pay Attention to MLPsPaper | Code | Results | Date | Stars |
---|
Task | Papers | Share |
---|---|---|
Image Classification | 3 | 11.11% |
Instance Segmentation | 2 | 7.41% |
Object Detection | 2 | 7.41% |
Semantic Segmentation | 2 | 7.41% |
Question Answering | 2 | 7.41% |
Graph Representation Learning | 1 | 3.70% |
Node Classification | 1 | 3.70% |
Classification | 1 | 3.70% |
Decoder | 1 | 3.70% |
Component | Type |
|
---|---|---|
![]() |
Activation Functions | |
![]() |
Normalization | |
![]() |
Skip Connections | |
![]() |
Feedforward Networks |