gMLP is an MLPbased alternative to Transformers without selfattention, which simply consists of channel projections and spatial projections with static parameterization. It is built out of basic MLP layers with gating. The model consists of a stack of $L$ blocks with identical size and structure. Let $X \in \mathbb{R}^{n \times d}$ be the token representations with sequence length $n$ and dimension $d$. Each block is defined as:
$$ Z=\sigma(X U), \quad \tilde{Z}=s(Z), \quad Y=\tilde{Z} V $$
where $\sigma$ is an activation function such as GeLU. $U$ and $V$ define linear projections along the channel dimension  the same as those in the FFNs of Transformers (e.g., their shapes are $768 \times 3072$ and $3072 \times 768$ for $\text{BERT}_{\text {base }}$).
A key ingredient is $s(\cdot)$, a layer which captures spatial interactions. When $s$ is an identity mapping, the above transformation degenerates to a regular FFN, where individual tokens are processed independently without any crosstoken communication. One of the major focuses is therefore to design a good $s$ capable of capturing complex spatial interactions across tokens. This leads to the use of a Spatial Gating Unit which involves a modified linear gating.
The overall block layout is inspired by inverted bottlenecks, which define $s(\cdot)$ as a spatial depthwise convolution. Note, unlike Transformers, gMLP does not require position embeddings because such information will be captured in $s(\cdot)$.
Source: Pay Attention to MLPsPaper  Code  Results  Date  Stars 

Task  Papers  Share 

Image Classification  3  17.65% 
Instance Segmentation  2  11.76% 
Object Detection  2  11.76% 
Semantic Segmentation  2  11.76% 
MultiLabel Classification  1  5.88% 
MultiLabel Text Classification  1  5.88% 
Text Classification  1  5.88% 
Language Modelling  1  5.88% 
ZeroShot Learning  1  5.88% 
Component  Type 


GELU

Activation Functions  
Layer Normalization

Normalization  
Residual Connection

Skip Connections  
Spatial Gating Unit

Feedforward Networks 