Image Models


Introduced by Tolstikhin et al. in MLP-Mixer: An all-MLP Architecture for Vision

The MLP-Mixer architecture (or “Mixer” for short) is an image architecture that doesn't use convolutions or self-attention. Instead, Mixer’s architecture is based entirely on multi-layer perceptrons (MLPs) that are repeatedly applied across either spatial locations or feature channels. Mixer relies only on basic matrix multiplication routines, changes to data layout (reshapes and transpositions), and scalar nonlinearities.

It accepts a sequence of linearly projected image patches (also referred to as tokens) shaped as a “patches × channels” table as an input, and maintains this dimensionality. Mixer makes use of two types of MLP layers: channel-mixing MLPs and token-mixing MLPs. The channel-mixing MLPs allow communication between different channels; they operate on each token independently and take individual rows of the table as inputs. The token-mixing MLPs allow communication between different spatial locations (tokens); they operate on each channel independently and take individual columns of the table as inputs. These two types of layers are interleaved to enable interaction of both input dimensions.

Source: MLP-Mixer: An all-MLP Architecture for Vision


Paper Code Results Date Stars


Task Papers Share
Image Classification 5 20.00%
Object Detection 4 16.00%
Semantic Segmentation 4 16.00%
Language Modelling 2 8.00%
Instance Segmentation 2 8.00%
EEG 1 4.00%
Speech Synthesis 1 4.00%
Adversarial Attack 1 4.00%
Compressive Sensing 1 4.00%