Convolutional Neural Networks (CNNs) are the go-to model for computer vision. Recently, attention-based networks, such as the Vision Transformer, have also become popular. In this paper we show that while convolutions and attention are both sufficient for good performance, neither of them are necessary. We present MLP-Mixer, an architecture based exclusively on multi-layer perceptrons (MLPs). MLP-Mixer contains two types of layers: one with MLPs applied independently to image patches (i.e. "mixing" the per-location features), and one with MLPs applied across patches (i.e. "mixing" spatial information). When trained on large datasets, or with modern regularization schemes, MLP-Mixer attains competitive scores on image classification benchmarks, with pre-training and inference cost comparable to state-of-the-art models. We hope that these results spark further research beyond the realms of well established CNNs and Transformers.

PDF Abstract NeurIPS 2021 PDF NeurIPS 2021 Abstract
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Image Classification ImageNet ViT-L/16 Dosovitskiy et al. (2021) Top 1 Accuracy 85.3% # 138
Image Classification ImageNet Mixer-H/14 (JFT-300M pre-train) Top 1 Accuracy 87.94% # 39
Hardware Burden None # 1
Operations per network pass None # 1
Image Classification ImageNet Mixer-B/16 Top 1 Accuracy 76.44% # 597
Number of params 46M # 504
Image Classification ImageNet ReaL Mixer-H/14- 448 (JFT-300M pre-train) Accuracy 90.18% # 19
Params 409M # 43
Image Classification ImageNet ReaL Mixer-H/14 (JFT-300M pre-train) Accuracy 87.86% # 29
Params 409M # 43
Image Classification OmniBenchmark MLP-Mixer Average Top-1 Accuracy 32.2 # 18