MLP-Mixer: An all-MLP Architecture for Vision

Convolutional Neural Networks (CNNs) are the go-to model for computer vision. Recently, attention-based networks, such as the Vision Transformer, have also become popular... (read more)

PDF Abstract

Datasets


Results from the Paper


Ranked #9 on Image Classification on ImageNet (using extra training data)

     Get a GitHub badge
TASK DATASET MODEL METRIC NAME METRIC VALUE GLOBAL RANK USES EXTRA
TRAINING DATA
RESULT BENCHMARK
Image Classification ImageNet Mixer-L/16 Top 1 Accuracy 84.15 # 77
Image Classification ImageNet Mixer-H/14 (JFT-300M pre-train) Top 1 Accuracy 87.94% # 9
Image Classification ImageNet ReaL Mixer-H/14 Accuracy 87.86% # 18
Image Classification ImageNet ReaL Mixer-H/14 (JFT-300M pre-train) Accuracy 90.18% # 10

Methods used in the Paper


METHOD TYPE
Adam
Stochastic Optimization
Layer Normalization
Normalization
Residual Connection
Skip Connections
Label Smoothing
Regularization
BPE
Subword Segmentation
Dropout
Regularization
Softmax
Output Functions
Dense Connections
Feedforward Networks
Scaled Dot-Product Attention
Attention Mechanisms
Transformer
Transformers
Multi-Head Attention
Attention Modules
Vision Transformer
Image Models