TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Image Classification	CIFAR-10	ConvMixer-256/16	Percentage correct	96.74	# 96
Image Classification	CIFAR-10	ConvMixer-256/16	PARAMS	1.34M	# 186
Image Classification	CIFAR-10	ConvMixer-256/8	Percentage correct	96.03	# 113
Image Classification	CIFAR-10	ConvMixer-256/8	PARAMS	0.71M	# 178
Image Classification	ImageNet	ConvMixer-1536/20	Top 1 Accuracy	82.20	# 510
Image Classification	ImageNet	ConvMixer-1536/20	Number of params	51.6M	# 732

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/patches-are-all-you-need-1/image-classification-on-cifar-10)](https://paperswithcode.com/sota/image-classification-on-cifar-10?p=patches-are-all-you-need-1)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/patches-are-all-you-need-1/image-classification-on-imagenet)](https://paperswithcode.com/sota/image-classification-on-imagenet?p=patches-are-all-you-need-1)`

Patches Are All You Need?

24 Jan 2022 · Asher Trockman, J. Zico Kolter ·

Although convolutional networks have been the dominant architecture for vision tasks for many years, recent experiments have shown that Transformer-based models, most notably the Vision Transformer (ViT), may exceed their performance in some settings. However, due to the quadratic runtime of the self-attention layers in Transformers, ViTs require the use of patch embeddings, which group together small regions of the image into single input features, in order to be applied to larger image sizes. This raises a question: Is the performance of ViTs due to the inherently-more-powerful Transformer architecture, or is it at least partly due to using patches as the input representation? In this paper, we present some evidence for the latter: specifically, we propose the ConvMixer, an extremely simple model that is similar in spirit to the ViT and the even-more-basic MLP-Mixer in that it operates directly on patches as input, separates the mixing of spatial and channel dimensions, and maintains equal size and resolution throughout the network. In contrast, however, the ConvMixer uses only standard convolutions to achieve the mixing steps. Despite its simplicity, we show that the ConvMixer outperforms the ViT, MLP-Mixer, and some of their variants for similar parameter counts and data set sizes, in addition to outperforming classical vision models such as the ResNet. Our code is available at https://github.com/locuslab/convmixer.

PDF Abstract

Code

Add Remove Mark official

tmp-iclr/convmixer official

1,044

locuslab/convmixer official

1,044

labmlai/annotated_deep_learning_pap…

↳ View annotated code at

labml.ai

48,424

BR-IDL/PaddleViT

1,186

Westlake-AI/openmixup

574

See all 11 implementations

Tasks

Add Remove

Image Classification

Datasets

CIFAR-10

ImageNet

Results from the Paper

Edit

Ranked #96 on Image Classification on CIFAR-10

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Image Classification	CIFAR-10	ConvMixer-256/16	Percentage correct	96.74	# 96	Compare
Image Classification	CIFAR-10	ConvMixer-256/16	PARAMS	1.34M	# 186	Compare
Image Classification	CIFAR-10	ConvMixer-256/8	Percentage correct	96.03	# 113	Compare
Image Classification	CIFAR-10	ConvMixer-256/8	PARAMS	0.71M	# 178	Compare
Image Classification	ImageNet	ConvMixer-1536/20	Top 1 Accuracy	82.20	# 510	Compare
Image Classification	ImageNet	ConvMixer-1536/20	Number of params	51.6M	# 732	Compare

Methods

Add Remove

1x1 Convolution • Absolute Position Encodings • Adam • Average Pooling • Batch Normalization • Bottleneck Residual Block • BPE • Convolution • Dense Connections • Dropout • GELU • Global Average Pooling • Kaiming Initialization • Label Smoothing • Layer Normalization • Linear Layer • Max Pooling • MLP-Mixer • Multi-Head Attention • Position-Wise Feed-Forward Layer • ReLU • Residual Block • Residual Connection • ResNet • Scaled Dot-Product Attention • Softmax • Transformer • Vision Transformer

Edit Social Preview

Patches Are All You Need?

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove