TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Image Classification	ImageNet	ResMLP-B24 + STD	Top 1 Accuracy	82.4%	# 491
Image Classification	ImageNet	ResMLP-B24 + STD	Number of params	122.6M	# 878
Image Classification	ImageNet	ResMLP-B24 + STD	GFLOPs	24.1	# 380
Image Classification	ImageNet	Mixer-S16 + STD	Top 1 Accuracy	75.7%	# 867
Image Classification	ImageNet	Mixer-S16 + STD	Number of params	22.2M	# 565
Image Classification	ImageNet	Mixer-S16 + STD	GFLOPs	4.3	# 202
Image Classification	ImageNet	CycleMLP-B2 + STD	Top 1 Accuracy	82.1%	# 525
Image Classification	ImageNet	CycleMLP-B2 + STD	Number of params	30.1M	# 651
Image Classification	ImageNet	CycleMLP-B2 + STD	GFLOPs	4.0	# 191

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/spatial-channel-token-distillation-for-vision/image-classification-on-imagenet)](https://paperswithcode.com/sota/image-classification-on-imagenet?p=spatial-channel-token-distillation-for-vision)`

Spatial-Channel Token Distillation for Vision MLPs

International Conference on Machine Learning 2022 · Yanxi Li, Xinghao Chen, Minjing Dong, Yehui Tang, Yunhe Wang, Chang Xu ·

Recently, neural architectures with all Multi-layer Perceptrons (MLPs) have attracted great research interest from the computer vision community. However, the inefficient mixing of spatial-channel information causes MLP-like vision models to demand tremendous pre-training on large-scale datasets. This work solves the problem from a novel knowledge distillation perspective. We propose a novel Spatial-channel Token Distillation (STD) method, which improves the information mixing in the two dimensions by introducing distillation tokens to each of them. A mutual information regularization is further introduced to let distillation tokens focus on their specific dimensions and maximize the performance gain. Extensive experiments on ImageNet for several MLP-like architectures demonstrate that the proposed token distillation mechanism can efficiently improve the accuracy. For example, the proposed STD boosts the top-1 accuracy of Mixer-S16 on ImageNet from 73.8% to 75.7% without any costly pre-training on JFT-300M. When applied to stronger architectures, e.g. CycleMLP-B1 and CycleMLP-B2, STD can still harvest about 1.1% and 0.5% accuracy gains, respectively.

PDF Abstract

Code

Add Remove Mark official

devrimcavusoglu/std

Tasks

Add Remove

Image Classification

Knowledge Distillation

Datasets

ImageNet ImageNet-1K

JFT-300M

Results from the Paper

Add Remove

Ranked #491 on Image Classification on ImageNet

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Image Classification	ImageNet	ResMLP-B24 + STD	Top 1 Accuracy	82.4%	# 491	Compare
			Number of params	122.6M	# 878	Compare
			GFLOPs	24.1	# 380	Compare
Image Classification	ImageNet	Mixer-S16 + STD	Top 1 Accuracy	75.7%	# 867	Compare
			Number of params	22.2M	# 565	Compare
			GFLOPs	4.3	# 202	Compare
Image Classification	ImageNet	CycleMLP-B2 + STD	Top 1 Accuracy	82.1%	# 525	Compare
			Number of params	30.1M	# 651	Compare
			GFLOPs	4.0	# 191	Compare

Methods

Add Remove

Average Pooling • Dense Connections • Dropout • GELU • Global Average Pooling • Knowledge Distillation • Layer Normalization • MLP-Mixer • Residual Connection • STD

Edit Social Preview

Spatial-Channel Token Distillation for Vision MLPs

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit Add Remove

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Add Remove

Methods

Add Remove