Spatial-Channel Token Distillation for Vision MLPs

Recently, neural architectures with all Multi-layer Perceptrons (MLPs) have attracted great research interest from the computer vision community. However, the inefficient mixing of spatial-channel information causes MLP-like vision models to demand tremendous pre-training on large-scale datasets. This work solves the problem from a novel knowledge distillation perspective. We propose a novel Spatial-channel Token Distillation (STD) method, which improves the information mixing in the two dimensions by introducing distillation tokens to each of them. A mutual information regularization is further introduced to let distillation tokens focus on their specific dimensions and maximize the performance gain. Extensive experiments on ImageNet for several MLP-like architectures demonstrate that the proposed token distillation mechanism can efficiently improve the accuracy. For example, the proposed STD boosts the top-1 accuracy of Mixer-S16 on ImageNet from 73.8% to 75.7% without any costly pre-training on JFT-300M. When applied to stronger architectures, e.g. CycleMLP-B1 and CycleMLP-B2, STD can still harvest about 1.1% and 0.5% accuracy gains, respectively.

PDF Abstract

Results from the Paper

Task Dataset Model Metric Name Metric Value Global Rank Benchmark
Image Classification ImageNet ResMLP-B24 + STD Top 1 Accuracy 82.4% # 473
Number of params 122.6M # 850
GFLOPs 24.1 # 374
Image Classification ImageNet Mixer-S16 + STD Top 1 Accuracy 75.7% # 843
Number of params 22.2M # 539
GFLOPs 4.3 # 198
Image Classification ImageNet CycleMLP-B2 + STD Top 1 Accuracy 82.1% # 506
Number of params 30.1M # 625
GFLOPs 4.0 # 187