Spatial-Channel Token Distillation for Vision MLPs
Recently, neural architectures with all Multi-layer Perceptrons (MLPs) have attracted great research interest from the computer vision community. However, the inefficient mixing of spatial-channel information causes MLP-like vision models to demand tremendous pre-training on large-scale datasets. This work solves the problem from a novel knowledge distillation perspective. We propose a novel Spatial-channel Token Distillation (STD) method, which improves the information mixing in the two dimensions by introducing distillation tokens to each of them. A mutual information regularization is further introduced to let distillation tokens focus on their specific dimensions and maximize the performance gain. Extensive experiments on ImageNet for several MLP-like architectures demonstrate that the proposed token distillation mechanism can efficiently improve the accuracy. For example, the proposed STD boosts the top-1 accuracy of Mixer-S16 on ImageNet from 73.8% to 75.7% without any costly pre-training on JFT-300M. When applied to stronger architectures, e.g. CycleMLP-B1 and CycleMLP-B2, STD can still harvest about 1.1% and 0.5% accuracy gains, respectively.
PDF AbstractDatasets
Task | Dataset | Model | Metric Name | Metric Value | Global Rank | Benchmark |
---|---|---|---|---|---|---|
Image Classification | ImageNet | ResMLP-B24 + STD | Top 1 Accuracy | 82.4% | # 537 | |
Number of params | 122.6M | # 924 | ||||
GFLOPs | 24.1 | # 416 | ||||
Image Classification | ImageNet | Mixer-S16 + STD | Top 1 Accuracy | 75.7% | # 928 | |
Number of params | 22.2M | # 586 | ||||
GFLOPs | 4.3 | # 210 | ||||
Image Classification | ImageNet | CycleMLP-B2 + STD | Top 1 Accuracy | 82.1% | # 572 | |
Number of params | 30.1M | # 675 | ||||
GFLOPs | 4.0 | # 197 |