Self-Slimming Vision Transformer

29 Sep 2021 · Zhuofan Zong, Kunchang Li, Guanglu Song, Yali Wang, Yu Qiao, Biao Leng, Yu Liu ·

Vision transformers (ViTs) have become the popular structures and outperformed convolutional neural networks (CNNs) on various vision tasks. However, such powerful transformers bring huge computation burden, due to the exhausting token-to-token comparison. To make ViTs more efficient, we can prune them from two orthogonal directions: model structure and token number. However, pruning structure decreases the model capacity and struggles to speed up ViTs. Alternatively, we observe that ViTs exhibit sparse attention with high token similarity, while reducing tokens can greatly improve the throughput. Therefore, we propose a generic self-slimming learning approach for vanilla ViTs, namely SiT. Specifically, we first design a novel Token Slimming Module (TSM), which can boost the inference efficiency of ViTs by dynamic token aggregation. Different from the token hard dropping, our TSM softly integrates redundant tokens into fewer informative ones, which can dynamically zoom visual attention without cutting off discriminative token relations in the image. Furthermore, we introduce a concise Dense Knowledge Distillation (DKD) framework, which densely transfers token information in a flexible auto-encoder manner. Due to the similar structure between teacher and student, our framework can effectively leverage both parameter and structure knowledge to accelerate training convergence. Finally, we conduct extensive experiments to evaluate our SiT. In most cases, our method can speed up ViTs by 3.6x while maintaining 97% of their performance. Surprisingly, by simply arming LV-ViT with our SiT, we achieve new state-of-the-art performance on ImageNet, surpassing all the CNNs and ViTs in the recent literature.

PDF Abstract