HFSP: A Hardware-friendly Soft Pruning Framework for Vision Transformers
Recently, Vision Transformer (ViT) has continuously established new milestones in the computer vision field, while the high computation and memory cost makes its propagation in industrial production difficult. Pruning, a traditional model compression paradigm for hardware efficiency, has been widely applied in various DNN structures. Nevertheless, it stays ambiguous on how to perform exclusive pruning on the ViT structure. Considering three key points: the structural characteristics, the internal data pattern of ViT, and the related edge device deployment, we leverage the input token sparsity and propose a hardware-friendly soft pruning framework (HFSP), which can be set up on vanilla Transformers of both flatten and CNN-type structures, such as Pooling-based ViT (PiT). More concretely, we design a dynamic attention-based multi-head token selector, which is a lightweight module for adaptive instance-wise token selection. We further introduce a soft pruning technique to package the pruned tokens, which integrate the less informative tokens generated by the selector module into a package token, and participates in subsequent calculations rather than being discarded completely. From a hardware standpoint, our framework is bound to the tradeoff between accuracy and specific hardware constraints through our proposed hardware-oriented progressive training, and all the operators embedded in the framework have been well-supported. Experimental results demonstrate that the proposed framework significantly reduces the computational costs of ViTs while maintaining comparable performance on image classification. For example, our method reduces the FLOPs of DeiT-S by over 42.6% while only sacrificing 0.46% top-1 accuracy. Moreover, our framework can guarantee the identified model to meet resource specifications of mobile devices and FPGA, and even achieve the real-time execution of DeiT-T on mobile platforms. Code will be publicly released.
PDF Abstract