An Investigation on Hardware-Aware Vision Transformer Scaling

29 Sep 2021 · Chaojian Li, KyungMin Kim, Bichen Wu, Peizhao Zhang, Hang Zhang, Xiaoliang Dai, Peter Vajda, Yingyan Lin ·

Vision Transformer (ViT) has demonstrated promising performance in various computer vision tasks, and recently attracted a lot of research attention. Many recent works have focused on proposing new architectures to improve ViT and deploying it into real-world applications. However, little effort has been made to analyze and understand ViT’s architecture design space and its implication of hardware-cost on different devices. In this work, by simply scaling ViT's depth, width, input size, and other basic configurations, we show that a scaled vanilla ViT model without bells and whistles can achieve comparable or superior accuracy-efficiency trade-off than most of the latest ViT variants. Specifically, compared to DeiT-Tiny, our scaled model achieves a $\uparrow1.9\%$ higher ImageNet top-1 accuracy under the same FLOPs and a $\uparrow3.7\%$ better ImageNet top-1 accuracy under the same latency on an NVIDIA Edge GPU TX2. Motivated by this, we further investigate the extracted scaling strategies from the following two aspects: (1) "can these scaling strategies be transferred across different real hardware devices?''; and (2) "can these scaling strategies be transferred to different ViT variants and tasks?''. For (1), our exploration, based on various devices with different resource budgets, indicates that the transferability effectiveness depends on the underlying device together with its corresponding deployment tool; for (2), we validate the effective transferability of the aforementioned scaling strategies obtained from a vanilla ViT model on top of an image classification task to the PiT model, a strong ViT variant targeting efficiency, as well as object detection and video classification tasks. In particular, when transferred to PiT, our scaling strategies lead to a boosted ImageNet top-1 accuracy of from $74.6\%$ to $76.7\%$ ($\uparrow2.1\%$) under the same 0.7G FLOPs; and when transferred to the COCO object detection task, the average precision is boosted by $\uparrow0.7\%$ under a similar throughput on a V100 GPU.

PDF Abstract