CvT: Introducing Convolutions to Vision Transformers

We present in this paper a new architecture, named Convolutional vision Transformer (CvT), that improves Vision Transformer (ViT) in performance and efficiency by introducing convolutions into ViT to yield the best of both designs. This is accomplished through two primary modifications: a hierarchy of Transformers containing a new convolutional token embedding, and a convolutional Transformer block leveraging a convolutional projection. These changes introduce desirable properties of convolutional neural networks (CNNs) to the ViT architecture (\ie shift, scale, and distortion invariance) while maintaining the merits of Transformers (\ie dynamic attention, global context, and better generalization). We validate CvT by conducting extensive experiments, showing that this approach achieves state-of-the-art performance over other Vision Transformers and ResNets on ImageNet-1k, with fewer parameters and lower FLOPs. In addition, performance gains are maintained when pretrained on larger datasets (\eg ImageNet-22k) and fine-tuned to downstream tasks. Pre-trained on ImageNet-22k, our CvT-W24 obtains a top-1 accuracy of 87.7\% on the ImageNet-1k val set. Finally, our results show that the positional encoding, a crucial component in existing Vision Transformers, can be safely removed in our model, simplifying the design for higher resolution vision tasks. Code will be released at \url{https://github.com/leoxiaobin/CvT}.

PDF Abstract ICCV 2021 PDF ICCV 2021 Abstract
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Image Classification CIFAR-10 CvT-W24 Percentage correct 99.39 # 6
Image Classification CIFAR-100 CvT-W24 Percentage correct 94.09 # 5
Image Classification Flowers-102 CvT-W24 Accuracy 99.72 # 3
Image Classification ImageNet CvT-21 (384 res) Top 1 Accuracy 83.3% # 436
GFLOPs 24.9 # 418
Image Classification ImageNet CvT-W24 (384 res, ImageNet-22k pretrain) Top 1 Accuracy 87.7% # 74
Image Classification ImageNet CvT-13-NAS Top 1 Accuracy 82.2% # 558
Number of params 18M # 570
GFLOPs 4.1 # 203
Image Classification ImageNet CvT-21 Top 1 Accuracy 82.5% # 527
GFLOPs 7.1 # 267
Image Classification ImageNet CvT-13 Top 1 Accuracy 81.6% # 620
GFLOPs 4.5 # 221
Image Classification ImageNet CvT-13 (384 res) Top 1 Accuracy 83% # 474
Number of params 20M # 580
GFLOPs 16.3 # 377
Image Classification ImageNet CvT-21 (384 res, ImageNet-22k pretrain) Top 1 Accuracy 84.9% # 274
Number of params 32M # 705
GFLOPs 25 # 419
Image Classification ImageNet ReaL CvT-W24 (384 res, ImageNet-22k pretrain) Accuracy 90.6% # 14
Top 1 Accuracy 87.7% # 1
Number of params 277M # 3
Image Classification Oxford-IIIT Pets CvT-W24 Accuracy 94.73 # 2

Methods