CvT: Introducing Convolutions to Vision Transformers

We present in this paper a new architecture, named Convolutional vision Transformer (CvT), that improves Vision Transformer (ViT) in performance and efficiency by introducing convolutions into ViT to yield the best of both designs. This is accomplished through two primary modifications: a hierarchy of Transformers containing a new convolutional token embedding, and a convolutional Transformer block leveraging a convolutional projection. These changes introduce desirable properties of convolutional neural networks (CNNs) to the ViT architecture (\ie shift, scale, and distortion invariance) while maintaining the merits of Transformers (\ie dynamic attention, global context, and better generalization). We validate CvT by conducting extensive experiments, showing that this approach achieves state-of-the-art performance over other Vision Transformers and ResNets on ImageNet-1k, with fewer parameters and lower FLOPs. In addition, performance gains are maintained when pretrained on larger datasets (\eg ImageNet-22k) and fine-tuned to downstream tasks. Pre-trained on ImageNet-22k, our CvT-W24 obtains a top-1 accuracy of 87.7\% on the ImageNet-1k val set. Finally, our results show that the positional encoding, a crucial component in existing Vision Transformers, can be safely removed in our model, simplifying the design for higher resolution vision tasks. Code will be released at \url{https://github.com/leoxiaobin/CvT}.

PDF Abstract ICCV 2021 PDF ICCV 2021 Abstract

Results from the Paper


Ranked #3 on Image Classification on Flowers-102 (using extra training data)

     Get a GitHub badge
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Image Classification CIFAR-10 CvT-W24 Percentage correct 99.39 # 6
Top-1 Accuracy 99.39 # 2
Image Classification CIFAR-100 CvT-W24 Percentage correct 94.09 # 5
Image Classification Flowers-102 CvT-W24 Accuracy 99.72 # 3
Image Classification ImageNet CvT-21 (384 res, ImageNet-22k pretrain) Top 1 Accuracy 84.9% # 265
Number of params 32M # 653
GFLOPs 25 # 382
Image Classification ImageNet CvT-21 Top 1 Accuracy 82.5% # 482
GFLOPs 7.1 # 250
Image Classification ImageNet CvT-13 Top 1 Accuracy 81.6% # 569
GFLOPs 4.5 # 211
Image Classification ImageNet CvT-13 (384 res) Top 1 Accuracy 83% # 437
Number of params 20M # 536
GFLOPs 16.3 # 348
Image Classification ImageNet CvT-13-NAS Top 1 Accuracy 82.2% # 510
Number of params 18M # 526
GFLOPs 4.1 # 196
Image Classification ImageNet CvT-21 (384 res) Top 1 Accuracy 83.3% # 403
GFLOPs 24.9 # 381
Image Classification ImageNet ReaL CvT-W24 (384 res, ImageNet-22k pretrain) Accuracy 90.6% # 14
Params 277M # 46
Top 1 Accuracy 87.7% # 1
Number of params 277M # 3
Image Classification Oxford-IIIT Pets CvT-W24 Accuracy 94.73 # 3

Methods