PVT v2: Improved Baselines with Pyramid Vision Transformer

25 Jun 2021  ·  Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, Ling Shao ·

Transformer recently has presented encouraging progress in computer vision. In this work, we present new baselines by improving the original Pyramid Vision Transformer (PVT v1) by adding three designs, including (1) linear complexity attention layer, (2) overlapping patch embedding, and (3) convolutional feed-forward network. With these modifications, PVT v2 reduces the computational complexity of PVT v1 to linear and achieves significant improvements on fundamental vision tasks such as classification, detection, and segmentation. Notably, the proposed PVT v2 achieves comparable or better performances than recent works such as Swin Transformer. We hope this work will facilitate state-of-the-art Transformer researches in computer vision. Code is available at https://github.com/whai362/PVT.

PDF Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Object Detection COCO minival Sparse R-CNN (PVTv2-B2) box AP 50.1 # 76
AP50 69.5 # 23
AP75 54.9 # 18
Object Detection COCO-O PVTv2-B5 (Mask R-CNN) Average mAP 28.2 # 23
Effective Robustness 6.85 # 17
Image Classification ImageNet PVTv2-B3 Top 1 Accuracy 83.2% # 413
Number of params 45.2M # 708
GFLOPs 6.9 # 248
Image Classification ImageNet PVTv2-B1 Top 1 Accuracy 78.7% # 746
Number of params 13.1M # 506
GFLOPs 2.1 # 151
Image Classification ImageNet PVTv2-B0 Top 1 Accuracy 70.5% # 948
Number of params 3.4M # 372
GFLOPs 0.6 # 65
Image Classification ImageNet PVTv2-B2 Top 1 Accuracy 82% # 530
Number of params 25.4M # 595
GFLOPs 4 # 191
Image Classification ImageNet PVTv2-B4 Top 1 Accuracy 83.8% # 358
Number of params 82M # 808
GFLOPs 11.8 # 313

Methods