PVT v2: Improved Baselines with Pyramid Vision Transformer

25 Jun 2021  ·  Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, Ling Shao ·

Transformer recently has presented encouraging progress in computer vision. In this work, we present new baselines by improving the original Pyramid Vision Transformer (PVT v1) by adding three designs, including (1) linear complexity attention layer, (2) overlapping patch embedding, and (3) convolutional feed-forward network. With these modifications, PVT v2 reduces the computational complexity of PVT v1 to linear and achieves significant improvements on fundamental vision tasks such as classification, detection, and segmentation. Notably, the proposed PVT v2 achieves comparable or better performances than recent works such as Swin Transformer. We hope this work will facilitate state-of-the-art Transformer researches in computer vision. Code is available at https://github.com/whai362/PVT.

PDF Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Object Detection COCO minival Sparse R-CNN (PVTv2-B2) box AP 50.1 # 76
AP50 69.5 # 23
AP75 54.9 # 18
Object Detection COCO-O PVTv2-B5 (Mask R-CNN) Average mAP 28.2 # 23
Effective Robustness 6.85 # 17
Image Classification ImageNet PVTv2-B3 Top 1 Accuracy 83.2% # 416
Number of params 45.2M # 707
GFLOPs 6.9 # 249
Image Classification ImageNet PVTv2-B1 Top 1 Accuracy 78.7% # 750
Number of params 13.1M # 504
GFLOPs 2.1 # 151
Image Classification ImageNet PVTv2-B0 Top 1 Accuracy 70.5% # 952
Number of params 3.4M # 373
GFLOPs 0.6 # 65
Image Classification ImageNet PVTv2-B2 Top 1 Accuracy 82% # 534
Number of params 25.4M # 594
GFLOPs 4 # 191
Image Classification ImageNet PVTv2-B4 Top 1 Accuracy 83.8% # 360
Number of params 82M # 807
GFLOPs 11.8 # 315

Methods