MViTv2: Improved Multiscale Vision Transformers for Classification and Detection

In this paper, we study Multiscale Vision Transformers (MViTv2) as a unified architecture for image and video classification, as well as object detection. We present an improved version of MViT that incorporates decomposed relative positional embeddings and residual pooling connections. We instantiate this architecture in five sizes and evaluate it for ImageNet classification, COCO detection and Kinetics video recognition where it outperforms prior work. We further compare MViTv2s' pooling attention to window attention mechanisms where it outperforms the latter in accuracy/compute. Without bells-and-whistles, MViTv2 has state-of-the-art performance in 3 domains: 88.8% accuracy on ImageNet classification, 58.7 boxAP on COCO object detection as well as 86.1% on Kinetics-400 video classification. Code and models are available at https://github.com/facebookresearch/mvit.

PDF Abstract CVPR 2022 PDF CVPR 2022 Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Action Recognition AVA v2.2 MViTv2-L (IN21k, K700) mAP 34.4 # 18
Instance Segmentation COCO minival MViT-L (Mask R-CNN, single-scale) mask AP 46.2 # 42
Instance Segmentation COCO minival MViTv2-H (Cascade Mask R-CNN, single-scale, IN21k pre-train) mask AP 48.5 # 33
Object Detection COCO minival MViTv2-H (Cascade Mask R-CNN, single-scale, IN21k pre-train) box AP 56.1 # 43
Instance Segmentation COCO minival MViTv2-L (Cascade Mask R-CNN, single-scale) mask AP 47.1 # 37
Object Detection COCO minival MViTv2-L (Cascade Mask R-CNN, single-scale) box AP 54.3 # 55
Instance Segmentation COCO minival MViTv2-L (Cascade Mask R-CNN, multi-scale, IN21k pre-train) mask AP 50.5 # 21
Object Detection COCO minival MViTv2-L (Cascade Mask R-CNN, multi-scale, IN21k pre-train) box AP 58.7 # 31
Object Detection COCO minival MViT-L (Mask R-CNN, single-scale, IN21k pre-train) box AP 52.7 # 62
Object Detection COCO-O MViTV2-H (Cascade Mask R-CNN) Average mAP 30.9 # 16
Effective Robustness 5.62 # 21
Image Classification ImageNet MViTv2-H (512 res, ImageNet-21k pretrain) Top 1 Accuracy 88.8% # 38
Number of params 667M # 950
GFLOPs 763.5 # 490
Image Classification ImageNet MViTv2-H (mageNet-21k pretrain) Top 1 Accuracy 88% # 69
Number of params 667M # 950
GFLOPs 120.6 # 462
Image Classification ImageNet MViTv2-L (384 res, ImageNet-21k pretrain) Top 1 Accuracy 88.4% # 55
Number of params 218M # 903
GFLOPs 140.7 # 465
Image Classification ImageNet MViTv2-L (384 res) Top 1 Accuracy 86.3% # 153
Number of params 218M # 903
GFLOPs 140.2 # 464
Image Classification ImageNet MViTv2-T Top 1 Accuracy 82.3% # 501
Number of params 24M # 579
GFLOPs 4.7 # 221
Action Classification Kinetics-400 MViTv2-L (ImageNet-21k pretrain) Acc@1 86.1 # 44
Acc@5 97.0 # 34
Action Classification Kinetics-400 MViT-B (train from scratch) FLOPs (G) x views 225x5 # 1
Action Classification Kinetics-600 MViTv2-L (ImageNet-21k pretrain) Top-1 Accuracy 87.9 # 23
Top-5 Accuracy 97.9 # 10
Action Classification Kinetics-600 MViTv2-L (train from scratch) Top-1 Accuracy 85.5 # 30
Action Classification Kinetics-600 MViTv2-B (train from scratch) Top-5 Accuracy 97.2 # 17
Action Classification Kinetics-600 MViT-L (train from scratch) GFLOPs 206x5 # 1
Action Classification Kinetics-700 MoViNet-A6 Top-1 Accuracy 79.4 # 16
Action Classification Kinetics-700 MViTv2-B Top-1 Accuracy 76.6 # 19
Top-5 Accuracy 93.2 # 10
Action Classification Kinetics-700 MViTv2-L (ImageNet-21k pretrain) Top-1 Accuracy 79.4 # 16
Top-5 Accuracy 94.9 # 6
Action Recognition Something-Something V2 MViTv2-B (IN-21K + Kinetics400 pretrain) Top-5 Accuracy 93.4 # 18
Parameters 51.1 # 30
Action Recognition Something-Something V2 MViT-L (IN-21K + Kinetics400 pretrain) GFLOPs 2828x3 # 6
Action Recognition Something-Something V2 MViTv2-L (IN-21K + Kinetics400 pretrain) Top-1 Accuracy 73.3 # 20
Top-5 Accuracy 94.1 # 12
Parameters 213.1 # 20
Action Recognition Something-Something V2 MViT-B (IN-21K + Kinetics400 pretrain) Top-1 Accuracy 72.1 # 25
GFLOPs 225x3 # 6

Methods