Multiscale Vision Transformers

We present Multiscale Vision Transformers (MViT) for video and image recognition, by connecting the seminal idea of multiscale feature hierarchies with transformer models. Multiscale Transformers have several channel-resolution scale stages. Starting from the input resolution and a small channel dimension, the stages hierarchically expand the channel capacity while reducing the spatial resolution. This creates a multiscale pyramid of features with early layers operating at high spatial resolution to model simple low-level visual information, and deeper layers at spatially coarse, but complex, high-dimensional features. We evaluate this fundamental architectural prior for modeling the dense nature of visual signals for a variety of video recognition tasks where it outperforms concurrent vision transformers that rely on large scale external pre-training and are 5-10x more costly in computation and parameters. We further remove the temporal dimension and apply our model for image classification where it outperforms prior work on vision transformers. Code is available at: https://github.com/facebookresearch/SlowFast

PDF Abstract ICCV 2021 PDF ICCV 2021 Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Action Recognition AVA v2.2 MViT-B, 32x3 (Kinetics-500 pretraining) mAP 27.5 # 28
Action Recognition AVA v2.2 MViT-B, 16x4 (Kinetics-400 pretraining) mAP 24.5 # 36
Action Recognition AVA v2.2 MViT-B, 32x3 (Kinetics-400 pretraining) mAP 26.8 # 32
Action Recognition AVA v2.2 MViT-B-24, 32x3 (Kinetics-600 pretraining) mAP 28.7 # 27
Action Recognition AVA v2.2 MViT-B, 64x3 (Kinetics-400 pretraining) mAP 27.3 # 30
Action Recognition AVA v2.2 MViT-B, 16x4 (Kinetics-600 pretraining) mAP 26.1 # 35
Action Classification Charades MViT-B-24, 32x3 (Kinetics-600 pretraining) MAP 47.7 # 14
Action Classification Charades MViT-B, 32x3 (Kinetics-600 pretraining) MAP 47.1 # 16
Action Classification Charades MViT-B, 16x4 (Kinetics-600 pretraining) MAP 43.9 # 21
Action Classification Charades MViT-B-24, 32x3 (Kinetics-400 pretraining) MAP 46.3 # 17
Action Classification Charades MViT-B, 32x3 (Kinetics-400 pretraining) MAP 44.3 # 19
Action Classification Charades MViT-B, 16x4 (Kinetics-400 pretraining) MAP 40 # 33
Image Classification ImageNet MViT-B-16 Top 1 Accuracy 83.0% # 437
Number of params 37.0M # 662
Image Classification ImageNet MViT-B-24 Top 1 Accuracy 84.8% # 270
Number of params 72.9M # 791
Action Classification Kinetics-400 MViT-B, 16x4 Acc@1 78.4 # 119
Acc@5 93.5 # 91
Action Classification Kinetics-400 MViT-S Acc@1 76 # 143
Acc@5 92.1 # 107
Action Classification Kinetics-400 MViT-B, 64x3 Acc@1 81.2 # 79
Acc@5 95.1 # 53
Action Classification Kinetics-400 MViT-B, 32x3 Acc@1 80.2 # 96
Acc@5 94.4 # 70
Action Classification Kinetics-600 MViT-B, 32x3 Top-1 Accuracy 83.4 # 39
Top-5 Accuracy 96.3 # 27
Action Classification Kinetics-600 MViT-B, 16x4 Top-1 Accuracy 82.1 # 44
Top-5 Accuracy 95.7 # 33
Action Classification Kinetics-600 MViT-B-24, 32x3 Top-1 Accuracy 83.8 # 36
Top-5 Accuracy 96.3 # 27
Action Recognition Something-Something V2 MViT-B-24, 32x3 Top-1 Accuracy 68.7 # 48
Top-5 Accuracy 91.5 # 35
Parameters 53.2M # 1
GFLOPs 236x3 # 6
Action Recognition Something-Something V2 MViT-B, 32x3(Kinetics600 pretrain) Top-1 Accuracy 67.8 # 56
Top-5 Accuracy 91.3 # 38
Parameters 36.6 # 32
GFLOPs 170x3 # 6
Action Recognition Something-Something V2 MViT-B, 16x4 Top-1 Accuracy 66.2 # 81
Top-5 Accuracy 90.2 # 61

Methods