TASK
DATASET
MODEL
METRIC NAME
METRIC VALUE
GLOBAL RANK
EXTRA DATA
REMOVE
Action Recognition
AVA v2.2
MViTv2-L (IN21k, K700)
mAP
34.4
# 18
Instance Segmentation
COCO minival
MViT-L (Mask R-CNN, single-scale)
mask AP
46.2
# 42
Instance Segmentation
COCO minival
MViTv2-H (Cascade Mask R-CNN, single-scale, IN21k pre-train)
mask AP
48.5
# 33
Object Detection
COCO minival
MViTv2-H (Cascade Mask R-CNN, single-scale, IN21k pre-train)
box AP
56.1
# 43
Instance Segmentation
COCO minival
MViTv2-L (Cascade Mask R-CNN, single-scale)
mask AP
47.1
# 37
Object Detection
COCO minival
MViTv2-L (Cascade Mask R-CNN, single-scale)
box AP
54.3
# 55
Instance Segmentation
COCO minival
MViTv2-L (Cascade Mask R-CNN, multi-scale, IN21k pre-train)
mask AP
50.5
# 21
Object Detection
COCO minival
MViTv2-L (Cascade Mask R-CNN, multi-scale, IN21k pre-train)
box AP
58.7
# 31
Object Detection
COCO minival
MViT-L (Mask R-CNN, single-scale, IN21k pre-train)
box AP
52.7
# 62
Object Detection
COCO-O
MViTV2-H
(Cascade Mask R-CNN)
Average mAP
30.9
# 16
Object Detection
COCO-O
MViTV2-H
(Cascade Mask R-CNN)
Effective Robustness
5.62
# 21
Image Classification
ImageNet
MViTv2-H (512 res, ImageNet-21k pretrain)
Top 1 Accuracy
88.8%
# 38
Image Classification
ImageNet
MViTv2-H (512 res, ImageNet-21k pretrain)
Number of params
667M
# 950
Image Classification
ImageNet
MViTv2-H (512 res, ImageNet-21k pretrain)
GFLOPs
763.5
# 490
Image Classification
ImageNet
MViTv2-H (mageNet-21k pretrain)
Top 1 Accuracy
88%
# 69
Image Classification
ImageNet
MViTv2-H (mageNet-21k pretrain)
Number of params
667M
# 950
Image Classification
ImageNet
MViTv2-H (mageNet-21k pretrain)
GFLOPs
120.6
# 462
Image Classification
ImageNet
MViTv2-L (384 res, ImageNet-21k pretrain)
Top 1 Accuracy
88.4%
# 55
Image Classification
ImageNet
MViTv2-L (384 res, ImageNet-21k pretrain)
Number of params
218M
# 903
Image Classification
ImageNet
MViTv2-L (384 res, ImageNet-21k pretrain)
GFLOPs
140.7
# 465
Image Classification
ImageNet
MViTv2-L (384 res)
Top 1 Accuracy
86.3%
# 153
Image Classification
ImageNet
MViTv2-L (384 res)
Number of params
218M
# 903
Image Classification
ImageNet
MViTv2-L (384 res)
GFLOPs
140.2
# 464
Image Classification
ImageNet
MViTv2-T
Top 1 Accuracy
82.3%
# 501
Image Classification
ImageNet
MViTv2-T
Number of params
24M
# 579
Image Classification
ImageNet
MViTv2-T
GFLOPs
4.7
# 221
Action Classification
Kinetics-400
MViTv2-L (ImageNet-21k pretrain)
Acc@1
86.1
# 44
Action Classification
Kinetics-400
MViTv2-L (ImageNet-21k pretrain)
Acc@5
97.0
# 34
Action Classification
Kinetics-400
MViT-B (train from scratch)
FLOPs (G) x views
225x5
# 1
Action Classification
Kinetics-600
MViTv2-L (ImageNet-21k pretrain)
Top-1 Accuracy
87.9
# 23
Action Classification
Kinetics-600
MViTv2-L (ImageNet-21k pretrain)
Top-5 Accuracy
97.9
# 10
Action Classification
Kinetics-600
MViTv2-L (train from scratch)
Top-1 Accuracy
85.5
# 30
Action Classification
Kinetics-600
MViTv2-B (train from scratch)
Top-5 Accuracy
97.2
# 17
Action Classification
Kinetics-600
MViT-L (train from scratch)
GFLOPs
206x5
# 1
Action Classification
Kinetics-700
MoViNet-A6
Top-1 Accuracy
79.4
# 16
Action Classification
Kinetics-700
MViTv2-B
Top-1 Accuracy
76.6
# 19
Action Classification
Kinetics-700
MViTv2-B
Top-5 Accuracy
93.2
# 10
Action Classification
Kinetics-700
MViTv2-L (ImageNet-21k pretrain)
Top-1 Accuracy
79.4
# 16
Action Classification
Kinetics-700
MViTv2-L (ImageNet-21k pretrain)
Top-5 Accuracy
94.9
# 6
Action Recognition
Something-Something V2
MViTv2-B (IN-21K + Kinetics400 pretrain)
Top-5 Accuracy
93.4
# 18
Action Recognition
Something-Something V2
MViTv2-B (IN-21K + Kinetics400 pretrain)
Parameters
51.1
# 30
Action Recognition
Something-Something V2
MViT-L (IN-21K + Kinetics400 pretrain)
GFLOPs
2828x3
# 6
Action Recognition
Something-Something V2
MViTv2-L (IN-21K + Kinetics400 pretrain)
Top-1 Accuracy
73.3
# 20
Action Recognition
Something-Something V2
MViTv2-L (IN-21K + Kinetics400 pretrain)
Top-5 Accuracy
94.1
# 12
Action Recognition
Something-Something V2
MViTv2-L (IN-21K + Kinetics400 pretrain)
Parameters
213.1
# 20
Action Recognition
Something-Something V2
MViT-B (IN-21K + Kinetics400 pretrain)
Top-1 Accuracy
72.1
# 25
Action Recognition
Something-Something V2
MViT-B (IN-21K + Kinetics400 pretrain)
GFLOPs
225x3
# 6