MoViNets: Mobile Video Networks for Efficient Video Recognition

We present Mobile Video Networks (MoViNets), a family of computation and memory efficient video networks that can operate on streaming video for online inference. 3D convolutional neural networks (CNNs) are accurate at video recognition but require large computation and memory budgets and do not support online inference, making them difficult to work on mobile devices... (read more)

PDF Abstract

Results from the Paper


TASK DATASET MODEL METRIC NAME METRIC VALUE GLOBAL RANK RESULT BENCHMARK
Action Classification Charades MoViNet-A6 MAP 63.2 # 1
Action Classification Charades MoViNet-A4 MAP 48.5 # 6
Action Classification Charades MoViNet-A2 MAP 32.5 # 25
Action Recognition EPIC-KITCHENS-100 MoViNet-A6 Action@1 47.7 # 1
Verb@1 72.2 # 1
Noun@1 57.3 # 1
GFLOPs 117x1 # 1
Action Recognition EPIC-KITCHENS-100 MoViNet-A0 Action@1 36.8 # 8
Verb@1 64.8 # 6
Noun@1 47.4 # 6
GFLOPs 1.74x1 # 1
Action Recognition EPIC-KITCHENS-100 MoViNet-A2 Action@1 41.2 # 5
Verb@1 67.1 # 4
Noun@1 52.3 # 5
GFLOPs 7.59x1 # 1
Action Recognition EPIC-KITCHENS-100 MoViNet-A4 Action@1 44.4 # 3
Verb@1 68.8 # 3
Noun@1 56.2 # 3
GFLOPs 42.2x1 # 1
Action Recognition EPIC-KITCHENS-100 MoViNet-A5 Action@1 44.5 # 2
Verb@1 69.1 # 2
Noun@1 55.1 # 4
GFLOPs 74.9x1 # 1
Action Classification Kinetics-400 MoViNet-A1 Vid acc@1 72.7 # 74
Vid acc@5 91.2 # 45
Clip acc@1 72.7 # 6
Clip acc@5 91.2 # 6
Flops x views 6.0x1 # 1
Action Classification Kinetics-400 MoViNet-A6 Vid acc@1 81.5 # 7
Vid acc@5 95.3 # 3
Clip acc@1 79.1 # 3
Clip acc@5 95.3 # 1
Flops x views 386x1 # 1
Action Classification Kinetics-400 MoViNet-A5 Vid acc@1 80.9 # 12
Vid acc@5 94.9 # 8
Clip acc@1 80.9 # 1
Clip acc@5 94.9 # 2
Flops x views 281x1 # 1
Action Classification Kinetics-400 MoViNet-A4 Vid acc@1 80.5 # 14
Vid acc@5 94.5 # 13
Clip acc@1 80.5 # 2
Clip acc@5 94.5 # 3
Flops x views 105x1 # 1
Action Classification Kinetics-400 MoViNet-A3 Vid acc@1 78.2 # 34
Vid acc@5 93.8 # 22
Clip acc@1 78.2 # 4
Clip acc@5 93.8 # 4
Flops x views 56.9x1 # 1
Action Classification Kinetics-400 MoViNet-A2 Vid acc@1 75.0 # 61
Vid acc@5 92.3 # 38
Clip acc@1 75.0 # 5
Clip acc@5 92.3 # 5
Flops x views 10.3x1 # 1
Action Classification Kinetics-400 MoViNet-A0 Vid acc@1 65.8 # 84
Vid acc@5 87.4 # 57
Clip acc@1 65.8 # 7
Clip acc@5 87.4 # 7
Flops x views 2.7x1 # 1
Action Classification Kinetics-600 MoViNet-A3 Top-1 Accuracy 80.8 # 16
Top-5 Accuracy 80.8 # 22
GFLOPs 56.9x1 # 1
Action Classification Kinetics-600 MoViNet-A6 Top-1 Accuracy 83.5 # 6
GFLOPs 386x1 # 1
Action Classification Kinetics-600 MoViNet-A5 Top-1 Accuracy 82.7 # 9
Top-5 Accuracy 95.7 # 7
GFLOPs 281x1 # 1
Action Classification Kinetics-600 MoViNet-A4 Top-1 Accuracy 81.2 # 14
Top-5 Accuracy 94.9 # 14
GFLOPs 105x1 # 1
Action Classification Kinetics-600 MoViNet-A0 Top-1 Accuracy 71.5 # 28
Top-5 Accuracy 90.4 # 21
GFLOPs 2.7x1 # 1
Action Classification Kinetics-600 MoViNet-A2 Top-1 Accuracy 77.5 # 23
Top-5 Accuracy 93.4 # 18
GFLOPs 10.3x1 # 1
Action Classification Kinetics-600 MoViNet-A1 Top-1 Accuracy 76.0 # 25
Top-5 Accuracy 92.6 # 19
GFLOPs 6.0x1 # 1
Action Classification Kinetics-600 MoViNet-A6 (AutoAugment) Top-1 Accuracy 84.8 # 2
Top-5 Accuracy 96.5 # 2
GFLOPs 386x1 # 1
Action Classification Kinetics-600 MoViNet-A5 (AutoAugment) Top-1 Accuracy 84.3 # 3
Top-5 Accuracy 96.4 # 4
GFLOPs 281x1 # 1
Action Classification Kinetics-700 MoViNet-A1 Top-1 Accuracy 63.5 # 6
Action Classification Kinetics-700 MoViNet-A0 Top-1 Accuracy 58.5 # 7
Action Classification Kinetics-700 MoViNet-A6 Top-1 Accuracy 72.3 # 1
Action Classification Kinetics-700 MoViNet-A5 Top-1 Accuracy 71.7 # 2
Action Classification Kinetics-700 MoViNet-A4 Top-1 Accuracy 70.7 # 3
Action Classification Kinetics-700 MoViNet-A3 Top-1 Accuracy 68.0 # 4
Action Classification Kinetics-700 MoViNet-A2 Top-1 Accuracy 66.7 # 5
Action Classification Moments in Time MoViNet-A1 Top 1 Accuracy 32.0 # 11
Action Classification Moments in Time MoViNet-A0 Top 1 Accuracy 27.5 # 19
Action Classification Moments in Time MoViNet-A4 Top 1 Accuracy 37.9 # 5
Action Classification Moments in Time MoViNet-A5 Top 1 Accuracy 39.1 # 3
Action Classification Moments in Time MoViNet-A3 Top 1 Accuracy 35.6 # 6
Action Classification Moments in Time MoViNet-A2 Top 1 Accuracy 34.3 # 7
Action Classification Moments in Time MoViNet-A6 Top 1 Accuracy 40.2 # 2
Action Recognition Something-Something V2 MoViNet-A1 Top-1 Accuracy 62.7 # 24
Top-5 Accuracy 89.0 # 20
Parameters 4.6M # 6
GFLOPs 6.0x1 # 1
Action Recognition Something-Something V2 MoViNet-A0 Top-1 Accuracy 61.3 # 31
Top-5 Accuracy 88.2 # 26
Parameters 3.1M # 10
GFLOPs 2.7x1 # 1
Action Recognition Something-Something V2 MoViNet-A3 Top-1 Accuracy 64.1 # 19
Top-5 Accuracy 88.8 # 22
Parameters 5.3M # 4
GFLOPs 23.7x1 # 1
Action Recognition Something-Something V2 MoViNet-A2 Top-1 Accuracy 63.5 # 21
Top-5 Accuracy 89.0 # 20
Parameters 4.8M # 5
GFLOPs 10.3x1 # 1

Methods used in the Paper


METHOD TYPE
🤖 No Methods Found Help the community by adding them if they're not listed; e.g. Deep Residual Learning for Image Recognition uses ResNet