Multiview Transformers for Video Recognition

Video understanding requires reasoning at multiple spatiotemporal resolutions -- from short fine-grained motions to events taking place over longer durations. Although transformer architectures have recently advanced the state-of-the-art, they have not explicitly modelled different spatiotemporal resolutions. To this end, we present Multiview Transformers for Video Recognition (MTV). Our model consists of separate encoders to represent different views of the input video with lateral connections to fuse information across views. We present thorough ablation studies of our model and show that MTV consistently performs better than single-view counterparts in terms of accuracy and computational cost across a range of model sizes. Furthermore, we achieve state-of-the-art results on six standard datasets, and improve even further with large-scale pretraining. Code and checkpoints are available at:

PDF Abstract CVPR 2022 PDF CVPR 2022 Abstract

Results from the Paper

Ranked #4 on Action Classification on Kinetics-700 (using extra training data)

     Get a GitHub badge
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Action Recognition EPIC-KITCHENS-100 MTV-B (WTS 60M) Action@1 50.5 # 5
Verb@1 69.9 # 10
Noun@1 63.9 # 4
Action Classification Kinetics-400 MTV-H (WTS 60M) Acc@1 89.9 # 8
Acc@5 98.3 # 8
FLOPs (G) x views 735700x4x3 # 1
Action Classification Kinetics-600 MTV-H (WTS 60M) Top-1 Accuracy 90.3 # 7
Top-5 Accuracy 98.5 # 4
Action Classification Kinetics-700 MTV-H (WTS 60M) Top-1 Accuracy 83.4 # 4
Top-5 Accuracy 96.2 # 3
Action Classification Moments in Time MTV-H (WTS 60M) Top 1 Accuracy 47.2 # 5
Top 5 Accuracy 75.7 # 3
Action Recognition Something-Something V2 MTV-B Top-1 Accuracy 68.5 # 43
Top-5 Accuracy 90.4 # 54


No methods listed for this paper. Add relevant methods here