UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer

ICLR2023 submitted 2022  ·  Anonymous ·

Learning discriminative spatiotemporal representation is the key problem of video understanding. Recently, Vision Transformers (ViTs) have shown their power in learning long-term video dependency with self-attention. Unfortunately, they exhibit limitations in tackling local video redundancy, due to the blind global comparison among tokens. UniFormer has successfully alleviated this issue, by unifying convolution and self-attention as a relation aggregator in the transformer format. However, this model has to require a tiresome and complicated image-pretraining phrase, before being finetuned on videos. This blocks its wide usage in practice. On the contrary, open-sourced ViTs are readily available and well-pretrained with rich image supervision. Based on these observations, we propose a generic paradigm to build a powerful family of video networks, by arming the pretrained ViTs with efficient UniFormer designs. We call this family UniFormerV2, since it inherits the concise style of the UniFormer block. But it contains brand-new local and global relation aggregators, which allow for preferable accuracy-computation balance by seamlessly integrating advantages from both ViTs and UniFormer. Without any bells and whistles, our UniFormerV2 gets the state-of-the-art recognition performance on 8 popular video benchmarks, including scene-related Kinetics-400/600/700 and Moments in Time, temporal-related Something-Something V1/V2, untrimmed ActivityNet and HACS. In particular, it is the first model to achieve 90% top-1 accuracy on Kinetics-400, to our best knowledge. The models will be released afterward.

PDF Abstract

Results from the Paper

 Ranked #1 on Action Recognition on HACS (using extra training data)

     Get a GitHub badge
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Action Classification ActivityNet UniFormerV2-L Top 1 Accuracy 94.7 # 1
Top 5 Accuracy 99.5 # 1
Action Recognition HACS UniFormerV2-L Top 1 Accuracy 95.5 # 1
Top 5 Accuracy 99.8 # 1
Action Classification Kinetics-400 UniFormerV2-L (ViT-L, 336) Acc@1 90.0 # 7
Acc@5 98.4 # 5
FLOPs (G) x views 75300x3x2 # 1
Parameters (M) 354 # 26
Action Classification Kinetics-600 UniFormerV2-L Top-1 Accuracy 90.1 # 8
Top-5 Accuracy 98.5 # 4
Action Classification Kinetics-700 UniFormerV2-L Top-1 Accuracy 82.7 # 6
Top-5 Accuracy 96.2 # 3
Action Classification MiT UniFormerV2-L Top 1 Accuracy 47.8 # 4
Top 5 Accuracy 76.9 # 2
Action Recognition Something-Something V1 UniFormerV2-L Top 1 Accuracy 62.7 # 6
Top 5 Accuracy 88.0 # 4
Action Recognition Something-Something V2 UniFormerV2-L Top-1 Accuracy 73.0 # 19
Top-5 Accuracy 94.5 # 9
GFLOPs 5154 # 2