TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Action Classification	ActivityNet	UniFormerV2-L	Top 1 Accuracy	94.7	# 1
Action Classification	ActivityNet	UniFormerV2-L	Top 5 Accuracy	99.5	# 1
Action Recognition	HACS	UniFormerV2-L	Top 1 Accuracy	95.5	# 2
Action Recognition	HACS	UniFormerV2-L	Top 5 Accuracy	99.8	# 1
Action Classification	Kinetics-400	UniFormerV2-L (ViT-L, 336)	Acc@1	90.0	# 9
Action Classification	Kinetics-400	UniFormerV2-L (ViT-L, 336)	Acc@5	98.4	# 5
Action Classification	Kinetics-400	UniFormerV2-L (ViT-L, 336)	FLOPs (G) x views	75300x3x2	# 1
Action Classification	Kinetics-400	UniFormerV2-L (ViT-L, 336)	Parameters (M)	354	# 28
Action Classification	Kinetics-600	UniFormerV2-L	Top-1 Accuracy	90.1	# 10
Action Classification	Kinetics-600	UniFormerV2-L	Top-5 Accuracy	98.5	# 4
Action Classification	Kinetics-700	UniFormerV2-L	Top-1 Accuracy	82.7	# 8
Action Classification	Kinetics-700	UniFormerV2-L	Top-5 Accuracy	96.2	# 3
Action Classification	MiT	UniFormerV2-L	Top 1 Accuracy	47.8	# 5
Action Classification	MiT	UniFormerV2-L	Top 5 Accuracy	76.9	# 2
Action Recognition	Something-Something V1	UniFormerV2-L	Top 1 Accuracy	62.7	# 6
Action Recognition	Something-Something V1	UniFormerV2-L	Top 5 Accuracy	88.0	# 4
Action Recognition	Something-Something V2	UniFormerV2-L	Top-1 Accuracy	73.0	# 22
Action Recognition	Something-Something V2	UniFormerV2-L	Top-5 Accuracy	94.5	# 9
Action Recognition	Something-Something V2	UniFormerV2-L	GFLOPs	5154	# 2

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/uniformerv2-spatiotemporal-learning-by-arming/action-classification-on-activitynet)](https://paperswithcode.com/sota/action-classification-on-activitynet?p=uniformerv2-spatiotemporal-learning-by-arming)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/uniformerv2-spatiotemporal-learning-by-arming/action-recognition-on-hacs)](https://paperswithcode.com/sota/action-recognition-on-hacs?p=uniformerv2-spatiotemporal-learning-by-arming)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/uniformerv2-spatiotemporal-learning-by-arming/action-classification-on-moments-in-time)](https://paperswithcode.com/sota/action-classification-on-moments-in-time?p=uniformerv2-spatiotemporal-learning-by-arming)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/uniformerv2-spatiotemporal-learning-by-arming/action-recognition-in-videos-on-something-1)](https://paperswithcode.com/sota/action-recognition-in-videos-on-something-1?p=uniformerv2-spatiotemporal-learning-by-arming)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/uniformerv2-spatiotemporal-learning-by-arming/action-classification-on-kinetics-700)](https://paperswithcode.com/sota/action-classification-on-kinetics-700?p=uniformerv2-spatiotemporal-learning-by-arming)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/uniformerv2-spatiotemporal-learning-by-arming/action-classification-on-kinetics-400)](https://paperswithcode.com/sota/action-classification-on-kinetics-400?p=uniformerv2-spatiotemporal-learning-by-arming)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/uniformerv2-spatiotemporal-learning-by-arming/action-classification-on-kinetics-600)](https://paperswithcode.com/sota/action-classification-on-kinetics-600?p=uniformerv2-spatiotemporal-learning-by-arming)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/uniformerv2-spatiotemporal-learning-by-arming/action-recognition-in-videos-on-something)](https://paperswithcode.com/sota/action-recognition-in-videos-on-something?p=uniformerv2-spatiotemporal-learning-by-arming)`

UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer

ICLR2023 submitted 2022 · Anonymous ·

Learning discriminative spatiotemporal representation is the key problem of video understanding. Recently, Vision Transformers (ViTs) have shown their power in learning long-term video dependency with self-attention. Unfortunately, they exhibit limitations in tackling local video redundancy, due to the blind global comparison among tokens. UniFormer has successfully alleviated this issue, by unifying convolution and self-attention as a relation aggregator in the transformer format. However, this model has to require a tiresome and complicated image-pretraining phrase, before being finetuned on videos. This blocks its wide usage in practice. On the contrary, open-sourced ViTs are readily available and well-pretrained with rich image supervision. Based on these observations, we propose a generic paradigm to build a powerful family of video networks, by arming the pretrained ViTs with efficient UniFormer designs. We call this family UniFormerV2, since it inherits the concise style of the UniFormer block. But it contains brand-new local and global relation aggregators, which allow for preferable accuracy-computation balance by seamlessly integrating advantages from both ViTs and UniFormer. Without any bells and whistles, our UniFormerV2 gets the state-of-the-art recognition performance on 8 popular video benchmarks, including scene-related Kinetics-400/600/700 and Moments in Time, temporal-related Something-Something V1/V2, untrimmed ActivityNet and HACS. In particular, it is the first model to achieve 90% top-1 accuracy on Kinetics-400, to our best knowledge. The models will be released afterward.

PDF Abstract

Code

Add Remove Mark official

OpenGVLab/UniFormerV2

↳ Quickstart in

Spaces

267

innat/UniFormerV2

↳ Quickstart in

Colab

Spaces

Tasks

Add Remove

Action Classification

Action Recognition

Video Understanding

Datasets

ImageNet

Kinetics

ActivityNet

Kinetics 400

Something-Something V2

Kinetics-600

Something-Something V1

MiT

Kinetics-700

HACS

Results from the Paper

Add Remove

Ranked #1 on Action Classification on ActivityNet (using extra training data)

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Action Classification	ActivityNet	UniFormerV2-L	Top 1 Accuracy	94.7	# 1	Compare
Action Classification	ActivityNet	UniFormerV2-L	Top 5 Accuracy	99.5	# 1	Compare
Action Recognition	HACS	UniFormerV2-L	Top 1 Accuracy	95.5	# 2	Compare
Action Recognition	HACS	UniFormerV2-L	Top 5 Accuracy	99.8	# 1	Compare
Action Classification	Kinetics-400	UniFormerV2-L (ViT-L, 336)	Acc@1	90.0	# 9	Compare
			Acc@5	98.4	# 5	Compare
			FLOPs (G) x views	75300x3x2	# 1	Compare
			Parameters (M)	354	# 28	Compare
Action Classification	Kinetics-600	UniFormerV2-L	Top-1 Accuracy	90.1	# 10	Compare
Action Classification	Kinetics-600	UniFormerV2-L	Top-5 Accuracy	98.5	# 4	Compare
Action Classification	Kinetics-700	UniFormerV2-L	Top-1 Accuracy	82.7	# 8	Compare
Action Classification	Kinetics-700	UniFormerV2-L	Top-5 Accuracy	96.2	# 3	Compare
Action Classification	MiT	UniFormerV2-L	Top 1 Accuracy	47.8	# 5	Compare
Action Classification	MiT	UniFormerV2-L	Top 5 Accuracy	76.9	# 2	Compare
Action Recognition	Something-Something V1	UniFormerV2-L	Top 1 Accuracy	62.7	# 6	Compare
Action Recognition	Something-Something V1	UniFormerV2-L	Top 5 Accuracy	88.0	# 4	Compare
Action Recognition	Something-Something V2	UniFormerV2-L	Top-1 Accuracy	73.0	# 22	Compare
			Top-5 Accuracy	94.5	# 9	Compare
			GFLOPs	5154	# 2	Compare

Methods

Add Remove

Adam • Dense Connections • Dropout • Layer Normalization • Linear Layer • Multi-Head Attention • Position-Wise Feed-Forward Layer • Residual Connection • Scaled Dot-Product Attention • Softmax • Transformer

Edit Social Preview

UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit Add Remove

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Add Remove

Methods

Add Remove