TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Action Classification	Kinetics-400	UniFormer-B (ImageNet-1K)	Acc@1	82.9	# 64
Action Classification	Kinetics-400	UniFormer-B (ImageNet-1K)	Acc@5	94.5	# 65
Action Classification	Kinetics-400	UniFormer-B (ImageNet-1K)	FLOPs (G) x views	259x4	# 1
Action Classification	Kinetics-600	UniFormer-B (ImageNet-1K)	Top-1 Accuracy	84.8	# 31
Action Classification	Kinetics-600	UniFormer-B (ImageNet-1K)	Top-5 Accuracy	96.7	# 20
Action Classification	Kinetics-600	UniFormer-B (ImageNet-1K)	GFLOPs	259x4	# 1
Action Recognition	Something-Something V1	UniFormer-B (IN-1K + Kinetics400)	Top 1 Accuracy	60.9	# 8
Action Recognition	Something-Something V1	UniFormer-B (IN-1K + Kinetics400)	Top 5 Accuracy	87.3	# 5
Action Recognition	Something-Something V1	UniFormer-B (IN-1K + Kinetics400)	Param.	50.1	# 2
Action Recognition	Something-Something V1	UniFormer-B (IN-1K + Kinetics400)	GFLOPs	259x3	# 1
Action Recognition	Something-Something V1	UniFormer-B (IN-1K + Kinetics600)	Top 1 Accuracy	57.6	# 12
Action Recognition	Something-Something V1	UniFormer-B (IN-1K + Kinetics600)	Top 5 Accuracy	84.9	# 7
Action Recognition	Something-Something V1	UniFormer-B (IN-1K + Kinetics600)	Param.	21.4	# 1
Action Recognition	Something-Something V1	UniFormer-B (IN-1K + Kinetics600)	GFLOPs	41.8x3	# 1
Action Recognition	Something-Something V2	UniFormer-S (IN-1K + Kinetics600 pretrain)	Top-1 Accuracy	69.4	# 46
Action Recognition	Something-Something V2	UniFormer-S (IN-1K + Kinetics600 pretrain)	Top-5 Accuracy	92.1	# 30
Action Recognition	Something-Something V2	UniFormer-S (IN-1K + Kinetics600 pretrain)	Parameters	21.4	# 36
Action Recognition	Something-Something V2	UniFormer-S (IN-1K + Kinetics600 pretrain)	GFLOPs	41.8x3	# 6
Action Recognition	Something-Something V2	UniFormer-B (IN-1K + Kinetics400 pretrain)	Top-1 Accuracy	71.2	# 30
Action Recognition	Something-Something V2	UniFormer-B (IN-1K + Kinetics400 pretrain)	Top-5 Accuracy	92.8	# 20
Action Recognition	Something-Something V2	UniFormer-B (IN-1K + Kinetics400 pretrain)	Parameters	50.1	# 31
Action Recognition	Something-Something V2	UniFormer-B (IN-1K + Kinetics400 pretrain)	GFLOPs	259x3	# 6

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/uniformer-unified-transformer-for-efficient/action-recognition-in-videos-on-something-1)](https://paperswithcode.com/sota/action-recognition-in-videos-on-something-1?p=uniformer-unified-transformer-for-efficient)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/uniformer-unified-transformer-for-efficient/action-recognition-in-videos-on-something)](https://paperswithcode.com/sota/action-recognition-in-videos-on-something?p=uniformer-unified-transformer-for-efficient)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/uniformer-unified-transformer-for-efficient/action-classification-on-kinetics-600)](https://paperswithcode.com/sota/action-classification-on-kinetics-600?p=uniformer-unified-transformer-for-efficient)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/uniformer-unified-transformer-for-efficient/action-classification-on-kinetics-400)](https://paperswithcode.com/sota/action-classification-on-kinetics-400?p=uniformer-unified-transformer-for-efficient)`

UniFormer: Unified Transformer for Efficient Spatial-Temporal Representation Learning

ICLR 2022 · Kunchang Li, Yali Wang, Gao Peng, Guanglu Song, Yu Liu, Hongsheng Li, Yu Qiao ·

It is a challenging task to learn rich and multi-scale spatial-temporal semantics from high-dimensional videos, due to large local redundancy and complex global dependency between video frames. The recent advances in this research have been mainly driven by 3D convolutional neural networks and vision transformers. Although 3D convolution can efficiently aggregate local context to suppress local redundancy from a small 3D neighborhood, it lacks the capability to capture global dependency because of the limited receptive field. Alternatively, vision transformers can effectively capture long-range dependency by self-attention mechanism, while having limitations on reducing local redundancy with blind similarity comparison among all the tokens in each layer. Based on these observations, we propose a novel Unified transFormer (UniFormer) which seamlessly integrates merits of 3D convolution and spatial-temporal self-attention in a concise transformer format, and achieves a preferable balance between computation and accuracy. Different from traditional transformers, our relation aggregator can tackle both spatial-temporal redundancy and dependency, by learning local and global token affinity respectively in shallow and deep layers. We conduct extensive experiments on the popular video benchmarks, e.g., Kinetics-400, Kinetics-600, and Something-Something V1&V2. With only ImageNet-1K pretraining, our UniFormer achieves 82.9%/84.8% top-1 accuracy on Kinetics-400/Kinetics-600, while requiring 10x fewer GFLOPs than other state-of-the-art methods. For Something-Something V1 and V2, our UniFormer achieves new state-of-the-art performances of 60.8% and 71.4% top-1 accuracy respectively.

PDF Abstract

Code

Add Remove Mark official

sense-x/uniformer official

↳ Quickstart in

Spaces

777

towhee-io/towhee

2,986

lucidrains/uniformer-pytorch

Tasks

Add Remove

Action Classification

Action Recognition

Representation Learning

Datasets

ImageNet

Kinetics

Kinetics 400

Something-Something V2

Kinetics-600

Something-Something V1

Results from the Paper

Add Remove

Ranked #8 on Action Recognition on Something-Something V1

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Action Classification	Kinetics-400	UniFormer-B (ImageNet-1K)	Acc@1	82.9	# 64	Compare
			Acc@5	94.5	# 65	Compare
			FLOPs (G) x views	259x4	# 1	Compare
Action Classification	Kinetics-600	UniFormer-B (ImageNet-1K)	Top-1 Accuracy	84.8	# 31	Compare
			Top-5 Accuracy	96.7	# 20	Compare
			GFLOPs	259x4	# 1	Compare
Action Recognition	Something-Something V1	UniFormer-B (IN-1K + Kinetics400)	Top 1 Accuracy	60.9	# 8	Compare
			Top 5 Accuracy	87.3	# 5	Compare
			Param.	50.1	# 2	Compare
			GFLOPs	259x3	# 1	Compare
Action Recognition	Something-Something V1	UniFormer-B (IN-1K + Kinetics600)	Top 1 Accuracy	57.6	# 12	Compare
			Top 5 Accuracy	84.9	# 7	Compare
			Param.	21.4	# 1	Compare
			GFLOPs	41.8x3	# 1	Compare
Action Recognition	Something-Something V2	UniFormer-S (IN-1K + Kinetics600 pretrain)	Top-1 Accuracy	69.4	# 46	Compare
			Top-5 Accuracy	92.1	# 30	Compare
			Parameters	21.4	# 36	Compare
			GFLOPs	41.8x3	# 6	Compare
Action Recognition	Something-Something V2	UniFormer-B (IN-1K + Kinetics400 pretrain)	Top-1 Accuracy	71.2	# 30	Compare
			Top-5 Accuracy	92.8	# 20	Compare
			Parameters	50.1	# 31	Compare
			GFLOPs	259x3	# 6	Compare

Methods

Add Remove

3D Convolution • Convolution

Edit Social Preview

UniFormer: Unified Transformer for Efficient Spatial-Temporal Representation Learning

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit Add Remove

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Add Remove

Methods

Add Remove