TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Action Classification	Kinetics-400	MAR (50% mask, ViT-L, 16x4)	Acc@1	85.3	# 49
Action Classification	Kinetics-400	MAR (50% mask, ViT-L, 16x4)	Acc@5	96.3	# 38
Action Classification	Kinetics-400	MAR (75% mask, ViT-B, 16x4)	Acc@1	79.4	# 103
Action Classification	Kinetics-400	MAR (75% mask, ViT-B, 16x4)	Acc@5	93.7	# 86
Action Classification	Kinetics-400	MAR (50% mask, ViT-B, 16x4)	Acc@1	81.0	# 82
Action Classification	Kinetics-400	MAR (50% mask, ViT-B, 16x4)	Acc@5	94.4	# 70
Action Classification	Kinetics-400	MAR (75% mask, ViT-L, 16x4)	Acc@1	83.9	# 57
Action Classification	Kinetics-400	MAR (75% mask, ViT-L, 16x4)	Acc@5	96.0	# 41
Action Recognition	Something-Something V2	MAR (50% mask, ViT-L, 16x4)	Top-1 Accuracy	74.7	# 12
Action Recognition	Something-Something V2	MAR (50% mask, ViT-L, 16x4)	Top-5 Accuracy	94.9	# 7
Action Recognition	Something-Something V2	MAR (50% mask, ViT-L, 16x4)	Parameters	311	# 14
Action Recognition	Something-Something V2	MAR (50% mask, ViT-L, 16x4)	GFLOPs	276x6	# 6
Action Recognition	Something-Something V2	MAR (75% mask, ViT-B, 16x4)	Top-1 Accuracy	69.5	# 44
Action Recognition	Something-Something V2	MAR (75% mask, ViT-B, 16x4)	Top-5 Accuracy	91.9	# 31
Action Recognition	Something-Something V2	MAR (75% mask, ViT-B, 16x4)	Parameters	94	# 21
Action Recognition	Something-Something V2	MAR (75% mask, ViT-B, 16x4)	GFLOPs	41x6	# 6
Action Recognition	Something-Something V2	MAR (50% mask, ViT-B, 16x4)	Top-1 Accuracy	71.0	# 32
Action Recognition	Something-Something V2	MAR (50% mask, ViT-B, 16x4)	Top-5 Accuracy	92.8	# 20
Action Recognition	Something-Something V2	MAR (50% mask, ViT-B, 16x4)	Parameters	94	# 21
Action Recognition	Something-Something V2	MAR (50% mask, ViT-B, 16x4)	GFLOPs	86x6	# 6
Action Recognition	Something-Something V2	MAR (75% mask, ViT-L, 16x4)	Top-1 Accuracy	73.8	# 16
Action Recognition	Something-Something V2	MAR (75% mask, ViT-L, 16x4)	Top-5 Accuracy	94.4	# 10
Action Recognition	Something-Something V2	MAR (75% mask, ViT-L, 16x4)	Parameters	311	# 14
Action Recognition	Something-Something V2	MAR (75% mask, ViT-L, 16x4)	GFLOPs	131x6	# 6

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/mar-masked-autoencoders-for-efficient-action/action-recognition-in-videos-on-something)](https://paperswithcode.com/sota/action-recognition-in-videos-on-something?p=mar-masked-autoencoders-for-efficient-action)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/mar-masked-autoencoders-for-efficient-action/action-classification-on-kinetics-400)](https://paperswithcode.com/sota/action-classification-on-kinetics-400?p=mar-masked-autoencoders-for-efficient-action)`

MAR: Masked Autoencoders for Efficient Action Recognition

24 Jul 2022 · Zhiwu Qing, Shiwei Zhang, Ziyuan Huang, Xiang Wang, Yuehuan Wang, Yiliang Lv, Changxin Gao, Nong Sang ·

Standard approaches for video recognition usually operate on the full input videos, which is inefficient due to the widely present spatio-temporal redundancy in videos. Recent progress in masked video modelling, i.e., VideoMAE, has shown the ability of vanilla Vision Transformers (ViT) to complement spatio-temporal contexts given only limited visual contents. Inspired by this, we propose propose Masked Action Recognition (MAR), which reduces the redundant computation by discarding a proportion of patches and operating only on a part of the videos. MAR contains the following two indispensable components: cell running masking and bridging classifier. Specifically, to enable the ViT to perceive the details beyond the visible patches easily, cell running masking is presented to preserve the spatio-temporal correlations in videos, which ensures the patches at the same spatial location can be observed in turn for easy reconstructions. Additionally, we notice that, although the partially observed features can reconstruct semantically explicit invisible patches, they fail to achieve accurate classification. To address this, a bridging classifier is proposed to bridge the semantic gap between the ViT encoded features for reconstruction and the features specialized for classification. Our proposed MAR reduces the computational cost of ViT by 53% and extensive experiments show that MAR consistently outperforms existing ViT models with a notable margin. Especially, we found a ViT-Large trained by MAR outperforms the ViT-Huge trained by a standard training scheme by convincing margins on both Kinetics-400 and Something-Something v2 datasets, while our computation overhead of ViT-Large is only 14.5% of ViT-Huge.

PDF Abstract

Code

Add Remove Mark official

alibaba-mmai-research/masked-action… official

Tasks

Add Remove

Action Classification

Action Recognition

Video Recognition

Datasets

ImageNet

UCF101

Kinetics

HMDB51

Kinetics 400

Something-Something V2

Results from the Paper

Edit

Ranked #12 on Action Recognition on Something-Something V2

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Action Classification	Kinetics-400	MAR (50% mask, ViT-L, 16x4)	Acc@1	85.3	# 49	Compare
Action Classification	Kinetics-400	MAR (50% mask, ViT-L, 16x4)	Acc@5	96.3	# 38	Compare
Action Classification	Kinetics-400	MAR (75% mask, ViT-B, 16x4)	Acc@1	79.4	# 103	Compare
Action Classification	Kinetics-400	MAR (75% mask, ViT-B, 16x4)	Acc@5	93.7	# 86	Compare
Action Classification	Kinetics-400	MAR (50% mask, ViT-B, 16x4)	Acc@1	81.0	# 82	Compare
Action Classification	Kinetics-400	MAR (50% mask, ViT-B, 16x4)	Acc@5	94.4	# 70	Compare
Action Classification	Kinetics-400	MAR (75% mask, ViT-L, 16x4)	Acc@1	83.9	# 57	Compare
Action Classification	Kinetics-400	MAR (75% mask, ViT-L, 16x4)	Acc@5	96.0	# 41	Compare
Action Recognition	Something-Something V2	MAR (50% mask, ViT-L, 16x4)	Top-1 Accuracy	74.7	# 12	Compare
			Top-5 Accuracy	94.9	# 7	Compare
			Parameters	311	# 14	Compare
			GFLOPs	276x6	# 6	Compare
Action Recognition	Something-Something V2	MAR (75% mask, ViT-B, 16x4)	Top-1 Accuracy	69.5	# 44	Compare
			Top-5 Accuracy	91.9	# 31	Compare
			Parameters	94	# 21	Compare
			GFLOPs	41x6	# 6	Compare
Action Recognition	Something-Something V2	MAR (50% mask, ViT-B, 16x4)	Top-1 Accuracy	71.0	# 32	Compare
			Top-5 Accuracy	92.8	# 20	Compare
			Parameters	94	# 21	Compare
			GFLOPs	86x6	# 6	Compare
Action Recognition	Something-Something V2	MAR (75% mask, ViT-L, 16x4)	Top-1 Accuracy	73.8	# 16	Compare
			Top-5 Accuracy	94.4	# 10	Compare
			Parameters	311	# 14	Compare
			GFLOPs	131x6	# 6	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

MAR: Masked Autoencoders for Efficient Action Recognition

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove