TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Action Classification	Kinetics-400	AdaMAE	Acc@1	81.7	# 74
Action Classification	Kinetics-400	AdaMAE	Acc@5	95.2	# 50
Action Classification	Something-Something V2	AdaMAE	Acc@1	70.04	# 1
Action Classification	Something-Something V2	AdaMAE	Acc@5	92.7	# 1

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/adamae-adaptive-masking-for-efficient/action-classification-on-something-something-2)](https://paperswithcode.com/sota/action-classification-on-something-something-2?p=adamae-adaptive-masking-for-efficient)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/adamae-adaptive-masking-for-efficient/action-classification-on-kinetics-400)](https://paperswithcode.com/sota/action-classification-on-kinetics-400?p=adamae-adaptive-masking-for-efficient)`

AdaMAE: Adaptive Masking for Efficient Spatiotemporal Learning with Masked Autoencoders

CVPR 2023 · Wele Gedara Chaminda Bandara, Naman Patel, Ali Gholami, Mehdi Nikkhah, Motilal Agrawal, Vishal M. Patel ·

Masked Autoencoders (MAEs) learn generalizable representations for image, text, audio, video, etc., by reconstructing masked input data from tokens of the visible data. Current MAE approaches for videos rely on random patch, tube, or frame-based masking strategies to select these tokens. This paper proposes AdaMAE, an adaptive masking strategy for MAEs that is end-to-end trainable. Our adaptive masking strategy samples visible tokens based on the semantic context using an auxiliary sampling network. This network estimates a categorical distribution over spacetime-patch tokens. The tokens that increase the expected reconstruction error are rewarded and selected as visible tokens, motivated by the policy gradient algorithm in reinforcement learning. We show that AdaMAE samples more tokens from the high spatiotemporal information regions, thereby allowing us to mask 95% of tokens, resulting in lower memory requirements and faster pre-training. We conduct ablation studies on the Something-Something v2 (SSv2) dataset to demonstrate the efficacy of our adaptive sampling approach and report state-of-the-art results of 70.0% and 81.7% in top-1 accuracy on SSv2 and Kinetics-400 action classification datasets with a ViT-Base backbone and 800 pre-training epochs.

PDF Abstract CVPR 2023 PDF CVPR 2023 Abstract

Code

Add Remove Mark official

wgcban/adamae official

Nithin-GK/UniteandConquer

Tasks

Add Remove

Action Classification

Representation Learning

Datasets

Kinetics

Kinetics 400

Something-Something V2

Results from the Paper

Edit

Ranked #1 on Action Classification on Something-Something V2

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Action Classification	Kinetics-400	AdaMAE	Acc@1	81.7	# 74	Compare
Action Classification	Kinetics-400	AdaMAE	Acc@5	95.2	# 50	Compare
Action Classification	Something-Something V2	AdaMAE	Acc@1	70.04	# 1	Compare
Action Classification	Something-Something V2	AdaMAE	Acc@5	92.7	# 1	Compare

Methods

Add Remove

Adaptive Masking • L1 Regularization • MAE • Masked Convolution

Edit Social Preview

AdaMAE: Adaptive Masking for Efficient Spatiotemporal Learning with Masked Autoencoders

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove