TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Self-Supervised Action Recognition	HMDB51	M3Video	Top-1 Accuracy	78.0	# 2
Self-Supervised Action Recognition	HMDB51	M3Video	Pre-Training Dataset	Kinetics400	# 1
Self-Supervised Action Recognition	HMDB51	M3Video	Frozen	false	# 1
Self-Supervised Action Recognition	UCF101	M3Video	3-fold Accuracy	96.5	# 4
Self-Supervised Action Recognition	UCF101	M3Video	Pre-Training Dataset	Kinetics400	# 1
Self-Supervised Action Recognition	UCF101	M3Video	Frozen	false	# 1

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/m-3-video-masked-motion-modeling-for-self/self-supervised-action-recognition-on-hmdb51)](https://paperswithcode.com/sota/self-supervised-action-recognition-on-hmdb51?p=m-3-video-masked-motion-modeling-for-self)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/m-3-video-masked-motion-modeling-for-self/self-supervised-action-recognition-on-ucf101)](https://paperswithcode.com/sota/self-supervised-action-recognition-on-ucf101?p=m-3-video-masked-motion-modeling-for-self)`

Masked Motion Encoding for Self-Supervised Video Representation Learning

CVPR 2023 · Xinyu Sun, Peihao Chen, LiangWei Chen, Changhao Li, Thomas H. Li, Mingkui Tan, Chuang Gan ·

How to learn discriminative video representation from unlabeled videos is challenging but crucial for video analysis. The latest attempts seek to learn a representation model by predicting the appearance contents in the masked regions. However, simply masking and recovering appearance contents may not be sufficient to model temporal clues as the appearance contents can be easily reconstructed from a single frame. To overcome this limitation, we present Masked Motion Encoding (MME), a new pre-training paradigm that reconstructs both appearance and motion information to explore temporal clues. In MME, we focus on addressing two critical challenges to improve the representation performance: 1) how to well represent the possible long-term motion across multiple frames; and 2) how to obtain fine-grained temporal clues from sparsely sampled videos. Motivated by the fact that human is able to recognize an action by tracking objects' position changes and shape changes, we propose to reconstruct a motion trajectory that represents these two kinds of change in the masked regions. Besides, given the sparse video input, we enforce the model to reconstruct dense motion trajectories in both spatial and temporal dimensions. Pre-trained with our MME paradigm, the model is able to anticipate long-term and fine-grained motion details. Code is available at https://github.com/XinyuSun/MME.

PDF Abstract CVPR 2023 PDF CVPR 2023 Abstract

Code

Add Remove Mark official

xinyusun/mme official

XinyuSun/M3Video

Tasks

Add Remove

Optical Flow Estimation

Representation Learning

Self-Supervised Action Recognition

Self-Supervised Learning

Datasets

UCF101

HMDB51

Something-Something V2

Results from the Paper

Edit

Ranked #2 on Self-Supervised Action Recognition on HMDB51

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Self-Supervised Action Recognition	HMDB51	M3Video	Top-1 Accuracy	78.0	# 2	Compare
			Pre-Training Dataset	Kinetics400	# 1	Compare
			Frozen	false	# 1	Compare
Self-Supervised Action Recognition	UCF101	M3Video	3-fold Accuracy	96.5	# 4	Compare
			Pre-Training Dataset	Kinetics400	# 1	Compare
			Frozen	false	# 1	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

Masked Motion Encoding for Self-Supervised Video Representation Learning

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove