Please note some benchmarks may be located in the Action Classification or Video Classification tasks, e.g. Kinetics-400.

VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text

tensorflow/models 22 Apr 2021

We show that the convolution-free VATT outperforms state-of-the-art ConvNet-based architectures in the downstream tasks.

 Ranked #1 on Action Classification on Moments in Time (using extra training data)

Action Classification Action Recognition +6

MoViNets: Mobile Video Networks for Efficient Video Recognition

tensorflow/models CVPR 2021

We present Mobile Video Networks (MoViNets), a family of computation and memory efficient video networks that can operate on streaming video for online inference.

Action Classification Action Recognition +2

AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual Actions

tensorflow/models CVPR 2018

The AVA dataset densely annotates 80 atomic visual actions in 430 15-minute video clips, where actions are localized in space and time, resulting in 1. 58M action labels with multiple labels per person occurring frequently.

Action Recognition Video Understanding

Non-local Neural Networks

facebookresearch/detectron CVPR 2018

Both convolutional and recurrent operations are building blocks that process one local neighborhood at a time.

Ranked #7 on Action Classification on Toyota Smarthome dataset (using extra training data)

Action Classification Action Recognition +3

View-Invariant, Occlusion-Robust Probabilistic Embedding for Human Pose

google-research/google-research 23 Oct 2020

We further show that keypoint occlusion augmentation during training significantly improves retrieval performance on partial 2D input poses.

3D Pose Estimation Action Recognition +1

Unsupervised Learning of Object Structure and Dynamics from Videos

google-research/google-research NeurIPS 2019

Extracting and predicting object structure and dynamics from videos without supervision is a major challenge in machine learning.

Action Recognition Continuous Control +2

Large-scale weakly-supervised pre-training for video action recognition

microsoft/computervision-recipes CVPR 2019

Second, frame-based models perform quite well on action recognition; is pre-training for good image features sufficient or is pre-training for spatio-temporal features valuable for optimal transfer learning?

 Ranked #1 on Egocentric Activity Recognition on EPIC-KITCHENS-55 (Actions Top-1 (S2) metric)

Action Classification Action Recognition +3

A Closer Look at Spatiotemporal Convolutions for Action Recognition

microsoft/computervision-recipes CVPR 2018

In this paper we discuss several forms of spatiotemporal convolutions for video analysis and study their effects on action recognition.

Action Classification Action Recognition