About

A crucial task of Video Understanding is to recognise and localise (in space and time) different actions or events appearing in the video.

Source: Action Detection from a Robot-Car Perspective

Benchmarks

You can find evaluation results in the subtasks. You can also submitting evaluation metrics for this task.

Subtasks

Datasets

Greatest papers with code

Context R-CNN: Long Term Temporal Context for Per-Camera Object Detection

CVPR 2020 tensorflow/models

In this paper we propose a method that leverages temporal context from the unlabeled frames of a novel camera to improve performance at that camera.

VIDEO OBJECT DETECTION VIDEO UNDERSTANDING

AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual Actions

CVPR 2018 tensorflow/models

The AVA dataset densely annotates 80 atomic visual actions in 430 15-minute video clips, where actions are localized in space and time, resulting in 1. 58M action labels with multiple labels per person occurring frequently.

ACTION RECOGNITION VIDEO UNDERSTANDING

A Multigrid Method for Efficiently Training Video Models

CVPR 2020 facebookresearch/SlowFast

We empirically demonstrate a general and robust grid schedule that yields a significant out-of-the-box training speedup without a loss in accuracy for different models (I3D, non-local, SlowFast), datasets (Kinetics, Something-Something, Charades), and training settings (with and without pre-training, 128 GPUs or 1 GPU).

ACTION DETECTION ACTION RECOGNITION VIDEO UNDERSTANDING

TSM: Temporal Shift Module for Efficient Video Understanding

ICCV 2019 MIT-HAN-LAB/temporal-shift-module

The explosive growth in video streaming gives rise to challenges on performing video understanding at high accuracy and low computation cost.

Ranked #4 on Action Recognition on Something-Something V2 (using extra training data)

ACTION CLASSIFICATION ACTION RECOGNITION VIDEO OBJECT DETECTION VIDEO RECOGNITION VIDEO UNDERSTANDING

Detect-and-Track: Efficient Pose Estimation in Videos

CVPR 2018 facebookresearch/DetectAndTrack

This paper addresses the problem of estimating and tracking human body keypoints in complex, multi-person video.

Ranked #5 on Pose Tracking on PoseTrack2017 (using extra training data)

HUMAN DETECTION MULTI-OBJECT TRACKING POSE ESTIMATION POSE TRACKING VIDEO UNDERSTANDING

Temporal Interlacing Network

17 Jan 2020open-mmlab/mmaction2

In this way, a heavy temporal model is replaced by a simple interlacing operator.

OPTICAL FLOW ESTIMATION VIDEO UNDERSTANDING

TS-LSTM and Temporal-Inception: Exploiting Spatiotemporal Dynamics for Activity Recognition

30 Mar 2017jeffreyhuang1/two-stream-action-recognition

We demonstrate that using both RNNs (using LSTMs) and Temporal-ConvNets on spatiotemporal feature matrices are able to exploit spatiotemporal dynamics to improve the overall performance.

ACTION CLASSIFICATION ACTION RECOGNITION VIDEO UNDERSTANDING

Learnable pooling with Context Gating for video classification

21 Jun 2017antoine77340/Youtube-8M-WILLOW

In particular, we evaluate our method on the large-scale multi-modal Youtube-8M v2 dataset and outperform all other methods in the Youtube 8M Large-Scale Video Understanding challenge.

VIDEO CLASSIFICATION VIDEO UNDERSTANDING