Video Understanding

148 papers with code • 0 benchmarks • 31 datasets

A crucial task of Video Understanding is to recognise and localise (in space and time) different actions or events appearing in the video.

Source: Action Detection from a Robot-Car Perspective


Use these libraries to find Video Understanding models and implementations

Most implemented papers

Is Space-Time Attention All You Need for Video Understanding?

facebookresearch/TimeSformer 9 Feb 2021

We present a convolution-free approach to video classification built exclusively on self-attention over space and time.

Video Swin Transformer

SwinTransformer/Video-Swin-Transformer 24 Jun 2021

The vision community is witnessing a modeling shift from CNNs to Transformers, where pure Transformer architectures have attained top accuracy on the major video recognition benchmarks.

TSM: Temporal Shift Module for Efficient Video Understanding

MIT-HAN-LAB/temporal-shift-module ICCV 2019

The explosive growth in video streaming gives rise to challenges on performing video understanding at high accuracy and low computation cost.

Representation Flow for Action Recognition

piergiaj/representation-flow-cvpr19 CVPR 2019

Our representation flow layer is a fully-differentiable layer designed to capture the `flow' of any representation channel within a convolutional neural network for action recognition.

Video Instance Segmentation

Epiphqny/VisTR ICCV 2019

The goal of this new task is simultaneous detection, segmentation and tracking of instances in videos.

TS-LSTM and Temporal-Inception: Exploiting Spatiotemporal Dynamics for Activity Recognition

chihyaoma/Activity-Recognition-with-CNN-and-RNN 30 Mar 2017

We demonstrate that using both RNNs (using LSTMs) and Temporal-ConvNets on spatiotemporal feature matrices are able to exploit spatiotemporal dynamics to improve the overall performance.

AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual Actions

tensorflow/models CVPR 2018

The AVA dataset densely annotates 80 atomic visual actions in 430 15-minute video clips, where actions are localized in space and time, resulting in 1. 58M action labels with multiple labels per person occurring frequently.

Learnable pooling with Context Gating for video classification

antoine77340/Youtube-8M-WILLOW 21 Jun 2017

In particular, we evaluate our method on the large-scale multi-modal Youtube-8M v2 dataset and outperform all other methods in the Youtube 8M Large-Scale Video Understanding challenge.

Long-Term Feature Banks for Detailed Video Understanding

facebookresearch/video-long-term-feature-banks CVPR 2019

To understand the world, we humans constantly need to relate the present to the past, and put events in context.

Temporal Interlacing Network

deepcs233/TIN 17 Jan 2020

In this way, a heavy temporal model is replaced by a simple interlacing operator.