Video Understanding

293 papers with code • 0 benchmarks • 42 datasets

A crucial task of Video Understanding is to recognise and localise (in space and time) different actions or events appearing in the video.

Source: Action Detection from a Robot-Car Perspective

Libraries

Use these libraries to find Video Understanding models and implementations

Most implemented papers

TS-LSTM and Temporal-Inception: Exploiting Spatiotemporal Dynamics for Activity Recognition

chihyaoma/Activity-Recognition-with-CNN-and-RNN 30 Mar 2017

We demonstrate that using both RNNs (using LSTMs) and Temporal-ConvNets on spatiotemporal feature matrices are able to exploit spatiotemporal dynamics to improve the overall performance.

VirtualHome: Simulating Household Activities via Programs

xavierpuigf/virtualhome CVPR 2018

We then implement the most common atomic (inter)actions in the Unity3D game engine, and use our programs to "drive" an artificial agent to execute tasks in a simulated household environment.

Long-Term Feature Banks for Detailed Video Understanding

facebookresearch/video-long-term-feature-banks CVPR 2019

To understand the world, we humans constantly need to relate the present to the past, and put events in context.

Temporal Interlacing Network

deepcs233/TIN 17 Jan 2020

In this way, a heavy temporal model is replaced by a simple interlacing operator.

TokenLearner: What Can 8 Learned Tokens Do for Images and Videos?

google-research/scenic 21 Jun 2021

In this paper, we introduce a novel visual representation learning which relies on a handful of adaptively learned tokens, and which is applicable to both image and video understanding tasks.

TSM: Temporal Shift Module for Efficient and Scalable Video Understanding on Edge Device

MIT-HAN-LAB/temporal-shift-module 27 Sep 2021

Secondly, TSM has high efficiency; it achieves a high frame rate of 74fps and 29fps for online video recognition on Jetson Nano and Galaxy Note8.

VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training

MCG-NJU/VideoMAE 23 Mar 2022

Pre-training video transformers on extra large-scale datasets is generally required to achieve premier performance on relatively small datasets.

Flamingo: a Visual Language Model for Few-Shot Learning

mlfoundations/open_flamingo DeepMind 2022

Building models that can be rapidly adapted to novel tasks using only a handful of annotated examples is an open challenge for multimodal machine learning research.

DeepSportradar-v1: Computer Vision Dataset for Sports Understanding with High Quality Annotations

deepsportradar/player-reidentification-challenge 17 Aug 2022

With the recent development of Deep Learning applied to Computer Vision, sport video understanding has gained a lot of attention, providing much richer information for both sport consumers and leagues.

A Multigrid Method for Efficiently Training Video Models

facebookresearch/SlowFast CVPR 2020

We empirically demonstrate a general and robust grid schedule that yields a significant out-of-the-box training speedup without a loss in accuracy for different models (I3D, non-local, SlowFast), datasets (Kinetics, Something-Something, Charades), and training settings (with and without pre-training, 128 GPUs or 1 GPU).