Video Understanding

293 papers with code • 0 benchmarks • 42 datasets

A crucial task of Video Understanding is to recognise and localise (in space and time) different actions or events appearing in the video.

Source: Action Detection from a Robot-Car Perspective

Benchmarks

Add a Result

These leaderboards are used to track progress in Video Understanding

You can find evaluation results in the subtasks. You can also submitting evaluation metrics for this task.

Libraries

Use these libraries to find Video Understanding models and implementations

open-mmlab/mmaction2

7 papers

3,865

towhee-io/towhee

4 papers

2,968

google-research/scenic

2 papers

2,983

MIT-HAN-LAB/temporal-shift-module

2 papers

2,016

Datasets

Subtasks

Most implemented papers

Most implemented Social Latest No code

TS-LSTM and Temporal-Inception: Exploiting Spatiotemporal Dynamics for Activity Recognition

chihyaoma/Activity-Recognition-with-CNN-and-RNN • • 30 Mar 2017

We demonstrate that using both RNNs (using LSTMs) and Temporal-ConvNets on spatiotemporal feature matrices are able to exploit spatiotemporal dynamics to improve the overall performance.

Paper
Code

VirtualHome: Simulating Household Activities via Programs

xavierpuigf/virtualhome • CVPR 2018

We then implement the most common atomic (inter)actions in the Unity3D game engine, and use our programs to "drive" an artificial agent to execute tasks in a simulated household environment.

Paper
Code

Long-Term Feature Banks for Detailed Video Understanding

facebookresearch/video-long-term-feature-banks • • CVPR 2019

To understand the world, we humans constantly need to relate the present to the past, and put events in context.

Paper
Code

Temporal Interlacing Network

deepcs233/TIN • • 17 Jan 2020

In this way, a heavy temporal model is replaced by a simple interlacing operator.

Paper
Code

TokenLearner: What Can 8 Learned Tokens Do for Images and Videos?

google-research/scenic • • 21 Jun 2021

In this paper, we introduce a novel visual representation learning which relies on a handful of adaptively learned tokens, and which is applicable to both image and video understanding tasks.

Paper
Code

TSM: Temporal Shift Module for Efficient and Scalable Video Understanding on Edge Device

MIT-HAN-LAB/temporal-shift-module • • 27 Sep 2021

Secondly, TSM has high efficiency; it achieves a high frame rate of 74fps and 29fps for online video recognition on Jetson Nano and Galaxy Note8.

Paper
Code

VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training

MCG-NJU/VideoMAE • • 23 Mar 2022

Pre-training video transformers on extra large-scale datasets is generally required to achieve premier performance on relatively small datasets.

Paper
Code

Flamingo: a Visual Language Model for Few-Shot Learning

mlfoundations/open_flamingo • • DeepMind 2022

Building models that can be rapidly adapted to novel tasks using only a handful of annotated examples is an open challenge for multimodal machine learning research.

Paper
Code

DeepSportradar-v1: Computer Vision Dataset for Sports Understanding with High Quality Annotations

deepsportradar/player-reidentification-challenge • • 17 Aug 2022

With the recent development of Deep Learning applied to Computer Vision, sport video understanding has gained a lot of attention, providing much richer information for both sport consumers and leagues.

Paper
Code

A Multigrid Method for Efficiently Training Video Models

facebookresearch/SlowFast • • CVPR 2020

We empirically demonstrate a general and robust grid schedule that yields a significant out-of-the-box training speedup without a loss in accuracy for different models (I3D, non-local, SlowFast), datasets (Kinetics, Something-Something, Charades), and training settings (with and without pre-training, 128 GPUs or 1 GPU).

Paper
Code

Video Understanding

Benchmarks Add a Result

Libraries

Datasets

Subtasks

Most implemented papers

Content

Benchmarks

Add a Result