Video Understanding

300 papers with code • 0 benchmarks • 42 datasets

A crucial task of Video Understanding is to recognise and localise (in space and time) different actions or events appearing in the video.

Source: Action Detection from a Robot-Car Perspective

Libraries

Use these libraries to find Video Understanding models and implementations

Most implemented papers

A Multigrid Method for Efficiently Training Video Models

facebookresearch/SlowFast CVPR 2020

We empirically demonstrate a general and robust grid schedule that yields a significant out-of-the-box training speedup without a loss in accuracy for different models (I3D, non-local, SlowFast), datasets (Kinetics, Something-Something, Charades), and training settings (with and without pre-training, 128 GPUs or 1 GPU).

Context R-CNN: Long Term Temporal Context for Per-Camera Object Detection

tensorflow/models CVPR 2020

In this paper we propose a method that leverages temporal context from the unlabeled frames of a novel camera to improve performance at that camera.

Actor-Context-Actor Relation Network for Spatio-Temporal Action Localization

Siyu-C/ACAR-Net CVPR 2021

We propose to explicitly model the Actor-Context-Actor Relation, which is the relation between two actors based on their interactions with the context.

SoccerNet-v2: A Dataset and Benchmarks for Holistic Understanding of Broadcast Soccer Videos

SilvioGiancola/SoccerNetv2-DevKit 26 Nov 2020

In this work, we propose SoccerNet-v2, a novel large-scale corpus of manual annotations for the SoccerNet video dataset, along with open challenges to encourage more research in soccer understanding and broadcast production.

Token Shift Transformer for Video Classification

VideoNetworks/TokShift-Transformer 5 Aug 2021

It is worth noticing that our TokShift transformer is a pure convolutional-free video transformer pilot with computational efficiency for video understanding.

Progressive Attention on Multi-Level Dense Difference Maps for Generic Event Boundary Detection

mcg-nju/ddm CVPR 2022

Generic event boundary detection is an important yet challenging task in video understanding, which aims at detecting the moments where humans naturally perceive event boundaries.

UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer

OpenGVLab/UniFormerV2 17 Nov 2022

UniFormer has successfully alleviated this issue, by unifying convolution and self-attention as a relation aggregator in the transformer format.

Panoptic Video Scene Graph Generation

jingkang50/openpvsg CVPR 2023

PVSG relates to the existing video scene graph generation (VidSGG) problem, which focuses on temporal interactions between humans and objects grounded with bounding boxes in videos.

VideoMamba: State Space Model for Efficient Video Understanding

opengvlab/videomamba 11 Mar 2024

Addressing the dual challenges of local redundancy and global dependencies in video understanding, this work innovatively adapts the Mamba to the video domain.

Constrained-size Tensorflow Models for YouTube-8M Video Understanding Challenge

boliu61/youtube-8m 21 Aug 2018

This paper presents our 7th place solution to the second YouTube-8M video understanding competition which challenges participates to build a constrained-size model to classify millions of YouTube videos into thousands of classes.