Video Understanding

234 papers with code • 0 benchmarks • 38 datasets

A crucial task of Video Understanding is to recognise and localise (in space and time) different actions or events appearing in the video.

Source: Action Detection from a Robot-Car Perspective


Use these libraries to find Video Understanding models and implementations
7 papers
4 papers
2 papers
See all 5 libraries.

Most implemented papers

Video Swin Transformer

SwinTransformer/Video-Swin-Transformer CVPR 2022

The vision community is witnessing a modeling shift from CNNs to Transformers, where pure Transformer architectures have attained top accuracy on the major video recognition benchmarks.

TSM: Temporal Shift Module for Efficient Video Understanding

MIT-HAN-LAB/temporal-shift-module ICCV 2019

The explosive growth in video streaming gives rise to challenges on performing video understanding at high accuracy and low computation cost.

Is Space-Time Attention All You Need for Video Understanding?

facebookresearch/TimeSformer 9 Feb 2021

We present a convolution-free approach to video classification built exclusively on self-attention over space and time.

SoccerNet 2022 Challenges Results

soccernet/sn-calibration 5 Oct 2022

The SoccerNet 2022 challenges were the second annual video understanding challenges organized by the SoccerNet team.

AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual Actions

tensorflow/models CVPR 2018

The AVA dataset densely annotates 80 atomic visual actions in 430 15-minute video clips, where actions are localized in space and time, resulting in 1. 58M action labels with multiple labels per person occurring frequently.

Video Instance Segmentation

open-mmlab/mmdetection ICCV 2019

The goal of this new task is simultaneous detection, segmentation and tracking of instances in videos.

Learnable pooling with Context Gating for video classification

antoine77340/Youtube-8M-WILLOW 21 Jun 2017

In particular, we evaluate our method on the large-scale multi-modal Youtube-8M v2 dataset and outperform all other methods in the Youtube 8M Large-Scale Video Understanding challenge.

Representation Flow for Action Recognition

piergiaj/representation-flow-cvpr19 CVPR 2019

Our representation flow layer is a fully-differentiable layer designed to capture the `flow' of any representation channel within a convolutional neural network for action recognition.

CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval

ArrowLuo/CLIP4Clip 18 Apr 2021

In this paper, we propose a CLIP4Clip model to transfer the knowledge of the CLIP model to video-language retrieval in an end-to-end manner.

TS-LSTM and Temporal-Inception: Exploiting Spatiotemporal Dynamics for Activity Recognition

chihyaoma/Activity-Recognition-with-CNN-and-RNN 30 Mar 2017

We demonstrate that using both RNNs (using LSTMs) and Temporal-ConvNets on spatiotemporal feature matrices are able to exploit spatiotemporal dynamics to improve the overall performance.