Video Understanding

300 papers with code • 0 benchmarks • 42 datasets

A crucial task of Video Understanding is to recognise and localise (in space and time) different actions or events appearing in the video.

Source: Action Detection from a Robot-Car Perspective

Libraries

Use these libraries to find Video Understanding models and implementations

$R^2$-Tuning: Efficient Image-to-Video Transfer Learning for Video Temporal Grounding

yeliudev/R2-Tuning 31 Mar 2024

Video temporal grounding (VTG) is a fine-grained video understanding problem that aims to ground relevant clips in untrimmed videos given natural language queries.

18
31 Mar 2024

ST-LLM: Large Language Models Are Effective Temporal Learners

TencentARC/ST-LLM 30 Mar 2024

In this paper, we investigate a straightforward yet unexplored question: Can we feed all spatial-temporal tokens into the LLM, thus delegating the task of video sequence modeling to the LLMs?

41
30 Mar 2024

Towards Multimodal Video Paragraph Captioning Models Robust to Missing Modality

lancopku/mr-vpc 28 Mar 2024

Video paragraph captioning (VPC) involves generating detailed narratives for long videos, utilizing supportive modalities such as speech and event boundaries.

1
28 Mar 2024

OmniVid: A Generative Framework for Universal Video Understanding

wangjk666/omnivid 26 Mar 2024

The core of video understanding tasks, such as recognition, captioning, and tracking, is to automatically detect objects or actions in a video and analyze their temporal evolution.

19
26 Mar 2024

Understanding Long Videos in One Multimodal Language Model Pass

kahnchana/mvu 25 Mar 2024

In addition to faster inference, we discover the resulting models to yield surprisingly good accuracy on long-video tasks, even with no video specific information.

15
25 Mar 2024

InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding

opengvlab/internvideo 22 Mar 2024

We introduce InternVideo2, a new video foundation model (ViFM) that achieves the state-of-the-art performance in action recognition, video-text tasks, and video-centric dialogue.

936
22 Mar 2024

Language Repository for Long Video Understanding

kkahatapitiya/langrepo 21 Mar 2024

In this paper, we introduce a Language Repository (LangRepo) for LLMs, that maintains concise and structured information as an interpretable (i. e., all-textual) representation.

17
21 Mar 2024

Exploring Pre-trained Text-to-Video Diffusion Models for Referring Video Object Segmentation

buxiangzhiren/vd-it 18 Mar 2024

We hypothesize that the latent representation learned from a pretrained generative T2V model encapsulates rich semantics and coherent temporal correspondences, thereby naturally facilitating video understanding.

16
18 Mar 2024

Video Mamba Suite: State Space Model as a Versatile Alternative for Video Understanding

opengvlab/video-mamba-suite 14 Mar 2024

We categorize Mamba into four roles for modeling videos, deriving a Video Mamba Suite composed of 14 models/modules, and evaluating them on 12 video understanding tasks.

122
14 Mar 2024

Don't Judge by the Look: Towards Motion Coherent Video Representation

bespontaneous/mca-pytorch 14 Mar 2024

Current training pipelines in object recognition neglect Hue Jittering when doing data augmentation as it not only brings appearance changes that are detrimental to classification, but also the implementation is inefficient in practice.

4
14 Mar 2024