Video Grounding

42 papers with code • 2 benchmarks • 8 datasets

Video grounding is the task of linking spoken language descriptions to specific video segments. In video grounding, the model is given a video and a natural language description, such as a sentence or a caption, and its goal is to identify the specific segment of the video that corresponds to the description. This can involve tasks such as localizing the objects or actions mentioned in the description within the video, or associating a specific time interval with the description.

Benchmarks

Add a Result

These leaderboards are used to track progress in Video Grounding

Trend	Dataset	Best Model	Paper	Code	Compare
	QVHighlights	InternVideo2-6B			See all
	MAD	DenoiseLoc			See all

Datasets

Subtasks

Most implemented papers

Most implemented Social Latest No code

Detecting Moments and Highlights in Videos via Natural Language Queries

jayleicn/moment_detr • • NeurIPS 2021

Each video in the dataset is annotated with: (1) a human-written free-form NL query, (2) relevant moments in the video w. r. t.

Paper
Code

Explore-And-Match: Bridging Proposal-Based and Proposal-Free With Transformer for Sentence Grounding in Videos

sangminwoo/explore-and-match • • 25 Jan 2022

To our surprise, we found that training schedule shows divide-and-conquer-like pattern: time segments are first diversified regardless of the target, then coupled with each target, and fine-tuned to the target again.

Paper
Code

UMT: Unified Multi-modal Transformers for Joint Video Moment Retrieval and Highlight Detection

tencentarc/umt • • CVPR 2022

Finding relevant moments and highlights in videos according to natural language queries is a natural and highly valuable common need in the current video content explosion era.

Paper
Code

TubeDETR: Spatio-Temporal Video Grounding with Transformers

antoyang/TubeDETR • • CVPR 2022

We consider the problem of localizing a spatio-temporal tube in a video corresponding to a given text query.

Paper
Code

Animal Kingdom: A Large and Diverse Dataset for Animal Behavior Understanding

SUTDCV/Animal-Kingdom • • CVPR 2022

More specifically, our dataset contains 50 hours of annotated videos to localize relevant animal behavior segments in long videos for the video grounding task, 30K video sequences for the fine-grained multi-label action recognition task, and 33K frames for the pose estimation task, which correspond to a diverse range of animals with 850 species across 6 major animal classes.

Paper
Code

Video-Guided Curriculum Learning for Spoken Video Grounding

marmot-xy/spoken-video-grounding • • 1 Sep 2022

To rectify the discriminative phonemes and extract video-related information from noisy audio, we develop a novel video-guided curriculum learning (VGCL) during the audio pre-training process, which can make use of the vital visual perceptions to help understand the spoken language and suppress the external noise.

Paper
Code

CONE: An Efficient COarse-to-fiNE Alignment Framework for Long Video Temporal Grounding

houzhijian/cone • • 22 Sep 2022

This paper tackles an emerging and challenging problem of long video temporal grounding~(VTG) that localizes video moments related to a natural language (NL) query.

Paper
Code

Towards Parameter-Efficient Integration of Pre-Trained Language Models In Temporal Video Grounding

ericashimomoto/parameter-efficient-tvg • • 26 Sep 2022

This paper explores the task of Temporal Video Grounding (TVG) where, given an untrimmed video and a natural language sentence query, the goal is to recognize and determine temporal boundaries of action instances in the video described by the query.

Paper
Code

Embracing Consistency: A One-Stage Approach for Spatio-Temporal Video Grounding

jy0205/stcat • • 27 Sep 2022

Spatio-Temporal video grounding (STVG) focuses on retrieving the spatio-temporal tube of a specific object depicted by a free-form textual expression.

Paper
Code

Graph2Vid: Flow graph to Video Grounding for Weakly-supervised Multi-Step Localization

samsunglabs/graph2vid • • 10 Oct 2022

In this setup, we seek the optimal step ordering consistent with the procedure flow graph and a given video.

Paper
Code

Video Grounding

Benchmarks Add a Result

Datasets

Subtasks

Most implemented papers

Content

Benchmarks

Add a Result