Video Grounding

41 papers with code • 2 benchmarks • 8 datasets

Video grounding is the task of linking spoken language descriptions to specific video segments. In video grounding, the model is given a video and a natural language description, such as a sentence or a caption, and its goal is to identify the specific segment of the video that corresponds to the description. This can involve tasks such as localizing the objects or actions mentioned in the description within the video, or associating a specific time interval with the description.

Benchmarks

Add a Result

These leaderboards are used to track progress in Video Grounding

Trend	Dataset	Best Model	Paper	Code	Compare
	QVHighlights	InternVideo2-6B			See all
	MAD	DenoiseLoc			See all

Datasets

Subtasks

Most implemented papers

Most implemented Social Latest No code

Negative Sample Matters: A Renaissance of Metric Learning for Temporal Grounding

mcg-nju/mmn • • 10 Sep 2021

Instead, from a perspective on temporal grounding as a metric-learning problem, we present a Mutual Matching Network (MMN), to directly model the similarity between language queries and video moments in a joint embedding space.

Paper
Code

InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding

opengvlab/internvideo2 • 22 Mar 2024

We introduce InternVideo2, a new video foundation model (ViFM) that achieves the state-of-the-art performance in action recognition, video-text tasks, and video-centric dialogue.

Paper
Code

Read, Watch, and Move: Reinforcement Learning for Temporally Grounding Natural Language Descriptions in Videos

WuJie1010/Temporally-language-grounding • • 21 Jan 2019

The task of video grounding, which temporally localizes a natural language description in a video, plays an important role in understanding videos.

Paper
Code

Where Does It Exist: Spatio-Temporal Video Grounding for Multi-Form Sentences

Guaranteer/VidSTG-Dataset • CVPR 2020

In this paper, we consider a novel task, Spatio-Temporal Video Grounding for Multi-Form Sentences (STVG).

Paper
Code

Dense Regression Network for Video Grounding

alvin-zeng/drn • • CVPR 2020

The key idea of this paper is to use the distances between the frame within the ground truth and the starting (ending) frame as dense supervisions to improve the video grounding accuracy.

Paper
Code

Human-centric Spatio-Temporal Video Grounding With Visual Transformers

tzhhhh123/HC-STVG • 10 Nov 2020

HC-STVG is a video grounding task that requires both spatial (where) and temporal (when) localization.

Paper
Code

VLG-Net: Video-Language Graph Matching Network for Video Grounding

Soldelli/VLG-Net • • 19 Nov 2020

Grounding language queries in videos aims at identifying the time interval (or moment) semantically relevant to a language query.

Paper
Code

Cross-Modal learning for Audio-Visual Video Parsing

jayaprakash-a/Cross-Modal-learning-for-Audio-Visual-Video-Parsing • • 3 Apr 2021

In this paper, we present a novel approach to the audio-visual video parsing (AVVP) task that demarcates events from a video separately for audio and visual modalities.

Paper
Code

Interventional Video Grounding with Dual Contrastive Learning

nanguoshun/IVG • • CVPR 2021

2) Meanwhile, we introduce a dual contrastive learning approach (DCL) to better align the text and video by maximizing the mutual information (MI) between query and video clips, and the MI between start/end frames of a target moment and the others within a video to learn more informative visual representations.

Paper
Code

VidLanKD: Improving Language Understanding via Video-Distilled Knowledge Transfer

zinengtang/VidLanKD • • NeurIPS 2021

We train a multi-modal teacher model on a video-text dataset, and then transfer its knowledge to a student language model with a text dataset.

Paper
Code

Video Grounding

Benchmarks Add a Result

Datasets

Subtasks

Most implemented papers

Content

Benchmarks

Add a Result