Video Grounding

37 papers with code • 2 benchmarks • 6 datasets

Video grounding is the task of linking spoken language descriptions to specific video segments. In video grounding, the model is given a video and a natural language description, such as a sentence or a caption, and its goal is to identify the specific segment of the video that corresponds to the description. This can involve tasks such as localizing the objects or actions mentioned in the description within the video, or associating a specific time interval with the description.

Most implemented papers

Negative Sample Matters: A Renaissance of Metric Learning for Temporal Grounding

mcg-nju/mmn 10 Sep 2021

Instead, from a perspective on temporal grounding as a metric-learning problem, we present a Mutual Matching Network (MMN), to directly model the similarity between language queries and video moments in a joint embedding space.

Read, Watch, and Move: Reinforcement Learning for Temporally Grounding Natural Language Descriptions in Videos

WuJie1010/Temporally-language-grounding 21 Jan 2019

The task of video grounding, which temporally localizes a natural language description in a video, plays an important role in understanding videos.

Where Does It Exist: Spatio-Temporal Video Grounding for Multi-Form Sentences

Guaranteer/VidSTG-Dataset CVPR 2020

In this paper, we consider a novel task, Spatio-Temporal Video Grounding for Multi-Form Sentences (STVG).

Dense Regression Network for Video Grounding

alvin-zeng/drn CVPR 2020

The key idea of this paper is to use the distances between the frame within the ground truth and the starting (ending) frame as dense supervisions to improve the video grounding accuracy.

Human-centric Spatio-Temporal Video Grounding With Visual Transformers

tzhhhh123/HC-STVG 10 Nov 2020

HC-STVG is a video grounding task that requires both spatial (where) and temporal (when) localization.

VLG-Net: Video-Language Graph Matching Network for Video Grounding

Soldelli/VLG-Net 19 Nov 2020

Grounding language queries in videos aims at identifying the time interval (or moment) semantically relevant to a language query.

Cross-Modal learning for Audio-Visual Video Parsing

jayaprakash-a/Cross-Modal-learning-for-Audio-Visual-Video-Parsing 3 Apr 2021

In this paper, we present a novel approach to the audio-visual video parsing (AVVP) task that demarcates events from a video separately for audio and visual modalities.

Interventional Video Grounding with Dual Contrastive Learning

nanguoshun/IVG CVPR 2021

2) Meanwhile, we introduce a dual contrastive learning approach (DCL) to better align the text and video by maximizing the mutual information (MI) between query and video clips, and the MI between start/end frames of a target moment and the others within a video to learn more informative visual representations.

VidLanKD: Improving Language Understanding via Video-Distilled Knowledge Transfer

zinengtang/VidLanKD NeurIPS 2021

We train a multi-modal teacher model on a video-text dataset, and then transfer its knowledge to a student language model with a text dataset.

Detecting Moments and Highlights in Videos via Natural Language Queries

jayleicn/moment_detr NeurIPS 2021

Each video in the dataset is annotated with: (1) a human-written free-form NL query, (2) relevant moments in the video w. r. t.