Moment Retrieval

48 papers with code • 2 benchmarks • 5 datasets

Moment retrieval can de defined as the task of "localizing moments in a video given a user query".

Description from: QVHIGHLIGHTS: Detecting Moments and Highlights in Videos via Natural Language Queries

Image credit: QVHIGHLIGHTS: Detecting Moments and Highlights in Videos via Natural Language Queries

Libraries

Use these libraries to find Moment Retrieval models and implementations
2 papers
178

Most implemented papers

QVHighlights: Detecting Moments and Highlights in Videos via Natural Language Queries

jayleicn/moment_detr 20 Jul 2021

Each video in the dataset is annotated with: (1) a human-written free-form NL query, (2) relevant moments in the video w. r. t.

HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training

linjieli222/HERO EMNLP 2020

We present HERO, a novel framework for large-scale video+language omni-representation learning.

Finding Moments in Video Collections Using Natural Language

jayleicn/TVRetrieval 30 Jul 2019

We evaluate our approach on two recently proposed datasets for temporal localization of moments in video with natural language (DiDeMo and Charades-STA) extended to our video corpus moment retrieval setting.

TVR: A Large-Scale Dataset for Video-Subtitle Moment Retrieval

jayleicn/TVRetrieval ECCV 2020

The queries are also labeled with query types that indicate whether each of them is more related to video or subtitle or both, allowing for in-depth analysis of the dataset and the methods that built on top of it.

Correlation-guided Query-Dependency Calibration in Video Representation Learning for Temporal Grounding

wjun0830/cgdetr 15 Nov 2023

Dummy tokens conditioned by text query take portions of the attention weights, preventing irrelevant video clips from being represented by the text query.

InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding

opengvlab/internvideo2 22 Mar 2024

We introduce InternVideo2, a new video foundation model (ViFM) that achieves the state-of-the-art performance in action recognition, video-text tasks, and video-centric dialogue.

Weakly Supervised Video Moment Retrieval From Text Queries

niluthpol/weak_supervised_video_moment CVPR 2019

The weak nature of the supervision is because, during training, we only have access to the video-text pairs rather than the temporal extent of the video to which different text descriptions relate.

Cross-Modal Interaction Networks for Query-Based Moment Retrieval in Videos

ikuinen/CMIN 6 Jun 2019

Query-based moment retrieval aims to localize the most relevant moment in an untrimmed video according to the given natural language query.

Regularized Two-Branch Proposal Networks for Weakly-Supervised Moment Retrieval in Videos

ikuinen/regularized_two-branch_proposal_network 19 Aug 2020

Thus, these methods fail to distinguish the target moment from plausible negative moments.

VLANet: Video-Language Alignment Network for Weakly-Supervised Video Moment Retrieval

dbstjswo505/SQuiDNet ECCV 2020

This paper explores methods for performing VMR in a weakly-supervised manner (wVMR): training is performed without temporal moment labels but only with the text query that describes a segment of the video.