Moment Retrieval
69 papers with code • 2 benchmarks • 5 datasets
Moment retrieval can de defined as the task of "localizing moments in a video given a user query".
Description from: QVHIGHLIGHTS: Detecting Moments and Highlights in Videos via Natural Language Queries
Image credit: QVHIGHLIGHTS: Detecting Moments and Highlights in Videos via Natural Language Queries
Libraries
Use these libraries to find Moment Retrieval models and implementationsMost implemented papers
QVHighlights: Detecting Moments and Highlights in Videos via Natural Language Queries
Each video in the dataset is annotated with: (1) a human-written free-form NL query, (2) relevant moments in the video w. r. t.
HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training
We present HERO, a novel framework for large-scale video+language omni-representation learning.
UMT: Unified Multi-modal Transformers for Joint Video Moment Retrieval and Highlight Detection
Finding relevant moments and highlights in videos according to natural language queries is a natural and highly valuable common need in the current video content explosion era.
Finding Moments in Video Collections Using Natural Language
We evaluate our approach on two recently proposed datasets for temporal localization of moments in video with natural language (DiDeMo and Charades-STA) extended to our video corpus moment retrieval setting.
TVR: A Large-Scale Dataset for Video-Subtitle Moment Retrieval
The queries are also labeled with query types that indicate whether each of them is more related to video or subtitle or both, allowing for in-depth analysis of the dataset and the methods that built on top of it.
Correlation-Guided Query-Dependency Calibration for Video Temporal Grounding
Dummy tokens conditioned by text query take portions of the attention weights, preventing irrelevant video clips from being represented by the text query.
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding
We introduce InternVideo2, a new family of video foundation models (ViFM) that achieve the state-of-the-art results in video recognition, video-text tasks, and video-centric dialogue.
Weakly Supervised Video Moment Retrieval From Text Queries
The weak nature of the supervision is because, during training, we only have access to the video-text pairs rather than the temporal extent of the video to which different text descriptions relate.
Cross-Modal Interaction Networks for Query-Based Moment Retrieval in Videos
Query-based moment retrieval aims to localize the most relevant moment in an untrimmed video according to the given natural language query.
Regularized Two-Branch Proposal Networks for Weakly-Supervised Moment Retrieval in Videos
Thus, these methods fail to distinguish the target moment from plausible negative moments.