Natural Language Moment Retrieval
12 papers with code • 4 benchmarks • 3 datasets
Latest papers
UniMD: Towards Unifying Moment Retrieval and Temporal Action Detection
Temporal Action Detection (TAD) focuses on detecting pre-defined actions, while Moment Retrieval (MR) aims to identify the events described by open-ended natural language within untrimmed videos.
BAM-DETR: Boundary-Aligned Moment Detection Transformer for Temporal Sentence Grounding in Videos
However, they suffer from the issue of center misalignment raised by the inherent ambiguity of moment centers, leading to inaccurate predictions.
Bridging the Gap: A Unified Video Comprehension Framework for Moment Retrieval and Highlight Detection
Video Moment Retrieval (MR) and Highlight Detection (HD) have attracted significant attention due to the growing demand for video analysis.
Correlation-guided Query-Dependency Calibration in Video Representation Learning for Temporal Grounding
Dummy tokens conditioned by text query take portions of the attention weights, preventing irrelevant video clips from being represented by the text query.
UnLoc: A Unified Framework for Video Localization Tasks
While large-scale image-text pretrained models such as CLIP have been used for multiple video-level tasks on trimmed videos, their use for temporal localization in untrimmed videos is still a relatively unexplored task.
UniVTG: Towards Unified Video-Language Temporal Grounding
Most methods in this direction develop taskspecific models that are trained with type-specific labels, such as moment retrieval (time interval) and highlight detection (worthiness curve), which limits their abilities to generalize to various VTG tasks and labels.
Overcoming Weak Visual-Textual Alignment for Video Moment Retrieval
Video moment retrieval (VMR) identifies a specific moment in an untrimmed video for a given natural language query.
Learning Grounded Vision-Language Representation for Versatile Understanding in Untrimmed Videos
Our framework is easily extensible to tasks covering visually-grounded language understanding and generation.
Localizing Moments in Long Video Via Multimodal Guidance
In this paper, we propose a method for improving the performance of natural language grounding in long videos by identifying and pruning out non-describable windows.
MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio Descriptions
The recent and increasing interest in video-language research has driven the development of large-scale datasets that enable data-intensive machine learning techniques.