33 papers with code • 2 benchmarks • 5 datasets
Moment retrieval can de defined as the task of "localizing moments in a video given a user query".
LibrariesUse these libraries to find Moment Retrieval models and implementations
Each video in the dataset is annotated with: (1) a human-written free-form NL query, (2) relevant moments in the video w. r. t.
We evaluate our approach on two recently proposed datasets for temporal localization of moments in video with natural language (DiDeMo and Charades-STA) extended to our video corpus moment retrieval setting.
The queries are also labeled with query types that indicate whether each of them is more related to video or subtitle or both, allowing for in-depth analysis of the dataset and the methods that built on top of it.
Correlation-guided Query-Dependency Calibration in Video Representation Learning for Temporal Grounding
Dummy tokens conditioned by text query take a portion of the attention weights, preventing irrelevant video clips from being represented by the text query.
The weak nature of the supervision is because, during training, we only have access to the video-text pairs rather than the temporal extent of the video to which different text descriptions relate.
Query-based moment retrieval aims to localize the most relevant moment in an untrimmed video according to the given natural language query.
In this paper, we present a series of experiments assessing how well the benchmark results reflect the true progress in solving the moment retrieval task.
Another contribution is that we propose an additional predictor to utilize the internal frames in the model training to improve the localization accuracy.