Video Retrieval
221 papers with code • 18 benchmarks • 31 datasets
The objective of video retrieval is as follows: given a text query and a pool of candidate videos, select the video which corresponds to the text query. Typically, the videos are returned as a ranked list of candidates and scored via document retrieval metrics.
Libraries
Use these libraries to find Video Retrieval models and implementationsSubtasks
Latest papers with no code
Enhancing Interactive Image Retrieval With Query Rewriting Using Large Language Models and Vision Language Models
Our contributions encompass the development of an innovative interactive image retrieval system, the integration of an LLM-based denoiser, the curation of a meticulously designed evaluation dataset, and thorough experimental validation.
Learning text-to-video retrieval from image captioning
In this paper, we make use of this progress and instantiate the image experts from two types of models: a text-to-image retrieval model to provide an initial backbone, and image captioning models to provide supervision signal into unlabeled videos.
SHE-Net: Syntax-Hierarchy-Enhanced Text-Video Retrieval
In particular, text-video retrieval, which aims to find the top matching videos given text descriptions from a vast video corpus, is an essential function, the primary challenge of which is to bridge the modality gap.
ProTA: Probabilistic Token Aggregation for Text-Video Retrieval
Text-video retrieval aims to find the most relevant cross-modal samples for a given query.
Event-aware Video Corpus Moment Retrieval
Video Corpus Moment Retrieval (VCMR) is a practical video retrieval task focused on identifying a specific moment within a vast corpus of untrimmed videos using the natural language query.
Video Editing for Video Retrieval
The teacher model is employed to edit the clips in the training set whereas the student model trains on the edited clips.
CoAVT: A Cognition-Inspired Unified Audio-Visual-Text Pre-Training Model for Multimodal Processing
To bridge the gap between modalities, CoAVT employs a query encoder, which contains a set of learnable query embeddings, and extracts the most informative audiovisual features of the corresponding text.
Distilling Vision-Language Models on Millions of Videos
Our best model outperforms state-of-the-art methods on MSR-VTT zero-shot text-to-video retrieval by 6%.
Text-Video Retrieval via Variational Multi-Modal Hypergraph Networks
Compared to conventional textual retrieval, the main obstacle for text-video retrieval is the semantic gap between the textual nature of queries and the visual richness of video content.
Detours for Navigating Instructional Videos
We introduce the video detours problem for navigating instructional videos.