Video Retrieval

221 papers with code • 18 benchmarks • 31 datasets

The objective of video retrieval is as follows: given a text query and a pool of candidate videos, select the video which corresponds to the text query. Typically, the videos are returned as a ranked list of candidates and scored via document retrieval metrics.

Libraries

Use these libraries to find Video Retrieval models and implementations
5 papers
3,009
2 papers
29,334
See all 5 libraries.

Latest papers with no code

Enhancing Interactive Image Retrieval With Query Rewriting Using Large Language Models and Vision Language Models

no code yet • 29 Apr 2024

Our contributions encompass the development of an innovative interactive image retrieval system, the integration of an LLM-based denoiser, the curation of a meticulously designed evaluation dataset, and thorough experimental validation.

Learning text-to-video retrieval from image captioning

no code yet • 26 Apr 2024

In this paper, we make use of this progress and instantiate the image experts from two types of models: a text-to-image retrieval model to provide an initial backbone, and image captioning models to provide supervision signal into unlabeled videos.

SHE-Net: Syntax-Hierarchy-Enhanced Text-Video Retrieval

no code yet • 22 Apr 2024

In particular, text-video retrieval, which aims to find the top matching videos given text descriptions from a vast video corpus, is an essential function, the primary challenge of which is to bridge the modality gap.

ProTA: Probabilistic Token Aggregation for Text-Video Retrieval

no code yet • 18 Apr 2024

Text-video retrieval aims to find the most relevant cross-modal samples for a given query.

Event-aware Video Corpus Moment Retrieval

no code yet • 21 Feb 2024

Video Corpus Moment Retrieval (VCMR) is a practical video retrieval task focused on identifying a specific moment within a vast corpus of untrimmed videos using the natural language query.

Video Editing for Video Retrieval

no code yet • 4 Feb 2024

The teacher model is employed to edit the clips in the training set whereas the student model trains on the edited clips.

CoAVT: A Cognition-Inspired Unified Audio-Visual-Text Pre-Training Model for Multimodal Processing

no code yet • 22 Jan 2024

To bridge the gap between modalities, CoAVT employs a query encoder, which contains a set of learnable query embeddings, and extracts the most informative audiovisual features of the corresponding text.

Distilling Vision-Language Models on Millions of Videos

no code yet • 11 Jan 2024

Our best model outperforms state-of-the-art methods on MSR-VTT zero-shot text-to-video retrieval by 6%.

Text-Video Retrieval via Variational Multi-Modal Hypergraph Networks

no code yet • 6 Jan 2024

Compared to conventional textual retrieval, the main obstacle for text-video retrieval is the semantic gap between the textual nature of queries and the visual richness of video content.

Detours for Navigating Instructional Videos

no code yet • 3 Jan 2024

We introduce the video detours problem for navigating instructional videos.