Video Retrieval
246 papers with code • 19 benchmarks • 35 datasets
The objective of video retrieval is as follows: given a text query and a pool of candidate videos, select the video which corresponds to the text query. Typically, the videos are returned as a ranked list of candidates and scored via document retrieval metrics.
Libraries
Use these libraries to find Video Retrieval models and implementationsSubtasks
Most implemented papers
ECO: Efficient Convolutional Network for Online Video Understanding
In this paper, we introduce a network architecture that takes long-term content into account and enables fast per-video processing at the same time.
Learning a Text-Video Embedding from Incomplete and Heterogeneous Data
We evaluate our method on the task of video retrieval and report results for the MPII Movie Description and MSR-VTT datasets.
Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval
Our objective in this work is video-text retrieval - in particular a joint embedding that enables efficient text-to-video retrieval.
CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval
In this paper, we propose a CLIP4Clip model to transfer the knowledge of the CLIP model to video-language retrieval in an end-to-end manner.
VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text
We train VATT end-to-end from scratch using multimodal contrastive losses and evaluate its performance by the downstream tasks of video action recognition, audio event classification, image classification, and text-to-video retrieval.
CoCa: Contrastive Captioners are Image-Text Foundation Models
We apply a contrastive loss between unimodal image and text embeddings, in addition to a captioning loss on the multimodal decoder outputs which predicts text tokens autoregressively.
Dense-Captioning Events in Videos
We also introduce ActivityNet Captions, a large-scale benchmark for dense-captioning events.
HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips
In this work, we propose instead to learn such embeddings from video data with readily available natural language annotations in the form of automatically transcribed narrations.
End-to-End Learning of Visual Representations from Uncurated Instructional Videos
Annotating videos is cumbersome, expensive and not scalable.
Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations
Most video-and-language representation learning approaches employ contrastive learning, e. g., CLIP, to project the video and text features into a common latent space according to the semantic similarities of text-video pairs.