Text to Video Retrieval
40 papers with code • 2 benchmarks • 3 datasets
These leaderboards are used to track progress in Text to Video Retrieval
LibrariesUse these libraries to find Text to Video Retrieval models and implementations
Most implemented papers
Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval
Our objective in this work is video-text retrieval - in particular a joint embedding that enables efficient text-to-video retrieval.
CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval
In this paper, we propose a CLIP4Clip model to transfer the knowledge of the CLIP model to video-language retrieval in an end-to-end manner.
HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips
In this work, we propose instead to learn such embeddings from video data with readily available natural language annotations in the form of automatically transcribed narrations.
End-to-End Learning of Visual Representations from Uncurated Instructional Videos
Annotating videos is cumbersome, expensive and not scalable.
MDMMT: Multidomain Multimodal Transformer for Video Retrieval
We present a new state-of-the-art on the text to video retrieval task on MSRVTT and LSMDC benchmarks where our model outperforms all previous solutions by a large margin.
VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text
We train VATT end-to-end from scratch using multimodal contrastive losses and evaluate its performance by the downstream tasks of video action recognition, audio event classification, image classification, and text-to-video retrieval.
Bridging Video-text Retrieval with Multiple Choice Questions
As an additional benefit, our method achieves competitive results with much shorter pre-training videos on single-modality downstream tasks, e. g., action recognition with linear evaluation.
Revitalize Region Feature for Democratizing Video-Language Pre-training of Retrieval
Recent dominant methods for video-language pre-training (VLP) learn transferable representations from the raw pixels in an end-to-end manner to achieve advanced performance on downstream video-language retrieval.
FitCLIP: Refining Large-Scale Pretrained Image-Text Models for Zero-Shot Video Understanding Tasks
Large-scale pretrained image-text models have shown incredible zero-shot performance in a handful of tasks, including video ones such as action recognition and text-to-video retrieval.
Revealing Single Frame Bias for Video-and-Language Learning
Training an effective video-and-language model intuitively requires multiple frames as model inputs.