Text to Video Retrieval
18 papers with code • 0 benchmarks • 0 datasets
Benchmarks
These leaderboards are used to track progress in Text to Video Retrieval
Most implemented papers
HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips
In this work, we propose instead to learn such embeddings from video data with readily available natural language annotations in the form of automatically transcribed narrations.
Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval
Our objective in this work is video-text retrieval - in particular a joint embedding that enables efficient text-to-video retrieval.
End-to-End Learning of Visual Representations from Uncurated Instructional Videos
Annotating videos is cumbersome, expensive and not scalable.
MDMMT: Multidomain Multimodal Transformer for Video Retrieval
We present a new state-of-the-art on the text to video retrieval task on MSRVTT and LSMDC benchmarks where our model outperforms all previous solutions by a large margin.
VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text
We train VATT end-to-end from scratch using multimodal contrastive losses and evaluate its performance by the downstream tasks of video action recognition, audio event classification, image classification, and text-to-video retrieval.
Revitalize Region Feature for Democratizing Video-Language Pre-training
Recent dominant methods for video-language pre-training (VLP) learn transferable representations from the raw pixels in an end-to-end manner to achieve advanced performance on downstream video-language tasks.
Noise Estimation Using Density Estimation for Self-Supervised Multimodal Learning
One of the key factors of enabling machine learning models to comprehend and solve real-world tasks is to leverage multimodal data.
Condensed Movies: Story Based Retrieval with Contextual Embeddings
Our objective in this work is long range understanding of the narrative structure of movies.
The End-of-End-to-End: A Video Understanding Pentathlon Challenge (2020)
This report summarizes the results of the first edition of the challenge together with the findings of the participants.
Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling
Experiments on text-to-video retrieval and video question answering on six datasets demonstrate that ClipBERT outperforms (or is on par with) existing methods that exploit full-length videos, suggesting that end-to-end learning with just a few sparsely sampled clips is often more accurate than using densely extracted offline features from full-length videos, proving the proverbial less-is-more principle.