Text to Video Retrieval
46 papers with code • 3 benchmarks • 6 datasets
Given a natural language query, find the most relevant video from a large set of candidate videos.
Libraries
Use these libraries to find Text to Video Retrieval models and implementationsMost implemented papers
X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks
Vision language pre-training aims to learn alignments between vision and language from a large amount of data.
Noise Estimation Using Density Estimation for Self-Supervised Multimodal Learning
One of the key factors of enabling machine learning models to comprehend and solve real-world tasks is to leverage multimodal data.
Condensed Movies: Story Based Retrieval with Contextual Embeddings
Our objective in this work is long range understanding of the narrative structure of movies.
Retrieving and Highlighting Action with Spatiotemporal Reference
In this paper, we present a framework that jointly retrieves and spatiotemporally highlights actions in videos by enhancing current deep cross-modal retrieval methods.
The End-of-End-to-End: A Video Understanding Pentathlon Challenge (2020)
This report summarizes the results of the first edition of the challenge together with the findings of the participants.
Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling
Experiments on text-to-video retrieval and video question answering on six datasets demonstrate that ClipBERT outperforms (or is on par with) existing methods that exploit full-length videos, suggesting that end-to-end learning with just a few sparsely sampled clips is often more accurate than using densely extracted offline features from full-length videos, proving the proverbial less-is-more principle.
Multimodal Clustering Networks for Self-supervised Learning from Unlabeled Videos
Multimodal self-supervised learning is getting more and more attention as it allows not only to train large networks without human supervision but also to search and retrieve data across various modalities.
DeCEMBERT: Learning from Noisy Instructional Videos via Dense Captions and Entropy Minimization
Second, to alleviate the temporal misalignment issue, our method incorporates an entropy minimization-based constrained attention loss, to encourage the model to automatically focus on the correct caption from a pool of candidate ASR captions.
VALUE: A Multi-Task Benchmark for Video-and-Language Understanding Evaluation
Most existing video-and-language (VidL) research focuses on a single dataset, or multiple datasets of a single task.
Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions
To enable VL pre-training, we jointly optimize the HD-VILA model by a hybrid Transformer that learns rich spatiotemporal features, and a multimodal Transformer that enforces interactions of the learned video features with diversified texts.