Text to Video Retrieval

46 papers with code • 3 benchmarks • 6 datasets

Given a natural language query, find the most relevant video from a large set of candidate videos.

Libraries

Use these libraries to find Text to Video Retrieval models and implementations
4 papers
2,986

Most implemented papers

X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks

zengyan-97/x2-vlm 22 Nov 2022

Vision language pre-training aims to learn alignments between vision and language from a large amount of data.

Noise Estimation Using Density Estimation for Self-Supervised Multimodal Learning

elad-amrani/ssml 6 Mar 2020

One of the key factors of enabling machine learning models to comprehend and solve real-world tasks is to leverage multimodal data.

Condensed Movies: Story Based Retrieval with Contextual Embeddings

m-bain/CondensedMovies 8 May 2020

Our objective in this work is long range understanding of the narrative structure of movies.

Retrieving and Highlighting Action with Spatiotemporal Reference

yiskw713/yiskw713 19 May 2020

In this paper, we present a framework that jointly retrieves and spatiotemporally highlights actions in videos by enhancing current deep cross-modal retrieval methods.

The End-of-End-to-End: A Video Understanding Pentathlon Challenge (2020)

albanie/collaborative-experts 3 Aug 2020

This report summarizes the results of the first edition of the challenge together with the findings of the participants.

Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling

jayleicn/ClipBERT CVPR 2021

Experiments on text-to-video retrieval and video question answering on six datasets demonstrate that ClipBERT outperforms (or is on par with) existing methods that exploit full-length videos, suggesting that end-to-end learning with just a few sparsely sampled clips is often more accurate than using densely extracted offline features from full-length videos, proving the proverbial less-is-more principle.

Multimodal Clustering Networks for Self-supervised Learning from Unlabeled Videos

brian7685/Multimodal-Clustering-Network ICCV 2021

Multimodal self-supervised learning is getting more and more attention as it allows not only to train large networks without human supervision but also to search and retrieve data across various modalities.

DeCEMBERT: Learning from Noisy Instructional Videos via Dense Captions and Entropy Minimization

zinengtang/decembert NAACL 2021

Second, to alleviate the temporal misalignment issue, our method incorporates an entropy minimization-based constrained attention loss, to encourage the model to automatically focus on the correct caption from a pool of candidate ASR captions.

VALUE: A Multi-Task Benchmark for Video-and-Language Understanding Evaluation

VALUE-Leaderboard/StarterCode 8 Jun 2021

Most existing video-and-language (VidL) research focuses on a single dataset, or multiple datasets of a single task.

Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions

microsoft/xpretrain CVPR 2022

To enable VL pre-training, we jointly optimize the HD-VILA model by a hybrid Transformer that learns rich spatiotemporal features, and a multimodal Transformer that enforces interactions of the learned video features with diversified texts.