Text to Video Retrieval

46 papers with code • 3 benchmarks • 6 datasets

Given a natural language query, find the most relevant video from a large set of candidate videos.

Benchmarks

Add a Result

These leaderboards are used to track progress in Text to Video Retrieval

Dataset	Best Model	Compare
Kinetics-GEB+	FROZEN-revised	See all
MSR-VTT	CLIP4Clip	See all
MSVD-Indonesian	X-CLIP (Cross-Lingual)	See all

Libraries

Use these libraries to find Text to Video Retrieval models and implementations

towhee-io/towhee

4 papers

2,986

Datasets

Subtasks

Partially Relevant Video Retrieval

Most implemented papers

Most implemented Social Latest No code

X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks

zengyan-97/x2-vlm • • 22 Nov 2022

Vision language pre-training aims to learn alignments between vision and language from a large amount of data.

Paper
Code

Noise Estimation Using Density Estimation for Self-Supervised Multimodal Learning

elad-amrani/ssml • • 6 Mar 2020

One of the key factors of enabling machine learning models to comprehend and solve real-world tasks is to leverage multimodal data.

Paper
Code

Condensed Movies: Story Based Retrieval with Contextual Embeddings

m-bain/CondensedMovies • • 8 May 2020

Our objective in this work is long range understanding of the narrative structure of movies.

Paper
Code

Retrieving and Highlighting Action with Spatiotemporal Reference

yiskw713/yiskw713 • • 19 May 2020

In this paper, we present a framework that jointly retrieves and spatiotemporally highlights actions in videos by enhancing current deep cross-modal retrieval methods.

Paper
Code

The End-of-End-to-End: A Video Understanding Pentathlon Challenge (2020)

albanie/collaborative-experts • • 3 Aug 2020

This report summarizes the results of the first edition of the challenge together with the findings of the participants.

Paper
Code

Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling

jayleicn/ClipBERT • • CVPR 2021

Experiments on text-to-video retrieval and video question answering on six datasets demonstrate that ClipBERT outperforms (or is on par with) existing methods that exploit full-length videos, suggesting that end-to-end learning with just a few sparsely sampled clips is often more accurate than using densely extracted offline features from full-length videos, proving the proverbial less-is-more principle.

Paper
Code

Multimodal Clustering Networks for Self-supervised Learning from Unlabeled Videos

brian7685/Multimodal-Clustering-Network • • ICCV 2021

Multimodal self-supervised learning is getting more and more attention as it allows not only to train large networks without human supervision but also to search and retrieve data across various modalities.

Paper
Code

DeCEMBERT: Learning from Noisy Instructional Videos via Dense Captions and Entropy Minimization

zinengtang/decembert • • NAACL 2021

Second, to alleviate the temporal misalignment issue, our method incorporates an entropy minimization-based constrained attention loss, to encourage the model to automatically focus on the correct caption from a pool of candidate ASR captions.

Paper
Code

VALUE: A Multi-Task Benchmark for Video-and-Language Understanding Evaluation

VALUE-Leaderboard/StarterCode • • 8 Jun 2021

Most existing video-and-language (VidL) research focuses on a single dataset, or multiple datasets of a single task.

Paper
Code

Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions

microsoft/xpretrain • • CVPR 2022

To enable VL pre-training, we jointly optimize the HD-VILA model by a hybrid Transformer that learns rich spatiotemporal features, and a multimodal Transformer that enforces interactions of the learned video features with diversified texts.

Paper
Code

Text to Video Retrieval

Benchmarks Add a Result

Libraries

Datasets

Subtasks

Most implemented papers

Content

Benchmarks

Add a Result