Video Retrieval

220 papers with code • 18 benchmarks • 31 datasets

The objective of video retrieval is as follows: given a text query and a pool of candidate videos, select the video which corresponds to the text query. Typically, the videos are returned as a ranked list of candidates and scored via document retrieval metrics.

Benchmarks

Add a Result

These leaderboards are used to track progress in Video Retrieval

Dataset	Best Model	Compare
MSR-VTT-1kA	HunYuan_tvr (huge)	See all
LSMDC	InternVideo2-6B	See all
MSR-VTT	VAST	See all
DiDeMo	InternVideo2-6B	See all
ActivityNet	InternVideo2-6B	See all
MSVD	InternVideo2-6B	See all
FIVR-200K	S2VS	See all
YouCook2	VAST	See all
VATEX	VAST	See all
QuerYD	QB-Norm+TT-CE+	See all
SSv2-label retrieval	UMT-L (ViT-L/16)	See all
SSv2-template retrieval	UMT-L (ViT-L/16)	See all
Condensed Movies	TESTA (ViT-B/16)	See all
TVR	Hero w/ pre-training	See all
TGIF	MDMMT-2	See all
RUDDER	PO Loss	See all
Charades-STA	PO Loss	See all
MSVD-Indonesian	X-CLIP (Cross-Lingual)	See all

Show all 18 benchmarks

Collapse benchmarks

Libraries

Use these libraries to find Video Retrieval models and implementations

towhee-io/towhee

5 papers

2,986

jpthu17/diffusionret

4 papers

albanie/collaborative-experts

3 papers

327

pytorch/fairseq

2 papers

29,233

See all 5 libraries.

Datasets

Subtasks

Replay Grounding

Composed Video Retrieval (CoVR)

Most implemented papers

Most implemented Social Latest No code

mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video

alibaba/AliceMind • • 1 Feb 2023

In contrast to predominant paradigms of solely relying on sequence-to-sequence generation or encoder-based instance discrimination, mPLUG-2 introduces a multi-module composition network by sharing common universal modules for modality collaboration and disentangling different modality modules to deal with modality entanglement.

Paper
Code

DiffusionRet: Generative Text-Video Retrieval with Diffusion Model

jpthu17/diffusionret • • ICCV 2023

Existing text-video retrieval solutions are, in essence, discriminant models focused on maximizing the conditional likelihood, i. e., p(candidates|query).

Paper
Code

Video-Text as Game Players: Hierarchical Banzhaf Interaction for Cross-Modal Representation Learning

jpthu17/HBI • • CVPR 2023

Contrastive learning-based video-language representation learning approaches, e. g., CLIP, have achieved outstanding performance, which pursue semantic interaction upon pre-defined video-text pairs.

Paper
Code

Text-Video Retrieval with Disentangled Conceptualization and Set-to-Set Alignment

jpthu17/dicosa • • 20 May 2023

In this paper, we propose the Disentangled Conceptualization and Set-to-set Alignment (DiCoSA) to simulate the conceptualizing and reasoning process of human beings.

Paper
Code

IMAE for Noise-Robust Learning: Mean Absolute Error Does Not Treat Examples Equally and Gradient Magnitude's Variance Matters

XinshaoAmosWang/Improving-Mean-Absolute-Error-against-CCE • 28 Mar 2019

In this work, we study robust deep learning against abnormal training data from the perspective of example weighting built in empirical loss functions, i. e., gradient magnitude with respect to logits, an angle that is not thoroughly studied so far.

Paper
Code

Use What You Have: Video Retrieval Using Representations From Collaborative Experts

albanie/collaborative-experts • • 31 Jul 2019

The rapid growth of video on the internet has made searching for video content using natural language queries a significant challenge.

Paper
Code

HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training

linjieli222/HERO • • EMNLP 2020

We present HERO, a novel framework for large-scale video+language omni-representation learning.

Paper
Code

On Semantic Similarity in Video Retrieval

mwray/Semantic-Video-Retrieval • CVPR 2021

Current video retrieval efforts all found their evaluation on an instance-based assumption, that only a single caption is relevant to a query video and vice versa.

Paper
Code

MDMMT: Multidomain Multimodal Transformer for Video Retrieval

papermsucode/mdmmt • • 19 Mar 2021

We present a new state-of-the-art on the text to video retrieval task on MSRVTT and LSMDC benchmarks where our model outperforms all previous solutions by a large margin.

Paper
Code

Learning from Video and Text via Large-Scale Discriminative Clustering

jpeyre/unrel • ICCV 2017

Discriminative clustering has been successfully applied to a number of weakly-supervised learning tasks.

Paper
Code

Video Retrieval

Benchmarks Add a Result

Libraries

Datasets

Subtasks

Most implemented papers

Content

Benchmarks

Add a Result