Video Retrieval
220 papers with code • 18 benchmarks • 31 datasets
The objective of video retrieval is as follows: given a text query and a pool of candidate videos, select the video which corresponds to the text query. Typically, the videos are returned as a ranked list of candidates and scored via document retrieval metrics.
Libraries
Use these libraries to find Video Retrieval models and implementationsSubtasks
Most implemented papers
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video
In contrast to predominant paradigms of solely relying on sequence-to-sequence generation or encoder-based instance discrimination, mPLUG-2 introduces a multi-module composition network by sharing common universal modules for modality collaboration and disentangling different modality modules to deal with modality entanglement.
DiffusionRet: Generative Text-Video Retrieval with Diffusion Model
Existing text-video retrieval solutions are, in essence, discriminant models focused on maximizing the conditional likelihood, i. e., p(candidates|query).
Video-Text as Game Players: Hierarchical Banzhaf Interaction for Cross-Modal Representation Learning
Contrastive learning-based video-language representation learning approaches, e. g., CLIP, have achieved outstanding performance, which pursue semantic interaction upon pre-defined video-text pairs.
Text-Video Retrieval with Disentangled Conceptualization and Set-to-Set Alignment
In this paper, we propose the Disentangled Conceptualization and Set-to-set Alignment (DiCoSA) to simulate the conceptualizing and reasoning process of human beings.
IMAE for Noise-Robust Learning: Mean Absolute Error Does Not Treat Examples Equally and Gradient Magnitude's Variance Matters
In this work, we study robust deep learning against abnormal training data from the perspective of example weighting built in empirical loss functions, i. e., gradient magnitude with respect to logits, an angle that is not thoroughly studied so far.
Use What You Have: Video Retrieval Using Representations From Collaborative Experts
The rapid growth of video on the internet has made searching for video content using natural language queries a significant challenge.
HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training
We present HERO, a novel framework for large-scale video+language omni-representation learning.
On Semantic Similarity in Video Retrieval
Current video retrieval efforts all found their evaluation on an instance-based assumption, that only a single caption is relevant to a query video and vice versa.
MDMMT: Multidomain Multimodal Transformer for Video Retrieval
We present a new state-of-the-art on the text to video retrieval task on MSRVTT and LSMDC benchmarks where our model outperforms all previous solutions by a large margin.
Learning from Video and Text via Large-Scale Discriminative Clustering
Discriminative clustering has been successfully applied to a number of weakly-supervised learning tasks.