Video to Text Retrieval
7 papers with code • 2 benchmarks • 2 datasets
Most implemented papers
Learning a Text-Video Embedding from Incomplete and Heterogeneous Data
We evaluate our method on the task of video retrieval and report results for the MPII Movie Description and MSR-VTT datasets.
Bridging Video-text Retrieval with Multiple Choice Questions
As an additional benefit, our method achieves competitive results with much shorter pre-training videos on single-modality downstream tasks, e. g., action recognition with linear evaluation.
CLIP2Video: Mastering Video-Text Retrieval via Image CLIP
We present CLIP2Video network to transfer the image-language pre-training model to video-text retrieval in an end-to-end manner.
Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language
Large pretrained (e. g., "foundation") models exhibit distinct capabilities depending on the domain of data they are trained on.
MILES: Visual BERT Pre-training with Injected Language Semantics for Video-text Retrieval
Dominant pre-training work for video-text retrieval mainly adopt the "dual-encoder" architectures to enable efficient retrieval, where two separate encoders are used to contrast global video and text representations, but ignore detailed local semantics.
MSVD-Indonesian: A Benchmark for Multimodal Video-Text Tasks in Indonesian
Since the availability of the pretraining resources with Indonesian sentences is relatively limited, the applicability of those approaches to our dataset is still questionable.
Prototype-based Aleatoric Uncertainty Quantification for Cross-modal Retrieval
In this paper, we propose a novel Prototype-based Aleatoric Uncertainty Quantification (PAU) framework to provide trustworthy predictions by quantifying the uncertainty arisen from the inherent data ambiguity.