Zero-Shot Video Retrieval

30 papers with code • 8 benchmarks • 7 datasets

This task has no description! Would you like to contribute one?

Benchmarks

Add a Result

These leaderboards are used to track progress in Zero-Shot Video Retrieval

Dataset	Best Model	Compare
MSR-VTT	InternVideo2-6B	See all
DiDeMo	InternVideo2-6B	See all
LSMDC	InternVideo2-6B	See all
MSVD	InternVideo2-6B	See all
ActivityNet	InternVideo2-6B	See all
YouCook2	Norton	See all
VATEX	InternVideo2-6B	See all
MSR-VTT-full	InternVL-G	See all

Libraries

Use these libraries to find Zero-Shot Video Retrieval models and implementations

towhee-io/towhee

3 papers

2,987

Datasets

Most implemented papers

Most implemented Social Latest No code

Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval

m-bain/frozen-in-time • • ICCV 2021

Our objective in this work is video-text retrieval - in particular a joint embedding that enables efficient text-to-video retrieval.

Paper
Code

CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval

ArrowLuo/CLIP4Clip • • 18 Apr 2021

In this paper, we propose a CLIP4Clip model to transfer the knowledge of the CLIP model to video-language retrieval in an end-to-end manner.

Paper
Code

End-to-End Learning of Visual Representations from Uncurated Instructional Videos

antoine77340/MIL-NCE_HowTo100M • • CVPR 2020

Annotating videos is cumbersome, expensive and not scalable.

Paper
Code

mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video

alibaba/AliceMind • • 1 Feb 2023

In contrast to predominant paradigms of solely relying on sequence-to-sequence generation or encoder-based instance discrimination, mPLUG-2 introduces a multi-module composition network by sharing common universal modules for modality collaboration and disentangling different modality modules to deal with modality entanglement.

Paper
Code

LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment

pku-yuangroup/languagebind • • 3 Oct 2023

We thus propose VIDAL-10M with Video, Infrared, Depth, Audio and their corresponding Language, naming as VIDAL-10M.

Paper
Code

VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text

google-research/google-research • • NeurIPS 2021

We train VATT end-to-end from scratch using multimodal contrastive losses and evaluate its performance by the downstream tasks of video action recognition, audio event classification, image classification, and text-to-video retrieval.

Paper
Code

VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding

pytorch/fairseq • • EMNLP 2021

We present VideoCLIP, a contrastive approach to pre-train a unified model for zero-shot video and text understanding, without using any labels on downstream tasks.

Paper
Code

Bridging Video-text Retrieval with Multiple Choice Questions

tencentarc/mcq • • CVPR 2022

As an additional benefit, our method achieves competitive results with much shorter pre-training videos on single-modality downstream tasks, e. g., action recognition with linear evaluation.

Paper
Code

Revealing Single Frame Bias for Video-and-Language Learning

jayleicn/singularity • • 7 Jun 2022

Training an effective video-and-language model intuitively requires multiple frames as model inputs.

Paper
Code

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

opengvlab/internvl • • 21 Dec 2023

However, the progress in vision and vision-language foundation models, which are also critical elements of multi-modal AGI, has not kept pace with LLMs.

Paper
Code

Zero-Shot Video Retrieval

Benchmarks Add a Result

Libraries

Datasets

Most implemented papers

Content

Benchmarks

Add a Result