Zero-Shot Video Retrieval
6 papers with code • 7 benchmarks • 6 datasets
Most implemented papers
Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval
Our objective in this work is video-text retrieval - in particular a joint embedding that enables efficient text-to-video retrieval.
Revealing Single Frame Bias for Video-and-Language Learning
Training an effective video-and-language model intuitively requires multiple frames as model inputs.
Everything at Once -- Multi-modal Fusion Transformer for Video Retrieval
Multi-modal learning from video data has seen increased attention recently as it allows to train semantically meaningful embeddings without human annotation enabling tasks like zero-shot retrieval and classification.
Everything at Once - Multi-Modal Fusion Transformer for Video Retrieval
In this work, we present a multi-modal, modality agnostic fusion transformer that learns to exchange information between multiple modalities, such as video, audio, and text, and integrate them into a fused representation in a joined multi-modal embedding space.
Clover: Towards A Unified Video-Language Alignment and Fusion Model
We then introduce \textbf{Clover}\textemdash a Correlated Video-Language pre-training method\textemdash towards a universal Video-Language model for solving multiple video understanding tasks with neither performance nor efficiency compromise.
InternVideo: General Video Foundation Models via Generative and Discriminative Learning
Specifically, InternVideo efficiently explores masked video modeling and video-language contrastive learning as the pretraining objectives, and selectively coordinates video representations of these two complementary frameworks in a learnable manner to boost various video applications.