Video Retrieval

212 papers with code • 18 benchmarks • 31 datasets

The objective of video retrieval is as follows: given a text query and a pool of candidate videos, select the video which corresponds to the text query. Typically, the videos are returned as a ranked list of candidates and scored via document retrieval metrics.

Libraries

Use these libraries to find Video Retrieval models and implementations
5 papers
2,926
2 papers
28,920
See all 6 libraries.

Most implemented papers

ECO: Efficient Convolutional Network for Online Video Understanding

mzolfaghari/ECO-efficient-video-understanding ECCV 2018

In this paper, we introduce a network architecture that takes long-term content into account and enables fast per-video processing at the same time.

Learning a Text-Video Embedding from Incomplete and Heterogeneous Data

antoine77340/Mixture-of-Embedding-Experts 7 Apr 2018

We evaluate our method on the task of video retrieval and report results for the MPII Movie Description and MSR-VTT datasets.

Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval

m-bain/frozen-in-time ICCV 2021

Our objective in this work is video-text retrieval - in particular a joint embedding that enables efficient text-to-video retrieval.

CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval

ArrowLuo/CLIP4Clip 18 Apr 2021

In this paper, we propose a CLIP4Clip model to transfer the knowledge of the CLIP model to video-language retrieval in an end-to-end manner.

CoCa: Contrastive Captioners are Image-Text Foundation Models

mlfoundations/open_clip 4 May 2022

We apply a contrastive loss between unimodal image and text embeddings, in addition to a captioning loss on the multimodal decoder outputs which predicts text tokens autoregressively.

Dense-Captioning Events in Videos

sangminwoo/explore-and-match ICCV 2017

We also introduce ActivityNet Captions, a large-scale benchmark for dense-captioning events.

HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips

antoine77340/MIL-NCE_HowTo100M ICCV 2019

In this work, we propose instead to learn such embeddings from video data with readily available natural language annotations in the form of automatically transcribed narrations.

Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations

jpthu17/emcl 21 Nov 2022

Most video-and-language representation learning approaches employ contrastive learning, e. g., CLIP, to project the video and text features into a common latent space according to the semantic similarities of text-video pairs.

Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval?

whwu95/Cap4Video CVPR 2023

Most existing text-video retrieval methods focus on cross-modal matching between the visual content of videos and textual query sentences.