Video Retrieval

178 papers with code • 15 benchmarks • 27 datasets

The objective of video retrieval is as follows: given a text query and a pool of candidate videos, select the video which corresponds to the text query. Typically, the videos are returned as a ranked list of candidates and scored via document retrieval metrics.


Use these libraries to find Video Retrieval models and implementations

Most implemented papers

Learning a Text-Video Embedding from Incomplete and Heterogeneous Data

antoine77340/Mixture-of-Embedding-Experts 7 Apr 2018

We evaluate our method on the task of video retrieval and report results for the MPII Movie Description and MSR-VTT datasets.

Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval

m-bain/frozen-in-time ICCV 2021

Our objective in this work is video-text retrieval - in particular a joint embedding that enables efficient text-to-video retrieval.

CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval

ArrowLuo/CLIP4Clip 18 Apr 2021

In this paper, we propose a CLIP4Clip model to transfer the knowledge of the CLIP model to video-language retrieval in an end-to-end manner.

Dense-Captioning Events in Videos

sangminwoo/explore-and-match ICCV 2017

We also introduce ActivityNet Captions, a large-scale benchmark for dense-captioning events.

HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips

antoine77340/MIL-NCE_HowTo100M ICCV 2019

In this work, we propose instead to learn such embeddings from video data with readily available natural language annotations in the form of automatically transcribed narrations.

mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video

alibaba/AliceMind 1 Feb 2023

In contrast to predominant paradigms of solely relying on sequence-to-sequence generation or encoder-based instance discrimination, mPLUG-2 introduces a multi-module composition network by sharing common universal modules for modality collaboration and disentangling different modality modules to deal with modality entanglement.

ECO: Efficient Convolutional Network for Online Video Understanding

mzolfaghari/ECO-efficient-video-understanding ECCV 2018

In this paper, we introduce a network architecture that takes long-term content into account and enables fast per-video processing at the same time.

IMAE for Noise-Robust Learning: Mean Absolute Error Does Not Treat Examples Equally and Gradient Magnitude's Variance Matters

XinshaoAmosWang/Improving-Mean-Absolute-Error-against-CCE 28 Mar 2019

In this work, we study robust deep learning against abnormal training data from the perspective of example weighting built in empirical loss functions, i. e., gradient magnitude with respect to logits, an angle that is not thoroughly studied so far.

Use What You Have: Video Retrieval Using Representations From Collaborative Experts

albanie/collaborative-experts 31 Jul 2019

The rapid growth of video on the internet has made searching for video content using natural language queries a significant challenge.