The objective of video retrieval is as follows: given a text query and a pool of candidate videos, select the video which corresponds to the text query. Typically, the videos are returned as a ranked list of candidates and scored via document retrieval metrics.


We evaluate our method on the task of video retrieval and report results for the MPII Movie Description and MSR-VTT datasets.

We also introduce ActivityNet Captions, a large-scale benchmark for dense-captioning events.

In this work, we propose instead to learn such embeddings from video data with readily available natural language annotations in the form of automatically transcribed narrations.

Our objective in this work is video-text retrieval - in particular a joint embedding that enables efficient text-to-video retrieval.

In this paper, we introduce a network architecture that takes long-term content into account and enables fast per-video processing at the same time.

In this work, we study robust deep learning against abnormal training data from the perspective of example weighting built in empirical loss functions, i. e., gradient magnitude with respect to logits, an angle that is not thoroughly studied so far.

The rapid growth of video on the internet has made searching for video content using natural language queries a significant challenge.

In this paper, we propose a CLIP4Clip model to transfer the knowledge of the CLIP model to video-language retrieval in an end-to-end manner.

Discriminative clustering has been successfully applied to a number of weakly-supervised learning tasks.