The WebVid-CoVR dataset is a collection of video-text-video triplets that can be used for the task of composed video retrieval (CoVR). CoVR is a task that involves searching for videos that match both a query image and a query text. The text typically specifies the desired modification to the query image.

The WebVid-CoVR dataset is automatically generated from web-scraped video-caption pairs, using a language model to generate the modification text. The dataset contains 1.6 million triplets, with diverse content and variations. The dataset also includes a manually annotated test set of 2.5K triplets, which can be used to evaluate CoVR models.


