CoVR: Learning Composed Video Retrieval from Web Video Captions

28 Aug 2023  Β·  Lucas Ventura, Antoine Yang, Cordelia Schmid, GΓΌl Varol Β·

Composed Image Retrieval (CoIR) has recently gained popularity as a task that considers both text and image queries together, to search for relevant images in a database. Most CoIR approaches require manually annotated datasets, comprising image-text-image triplets, where the text describes a modification from the query image to the target image. However, manual curation of CoIR triplets is expensive and prevents scalability. In this work, we instead propose a scalable automatic dataset creation methodology that generates triplets given video-caption pairs, while also expanding the scope of the task to include composed video retrieval (CoVR). To this end, we mine paired videos with a similar caption from a large database, and leverage a large language model to generate the corresponding modification text. Applying this methodology to the extensive WebVid2M collection, we automatically construct our WebVid-CoVR dataset, resulting in 1.6 million triplets. Moreover, we introduce a new benchmark for CoVR with a manually annotated evaluation set, along with baseline results. Our experiments further demonstrate that training a CoVR model on our dataset effectively transfers to CoIR, leading to improved state-of-the-art performance in the zero-shot setup on both the CIRR and FashionIQ benchmarks. Our code, datasets, and models are publicly available at

WebVid Fashion IQ CIRR
Image Retrieval CIRR CoVR-BLIP (Recall@5+Recall_subset@1)/2 76.81 # 6
Zero-Shot Composed Image Retrieval (ZS-CIR) CIRR CoVR-BLIP R@5 66.7 # 2
Composed Image Retrieval (CoIR) CIRR CoVR-BLIP (Recall@5+Recall_subset@1)/2 76.81 # 1
Zero-Shot Composed Image Retrieval (ZS-CIR) Fashion IQ CoVR-BLIP (Recall@10+Recall@50)/2 36.17 # 12
Image Retrieval Fashion IQ CoVR-BLIP (Recall@10+Recall@50)/2 59.39 # 5
Composed Video Retrieval (CoVR) WebVid-CoVR CoVR-BLIP R@5 79.93 # 1