TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK	EXTRA DATA	REMOVE
Vision and Language Navigation	RxR	CLEAR-CLIP	ndtw	53.69	# 4
Visual Entailment	SNLI-VE val	CLIP-ViL	Accuracy	80.20	# 6

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/how-much-can-clip-benefit-vision-and-language/vision-and-language-navigation-on-rxr)](https://paperswithcode.com/sota/vision-and-language-navigation-on-rxr?p=how-much-can-clip-benefit-vision-and-language)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/how-much-can-clip-benefit-vision-and-language/visual-entailment-on-snli-ve-val)](https://paperswithcode.com/sota/visual-entailment-on-snli-ve-val?p=how-much-can-clip-benefit-vision-and-language)`

How Much Can CLIP Benefit Vision-and-Language Tasks?

13 Jul 2021 · Sheng Shen, Liunian Harold Li, Hao Tan, Mohit Bansal, Anna Rohrbach, Kai-Wei Chang, Zhewei Yao, Kurt Keutzer ·

Most existing Vision-and-Language (V&L) models rely on pre-trained visual encoders, using a relatively small set of manually-annotated data (as compared to web-crawled data), to perceive the visual world. However, it has been observed that large-scale pretraining usually can result in better generalization performance, e.g., CLIP (Contrastive Language-Image Pre-training), trained on a massive amount of image-caption pairs, has shown a strong zero-shot capability on various vision tasks. To further study the advantage brought by CLIP, we propose to use CLIP as the visual encoder in various V&L models in two typical scenarios: 1) plugging CLIP into task-specific fine-tuning; 2) combining CLIP with V&L pre-training and transferring to downstream tasks. We show that CLIP significantly outperforms widely-used visual encoders trained with in-domain annotated data, such as BottomUp-TopDown. We achieve competitive or better results on diverse V&L tasks, while establishing new state-of-the-art results on Visual Question Answering, Visual Entailment, and V&L Navigation tasks. We release our code at https://github.com/clip-vil/CLIP-ViL.

PDF Abstract