TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Image Retrieval	AIC-ICC	ERNIE-ViL2.0	Recall@1	19.0	# 1
Image Retrieval	AIC-ICC	ERNIE-ViL2.0	Recall@10	43.5	# 1
Image Retrieval	AIC-ICC	ERNIE-ViL2.0	Recall@5	35.3	# 2
Image-to-Text Retrieval	AIC-ICC	ERNIE-ViL2.0	Recall@1	33.7	# 1
Image-to-Text Retrieval	AIC-ICC	ERNIE-ViL2.0	Recall@5	52.1	# 1
Image-to-Text Retrieval	AIC-ICC	ERNIE-ViL2.0	Recall@10	60.0	# 1
Cross-Modal Retrieval	COCO 2014	ERNIE-ViL 2.0	Image-to-text R@1	77.4	# 13
Cross-Modal Retrieval	COCO 2014	ERNIE-ViL 2.0	Image-to-text R@10	97.1	# 11
Cross-Modal Retrieval	COCO 2014	ERNIE-ViL 2.0	Image-to-text R@5	93.6	# 13
Cross-Modal Retrieval	COCO 2014	ERNIE-ViL 2.0	Text-to-image R@1	59.5	# 17
Cross-Modal Retrieval	COCO 2014	ERNIE-ViL 2.0	Text-to-image R@10	90.1	# 13
Cross-Modal Retrieval	COCO 2014	ERNIE-ViL 2.0	Text-to-image R@5	83.4	# 15
Zero-Shot Cross-Modal Retrieval	COCO 2014	ERNIE-ViL 2.0	Image-to-text R@1	63.1	# 11
Zero-Shot Cross-Modal Retrieval	COCO 2014	ERNIE-ViL 2.0	Image-to-text R@5	85.7	# 11
Zero-Shot Cross-Modal Retrieval	COCO 2014	ERNIE-ViL 2.0	Image-to-text R@10	91.4	# 10
Zero-Shot Cross-Modal Retrieval	COCO 2014	ERNIE-ViL 2.0	Text-to-image R@1	46.0	# 11
Zero-Shot Cross-Modal Retrieval	COCO 2014	ERNIE-ViL 2.0	Text-to-image R@5	71.4	# 10
Zero-Shot Cross-Modal Retrieval	COCO 2014	ERNIE-ViL 2.0	Text-to-image R@10	80.4	# 11
Zero-shot Text-to-Image Retrieval	COCO-CN	ERNIE-ViL 2.0	Recall@1	69.6	# 2
Zero-shot Text-to-Image Retrieval	COCO-CN	ERNIE-ViL 2.0	Recall@5	91.2	# 2
Zero-shot Text-to-Image Retrieval	COCO-CN	ERNIE-ViL 2.0	Recall@10	96.9	# 2
Zero-shot Image Retrieval	COCO-CN	ERNIE-ViL 2.0	R@1	69.6	# 3
Zero-shot Image Retrieval	COCO-CN	ERNIE-ViL 2.0	R@5	91.2	# 4
Zero-shot Image Retrieval	COCO-CN	ERNIE-ViL 2.0	R@10	96.9	# 3
Cross-Modal Retrieval	Flickr30k	ERNIE-ViL 2.0	Image-to-text R@1	97.2	# 5
Cross-Modal Retrieval	Flickr30k	ERNIE-ViL 2.0	Image-to-text R@10	100.0	# 1
Cross-Modal Retrieval	Flickr30k	ERNIE-ViL 2.0	Image-to-text R@5	100.0	# 1
Cross-Modal Retrieval	Flickr30k	ERNIE-ViL 2.0	Text-to-image R@1	93.3	# 1
Cross-Modal Retrieval	Flickr30k	ERNIE-ViL 2.0	Text-to-image R@10	99.8	# 1
Cross-Modal Retrieval	Flickr30k	ERNIE-ViL 2.0	Text-to-image R@5	99.4	# 1
Image-to-Text Retrieval	Flickr30k	ERNIE-ViL 2.0	Recall@1	96.1	# 6
Image-to-Text Retrieval	Flickr30k	ERNIE-ViL 2.0	Recall@5	99.9	# 6
Image-to-Text Retrieval	Flickr30k	ERNIE-ViL 2.0	Recall@10	100.0	# 1
Zero-Shot Cross-Modal Retrieval	Flickr30k	ERNIE-ViL 2.0	Image-to-text R@1	91.2	# 6
Zero-Shot Cross-Modal Retrieval	Flickr30k	ERNIE-ViL 2.0	Image-to-text R@5	99.1	# 9
Zero-Shot Cross-Modal Retrieval	Flickr30k	ERNIE-ViL 2.0	Image-to-text R@10	99.8	# 5
Zero-Shot Cross-Modal Retrieval	Flickr30k	ERNIE-ViL 2.0	Text-to-image R@1	77.4	# 9
Zero-Shot Cross-Modal Retrieval	Flickr30k	ERNIE-ViL 2.0	Text-to-image R@5	93.8	# 10
Zero-Shot Cross-Modal Retrieval	Flickr30k	ERNIE-ViL 2.0	Text-to-image R@10	96.4	# 11

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/ernie-vil-2-0-multi-view-contrastive-learning/image-retrieval-on-aic-icc)](https://paperswithcode.com/sota/image-retrieval-on-aic-icc?p=ernie-vil-2-0-multi-view-contrastive-learning)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/ernie-vil-2-0-multi-view-contrastive-learning/image-to-text-retrieval-on-aic-icc)](https://paperswithcode.com/sota/image-to-text-retrieval-on-aic-icc?p=ernie-vil-2-0-multi-view-contrastive-learning)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/ernie-vil-2-0-multi-view-contrastive-learning/zero-shot-text-to-image-retrieval-on-coco-cn)](https://paperswithcode.com/sota/zero-shot-text-to-image-retrieval-on-coco-cn?p=ernie-vil-2-0-multi-view-contrastive-learning)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/ernie-vil-2-0-multi-view-contrastive-learning/zero-shot-image-retrieval-on-coco-cn)](https://paperswithcode.com/sota/zero-shot-image-retrieval-on-coco-cn?p=ernie-vil-2-0-multi-view-contrastive-learning)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/ernie-vil-2-0-multi-view-contrastive-learning/cross-modal-retrieval-on-flickr30k)](https://paperswithcode.com/sota/cross-modal-retrieval-on-flickr30k?p=ernie-vil-2-0-multi-view-contrastive-learning)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/ernie-vil-2-0-multi-view-contrastive-learning/image-to-text-retrieval-on-flickr30k)](https://paperswithcode.com/sota/image-to-text-retrieval-on-flickr30k?p=ernie-vil-2-0-multi-view-contrastive-learning)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/ernie-vil-2-0-multi-view-contrastive-learning/zero-shot-cross-modal-retrieval-on-flickr30k)](https://paperswithcode.com/sota/zero-shot-cross-modal-retrieval-on-flickr30k?p=ernie-vil-2-0-multi-view-contrastive-learning)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/ernie-vil-2-0-multi-view-contrastive-learning/zero-shot-cross-modal-retrieval-on-coco-2014)](https://paperswithcode.com/sota/zero-shot-cross-modal-retrieval-on-coco-2014?p=ernie-vil-2-0-multi-view-contrastive-learning)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/ernie-vil-2-0-multi-view-contrastive-learning/cross-modal-retrieval-on-coco-2014)](https://paperswithcode.com/sota/cross-modal-retrieval-on-coco-2014?p=ernie-vil-2-0-multi-view-contrastive-learning)`

ERNIE-ViL 2.0: Multi-view Contrastive Learning for Image-Text Pre-training

30 Sep 2022 · Bin Shan, Weichong Yin, Yu Sun, Hao Tian, Hua Wu, Haifeng Wang ·

Recent Vision-Language Pre-trained (VLP) models based on dual encoder have attracted extensive attention from academia and industry due to their superior performance on various cross-modal tasks and high computational efficiency. They attempt to learn cross-modal representation using contrastive learning on image-text pairs, however, the built inter-modal correlations only rely on a single view for each modality. Actually, an image or a text contains various potential views, just as humans could capture a real-world scene via diverse descriptions or photos. In this paper, we propose ERNIE-ViL 2.0, a Multi-View Contrastive learning framework to build intra-modal and inter-modal correlations between diverse views simultaneously, aiming at learning a more robust cross-modal representation. Specifically, we construct multiple views within each modality to learn the intra-modal correlation for enhancing the single-modal representation. Besides the inherent visual/textual views, we construct sequences of object tags as a special textual view to narrow the cross-modal semantic gap on noisy image-text pairs. Pre-trained with 29M publicly available datasets, ERNIE-ViL 2.0 achieves competitive results on English cross-modal retrieval. Additionally, to generalize our method to Chinese cross-modal tasks, we train ERNIE-ViL 2.0 through scaling up the pre-training datasets to 1.5B Chinese image-text pairs, resulting in significant improvements compared to previous SOTA results on Chinese cross-modal retrieval. We release our pre-trained models in https://github.com/PaddlePaddle/ERNIE.

PDF Abstract

Code

Add Remove Mark official

PaddlePaddle/ERNIE official

6,189

Tasks

Add Remove

Computational Efficiency

Contrastive Learning

Cross-Modal Retrieval

Image Retrieval

Image-to-Text Retrieval

Retrieval

Zero-Shot Cross-Modal Retrieval

Zero-shot Image Retrieval

Zero-shot Text-to-Image Retrieval

Datasets

MS COCO

Flickr30k

CC12M

COCO-CN

Results from the Paper

Edit

Ranked #1 on Image Retrieval on AIC-ICC

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Image Retrieval	AIC-ICC	ERNIE-ViL2.0	Recall@1	19.0	# 1	Compare
			Recall@10	43.5	# 1	Compare
			Recall@5	35.3	# 2	Compare
Image-to-Text Retrieval	AIC-ICC	ERNIE-ViL2.0	Recall@1	33.7	# 1	Compare
			Recall@5	52.1	# 1	Compare
			Recall@10	60.0	# 1	Compare
Cross-Modal Retrieval	COCO 2014	ERNIE-ViL 2.0	Image-to-text R@1	77.4	# 13	Compare
			Image-to-text R@10	97.1	# 11	Compare
			Image-to-text R@5	93.6	# 13	Compare
			Text-to-image R@1	59.5	# 17	Compare
			Text-to-image R@10	90.1	# 13	Compare
			Text-to-image R@5	83.4	# 15	Compare
Zero-Shot Cross-Modal Retrieval	COCO 2014	ERNIE-ViL 2.0	Image-to-text R@1	63.1	# 11	Compare
			Image-to-text R@5	85.7	# 11	Compare
			Image-to-text R@10	91.4	# 10	Compare
			Text-to-image R@1	46.0	# 11	Compare
			Text-to-image R@5	71.4	# 10	Compare
			Text-to-image R@10	80.4	# 11	Compare
Zero-shot Text-to-Image Retrieval	COCO-CN	ERNIE-ViL 2.0	Recall@1	69.6	# 2	Compare
			Recall@5	91.2	# 2	Compare
			Recall@10	96.9	# 2	Compare
Zero-shot Image Retrieval	COCO-CN	ERNIE-ViL 2.0	R@1	69.6	# 3	Compare
			R@5	91.2	# 4	Compare
			R@10	96.9	# 3	Compare
Cross-Modal Retrieval	Flickr30k	ERNIE-ViL 2.0	Image-to-text R@1	97.2	# 5	Compare
			Image-to-text R@10	100.0	# 1	Compare
			Image-to-text R@5	100.0	# 1	Compare
			Text-to-image R@1	93.3	# 1	Compare
			Text-to-image R@10	99.8	# 1	Compare
			Text-to-image R@5	99.4	# 1	Compare
Image-to-Text Retrieval	Flickr30k	ERNIE-ViL 2.0	Recall@1	96.1	# 6	Compare
			Recall@5	99.9	# 6	Compare
			Recall@10	100.0	# 1	Compare
Zero-Shot Cross-Modal Retrieval	Flickr30k	ERNIE-ViL 2.0	Image-to-text R@1	91.2	# 6	Compare
			Image-to-text R@5	99.1	# 9	Compare
			Image-to-text R@10	99.8	# 5	Compare
			Text-to-image R@1	77.4	# 9	Compare
			Text-to-image R@5	93.8	# 10	Compare
			Text-to-image R@10	96.4	# 11	Compare

Methods

Add Remove

Contrastive Learning

Edit Social Preview

ERNIE-ViL 2.0: Multi-view Contrastive Learning for Image-Text Pre-training

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove