TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Zero-Shot Cross-Modal Retrieval	COCO 2014	TCL	Image-to-text R@1	71.4	# 3
Zero-Shot Cross-Modal Retrieval	COCO 2014	TCL	Image-to-text R@5	90.8	# 3
Zero-Shot Cross-Modal Retrieval	COCO 2014	TCL	Image-to-text R@10	95.4	# 2
Zero-Shot Cross-Modal Retrieval	COCO 2014	TCL	Text-to-image R@1	53.5	# 4
Zero-Shot Cross-Modal Retrieval	COCO 2014	TCL	Text-to-image R@5	79.0	# 3
Zero-Shot Cross-Modal Retrieval	COCO 2014	TCL	Text-to-image R@10	87.1	# 3
Cross-Modal Retrieval	COCO 2014	TCL	Image-to-text R@1	75.6	# 16
Cross-Modal Retrieval	COCO 2014	TCL	Image-to-text R@10	96.7	# 15
Cross-Modal Retrieval	COCO 2014	TCL	Image-to-text R@5	92.8	# 16
Cross-Modal Retrieval	COCO 2014	TCL	Text-to-image R@1	59.0	# 17
Cross-Modal Retrieval	COCO 2014	TCL	Text-to-image R@10	89.9	# 15
Cross-Modal Retrieval	COCO 2014	TCL	Text-to-image R@5	83.2	# 17

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/vision-language-pre-training-with-triple/zero-shot-cross-modal-retrieval-on-coco-2014)](https://paperswithcode.com/sota/zero-shot-cross-modal-retrieval-on-coco-2014?p=vision-language-pre-training-with-triple)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/vision-language-pre-training-with-triple/cross-modal-retrieval-on-coco-2014)](https://paperswithcode.com/sota/cross-modal-retrieval-on-coco-2014?p=vision-language-pre-training-with-triple)`

Vision-Language Pre-Training with Triple Contrastive Learning

CVPR 2022 · Jinyu Yang, Jiali Duan, Son Tran, Yi Xu, Sampath Chanda, Liqun Chen, Belinda Zeng, Trishul Chilimbi, Junzhou Huang ·

Vision-language representation learning largely benefits from image-text alignment through contrastive losses (e.g., InfoNCE loss). The success of this alignment strategy is attributed to its capability in maximizing the mutual information (MI) between an image and its matched text. However, simply performing cross-modal alignment (CMA) ignores data potential within each modality, which may result in degraded representations. For instance, although CMA-based models are able to map image-text pairs close together in the embedding space, they fail to ensure that similar inputs from the same modality stay close by. This problem can get even worse when the pre-training data is noisy. In this paper, we propose triple contrastive learning (TCL) for vision-language pre-training by leveraging both cross-modal and intra-modal self-supervision. Besides CMA, TCL introduces an intra-modal contrastive objective to provide complementary benefits in representation learning. To take advantage of localized and structural information from image and text input, TCL further maximizes the average MI between local regions of image/text and their global summary. To the best of our knowledge, ours is the first work that takes into account local structure information for multi-modality representation learning. Experimental evaluations show that our approach is competitive and achieves the new state of the art on various common down-stream vision-language tasks such as image-text retrieval and visual question answering.

PDF Abstract CVPR 2022 PDF CVPR 2022 Abstract

Code

Add Remove Mark official

uta-smile/TCL official

253

Tasks

Add Remove

Contrastive Learning

Cross-Modal Retrieval

Question Answering

Representation Learning

Retrieval

Text Retrieval

Visual Question Answering

Visual Question Answering (VQA)

Zero-Shot Cross-Modal Retrieval

Datasets

MS COCO SNLI-VE

Results from the Paper

Edit

Ranked #3 on Zero-Shot Cross-Modal Retrieval on COCO 2014

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Zero-Shot Cross-Modal Retrieval	COCO 2014	TCL	Image-to-text R@1	71.4	# 3	Compare
			Image-to-text R@5	90.8	# 3	Compare
			Image-to-text R@10	95.4	# 2	Compare
			Text-to-image R@1	53.5	# 4	Compare
			Text-to-image R@5	79.0	# 3	Compare
			Text-to-image R@10	87.1	# 3	Compare
Cross-Modal Retrieval	COCO 2014	TCL	Image-to-text R@1	75.6	# 16	Compare
			Image-to-text R@10	96.7	# 15	Compare
			Image-to-text R@5	92.8	# 16	Compare
			Text-to-image R@1	59.0	# 17	Compare
			Text-to-image R@10	89.9	# 15	Compare
			Text-to-image R@5	83.2	# 17	Compare

Methods

Add Remove

Contrastive Learning • InfoNCE

Edit Social Preview

Vision-Language Pre-Training with Triple Contrastive Learning

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove