TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Zero-Shot Cross-Modal Retrieval	COCO 2014	CoCa	Image-to-text R@1	66.3	# 8
Zero-Shot Cross-Modal Retrieval	COCO 2014	CoCa	Image-to-text R@5	86.2	# 9
Zero-Shot Cross-Modal Retrieval	COCO 2014	CoCa	Image-to-text R@10	91.8	# 9
Zero-Shot Cross-Modal Retrieval	COCO 2014	CoCa	Text-to-image R@1	51.2	# 6
Zero-Shot Cross-Modal Retrieval	COCO 2014	CoCa	Text-to-image R@5	74.2	# 8
Zero-Shot Cross-Modal Retrieval	COCO 2014	CoCa	Text-to-image R@10	82.0	# 9
Image Captioning	COCO Captions	CoCa	BLEU-4	40.9	# 16
Image Captioning	COCO Captions	CoCa	METEOR	33.9	# 1
Image Captioning	COCO Captions	CoCa	CIDER	143.6	# 12
Image Captioning	COCO Captions	CoCa	SPICE	24.7	# 10
Zero-Shot Cross-Modal Retrieval	Flickr30k	CoCa	Image-to-text R@1	92.5	# 4
Zero-Shot Cross-Modal Retrieval	Flickr30k	CoCa	Image-to-text R@5	99.5	# 4
Zero-Shot Cross-Modal Retrieval	Flickr30k	CoCa	Image-to-text R@10	99.9	# 2
Zero-Shot Cross-Modal Retrieval	Flickr30k	CoCa	Text-to-image R@1	80.4	# 7
Zero-Shot Cross-Modal Retrieval	Flickr30k	CoCa	Text-to-image R@5	95.7	# 5
Zero-Shot Cross-Modal Retrieval	Flickr30k	CoCa	Text-to-image R@10	97.7	# 7
Zero-Shot Transfer Image Classification	ImageNet	CoCa	Accuracy (Private)	86.3	# 3
Image Classification	ImageNet	CoCa (frozen)	Top 1 Accuracy	90.60%	# 9
Image Classification	ImageNet	CoCa (frozen)	Number of params	2100M	# 966
Image Classification	ImageNet	CoCa (finetuned)	Top 1 Accuracy	91.0%	# 3
Image Classification	ImageNet	CoCa (finetuned)	Number of params	2100M	# 966
Zero-Shot Transfer Image Classification	ImageNet-A	CoCa	Accuracy (Private)	90.2	# 1
Zero-Shot Transfer Image Classification	ImageNet-R	CoCa	Accuracy	96.5	# 2
Zero-Shot Transfer Image Classification	ImageNet-Sketch	CoCa	Accuracy (Private)	77.6	# 1
Zero-Shot Transfer Image Classification	ImageNet V2	CoCa	Accuracy (Private)	80.7	# 3
Action Classification	Kinetics-400	CoCa (finetuned)	Acc@1	88.9	# 15
Action Classification	Kinetics-400	CoCa (frozen)	Acc@1	88.0	# 22
Action Classification	Kinetics-600	CoCa (finetuned)	Top-1 Accuracy	89.4	# 15
Action Classification	Kinetics-600	CoCa (frozen)	Top-1 Accuracy	88.5	# 19
Action Classification	Kinetics-700	CoCa (frozen)	Top-1 Accuracy	81.1	# 10
Action Classification	Kinetics-700	CoCa (finetuned)	Top-1 Accuracy	82.7	# 8
Action Classification	MiT	CoCa (finetuned)	Top 1 Accuracy	49.0	# 3
Action Classification	MiT	CoCa (frozen)	Top 1 Accuracy	47.4	# 6
Video Retrieval	MSR-VTT	CoCa (zero-shot)	text-to-video R@1	30.0	# 24
Video Retrieval	MSR-VTT	CoCa (zero-shot)	text-to-video R@5	52.4	# 25
Video Retrieval	MSR-VTT	CoCa (zero-shot)	text-to-video R@10	61.6	# 27
Video Retrieval	MSR-VTT	CoCa (zero-shot)	video-to-text R@1	49.9	# 8
Video Retrieval	MSR-VTT	CoCa (zero-shot)	video-to-text R@5	73.4	# 6
Video Retrieval	MSR-VTT	CoCa (zero-shot)	video-to-text R@10	81.4	# 6
Visual Reasoning	NLVR2 Dev	CoCa	Accuracy	86.1	# 5
Visual Reasoning	NLVR2 Test	CoCa	Accuracy	87.0	# 4
Zero-Shot Transfer Image Classification	ObjectNet	CoCa	Accuracy (Private)	82.7	# 3
Image Classification	ObjectNet	CoCa	Top-1 Accuracy	82.7	# 1
Visual Entailment	SNLI-VE test	CoCa	Accuracy	87.1	# 3
Visual Entailment	SNLI-VE val	CoCa	Accuracy	87.0	# 3
Visual Question Answering	VQA v2 test-dev	CoCa	Accuracy	82.3	# 1

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/coca-contrastive-captioners-are-image-text/zero-shot-transfer-image-classification-on-5)](https://paperswithcode.com/sota/zero-shot-transfer-image-classification-on-5?p=coca-contrastive-captioners-are-image-text)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/coca-contrastive-captioners-are-image-text/zero-shot-transfer-image-classification-on-8)](https://paperswithcode.com/sota/zero-shot-transfer-image-classification-on-8?p=coca-contrastive-captioners-are-image-text)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/coca-contrastive-captioners-are-image-text/image-classification-on-objectnet)](https://paperswithcode.com/sota/image-classification-on-objectnet?p=coca-contrastive-captioners-are-image-text)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/coca-contrastive-captioners-are-image-text/visual-question-answering-on-vqa-v2-test-dev-1)](https://paperswithcode.com/sota/visual-question-answering-on-vqa-v2-test-dev-1?p=coca-contrastive-captioners-are-image-text)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/coca-contrastive-captioners-are-image-text/zero-shot-transfer-image-classification-on-4)](https://paperswithcode.com/sota/zero-shot-transfer-image-classification-on-4?p=coca-contrastive-captioners-are-image-text)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/coca-contrastive-captioners-are-image-text/zero-shot-transfer-image-classification-on-1)](https://paperswithcode.com/sota/zero-shot-transfer-image-classification-on-1?p=coca-contrastive-captioners-are-image-text)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/coca-contrastive-captioners-are-image-text/image-classification-on-imagenet)](https://paperswithcode.com/sota/image-classification-on-imagenet?p=coca-contrastive-captioners-are-image-text)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/coca-contrastive-captioners-are-image-text/zero-shot-transfer-image-classification-on-3)](https://paperswithcode.com/sota/zero-shot-transfer-image-classification-on-3?p=coca-contrastive-captioners-are-image-text)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/coca-contrastive-captioners-are-image-text/action-classification-on-moments-in-time)](https://paperswithcode.com/sota/action-classification-on-moments-in-time?p=coca-contrastive-captioners-are-image-text)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/coca-contrastive-captioners-are-image-text/zero-shot-transfer-image-classification-on-6)](https://paperswithcode.com/sota/zero-shot-transfer-image-classification-on-6?p=coca-contrastive-captioners-are-image-text)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/coca-contrastive-captioners-are-image-text/visual-entailment-on-snli-ve-test)](https://paperswithcode.com/sota/visual-entailment-on-snli-ve-test?p=coca-contrastive-captioners-are-image-text)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/coca-contrastive-captioners-are-image-text/visual-entailment-on-snli-ve-val)](https://paperswithcode.com/sota/visual-entailment-on-snli-ve-val?p=coca-contrastive-captioners-are-image-text)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/coca-contrastive-captioners-are-image-text/zero-shot-cross-modal-retrieval-on-flickr30k)](https://paperswithcode.com/sota/zero-shot-cross-modal-retrieval-on-flickr30k?p=coca-contrastive-captioners-are-image-text)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/coca-contrastive-captioners-are-image-text/visual-reasoning-on-nlvr2-test)](https://paperswithcode.com/sota/visual-reasoning-on-nlvr2-test?p=coca-contrastive-captioners-are-image-text)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/coca-contrastive-captioners-are-image-text/visual-reasoning-on-nlvr2-dev)](https://paperswithcode.com/sota/visual-reasoning-on-nlvr2-dev?p=coca-contrastive-captioners-are-image-text)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/coca-contrastive-captioners-are-image-text/zero-shot-cross-modal-retrieval-on-coco-2014)](https://paperswithcode.com/sota/zero-shot-cross-modal-retrieval-on-coco-2014?p=coca-contrastive-captioners-are-image-text)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/coca-contrastive-captioners-are-image-text/action-classification-on-kinetics-700)](https://paperswithcode.com/sota/action-classification-on-kinetics-700?p=coca-contrastive-captioners-are-image-text)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/coca-contrastive-captioners-are-image-text/action-classification-on-kinetics-400)](https://paperswithcode.com/sota/action-classification-on-kinetics-400?p=coca-contrastive-captioners-are-image-text)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/coca-contrastive-captioners-are-image-text/action-classification-on-kinetics-600)](https://paperswithcode.com/sota/action-classification-on-kinetics-600?p=coca-contrastive-captioners-are-image-text)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/coca-contrastive-captioners-are-image-text/image-captioning-on-coco-captions)](https://paperswithcode.com/sota/image-captioning-on-coco-captions?p=coca-contrastive-captioners-are-image-text)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/coca-contrastive-captioners-are-image-text/video-retrieval-on-msr-vtt)](https://paperswithcode.com/sota/video-retrieval-on-msr-vtt?p=coca-contrastive-captioners-are-image-text)`

CoCa: Contrastive Captioners are Image-Text Foundation Models

4 May 2022 · Jiahui Yu, ZiRui Wang, Vijay Vasudevan, Legg Yeung, Mojtaba Seyedhosseini, Yonghui Wu ·

Exploring large-scale pretrained foundation models is of significant interest in computer vision because these models can be quickly transferred to many downstream tasks. This paper presents Contrastive Captioner (CoCa), a minimalist design to pretrain an image-text encoder-decoder foundation model jointly with contrastive loss and captioning loss, thereby subsuming model capabilities from contrastive approaches like CLIP and generative methods like SimVLM. In contrast to standard encoder-decoder transformers where all decoder layers attend to encoder outputs, CoCa omits cross-attention in the first half of decoder layers to encode unimodal text representations, and cascades the remaining decoder layers which cross-attend to the image encoder for multimodal image-text representations. We apply a contrastive loss between unimodal image and text embeddings, in addition to a captioning loss on the multimodal decoder outputs which predicts text tokens autoregressively. By sharing the same computational graph, the two training objectives are computed efficiently with minimal overhead. CoCa is pretrained end-to-end and from scratch on both web-scale alt-text data and annotated images by treating all labels simply as text, seamlessly unifying natural language supervision for representation learning. Empirically, CoCa achieves state-of-the-art performance with zero-shot transfer or minimal task-specific adaptation on a broad range of downstream tasks, spanning visual recognition (ImageNet, Kinetics-400/600/700, Moments-in-Time), crossmodal retrieval (MSCOCO, Flickr30K, MSR-VTT), multimodal understanding (VQA, SNLI-VE, NLVR2), and image captioning (MSCOCO, NoCaps). Notably on ImageNet classification, CoCa obtains 86.3% zero-shot top-1 accuracy, 90.6% with a frozen encoder and learned classification head, and new state-of-the-art 91.0% top-1 accuracy on ImageNet with a finetuned encoder.

PDF Abstract

Code

Add Remove Mark official

mlfoundations/open_clip

↳ Quickstart in

Colab

8,439

facebookresearch/multimodal

1,291

lucidrains/CoCa-pytorch

975

PaddlePaddle/PaddleMIX

212

amitakamath/whatsup_vlms

Tasks

Add Remove

Action Classification

Image Captioning

Image Classification

Representation Learning

Retrieval

Video Retrieval

Visual Entailment

Visual Question Answering

Visual Question Answering (VQA)

Visual Reasoning

Zero-Shot Cross-Modal Retrieval

Zero-Shot Transfer Image Classification

Datasets

ImageNet

MS COCO

Kinetics

Flickr30k

Kinetics 400

MSR-VTT

Visual Question Answering v2.0

ImageNet-R

ImageNet-A

ImageNet-Sketch

COCO Captions

NoCaps

ObjectNet

Kinetics-600 SNLI-VE

MiT

Kinetics-700

NLVR JFT-3B

Results from the Paper

Edit

Ranked #1 on Visual Question Answering on VQA v2 test-dev

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Zero-Shot Cross-Modal Retrieval	COCO 2014	CoCa	Image-to-text R@1	66.3	# 8	Compare
			Image-to-text R@5	86.2	# 9	Compare
			Image-to-text R@10	91.8	# 9	Compare
			Text-to-image R@1	51.2	# 6	Compare
			Text-to-image R@5	74.2	# 8	Compare
			Text-to-image R@10	82.0	# 9	Compare
Image Captioning	COCO Captions	CoCa	BLEU-4	40.9	# 16	Compare
			METEOR	33.9	# 1	Compare
			CIDER	143.6	# 12	Compare
			SPICE	24.7	# 10	Compare
Zero-Shot Cross-Modal Retrieval	Flickr30k	CoCa	Image-to-text R@1	92.5	# 4	Compare
			Image-to-text R@5	99.5	# 4	Compare
			Image-to-text R@10	99.9	# 2	Compare
			Text-to-image R@1	80.4	# 7	Compare
			Text-to-image R@5	95.7	# 5	Compare
			Text-to-image R@10	97.7	# 7	Compare
Zero-Shot Transfer Image Classification	ImageNet	CoCa	Accuracy (Private)	86.3	# 3	Compare
Image Classification	ImageNet	CoCa (frozen)	Top 1 Accuracy	90.60%	# 9	Compare
Image Classification	ImageNet	CoCa (frozen)	Number of params	2100M	# 966	Compare
Image Classification	ImageNet	CoCa (finetuned)	Top 1 Accuracy	91.0%	# 3	Compare
Image Classification	ImageNet	CoCa (finetuned)	Number of params	2100M	# 966	Compare
Zero-Shot Transfer Image Classification	ImageNet-A	CoCa	Accuracy (Private)	90.2	# 1	Compare
Zero-Shot Transfer Image Classification	ImageNet-R	CoCa	Accuracy	96.5	# 2	Compare
Zero-Shot Transfer Image Classification	ImageNet-Sketch	CoCa	Accuracy (Private)	77.6	# 1	Compare
Zero-Shot Transfer Image Classification	ImageNet V2	CoCa	Accuracy (Private)	80.7	# 3	Compare
Action Classification	Kinetics-400	CoCa (finetuned)	Acc@1	88.9	# 15	Compare
Action Classification	Kinetics-400	CoCa (frozen)	Acc@1	88.0	# 22	Compare
Action Classification	Kinetics-600	CoCa (finetuned)	Top-1 Accuracy	89.4	# 15	Compare
Action Classification	Kinetics-600	CoCa (frozen)	Top-1 Accuracy	88.5	# 19	Compare
Action Classification	Kinetics-700	CoCa (frozen)	Top-1 Accuracy	81.1	# 10	Compare
Action Classification	Kinetics-700	CoCa (finetuned)	Top-1 Accuracy	82.7	# 8	Compare
Action Classification	MiT	CoCa (finetuned)	Top 1 Accuracy	49.0	# 3	Compare
Action Classification	MiT	CoCa (frozen)	Top 1 Accuracy	47.4	# 6	Compare
Video Retrieval	MSR-VTT	CoCa (zero-shot)	text-to-video R@1	30.0	# 24	Compare
			text-to-video R@5	52.4	# 25	Compare
			text-to-video R@10	61.6	# 27	Compare
			video-to-text R@1	49.9	# 8	Compare
			video-to-text R@5	73.4	# 6	Compare
			video-to-text R@10	81.4	# 6	Compare
Visual Reasoning	NLVR2 Dev	CoCa	Accuracy	86.1	# 5	Compare
Visual Reasoning	NLVR2 Test	CoCa	Accuracy	87.0	# 4	Compare
Zero-Shot Transfer Image Classification	ObjectNet	CoCa	Accuracy (Private)	82.7	# 3	Compare
Image Classification	ObjectNet	CoCa	Top-1 Accuracy	82.7	# 1	Compare
Visual Entailment	SNLI-VE test	CoCa	Accuracy	87.1	# 3	Compare
Visual Entailment	SNLI-VE val	CoCa	Accuracy	87.0	# 3	Compare
Visual Question Answering	VQA v2 test-dev	CoCa	Accuracy	82.3	# 1	Compare

Methods

Add Remove

CLIP • SimVLM

Edit Social Preview

CoCa: Contrastive Captioners are Image-Text Foundation Models

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove