TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Zero-Shot Cross-Modal Retrieval	Flickr30k	UNITER	Image-to-text R@1	80.7	# 16
Zero-Shot Cross-Modal Retrieval	Flickr30k	UNITER	Image-to-text R@5	95.7	# 17
Zero-Shot Cross-Modal Retrieval	Flickr30k	UNITER	Image-to-text R@10	98.0	# 15
Zero-Shot Cross-Modal Retrieval	Flickr30k	UNITER	Text-to-image R@1	66.2	# 17
Zero-Shot Cross-Modal Retrieval	Flickr30k	UNITER	Text-to-image R@5	88.4	# 17
Zero-Shot Cross-Modal Retrieval	Flickr30k	UNITER	Text-to-image R@10	92.9	# 15
Visual Reasoning	NLVR2 Test	UNITER (Large)	Accuracy	79.5	# 10
Referring Expression Comprehension	RefCOCO	UNITER-L	Val	81.41	# 14
Referring Expression Comprehension	RefCOCO	UNITER-L	Test A	87.04	# 12
Referring Expression Comprehension	RefCOCO	UNITER-L	Test B	74.17	# 14
Visual Entailment	SNLI-VE test	UNITER (Large)	Accuracy	78.98	# 7
Visual Entailment	SNLI-VE val	UNITER	Accuracy	78.98	# 8
Visual Question Answering (VQA)	VCR (Q-AR) test	UNITER (Large)	Accuracy	62.8	# 3
Visual Question Answering (VQA)	VCR (QA-R) test	UNITER-large (ensemble of 10 models)	Accuracy	83.4	# 3
Visual Question Answering (VQA)	VCR (QA-R) test	UNITER (Large)	Accuracy	80.8	# 4
Visual Question Answering (VQA)	VCR (Q-A) test	UNITER-large (10 ensemble)	Accuracy	79.8	# 3
Visual Question Answering (VQA)	VCR (Q-A) test	UNITER (Large)	Accuracy	77.3	# 5
Visual Question Answering (VQA)	VQA v2 test-dev	UNITER (Large)	Accuracy	73.24	# 22
Visual Question Answering (VQA)	VQA v2 test-std	UNITER (Large)	overall	73.4	# 20

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/uniter-learning-universal-image-text-1/visual-question-answering-on-vcr-q-ar-test)](https://paperswithcode.com/sota/visual-question-answering-on-vcr-q-ar-test?p=uniter-learning-universal-image-text-1)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/uniter-learning-universal-image-text-1/visual-question-answering-on-vcr-qa-r-test)](https://paperswithcode.com/sota/visual-question-answering-on-vcr-qa-r-test?p=uniter-learning-universal-image-text-1)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/uniter-learning-universal-image-text-1/visual-question-answering-on-vcr-q-a-test)](https://paperswithcode.com/sota/visual-question-answering-on-vcr-q-a-test?p=uniter-learning-universal-image-text-1)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/uniter-learning-universal-image-text-1/visual-entailment-on-snli-ve-test)](https://paperswithcode.com/sota/visual-entailment-on-snli-ve-test?p=uniter-learning-universal-image-text-1)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/uniter-learning-universal-image-text-1/visual-entailment-on-snli-ve-val)](https://paperswithcode.com/sota/visual-entailment-on-snli-ve-val?p=uniter-learning-universal-image-text-1)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/uniter-learning-universal-image-text-1/visual-reasoning-on-nlvr2-test)](https://paperswithcode.com/sota/visual-reasoning-on-nlvr2-test?p=uniter-learning-universal-image-text-1)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/uniter-learning-universal-image-text-1/referring-expression-comprehension-on-refcoco)](https://paperswithcode.com/sota/referring-expression-comprehension-on-refcoco?p=uniter-learning-universal-image-text-1)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/uniter-learning-universal-image-text-1/zero-shot-cross-modal-retrieval-on-flickr30k)](https://paperswithcode.com/sota/zero-shot-cross-modal-retrieval-on-flickr30k?p=uniter-learning-universal-image-text-1)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/uniter-learning-universal-image-text-1/visual-question-answering-on-vqa-v2-test-std)](https://paperswithcode.com/sota/visual-question-answering-on-vqa-v2-test-std?p=uniter-learning-universal-image-text-1)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/uniter-learning-universal-image-text-1/visual-question-answering-on-vqa-v2-test-dev)](https://paperswithcode.com/sota/visual-question-answering-on-vqa-v2-test-dev?p=uniter-learning-universal-image-text-1)`

UNITER: UNiversal Image-TExt Representation Learning

ECCV 2020 · Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, Jingjing Liu ·

Joint image-text embedding is the bedrock for most Vision-and-Language (V+L) tasks, where multimodality inputs are simultaneously processed for joint visual and textual understanding. In this paper, we introduce UNITER, a UNiversal Image-TExt Representation, learned through large-scale pre-training over four image-text datasets (COCO, Visual Genome, Conceptual Captions, and SBU Captions), which can power heterogeneous downstream V+L tasks with joint multimodal embeddings. We design four pre-training tasks: Masked Language Modeling (MLM), Masked Region Modeling (MRM, with three variants), Image-Text Matching (ITM), and Word-Region Alignment (WRA). Different from previous work that applies joint random masking to both modalities, we use conditional masking on pre-training tasks (i.e., masked language/region modeling is conditioned on full observation of image/text). In addition to ITM for global image-text alignment, we also propose WRA via the use of Optimal Transport (OT) to explicitly encourage fine-grained alignment between words and image regions during pre-training. Comprehensive analysis shows that both conditional masking and OT-based WRA contribute to better pre-training. We also conduct a thorough ablation study to find an optimal combination of pre-training tasks. Extensive experiments show that UNITER achieves new state of the art across six V+L tasks (over nine datasets), including Visual Question Answering, Image-Text Retrieval, Referring Expression Comprehension, Visual Commonsense Reasoning, Visual Entailment, and NLVR$^2$. Code is available at https://github.com/ChenRocks/UNITER.

PDF Abstract ECCV 2020 PDF ECCV 2020 Abstract

Code

Add Remove Mark official

ChenRocks/UNITER official

764

YIKUAN8/Transformers-VQA

161

necla-ml/SNLI-VE

104

lichengunc/pretrain-vl-data

vladsandulescu/hatefulmemes

See all 7 implementations

Tasks

Add Remove

Image-text matching

Language Modelling

Masked Language Modeling

Question Answering

Referring Expression

Referring Expression Comprehension

Representation Learning

Retrieval

Text Matching

Text Retrieval

Visual Commonsense Reasoning

Visual Entailment

Visual Question Answering

Visual Question Answering (VQA)

Visual Reasoning

Zero-Shot Cross-Modal Retrieval

Datasets

MS COCO

Visual Question Answering

Visual Genome

Flickr30k

Visual Question Answering v2.0

Conceptual Captions

RefCOCO

VCR SNLI-VE

NLVR

Results from the Paper

Edit

Ranked #3 on Visual Question Answering (VQA) on VCR (Q-A) test

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Zero-Shot Cross-Modal Retrieval	Flickr30k	UNITER	Image-to-text R@1	80.7	# 16	Compare
			Image-to-text R@5	95.7	# 17	Compare
			Image-to-text R@10	98.0	# 15	Compare
			Text-to-image R@1	66.2	# 17	Compare
			Text-to-image R@5	88.4	# 17	Compare
			Text-to-image R@10	92.9	# 15	Compare
Visual Reasoning	NLVR2 Test	UNITER (Large)	Accuracy	79.5	# 10	Compare
Referring Expression Comprehension	RefCOCO	UNITER-L	Val	81.41	# 14	Compare
			Test A	87.04	# 12	Compare
			Test B	74.17	# 14	Compare
Visual Entailment	SNLI-VE test	UNITER (Large)	Accuracy	78.98	# 7	Compare
Visual Entailment	SNLI-VE val	UNITER	Accuracy	78.98	# 8	Compare
Visual Question Answering (VQA)	VCR (Q-AR) test	UNITER (Large)	Accuracy	62.8	# 3	Compare
Visual Question Answering (VQA)	VCR (QA-R) test	UNITER-large (ensemble of 10 models)	Accuracy	83.4	# 3	Compare
Visual Question Answering (VQA)	VCR (QA-R) test	UNITER (Large)	Accuracy	80.8	# 4	Compare
Visual Question Answering (VQA)	VCR (Q-A) test	UNITER-large (10 ensemble)	Accuracy	79.8	# 3	Compare
Visual Question Answering (VQA)	VCR (Q-A) test	UNITER (Large)	Accuracy	77.3	# 5	Compare
Visual Question Answering (VQA)	VQA v2 test-dev	UNITER (Large)	Accuracy	73.24	# 22	Compare
Visual Question Answering (VQA)	VQA v2 test-std	UNITER (Large)	overall	73.4	# 20	Compare

Methods

Add Remove

UNITER

Edit Social Preview

UNITER: UNiversal Image-TExt Representation Learning

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove