TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Cross-Modal Retrieval	COCO 2014	METER	Image-to-text R@1	76.16	# 15
Cross-Modal Retrieval	COCO 2014	METER	Image-to-text R@10	96.82	# 13
Cross-Modal Retrieval	COCO 2014	METER	Image-to-text R@5	93.16	# 15
Cross-Modal Retrieval	COCO 2014	METER	Text-to-image R@1	57.08	# 20
Cross-Modal Retrieval	COCO 2014	METER	Text-to-image R@10	90.07	# 14
Cross-Modal Retrieval	COCO 2014	METER	Text-to-image R@5	82.66	# 19

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/an-empirical-study-of-training-end-to-end/cross-modal-retrieval-on-coco-2014)](https://paperswithcode.com/sota/cross-modal-retrieval-on-coco-2014?p=an-empirical-study-of-training-end-to-end)`

An Empirical Study of Training End-to-End Vision-and-Language Transformers

CVPR 2022 · Zi-Yi Dou, Yichong Xu, Zhe Gan, JianFeng Wang, Shuohang Wang, Lijuan Wang, Chenguang Zhu, Pengchuan Zhang, Lu Yuan, Nanyun Peng, Zicheng Liu, Michael Zeng ·

Vision-and-language (VL) pre-training has proven to be highly effective on various VL downstream tasks. While recent work has shown that fully transformer-based VL models can be more efficient than previous region-feature-based methods, their performance on downstream tasks often degrades significantly. In this paper, we present METER, a Multimodal End-to-end TransformER framework, through which we investigate how to design and pre-train a fully transformer-based VL model in an end-to-end manner. Specifically, we dissect the model designs along multiple dimensions: vision encoders (e.g., CLIP-ViT, Swin transformer), text encoders (e.g., RoBERTa, DeBERTa), multimodal fusion module (e.g., merged attention vs. co-attention), architectural design (e.g., encoder-only vs. encoder-decoder), and pre-training objectives (e.g., masked image modeling). We conduct comprehensive experiments and provide insights on how to train a performant VL transformer. METER achieves an accuracy of 77.64% on the VQAv2 test-std set using only 4M images for pre-training, surpassing the state-of-the-art region-feature-based model by 1.04%, and outperforming the previous best fully transformer-based model by 1.6%. Notably, when further scaled up, our best VQA model achieves an accuracy of 80.54%. Code and pre-trained models are released at https://github.com/zdou0830/METER.

PDF Abstract CVPR 2022 PDF CVPR 2022 Abstract

Code

Add Remove Mark official

zdou0830/meter official

350

claws-lab/multimodal-robustness-xmai

Tasks

Add Remove

Cross-Modal Retrieval

Visual Question Answering (VQA)

Visual Reasoning

Datasets

ImageNet

MS COCO

Visual Question Answering

Visual Genome

Conceptual Captions SNLI-VE

Results from the Paper

Edit

Ranked #20 on Cross-Modal Retrieval on COCO 2014 (using extra training data)

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Cross-Modal Retrieval	COCO 2014	METER	Image-to-text R@1	76.16	# 15	Compare
			Image-to-text R@10	96.82	# 13	Compare
			Image-to-text R@5	93.16	# 15	Compare
			Text-to-image R@1	57.08	# 20	Compare
			Text-to-image R@10	90.07	# 14	Compare
			Text-to-image R@5	82.66	# 19	Compare

Methods

Add Remove

Adam • Attention Dropout • BERT • Dense Connections • Dropout • GELU • Layer Normalization • Linear Layer • Linear Warmup With Linear Decay • Multi-Head Attention • Residual Connection • RoBERTa • Scaled Dot-Product Attention • Softmax • Weight Decay • WordPiece

Edit Social Preview

An Empirical Study of Training End-to-End Vision-and-Language Transformers

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove