TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Cross-Modal Retrieval	COCO 2014	XFM (base)	Image-to-text R@1	84.2	# 3
Cross-Modal Retrieval	COCO 2014	XFM (base)	Image-to-text R@10	98.4	# 3
Cross-Modal Retrieval	COCO 2014	XFM (base)	Image-to-text R@5	96.4	# 3
Cross-Modal Retrieval	COCO 2014	XFM (base)	Text-to-image R@1	67.0	# 5
Cross-Modal Retrieval	COCO 2014	XFM (base)	Text-to-image R@10	92.4	# 4
Cross-Modal Retrieval	COCO 2014	XFM (base)	Text-to-image R@5	87.2	# 5
Visual Reasoning	NLVR2 Dev	XFM (base)	Accuracy	87.6	# 3
Visual Reasoning	NLVR2 Test	XFM (base)	Accuracy	88.4	# 3
Visual Grounding	RefCOCO+ testA	XFM (base)	Accuracy (%)	90.4	# 3
Visual Grounding	RefCOCO+ test B	XFM (base)	Accuracy (%)	79.8	# 3
Visual Grounding	RefCOCO+ val	XFM (base)	Accuracy (%)	86.1	# 3
Visual Question Answering (VQA)	VQA v2 test-dev	XFM (base)	Accuracy	80.4	# 10

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/toward-building-general-foundation-models-for/visual-reasoning-on-nlvr2-dev)](https://paperswithcode.com/sota/visual-reasoning-on-nlvr2-dev?p=toward-building-general-foundation-models-for)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/toward-building-general-foundation-models-for/visual-reasoning-on-nlvr2-test)](https://paperswithcode.com/sota/visual-reasoning-on-nlvr2-test?p=toward-building-general-foundation-models-for)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/toward-building-general-foundation-models-for/visual-grounding-on-refcoco-testa)](https://paperswithcode.com/sota/visual-grounding-on-refcoco-testa?p=toward-building-general-foundation-models-for)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/toward-building-general-foundation-models-for/visual-grounding-on-refcoco-test-b)](https://paperswithcode.com/sota/visual-grounding-on-refcoco-test-b?p=toward-building-general-foundation-models-for)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/toward-building-general-foundation-models-for/visual-grounding-on-refcoco-val)](https://paperswithcode.com/sota/visual-grounding-on-refcoco-val?p=toward-building-general-foundation-models-for)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/toward-building-general-foundation-models-for/cross-modal-retrieval-on-coco-2014)](https://paperswithcode.com/sota/cross-modal-retrieval-on-coco-2014?p=toward-building-general-foundation-models-for)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/toward-building-general-foundation-models-for/visual-question-answering-on-vqa-v2-test-dev)](https://paperswithcode.com/sota/visual-question-answering-on-vqa-v2-test-dev?p=toward-building-general-foundation-models-for)`

Toward Building General Foundation Models for Language, Vision, and Vision-Language Understanding Tasks

12 Jan 2023 · Xinsong Zhang, Yan Zeng, Jipeng Zhang, Hang Li ·

Foundation models or pre-trained models have substantially improved the performance of various language, vision, and vision-language understanding tasks. However, existing foundation models can only perform the best in one type of tasks, namely language, vision, or vision-language. It is still an open question whether it is possible to construct a foundation model performing the best for all the understanding tasks, which we call a general foundation model. In this paper, we propose a new general foundation model, X-FM (the X-Foundation Model). X-FM has one language encoder, one vision encoder, and one fusion encoder, as well as a new training method. The training method includes two new techniques for learning X-FM from text, image, and image-text pair data. One is to stop gradients from the vision-language training when learning the language encoder. The other is to leverage the vision-language training to guide the learning of the vision encoder. Extensive experiments on benchmark datasets show that X-FM can significantly outperform existing general foundation models and perform better than or comparable to existing foundation models specifically for language, vision, or vision-language understanding. Code and pre-trained models are released at https://github.com/zhangxinsong-nlp/XFM.

PDF Abstract

Code

Add Remove Mark official

zhangxinsong-nlp/XFM official

Tasks

Add Remove

Cross-Modal Retrieval

Open-Ended Question Answering

Visual Grounding

Visual Question Answering (VQA)

Visual Reasoning

Datasets

CIFAR-10

MS COCO

CIFAR-100

GLUE

SST

MultiNLI SST-2

QNLI

Oxford 102 Flower

MRPC

DTD

CoLA

Food-101

Visual Question Answering v2.0

RefCOCO

NLVR

Results from the Paper

Edit

Ranked #3 on Visual Grounding on RefCOCO+ test B

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Cross-Modal Retrieval	COCO 2014	XFM (base)	Image-to-text R@1	84.2	# 3	Compare
			Image-to-text R@10	98.4	# 3	Compare
			Image-to-text R@5	96.4	# 3	Compare
			Text-to-image R@1	67.0	# 5	Compare
			Text-to-image R@10	92.4	# 4	Compare
			Text-to-image R@5	87.2	# 5	Compare
Visual Reasoning	NLVR2 Dev	XFM (base)	Accuracy	87.6	# 3	Compare
Visual Reasoning	NLVR2 Test	XFM (base)	Accuracy	88.4	# 3	Compare
Visual Grounding	RefCOCO+ testA	XFM (base)	Accuracy (%)	90.4	# 3	Compare
Visual Grounding	RefCOCO+ test B	XFM (base)	Accuracy (%)	79.8	# 3	Compare
Visual Grounding	RefCOCO+ val	XFM (base)	Accuracy (%)	86.1	# 3	Compare
Visual Question Answering (VQA)	VQA v2 test-dev	XFM (base)	Accuracy	80.4	# 10	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

Toward Building General Foundation Models for Language, Vision, and Vision-Language Understanding Tasks

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove