TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Image Captioning	Flickr30k Captions test	MetaLM	CIDEr	43.3	# 4
Image Captioning	Flickr30k Captions test	MetaLM	SPICE	11.7	# 3
Image Captioning	nocaps val	MetaLM	CIDEr	58.7	# 2
Image Captioning	nocaps val	MetaLM	SPICE	8.6	# 2
Visual Question Answering (VQA)	OK-VQA	MetaLM	Accuracy	11.4	# 34
Visual Question Answering (VQA)	VQA v2 val	MetaLM	Accuracy	41.1	# 9

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/language-models-are-general-purpose/image-captioning-on-nocaps-val)](https://paperswithcode.com/sota/image-captioning-on-nocaps-val?p=language-models-are-general-purpose)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/language-models-are-general-purpose/image-captioning-on-flickr30k-captions-test)](https://paperswithcode.com/sota/image-captioning-on-flickr30k-captions-test?p=language-models-are-general-purpose)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/language-models-are-general-purpose/visual-question-answering-on-vqa-v2-val)](https://paperswithcode.com/sota/visual-question-answering-on-vqa-v2-val?p=language-models-are-general-purpose)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/language-models-are-general-purpose/visual-question-answering-on-ok-vqa)](https://paperswithcode.com/sota/visual-question-answering-on-ok-vqa?p=language-models-are-general-purpose)`

Language Models are General-Purpose Interfaces

13 Jun 2022 · Yaru Hao, Haoyu Song, Li Dong, Shaohan Huang, Zewen Chi, Wenhui Wang, Shuming Ma, Furu Wei ·

Foundation models have received much attention due to their effectiveness across a broad range of downstream applications. Though there is a big convergence in terms of architecture, most pretrained models are typically still developed for specific tasks or modalities. In this work, we propose to use language models as a general-purpose interface to various foundation models. A collection of pretrained encoders perceive diverse modalities (such as vision, and language), and they dock with a language model that plays the role of a universal task layer. We propose a semi-causal language modeling objective to jointly pretrain the interface and the modular encoders. We subsume the advantages and capabilities from both causal and non-causal modeling, thereby combining the best of two worlds. Specifically, the proposed method not only inherits the capabilities of in-context learning and open-ended generation from causal language modeling, but also is conducive to finetuning because of the bidirectional encoders. More importantly, our approach seamlessly unlocks the combinations of the above capabilities, e.g., enabling in-context learning or instruction following with finetuned encoders. Experimental results across various language-only and vision-language benchmarks show that our model outperforms or is competitive with specialized models on finetuning, zero-shot generalization, and few-shot learning.

PDF Abstract

Code

Add Remove Mark official

microsoft/unilm official

↳ Quickstart in

Spaces

18,274

Tasks

Add Remove

Causal Language Modeling

Few-Shot Learning

Image Captioning

In-Context Learning

Instruction Following

Language Modelling

Visual Question Answering (VQA)

Zero-shot Generalization

Datasets

MS COCO

GLUE

SST

MultiNLI

IMDb Movie Reviews SST-2

SNLI

QNLI

Flickr30k

MRPC

HellaSwag

BoolQ

Visual Question Answering v2.0

PIQA

OpenBookQA

WebText

WinoGrande

DROP

COPA

OK-VQA

ANLI

NoCaps

e-SNLI-VE

Results from the Paper

Edit

Ranked #2 on Image Captioning on nocaps val

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Image Captioning	Flickr30k Captions test	MetaLM	CIDEr	43.3	# 4	Compare
Image Captioning	Flickr30k Captions test	MetaLM	SPICE	11.7	# 3	Compare
Image Captioning	nocaps val	MetaLM	CIDEr	58.7	# 2	Compare
Image Captioning	nocaps val	MetaLM	SPICE	8.6	# 2	Compare
Visual Question Answering (VQA)	OK-VQA	MetaLM	Accuracy	11.4	# 34	Compare
Visual Question Answering (VQA)	VQA v2 val	MetaLM	Accuracy	41.1	# 9	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

Language Models are General-Purpose Interfaces

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove