TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Caption Generation	Concadia	VLIS (BLIP-2)	CIDEr	44.1	# 1
Zero-Shot Image Paragraph Captioning	Image Paragraph Captioning	VLIS (BLIP-2)	METEOR	14.6	# 1
Zero-Shot Image Paragraph Captioning	Image Paragraph Captioning	VLIS (BLIP-2)	CIDEr	14.8	# 1
Zero-Shot Image Paragraph Captioning	Image Paragraph Captioning	VLIS (BLIP-2)	BLEU-4	6.4	# 1
Zero-Shot Image Paragraph Captioning	Image Paragraph Captioning	BLIP-2	METEOR	10.8	# 2
Zero-Shot Image Paragraph Captioning	Image Paragraph Captioning	BLIP-2	CIDEr	6.5	# 2
Zero-Shot Image Paragraph Captioning	Image Paragraph Captioning	BLIP-2	BLEU-4	4.9	# 2
Explanation Generation	WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images	VLIS (Lynx)	Accuracy	80	# 1
Explanation Generation	WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images	VLIS (LLaVA)	Accuracy	73	# 2

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/vlis-unimodal-language-models-guide/caption-generation-on-concadia)](https://paperswithcode.com/sota/caption-generation-on-concadia?p=vlis-unimodal-language-models-guide)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/vlis-unimodal-language-models-guide/zero-shot-image-paragraph-captioning-on-image)](https://paperswithcode.com/sota/zero-shot-image-paragraph-captioning-on-image?p=vlis-unimodal-language-models-guide)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/vlis-unimodal-language-models-guide/explanation-generation-on-whoops)](https://paperswithcode.com/sota/explanation-generation-on-whoops?p=vlis-unimodal-language-models-guide)`

VLIS: Unimodal Language Models Guide Multimodal Language Generation

15 Oct 2023 · Jiwan Chung, Youngjae Yu ·

Multimodal language generation, which leverages the synergy of language and vision, is a rapidly expanding field. However, existing vision-language models face challenges in tasks that require complex linguistic understanding. To address this issue, we introduce Visual-Language models as Importance Sampling weights (VLIS), a novel framework that combines the visual conditioning capability of vision-language models with the language understanding of unimodal text-only language models without further training. It extracts pointwise mutual information of each image and text from a visual-language model and uses the value as an importance sampling weight to adjust the token likelihood from a text-only model. VLIS improves vision-language models on diverse tasks, including commonsense understanding (WHOOPS, OK-VQA, and ScienceQA) and complex text generation (Concadia, Image Paragraph Captioning, and ROCStories). Our results suggest that VLIS represents a promising new direction for multimodal language generation.

PDF Abstract

Code

Add Remove Mark official

jiwanchung/vlis official

Tasks

Add Remove

Caption Generation

Explanation Generation

Image Paragraph Captioning

Language Modelling

Text Generation

Visual Question Answering (VQA)

Zero-Shot Image Paragraph Captioning

Datasets

OK-VQA

ScienceQA

ROCStories

Image Paragraph Captioning

Concadia

WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images

Results from the Paper

Edit

Ranked #1 on Zero-Shot Image Paragraph Captioning on Image Paragraph Captioning

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Caption Generation	Concadia	VLIS (BLIP-2)	CIDEr	44.1	# 1	Compare
Zero-Shot Image Paragraph Captioning	Image Paragraph Captioning	VLIS (BLIP-2)	METEOR	14.6	# 1	Compare
			CIDEr	14.8	# 1	Compare
			BLEU-4	6.4	# 1	Compare
Zero-Shot Image Paragraph Captioning	Image Paragraph Captioning	BLIP-2	METEOR	10.8	# 2	Compare
			CIDEr	6.5	# 2	Compare
			BLEU-4	4.9	# 2	Compare
Explanation Generation	WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images	VLIS (Lynx)	Accuracy	80	# 1	Compare
Explanation Generation	WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images	VLIS (LLaVA)	Accuracy	73	# 2	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

VLIS: Unimodal Language Models Guide Multimodal Language Generation

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove