TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Text-to-Image Generation	GeNeVA (CoDraw)	LatteGAN	F1-score	77.51± 0.52	# 1
Text-to-Image Generation	GeNeVA (CoDraw)	LatteGAN	rsim	54.16± 0.21	# 1
Text-to-Image Generation	GeNeVA (i-CLEVR)	LatteGAN	F1-score	97.26±1.56	# 1
Text-to-Image Generation	GeNeVA (i-CLEVR)	LatteGAN	rsim	83.21± 1.70	# 1

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/lattegan-visually-guided-language-attention/text-to-image-generation-on-geneva-codraw)](https://paperswithcode.com/sota/text-to-image-generation-on-geneva-codraw?p=lattegan-visually-guided-language-attention)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/lattegan-visually-guided-language-attention/text-to-image-generation-on-geneva-i-clevr)](https://paperswithcode.com/sota/text-to-image-generation-on-geneva-i-clevr?p=lattegan-visually-guided-language-attention)`

LatteGAN: Visually Guided Language Attention for Multi-Turn Text-Conditioned Image Manipulation

28 Dec 2021 · Shoya Matsumori, Yuki Abe, Kosuke Shingyouchi, Komei Sugiura, Michita Imai ·

Text-guided image manipulation tasks have recently gained attention in the vision-and-language community. While most of the prior studies focused on single-turn manipulation, our goal in this paper is to address the more challenging multi-turn image manipulation (MTIM) task. Previous models for this task successfully generate images iteratively, given a sequence of instructions and a previously generated image. However, this approach suffers from under-generation and a lack of generated quality of the objects that are described in the instructions, which consequently degrades the overall performance. To overcome these problems, we present a novel architecture called a Visually Guided Language Attention GAN (LatteGAN). Here, we address the limitations of the previous approaches by introducing a Visually Guided Language Attention (Latte) module, which extracts fine-grained text representations for the generator, and a Text-Conditioned U-Net discriminator architecture, which discriminates both the global and local representations of fake or real images. Extensive experiments on two distinct MTIM datasets, CoDraw and i-CLEVR, demonstrate the state-of-the-art performance of the proposed model.

PDF Abstract

Code

Add Remove Mark official

smatsumori/lattegan official

Tasks

Add Remove

Image Manipulation

Text-to-Image Generation

Datasets

CLEVR

CoDraw

Results from the Paper

Edit

Ranked #1 on Text-to-Image Generation on GeNeVA (CoDraw)

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Text-to-Image Generation	GeNeVA (CoDraw)	LatteGAN	F1-score	77.51± 0.52	# 1	Compare
Text-to-Image Generation	GeNeVA (CoDraw)	LatteGAN	rsim	54.16± 0.21	# 1	Compare
Text-to-Image Generation	GeNeVA (i-CLEVR)	LatteGAN	F1-score	97.26±1.56	# 1	Compare
Text-to-Image Generation	GeNeVA (i-CLEVR)	LatteGAN	rsim	83.21± 1.70	# 1	Compare

Methods

Add Remove

Concatenated Skip Connection • Convolution • Max Pooling • ReLU • U-Net

Edit Social Preview

LatteGAN: Visually Guided Language Attention for Multi-Turn Text-Conditioned Image Manipulation

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove