LayoutTransformer: Relation-Aware Scene Layout Generation

1 Jan 2021 · Cheng-Fu Yang, Wan-Cyuan Fan, Fu-En Yang, Yu-Chiang Frank Wang ·

In the areas of machine learning and computer vision, text-to-image synthesis aims at producing image outputs given the input text. In particular, the task of layout generation requires one to describe the spatial information for each object component, with the ability to model their relationships. In this paper, we present a LayoutTransformer Network (LT-Net), which is a generative model for text-conditioned layout generation. By extracting semantics-aware yet object discriminative contextual features from the input, we utilize Gaussian mixture models to describe the layouts for each object with relation consistency enforced. Finally, a co-attention mechanism across textual and visual features is deployed to produce the final output. In our experiments, we conduct extensive experiments on both MS-COCO and Visual Genome (VG) datasets, and confirm the effectiveness and superiority of our LT-Net over recent text-to-image and layout generation models.

PDF Abstract