Learning Canonical Representations for Scene Graph to Image Generation

Generating realistic images of complex visual scenes becomes challenging when one wishes to control the structure of the generated images. Previous approaches showed that scenes with few entities can be controlled using scene graphs, but this approach struggles as the complexity of the graph (the number of objects and edges) increases. In this work, we show that one limitation of current methods is their inability to capture semantic equivalence in graphs. We present a novel model that addresses these issues by learning canonical graph representations from the data, resulting in improved image generation for complex visual scenes. Our model demonstrates improved empirical performance on large scene graphs, robustness to noise in the input scene graph, and generalization on semantically equivalent graphs. Finally, we show improved performance of the model on three different benchmarks: Visual Genome, COCO, and CLEVR.

PDF Abstract ECCV 2020 PDF ECCV 2020 Abstract
Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Layout-to-Image Generation COCO-Stuff 256x256 AttSPADE Inception Score 15.6 # 2
FID 54.7 # 5
LPIPS 0.44 # 1
Layout-to-Image Generation Visual Genome 256x256 AttSPADE Inception Score 11 # 3
FID 36.4 # 3
LPIPS 0.51 # 1

Methods


No methods listed for this paper. Add relevant methods here