In this work, we argue that captions contain much richer information about the image, including attributes of objects and their relations.
Human action is naturally compositional: humans can easily recognize and perform actions with objects that are different from those used in training demonstrations.
Generating realistic images of complex visual scenes becomes challenging when one wishes to control the structure of the generated images.
Ranked #1 on Layout-to-Image Generation on Visual Genome 256x256
We propose a hybrid coarse-to-fine approach that leverages visual and GPS location cues.
In many domains, it is preferable to train systems jointly in an end-to-end manner, but SGs are not commonly used as intermediate components in visual reasoning systems because being discrete and sparse, scene-graph representations are non-differentiable and difficult to optimize.
Events defined by the interaction of objects in a scene are often of critical importance; yet important events may have insufficient labeled examples to train a conventional deep model to generalize to future object appearance.