Today's scene graph generation (SGG) task is still far from practical, mainly due to the severe training bias, e. g., collapsing diverse "human walk on / sit on / lay on beach" into "human on beach".
We propose to compose dynamic tree structures that place the objects in an image into a visual context, helping visual reasoning tasks such as scene graph generation and visual Q&A.
SOTA for Visual Question Answering on VQA v2 (Percentage correct metric )
More specifically, we show that the statistical correlations between objects appearing in images and their relationships, can be explicitly represented by a structured knowledge graph, and a routing mechanism is learned to propagate messages through the graph to explore their interactions.
In this work, we explicitly model the objects and their relationships using scene graphs, a visually-grounded graphical structure of an image.
For this, we introduce the Toulouse Road Network dataset, based on real-world publicly-available data.
Indeed, as we demonstrate, their performance degrades significantly for larger molecules.