Scene Graph Generation
71 papers with code • 4 benchmarks • 5 datasets
A scene graph is a structured representation of an image, where nodes in a scene graph correspond to object bounding boxes with their object categories, and edges correspond to their pairwise relationships between objects. The task of Scene Graph Generation is to generate a visually-grounded scene graph that most accurately correlates with an image.
We propose to compose dynamic tree structures that place the objects in an image into a visual context, helping visual reasoning tasks such as scene graph generation and visual Q&A.
Today's scene graph generation (SGG) task is still far from practical, mainly due to the severe training bias, e. g., collapsing diverse "human walk on / sit on / lay on beach" into "human on beach".
In this work, we explicitly model the objects and their relationships using scene graphs, a visually-grounded graphical structure of an image.
The first, Entity Instance Confusion, occurs when the model confuses multiple instances of the same type of entity (e. g. multiple cups).
More specifically, we show that the statistical correlations between objects appearing in images and their relationships, can be explicitly represented by a structured knowledge graph, and a routing mechanism is learned to propagate messages through the graph to explore their interactions.
Scene graph generation models understand the scene through object and predicate recognition, but are prone to mistakes due to the challenges of perception in the wild.
Learning to infer graph representations and performing spatial reasoning in a complex surgical environment can play a vital role in surgical scene understanding in robotic surgery.