Scene Graph Generation from Objects, Phrases and Region Captions

ICCV 2017 Yikang LiWanli OuyangBolei ZhouKun WangXiaogang Wang

Object detection, scene graph generation and region captioning, which are three scene understanding tasks at different semantic levels, are tied together: scene graphs are generated on top of objects detected in an image with their pairwise relationship predicted, while region captioning gives a language description of the objects, their attributes, relations, and other context information. In this work, to leverage the mutual connections across semantic levels, we propose a novel neural network model, termed as Multi-level Scene Description Network (denoted as MSDN), to solve the three vision tasks jointly in an end-to-end manner... (read more)

Evaluation results from the paper

Task Dataset Model Metric name Metric value Global rank Compare
Scene Graph Generation Visual Genome MSDN [email protected] 10.72 # 2
Object Detection Visual Genome MSDN MAP 7.43 # 1