Total3DUnderstanding: Joint Layout, Object Pose and Mesh Reconstruction for Indoor Scenes from a Single Image

Semantic reconstruction of indoor scenes refers to both scene understanding and object reconstruction. Existing works either address one part of this problem or focus on independent objects. In this paper, we bridge the gap between understanding and reconstruction, and propose an end-to-end solution to jointly reconstruct room layout, object bounding boxes and meshes from a single image. Instead of separately resolving scene understanding and object reconstruction, our method builds upon a holistic scene context and proposes a coarse-to-fine hierarchy with three components: 1. room layout with camera pose; 2. 3D object bounding boxes; 3. object meshes. We argue that understanding the context of each component can assist the task of parsing the others, which enables joint understanding and reconstruction. The experiments on the SUN RGB-D and Pix3D datasets demonstrate that our method consistently outperforms existing methods in indoor layout estimation, 3D object detection and mesh reconstruction.

PDF Abstract CVPR 2020 PDF CVPR 2020 Abstract

Datasets


Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
3D Shape Reconstruction Pix3D MGN CD 0.0836 # 2
EMD N/A # 3
IoU N/A # 2
Monocular 3D Object Detection SUN RGB-D Total3D w/o. joint AP@0.15 (10 / NYU-37) 23.32 # 5
AP@0.15 (NYU-37) 13.25 # 4
Monocular 3D Object Detection SUN RGB-D Total3D joint AP@0.15 (10 / NYU-37) 26.38 # 3
AP@0.15 (NYU-37) 14.28 # 3
Room Layout Estimation SUN RGB-D Total3D joint IoU 59.2 # 3
Camera Pitch 3.15 # 3
Camera Roll 2.09 # 2
Room Layout Estimation SUN RGB-D Total w/o. joint IoU 57.6 # 4
Camera Pitch 3.68 # 5
Camera Roll 2.59 # 5

Methods


No methods listed for this paper. Add relevant methods here