ImVoxelNet: Image to Voxels Projection for Monocular and Multi-View General-Purpose 3D Object Detection

2 Jun 2021  ·  Danila Rukhovich, Anna Vorontsova, Anton Konushin ·

In this paper, we introduce the task of multi-view RGB-based 3D object detection as an end-to-end optimization problem. To address this problem, we propose ImVoxelNet, a novel fully convolutional method of 3D object detection based on monocular or multi-view RGB images. The number of monocular images in each multi-view input can variate during training and inference; actually, this number might be unique for each multi-view input. ImVoxelNet successfully handles both indoor and outdoor scenes, which makes it general-purpose. Specifically, it achieves state-of-the-art results in car detection on KITTI (monocular) and nuScenes (multi-view) benchmarks among all methods that accept RGB images. Moreover, it surpasses existing RGB-based 3D object detection methods on the SUN RGB-D dataset. On ScanNet, ImVoxelNet sets a new benchmark for multi-view 3D object detection. The source code and the trained models are available at

PDF Abstract
Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
3D Object Detection DAIR-V2X-I ImVoxelNet AP|R40(moderate) 37.6 # 8
AP|R40(easy) 44.8 # 8
AP|R40(hard) 37.6 # 8
3D Object Detection ScanNetV2 ImVoxelNet (RGB only) mAP@0.25 48.1 # 24
mAP@0.5 22.7 # 24
Monocular 3D Object Detection SUN RGB-D ImVoxelNet AP@0.15 (10 / NYU-37) 42.69 # 2
AP@0.15 (NYU-37) 21.08 # 2
AP@0.15 (10 / PNet-30) 48.74 # 1
Room Layout Estimation SUN RGB-D ImVoxelNet IoU 59.3 # 2
Camera Pitch 2.63 # 1
Camera Roll 1.96 # 1


No methods listed for this paper. Add relevant methods here