Digging Into Self-Supervised Monocular Depth Estimation

4 Jun 2018  ·  Clément Godard, Oisin Mac Aodha, Michael Firman, Gabriel Brostow ·

Per-pixel ground-truth depth data is challenging to acquire at scale. To overcome this limitation, self-supervised learning has emerged as a promising alternative for training models to perform monocular depth estimation. In this paper, we propose a set of improvements, which together result in both quantitatively and qualitatively improved depth maps compared to competing self-supervised methods. Research on self-supervised monocular training usually explores increasingly complex architectures, loss functions, and image formation models, all of which have recently helped to close the gap with fully-supervised methods. We show that a surprisingly simple model, and associated design choices, lead to superior predictions. In particular, we propose (i) a minimum reprojection loss, designed to robustly handle occlusions, (ii) a full-resolution multi-scale sampling method that reduces visual artifacts, and (iii) an auto-masking loss to ignore training pixels that violate camera motion assumptions. We demonstrate the effectiveness of each component in isolation, and show high quality, state-of-the-art results on the KITTI benchmark.

PDF Abstract
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Monocular Depth Estimation KITTI Eigen split monodepth2 M absolute relative error 0.106 # 37
Monocular Depth Estimation Make3D Monodepth2 Abs Rel 0.322 # 3
Sq Rel 3.589 # 2
RMSE 7.417 # 4
Monocular Depth Estimation Mid-Air Dataset Monodepth2 Abs Rel 0.717 # 6
SQ Rel 37.164 # 1
RMSE 74.552 # 5
RMSE log 0.882 # 6
Monocular Depth Estimation VA (Virtual Apartment) MonoDepth2 Root mean square error (RMSE) 0.432 # 3
Log root mean square error (RMSE_log) 0.251 # 3
Mean average error (MAE) 0.295 # 3
Absolute relative error (AbsRel) 0.203 # 3


No methods listed for this paper. Add relevant methods here