TransDSSL: Transformer based Depth Estimation via Self-Supervised Learning

Recently, transformers have been widely adopted for various computer vision tasks and show promising results due to their ability to encode long-range spatial dependencies in an image effectively. However, very few studies on adopting transformers in self-supervised depth estimation have been conducted. When replacing the CNN architecture with the transformer in self-supervised learning of depth, we encounter several problems such as problematic multi-scale photometric loss function when used with transformers and, insuffcient ability to capture local details. In this paper, we propose an attention-based decoder module, Pixel-Wise Skip Attention (PWSA), to enhance fine details in feature maps while keeping global context from transformers. In addition, we propose utilizing self-distillation loss with single-scale photometric loss to alleviate the instability of transformer training by using correct training signals. We demonstrate that the proposed model performs accurate predictions on large objects and thin structures that require global context and local details. Our model achieves state-ofthe-art performance among the self-supervised monocular depth estimation methods on KITTI and DDAD benchmarks

PDF Abstract

Datasets


Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Benchmark
Monocular Depth Estimation DDAD TransDSSL absolute relative error 0.151 # 3
Sq Rel 3.591 # 3
RMSE 14.350 # 3
RMSE log 0.172 # 2
Monocular Depth Estimation KITTI Eigen split unsupervised TransDSSL absolute relative error 0.095 # 7
RMSE 4.321 # 10
Sq Rel 0.711 # 15
RMSE log 0.172 # 6
Delta < 1.25 0.906 # 6
Delta < 1.25^2 0.967 # 7
Delta < 1.25^3 0.984 # 5
Mono O # 1

Methods


No methods listed for this paper. Add relevant methods here