Experiments on the KITTI and DDAD datasets show that our DepthFormer architecture establishes a new state of the art in self-supervised monocular depth estimation, and is even competitive with highly specialized supervised single-frame architectures.
However, the simultaneous self-supervised learning of depth and scene flow is ill-posed, as there are infinitely many combinations that result in the same 3D point.
Camera calibration is integral to robotics and computer vision algorithms that seek to infer geometric properties of the scene from visual input streams.
This top-down approach outperforms its bottom-up counterpart in which object bounding box prediction follows per-pixel depth estimation, since it does not suffer from the compounding error introduced by a depth prediction model.
Recent progress in 3D object detection from single images leverages monocular depth estimation as a way to produce 3D pointclouds, turning cameras into pseudo-lidar sensors.
Ranked #1 on Monocular 3D Object Detection on KITTI Pedestrian Hard (using extra training data)
Artists and video game designers often construct 2D animations using libraries of sprites -- textured patches of objects and characters.
In this work, we extend monocular self-supervised depth and ego-motion estimation to large-baseline multi-camera rigs.
Estimating scene geometry from data obtained with cost-effective sensors is key for robots and self-driving cars.
Simulators can efficiently generate large amounts of labeled synthetic data with perfect supervision for hard-to-label tasks like semantic segmentation.
Fluid-filled soft visuotactile sensors such as the Soft-bubbles alleviate key challenges for robust manipulation, as they enable reliable grasps along with the ability to obtain high-resolution sensory feedback on contact geometry and forces.
Self-supervised learning has emerged as a powerful tool for depth and ego-motion estimation, leading to state-of-the-art results on benchmark datasets.
Instead of using semantic labels and proxy losses in a multi-task approach, we propose a new architecture leveraging fixed pretrained semantic segmentation networks to guide self-supervised representation learning via pixel-adaptive convolutions.
By making the sampling of inlier-outlier sets from point-pair correspondences fully differentiable within the keypoint learning framework, we show that are able to simultaneously self-supervise keypoint description and improve keypoint matching.
Detecting and matching robust viewpoint-invariant keypoints is critical for visual SLAM and Structure-from-Motion.
This paper addresses the problem of learning instantaneous occupancy levels of dynamic environments and predicting future occupancy levels.
Panoptic segmentation is a complex full scene parsing task requiring simultaneous instance and semantic segmentation at high resolution.
Learning depth and camera ego-motion from raw unlabeled RGB video streams is seeing exciting progress through self-supervision from strong geometric cues.
Dense depth estimation from a single image is a key problem in computer vision, with exciting applications in a multitude of robotic tasks.
Although cameras are ubiquitous, robotic platforms typically rely on active sensors like LiDAR for direct 3D perception.
This paper addresses the problem of single image depth estimation (SIDE), focusing on improving the quality of deep neural network predictions.
On the other hand, the cost to evaluate the policy's performance might also be high, being desirable that a solution can be found with as few interactions as possible with the real system.
This paper introduces the concept of continuous convolution to neural networks and deep learning applications in general.
In outdoor environments, mobile robots are required to navigate through terrain with varying characteristics, some of which might significantly affect the integrity of the platform.