Building on recent progress in self-supervised image-language models, we revisit this question in the context of video and language tasks.
Autonomous vehicle software is typically structured as a modular pipeline of individual components (e. g., perception, prediction, and planning) to help separate concerns into interpretable sub-tasks.
Experiments on the KITTI and DDAD datasets show that our DepthFormer architecture establishes a new state of the art in self-supervised monocular depth estimation, and is even competitive with highly specialized supervised single-frame architectures.
This paper proposes a self-supervised objective for learning representations that localize objects under occlusion - a property known as object permanence.
However, the simultaneous self-supervised learning of depth and scene flow is ill-posed, as there are infinitely many combinations that result in the same 3D point.
Our experiments demonstrate that, despite only capturing a small subset of the objects that move, this signal is enough to generalize to segment both moving and static instances of dynamic objects.
The ability to learn reward functions plays an important role in enabling the deployment of intelligent agents in the real world.
Camera calibration is integral to robotics and computer vision algorithms that seek to infer geometric properties of the scene from visual input streams.
Third, inspired by the theoretical insights, we devise a re-weighted regularization technique that consistently improves the SSL representation quality on imbalanced datasets with several evaluation criteria, closing the small gap between balanced and imbalanced datasets with the same number of examples.
Deep learning models for semantic segmentation rely on expensive, large-scale, manually annotated datasets.
Ranked #13 on Semantic Segmentation on NYU Depth v2
Recent progress in 3D object detection from single images leverages monocular depth estimation as a way to produce 3D pointclouds, turning cameras into pseudo-lidar sensors.
Ranked #1 on Monocular 3D Object Detection on KITTI Pedestrian Hard (using extra training data)
We use a hierarchical Lovasz hinge loss to learn a low-dimensional embedding space structured into a unified semantic and instance hierarchy without requiring separate network branches or object proposals.
Despite the empirical successes, theoretical foundations are limited -- prior analyses assume conditional independence of the positive pairs given the same class label, but recent empirical applications use heavily correlated positive pairs (i. e., data augmentations of the same image).
We use a hierarchical Lov\'asz hinge loss to learn a low-dimensional embedding space structured into a unified semantic and instance hierarchy without requiring separate network branches or object proposals.
Reasoning about the future behavior of other agents is critical to safe robot navigation.
In this work, we extend monocular self-supervised depth and ego-motion estimation to large-baseline multi-camera rigs.
Estimating scene geometry from data obtained with cost-effective sensors is key for robots and self-driving cars.
Simulators can efficiently generate large amounts of labeled synthetic data with perfect supervision for hard-to-label tasks like semantic segmentation.
no code implementations • 29 Mar 2021 • Sharada Mohanty, Jyotish Poonganam, Adrien Gaidon, Andrey Kolobov, Blake Wulfe, Dipam Chakraborty, Gražvydas Šemetulskis, João Schapke, Jonas Kubilius, Jurgis Pašukonis, Linas Klimas, Matthew Hausknecht, Patrick MacAlpine, Quang Nhat Tran, Thomas Tumiel, Xiaocheng Tang, Xinwei Chen, Christopher Hesse, Jacob Hilton, William Hebgen Guss, Sahika Genc, John Schulman, Karl Cobbe
We present the design of a centralized benchmark for Reinforcement Learning which can help measure Sample Efficiency and Generalization in Reinforcement Learning by doing end to end evaluation of the training and rollout phases of thousands of user submitted code bases in a scalable way.
In this work, we introduce an end-to-end trainable approach for joint object detection and tracking that is capable of such reasoning.
Fluid-filled soft visuotactile sensors such as the Soft-bubbles alleviate key challenges for robust manipulation, as they enable reliable grasps along with the ability to obtain high-resolution sensory feedback on contact geometry and forces.
no code implementations • 24 Nov 2020 • Daisuke Nishiyama, Mario Ynocente Castro, Shirou Maruyama, Shinya Shiroshita, Karim Hamzaoui, Yi Ouyang, Guy Rosman, Jonathan DeCastro, Kuan-Hui Lee, Adrien Gaidon
Automated Vehicles require exhaustive testing in simulation to detect as many safety-critical failures as possible before deployment on public roads.
Traffic simulators are important tools in autonomous driving development.
3D object detection from monocular images is an ill-posed problem due to the projective entanglement of depth and scale.
Reasoning about human motion is a core component of modern human-robot interactive systems.
This paper presents a novel online framework for safe crowd-robot interaction based on risk-sensitive stochastic optimal control, wherein the risk is modeled by the entropic risk measure.
In this work, we propose a behavioral cloning approach that can safely leverage imperfect perception without being conservative.
Self-supervised learning has emerged as a powerful tool for depth and ego-motion estimation, leading to state-of-the-art results on benchmark datasets.
In autonomous driving, accurately estimating the state of surrounding obstacles is critical for safe and robust path planning.
To address driving in near-accident scenarios, we propose a hierarchical reinforcement and imitation learning (H-ReIL) approach that consists of low-level policies learned by IL for discrete driving modes, and a high-level policy learned by RL that switches between different driving modes.
Real-world large-scale datasets are heteroskedastic and imbalanced -- labels have varying levels of uncertainty and label distributions are long-tailed.
Ranked #8 on Image Classification on WebVision-1000
Deep neural networks (DNNs) have shown remarkable performance improvements on vision-related tasks such as object detection or image segmentation.
In this work, we present Predicted Endpoint Conditioned Network (PECNet) for flexible human trajectory prediction.
Ranked #1 on Multi Future Trajectory Prediction on ETH/UCY
In this paper, we propose a novel spatio-temporal graph model for video captioning that exploits object interactions in space and time.
Instead of using semantic labels and proxy losses in a multi-task approach, we propose a new architecture leveraging fixed pretrained semantic segmentation networks to guide self-supervised representation learning via pixel-adaptive convolutions.
In addition, we introduce a new dataset designed specifically for autonomous-driving scenarios in areas with dense pedestrian populations: the Stanford-TRI Intent Prediction (STIP) dataset.
Detecting and matching robust viewpoint-invariant keypoints is critical for visual SLAM and Structure-from-Motion.
Panoptic segmentation is a complex full scene parsing task requiring simultaneous instance and semantic segmentation at high resolution.
We present an automatic annotation pipeline to recover 9D cuboids and 3D shapes from pre-trained off-the-shelf 2D detectors and sparse LIDAR data.
In contrast to the previous work that aims to solve either the task of pose prediction or trajectory forecasting in isolation, we propose a framework to unify the two problems and address the practically useful task of pedestrian locomotion prediction in the wild.
With this model we generate a diverse, realistic, and physically plausible dataset of human action videos, called PHAV for "Procedural Human Action Videos".
Learning depth and camera ego-motion from raw unlabeled RGB video streams is seeing exciting progress through self-supervision from strong geometric cues.
Dense depth estimation from a single image is a key problem in computer vision, with exciting applications in a multitude of robotic tasks.
Deep learning algorithms can fare poorly when the training dataset suffers from heavy class-imbalance but the testing criterion requires good generalization on less frequent classes.
Ranked #4 on Long-tail Learning on CIFAR-10-LT (ρ=10)
Vehicle taillight recognition is an important application for automated driving, especially for intent prediction of ado vehicles and trajectory planning of the ego vehicle.
Although cameras are ubiquitous, robotic platforms typically rely on active sensors like LiDAR for direct 3D perception.
Driving requires reacting to a wide variety of complex environment conditions and agent behaviors.
Ranked #12 on Autonomous Driving on CARLA Leaderboard
We present a deep learning method for end-to-end monocular 3D object detection and metric shape retrieval.
We propose an end-to-end learning approach for panoptic segmentation, a novel task unifying instance (things) and semantic (stuff) segmentation.
Ranked #13 on Panoptic Segmentation on Cityscapes val (using extra training data)
Deep Learning for Computer Vision depends mainly on the source of supervision. Photo-realistic simulators can generate large-scale automatically labeled syntheticdata, but introduce a domain gap negatively impacting performance.
Both contributions provide significant performance gains over the state-of-the-art in self-supervised depth and pose estimation on the public KITTI benchmark.
Deep learning for human action recognition in videos is making significant progress, but is slowed down by its dependency on expensive manual labeling of large video collections.
Action recognition in videos is a challenging task due to the complexity of the spatio-temporal patterns to model and the difficulty to acquire and learn on large quantities of video data.
We provide quantitative experimental evidence suggesting that (i) modern deep learning algorithms pre-trained on real data behave similarly in real and virtual worlds, and (ii) pre-training on virtual data improves performance.
We quantitatively measure the benefit of our domain adaptation strategy on the KITTI tracking benchmark and on a new dataset (PASCAL-to-KITTI) we introduce to study the domain mismatch problem in MOT.
Convolutional Networks (ConvNets) have recently improved image recognition performance thanks to end-to-end learning of deep feed-forward models from raw pixels.
Stochastic Gradient Descent (SGD) is one of the most widely used techniques for online optimization in machine learning.
In this paper, we address the problem of self-learning detectors in an autonomous manner, i. e. (i) detectors continuously updating themselves to efficiently adapt to streaming data sources (contrary to transductive algorithms), (ii) without any labeled data strongly related to the target data stream (contrary to self-paced learning), and (iii) without manual intervention to set and update hyper-parameters.