Concurrently, recent breakthroughs in visual representation learning have sparked a paradigm shift leading to the advent of large foundation models that can be trained with completely unlabeled images.
no code implementations • 10 Aug 2023 • D. Adriana Gómez-Rosal, Max Bergau, Georg K. J. Fischer, Andreas Wachaja, Johannes Gräter, Matthias Odenweller, Uwe Piechottka, Fabian Hoeflinger, Nikhil Gosala, Niklas Wetzel, Daniel Büscher, Abhinav Valada, Wolfram Burgard
In today's chemical plants, human field operators perform frequent integrity checks to guarantee high safety standards, and thus are possibly the first to encounter dangerous operating conditions.
A popular approach to robot localization is based on image-to-point cloud registration, which combines illumination-invariant LiDAR-based mapping with economical image-based localization.
We employ our method to learn challenging multi-object robot manipulation tasks from wrist camera observations and demonstrate superior utility for policy learning compared to other representation learning techniques.
By training everything end-to-end with the loss of the dynamics model, we enforce the latent mapper to learn an update rule for the latent map that is useful for the subsequent dynamics model.
Early stopping based on the validation set performance is a popular approach to find the right balance between under- and overfitting in the context of supervised learning.
Operating a robot in the open world requires a high level of robustness with respect to previously unseen environments.
While interacting in the world is a multi-sensory experience, many robots continue to predominantly rely on visual perception to map and navigate in their environments.
In this work, we propose EvCenterNet, a novel uncertainty-aware 2D object detection framework utilizing evidential learning to directly estimate both classification and regression uncertainties.
To overcome these challenges, we propose a novel bottom-up approach to lane graph estimation from aerial imagery that aggregates multiple overlapping graphs into a single consistent graph.
Implicit supervision trains the model by enforcing spatial consistency of the scene over time based on FV semantic sequences, while explicit supervision exploits BEV pseudolabels generated from FV semantic annotations and self-supervised depth estimates.
Grounding language to the visual observations of a navigating agent can be performed using off-the-shelf visual-language models pretrained on Internet-scale data (e. g., image captions).
Current learning-based methods typically try to achieve maximum performance for this task, while neglecting a proper estimation of the associated uncertainties.
Recent works have shown that Large Language Models (LLMs) can be applied to ground natural language to a wide variety of robot skills.
Ranked #1 on Avg. sequence length on CALVIN
Concretely, we combine a low-level policy that learns latent skills via imitation learning and a high-level policy learned from offline reinforcement learning for skill-chaining the latent behavior priors.
To the best of our knowledge, our model is the first generative model that provides an RGB-D video prediction of the future for a static camera.
Robustly classifying ground infrastructure such as roads and street crossings is an essential task for mobile robots operating alongside pedestrians.
In this paper we propose USegScene, a framework for semantically guided unsupervised learning of depth, optical flow and ego-motion estimation for stereo camera images using convolutional neural networks.
In this work, we introduce the novel task of uncertainty-aware panoptic segmentation, which aims to predict per-pixel semantic and instance segmentations, together with per-pixel uncertainty estimates.
We have open-sourced our implementation to facilitate future research in learning to perform many complex manipulation skills in a row specified with natural language.
Robots operating in human-centered environments should have the ability to understand how objects function: what can be done with each object, where this interaction may occur, and how the object is used to achieve a goal.
In extensive experiments carried out with a real-world dataset, we demonstrate that our approach provides accurate detections of moving vehicles and does not require manual annotations.
We show that a baseline model based on multi-context imitation learning performs poorly on CALVIN, suggesting that there is significant room for developing innovative agents that learn to relate human language to their world models with this benchmark.
A core challenge for an autonomous agent acting in the real world is to adapt its repertoire of skills to cope with its noisy perception and dynamics.
Robust localization in dense urban scenarios using a low-cost sensor setup and sparse HD maps is highly relevant for the current advances in autonomous driving, but remains a challenging topic in research.
Our reinforcement learning agent learns a policy for a centralized controller to let connected autonomous vehicles at unsignalized intersections give up their right of way and yield to other vehicles to optimize traffic flow.
Lane-level scene annotations provide invaluable data in autonomous vehicles for trajectory planning in complex environments such as urban areas and cities.
Ranked #2 on Lane Detection on nuScenes
Visual domain randomization in simulated environments is a widely used method to transfer policies trained in simulation to real robots.
Estimating scene geometry from data obtained with cost-effective sensors is key for robots and self-driving cars.
In this work, we introduce an end-to-end trainable approach for joint object detection and tracking that is capable of such reasoning.
Panoptic segmentation of point clouds is a crucial task that enables autonomous vehicles to comprehend their vicinity using their highly accurate and reliable LiDAR sensors.
Controlling robots to perform tasks via natural language is one of the most challenging topics in human-robot interaction.
The advances in computer processor technology have enabled the application of nonlinear model predictive control (NMPC) to agile systems, such as quadrotors.
Robotics Systems and Control Systems and Control Optimization and Control
Deep neural networks (DNNs) are usually over-parameterized to increase the likelihood of getting adequate initial weights by random initialization.
In this work, we propose a behavioral cloning approach that can safely leverage imperfect perception without being conservative.
Self-supervised learning has emerged as a powerful tool for depth and ego-motion estimation, leading to state-of-the-art results on benchmark datasets.
In autonomous driving, accurately estimating the state of surrounding obstacles is critical for safe and robust path planning.
A key challenge for an agent learning to interact with the world is to reason about physical properties of objects and to foresee their dynamics under the effect of applied forces.
In order to further improve the architecture we introduce a weight function which aims to re-balance classes to increase the attention of the networks to under-represented objects.
Imitation learning is a powerful family of techniques for learning sensorimotor coordination in immersive environments.
To achieve this, we fine-tune an existing DeepMask network for instance segmentation on the self-labeled training data acquired by the robot.
In this paper, we introduce a novel perception task denoted as multi-object panoptic tracking (MOPT), which unifies the conventionally disjoint tasks of semantic segmentation, instance segmentation, and multi-object tracking.
We avoid the expensive annotation of nighttime images by leveraging an existing daytime RGB-dataset and propose a teacher-student training approach that transfers the dataset's knowledge to the nighttime domain.
In this paper we present an approach to learning policies for signal controllers using deep reinforcement learning aiming for optimized traffic flow.
We propose SYMOG (symmetric mixture of Gaussian modes), which significantly decreases the complexity of DNNs through low-bit fixed-point quantization.
One particular requirement for such robots is that they are able to understand spatial relations and can place objects in accordance with the spatial relations expressed by their user.
In this work, we propose a novel terrain classification framework leveraging an unsupervised proprioceptive classifier that learns from vehicle-terrain interaction sounds to self-supervise an exteroceptive classifier for pixel-wise semantic segmentation of images.
Whether it is object detection, model reconstruction, laser odometry, or point cloud registration: Plane extraction is a vital component of many robotic systems.
However, many common lidar models perform poorly in unstructured, unpredictable environments, they lack a consistent physical model for both mapping and localization, and they do not exploit all the information the sensor provides, e. g. out-of-range measurements.
A popular class of lidar-based grid mapping algorithms computes for each map cell the probability that it reflects an incident laser beam.
Due to their ubiquity and long-term stability, pole-like objects are well suited to serve as landmarks for vehicle localization in urban environments.
Most robot mapping techniques for lidar sensors tessellate the environment into pixels or voxels and assume uniformity of the environment within them.
Our method learns a general skill embedding independently from the task context by using an adversarial loss.
We propose Adaptive Curriculum Generation from Demonstrations (ACGD) for reinforcement learning in the presence of sparse rewards.
We present a convolutional neural network for joint 3D shape prediction and viewpoint estimation from a single input image.
Due to their high computational complexity, deep neural networks are still limited to powerful processing units.
In this paper we present CMRNet, a realtime approach based on a Convolutional Neural Network to localize an RGB image of a scene in a map built from LiDAR data.
This problem is extremely challenging as pre-existing maps cannot be leveraged for navigation due to structural changes that may have occurred.
Many state-of-the-art methods use intrinsic motivation to complement the sparse extrinsic reward signal, giving the agent more opportunities to receive feedback during exploration.
Indoor localization is one of the crucial enablers for deployment of service robots.
Our proposed architecture consists of a Siamese network for learning a feature descriptor and a metric learning network for matching the descriptors.
Learned representations from the traffic light recognition stream are fused with the estimated trajectories from the motion prediction stream to learn the crossing decision.
To address this limitation, we propose a mutimodal semantic segmentation framework that dynamically adapts the fusion of modality-specific features while being sensitive to the object category, spatial location and scene context in a self-supervised manner.
Ranked #1 on Semantic Segmentation on Freiburg Forest
Deep learning techniques have revolutionized the field of machine learning and were recently successfully applied to various classification problems in noninvasive electroencephalography (EEG).
Semantic understanding and localization are fundamental enablers of robot autonomy that have for the most part been tackled as disjoint problems.
no code implementations • 18 Apr 2018 • Niko Sünderhauf, Oliver Brock, Walter Scheirer, Raia Hadsell, Dieter Fox, Jürgen Leitner, Ben Upcroft, Pieter Abbeel, Wolfram Burgard, Michael Milford, Peter Corke
In this paper we discuss a number of robotics-specific learning, reasoning, and embodiment challenges for deep learning.
A video of our experimental results can be found at https://goo. gl/pWbpcF.
Terrain classification is a critical component of any autonomous mobile robot system operating in unknown real-world environments.
We evaluate our proposed VLocNet on indoor as well as outdoor datasets and show that even our single task model exceeds the performance of state-of-the-art deep architectures for global localization, while achieving competitive performance for visual odometry estimation.
We propose an approach to estimate 3D human pose in real world units from a single RGBD image and show that it exceeds performance of monocular 3D pose estimation approaches from color as well as pose estimation exclusively from depth.
Ranked #14 on 3D Human Pose Estimation on Total Capture
In this paper, we deal with the reality gap from a novel perspective, targeting transferring Deep Reinforcement Learning (DRL) policies learned in simulated environments to the real-world domain for visual control tasks.
Analysis of brain signals from a human interacting with a robot may help identifying robot errors, but accuracies of such analyses have still substantial space for improvement.
We recorded high-density EEG in a flanker task experiment (31 subjects) and an online BCI control paradigm (4 subjects).
Experiments show that our GAIL-based approach greatly improves the safety and efficiency of the behavior of mobile robots from pure behavior cloning.
Agricultural robots are expected to increase yields in a sustainable way and automate precision tasks, such as weeding and plant monitoring.
Our findings suggest that non-invasive recordings of brain responses elicited when observing robots indeed contain decodable information about the correctness of the robot's action and the type of observed robot.
In this paper, we propose a depth-based perception pipeline that estimates the position and velocity of people in the environment and categorizes them according to the mobility aids they use: pedestrian, person in wheelchair, person in a wheelchair with a person pushing them, person with crutches and person using a walker.
no code implementations • 20 Jul 2017 • Felix Burget, Lukas Dominique Josef Fiederer, Daniel Kuhner, Martin Völker, Johannes Aldinger, Robin Tibor Schirrmeister, Chau Do, Joschka Boedecker, Bernhard Nebel, Tonio Ball, Wolfram Burgard
As our results demonstrate, our system is capable of adapting to frequent changes in the environment and reliably completing given tasks within a reasonable amount of time.
Object detection is an essential task for autonomous robots operating in dynamic and changing environments.
To operate intelligently in domestic environments, robots require the ability to understand arbitrary spatial relations between objects and to generalize them to objects of varying sizes and shapes.
We present an approach for agents to learn representations of a global map from sensor data, to aid their exploration in new environments.
Compared to LiDAR-based localization methods, which provide high accuracy but rely on expensive sensors, visual localization approaches only require a camera and thus are more cost-effective while their accuracy and reliability typically is inferior to LiDAR-based methods.
To learn the distinction between movable and non-movable points in the environment, we introduce an approach based on deep neural network and for detecting the dynamic points, we estimate pointwise motion.
5 code implementations • 15 Mar 2017 • Robin Tibor Schirrmeister, Jost Tobias Springenberg, Lukas Dominique Josef Fiederer, Martin Glasstetter, Katharina Eggensperger, Michael Tangermann, Frank Hutter, Wolfram Burgard, Tonio Ball
PLEASE READ AND CITE THE REVISED VERSION at Human Brain Mapping: http://onlinelibrary. wiley. com/doi/10. 1002/hbm. 23730/full Code available here: https://github. com/robintibor/braindecode
We carry out our discussions on the two main paradigms for learning control with deep networks: deep reinforcement learning and imitation learning.
We propose a successor feature based deep reinforcement learning algorithm that can learn to transfer knowledge from previously mastered navigation tasks to new problem instances.
In this paper, we address this issue and present a dataset consisting of 5, 000 images covering 25 different classes of groceries, with at least 97 images per class.
Inverse Reinforcement Learning (IRL) describes the problem of learning an unknown reward function of a Markov Decision Process (MDP) from observed behavior of an agent.
Robust object recognition is a crucial ingredient of many, if not all, real-world robotics applications.
In this paper, we address the localization problem when the map of the environment is not present beforehand, and the robot relies on a hand-drawn map from a non-expert user.
In this paper we present a novel approach to global localization using an RGB-D camera in maps of visual features.