1) We propose a non-parametric prior distribution over the appearance of image parts so that the latent variable ``what-to-draw'' per step becomes a categorical random variable.
As data collection is often significantly cheaper than labeling in this domain, the decision of which subset of examples to label can have a profound impact on model performance.
Specifically, at each iteration, the neural network takes the feedback as input and outputs an update on the current estimation.
In this paper, we introduce a non-parametric memory representation for spatio-temporal segmentation that captures the local space and time around an autonomous vehicle (AV).
An intelligent agent operating in the real-world must balance achieving its goal with maintaining the safety and comfort of not only itself, but also other participants within the surrounding scene.
Reconstructing high-quality 3D objects from sparse, partial observations from a single view is of crucial importance for various applications in computer vision, robotics, and graphics.
In this paper, we propose a neural motion planner (NMP) for learning to drive autonomously in complex urban scenarios that include traffic-light handling, yielding, and interactions with multiple road-users.
Over the last few years, we have witnessed tremendous progress on many subtasks of autonomous driving, including perception, motion forecasting, and motion planning.
Standard convolutional neural networks assume a grid structured input is available and exploit discrete convolutions as their fundamental building blocks.
Ranked #8 on Semantic Segmentation on S3DIS Area5 (mAcc metric)
Yet, there have been limited studies on the adversarial robustness of multi-modal models that fuse LiDAR features with image features.
We show TrafficSim generates significantly more realistic and diverse traffic scenarios as compared to a diverse set of baselines.
Growing at a fast pace, modern autonomous systems will soon be deployed at scale, opening up the possibility for cooperative multi-agent systems.
Constructing and animating humans is an important component for building virtual worlds in a wide variety of applications such as virtual reality or robotics testing in simulation.
The key idea is to decompose the 4D object label into two parts: the object size in 3D that's fixed through time for rigid objects, and the motion path describing the evolution of the object's pose through time.
Recent work on hyperparameters optimization (HPO) has shown the possibility of training certain hyperparameters together with regular parameters.
In this paper we propose a model that unifies these two tasks and performs them in the same metric space.
In this paper, we propose LaneRCNN, a graph-centric motion forecasting model.
Ranked #77 on Motion Forecasting on Argoverse CVPR 2020
Towards this goal, in this paper we propose a bottom up approach where given a single click for each object in a video, we obtain the segmentation masks of these objects in the full video.
Importantly, by simulating directly from sensor data, we obtain adversarial scenarios that are safety-critical for the full autonomy stack.
Motivated by this ability, we present a new self-supervised learning representation framework that can be directly deployed on a video stream of complex scenes with many moving objects.
Our experiments on a wide range of tasks and models show that the proposed curation pipeline is able to select datasets that lead to better generalization and higher performance.
Existing methods typically insert actors into the scene according to a set of hand-crafted heuristics and are limited in their ability to model the true complexity and diversity of real traffic scenes, thus inducing a content gap between synthesized traffic scenes versus real ones.
Scalable sensor simulation is an important yet challenging open problem for safety-critical domains such as self-driving.
In this paper, we present LookOut, a novel autonomy system that perceives the environment, predicts a diverse set of futures of how the scene might unroll and estimates the trajectory of the SDV by optimizing a set of contingency plans over these future realizations.
On two large-scale real-world datasets, nuScenes and ATG4D, we showcase that our scene-occupancy predictions are more accurate and better calibrated than those from state-of-the-art motion forecasting methods, while also matching their performance in pedestrian motion forecasting metrics.
We are interested in understanding whether retrieval-based localization approaches are good enough in the context of self-driving vehicles.
One of the fundamental challenges to scale self-driving is being able to create accurate high definition maps (HD maps) with low cost.
In this paper we propose a novel deep neural network that is able to jointly reason about 3D detection, tracking and motion forecasting given data captured by a 3D sensor.
In this paper we propose to exploit multiple related tasks for accurate multi-sensor 3D object detection.
Ranked #12 on 3D Object Detection on KITTI Cars Easy
Creating high definition maps that contain precise information of static elements of the scene is of utmost importance for enabling self driving cars to drive safely.
In this paper we show that High-Definition (HD) maps provide strong priors that can boost the performance and robustness of modern 3D object detectors.
In this paper we propose a real-time, calibration-agnostic and effective localization system for self-driving cars.
One of the main difficulties of scaling current localization systems to large environments is the on-board storage required for the maps.
In this paper, we propose a novel 3D object detector that can exploit both LIDAR as well as cameras to perform very accurate localization.
In this paper, we derive generalization bounds for the two primary classes of graph neural networks (GNNs), namely graph convolutional networks (GCNs) and message passing GNNs (MPGNNs), via a PAC-Bayesian approach.
Note that GeoNet++ is generic and can be used in other depth/normal prediction frameworks to improve the quality of 3D reconstruction and pixel-wise accuracy of depth and surface normals.
We then incorporate the reconstructed pedestrian assets bank in a realistic LiDAR simulation system by performing motion retargeting, and show that the simulated LiDAR data can be used to significantly reduce the amount of annotated real-world data required for visual perception tasks.
Our model exploits spatio-temporal relationships across multiple LiDAR sweeps to reduce the bitrate of both geometry and intensity values.
In this paper, we tackle the problem of spatio-temporal tagging of self-driving scenes from raw sensor data.
In this paper we propose StrObe, a novel approach that minimizes latency by ingesting LiDAR packets and emitting a stream of detections without waiting for the full sweep to be built.
Learned communication makes multi-agent systems more effective by aggregating distributed information.
In this paper, we propose an end-to-end self-driving network featuring a sparse attention module that learns to automatically attend to important regions of the input.
Compressing large neural networks is an important step for their deployment in resource-constrained computational platforms.
In this paper, we present LiRaNet, a novel end-to-end trajectory prediction method which utilizes radar sensor information along with widely used lidar and high definition (HD) maps.
3D shape completion for real data is important but challenging, since partial point clouds acquired by real-world sensors are usually sparse, noisy and unaligned.
We propose a very simple and efficient video compression framework that only focuses on modeling the conditional entropy between frames.
In this paper, we explore the use of vehicle-to-vehicle (V2V) communication to improve the perception and motion forecasting performance of self-driving vehicles.
In this paper, we propose the Deep Structured self-Driving Network (DSDNet), which performs object detection, motion prediction, and motion planning with a single neural network.
We present a novel method for testing the safety of self-driving vehicles in simulation.
In this paper we propose a novel end-to-end learnable network that performs joint perception, prediction and motion planning for self-driving vehicles and produces interpretable intermediate representations.
We show that our approach can outperform the state-of-the-art on both datasets.
Deep neural nets typically perform end-to-end backpropagation to learn the weights, a procedure that creates synchronization constraints in the weight update step across layers and is not biologically plausible.
Obtaining precise instance segmentation masks is of high importance in many modern applications such as robotic manipulation and autonomous driving.
We tackle the problem of exploiting Radar for perception in the context of self-driving as Radar provides complementary information to other sensors such as LiDAR or cameras in the form of Doppler velocity.
We propose a motion forecasting model that exploits a novel structured map representation as well as actor-map interactions.
We show that, under certain conditions on the algorithm parameters, LayerCert provably reduces the number and size of the convex programs that one needs to solve compared to GeoCert.
In order to plan a safe maneuver an autonomous vehicle must accurately perceive its environment, and understand the interactions among traffic participants.
We first utilize ray casting over the 3D scene and then use a deep neural network to produce deviations from the physics-based simulation, producing realistic LiDAR point clouds.
Towards this goal, we design a framework that leverages REINFORCE to incorporate non-differentiable priors over sample trajectories from a probabilistic model, thus optimizing the whole distribution.
We tackle the problem of joint perception and motion forecasting in the context of self-driving vehicles.
Our shape-aware adversarial attacks are orthogonal to existing point cloud based attacks and shed light on the vulnerability of 3D deep neural networks.
Modern autonomous driving systems rely heavily on deep learning models to process point cloud sensory data; meanwhile, deep models have been shown to be susceptible to adversarial attacks with visually imperceptible perturbations.
We present a new object representation, called Dense RepPoints, that utilizes a large set of points to describe an object at multiple levels, including both box level and pixel level.
In this paper, we propose PolyTransform, a novel instance segmentation algorithm that produces precise, geometry-preserving masks by combining the strengths of prevailing segmentation approaches and modern polygon-based methods.
Ranked #1 on Instance Segmentation on Cityscapes test (using extra training data)
In the past few years, we have seen great progress in perception algorithms, particular through the use of deep learning.
A graph neural network then iteratively updates the actor states via a message passing process.
Particularly difficult is the prediction of human behavior.
Recent studies on catastrophic forgetting during sequential learning typically focus on fixing the accuracy of the predictions for a previously learned task.
The motion planners used in self-driving vehicles need to generate trajectories that are safe, comfortable, and obey the traffic rules.
Our model generates graphs one block of nodes and associated edges at a time.
In practice, it performs similarly to the Hungarian algorithm during inference.
Our goal is to significantly speed up the runtime of current state-of-the-art stereo algorithms to enable real-time inference.
In this paper we tackle the problem of stereo image compression, and leverage the fact that the two images have overlapping fields of view to further compress the representations.
no code implementations • 8 Aug 2019 • Wei-Chiu Ma, Ignacio Tartavull, Ioan Andrei Bârsan, Shenlong Wang, Min Bai, Gellert Mattyus, Namdar Homayounfar, Shrinidhi Kowshika Lakshmikanth, Andrei Pokrovsky, Raquel Urtasun
In this paper we propose a novel semantic localization algorithm that exploits multiple sensors and has precision on the order of a few centimeters.
Reliable and accurate lane detection has been a long-standing problem in the field of autonomous driving.
Most deep learning models rely on expressive high-dimensional representations to achieve good performance on tasks such as classification.
More importantly, we introduce a parameter-free panoptic head which solves the panoptic segmentation via pixel-wise classification.
Ranked #3 on Panoptic Segmentation on KITTI Panoptic Segmentation
We propose the Lanczos network (LanczosNet), which uses the Lanczos algorithm to construct low rank approximations of the graph Laplacian for graph convolution.
On the other hand, 3D convolution wastes a large amount of memory on mostly unoccupied 3D space, which consists of only the surface visible to the sensor.
Neural architecture search (NAS) automatically finds the best task-specific neural network topology, outperforming many manual architecture designs.
Synthesizing programs using example input/outputs is a classic problem in artificial intelligence.
At inference time, our model can be easily reduced to a single stream module that performs intrinsic decomposition on a single input image.
In this paper we propose a novel approach to tracking by detection that can exploit both cameras as well as LIDAR data to produce very accurate 3D trajectories.
Ranked #5 on 3D Multi-Object Tracking on KITTI
In this paper, we propose Geometric Neural Network (GeoNet) to jointly predict depth and surface normal maps from a single image.
We argue that the main difficulty of applying CGANs to supervised tasks is that the generator training consists of optimizing a loss function that does not depend directly on the ground truth labels.
Deep neural networks have been shown to be very powerful modeling tools for many supervised learning tasks involving complex input patterns.
Message-passing algorithms, such as belief propagation, are a natural way to disseminate evidence amongst correlated variables while exploiting the graph structure, but these algorithms can struggle when the conditional dependency graphs contain loops.
The world is covered with millions of buildings, and precisely knowing each instance's position and extents is vital to a multitude of applications.
We examine all RBP variants along with BPTT and TBPTT in three different application domains: associative memory with continuous Hopfield networks, document classification in citation networks using graph neural networks and hyperparameter optimization for fully connected networks.
We present graph partition neural networks (GPNN), an extension of graph neural networks (GNNs) able to handle extremely large graphs.
Conventional deep convolutional neural networks (CNNs) apply convolution operators uniformly in space across all feature maps for hundreds of layers - this incurs a high computational cost for real-time applications.
In the second stage, a generative model with a newly proposed compositional mapping layer is used to render the final image with precise regions and textures conditioned on this map.
By exploiting two-directional information, the second network groups horizontal and vertical lines into connected components.
Each node in the graph corresponds to a set of points and is associated with a hidden representation vector initialized with an appearance feature extracted by a unary CNN from 2D images.
We derive a closed-form expression for the gradient that is efficient to compute: the complexity to compute the gradient is linear in the size of the training mini-batch and quadratic in the representation dimensionality.
Deep residual networks (ResNets) have significantly pushed forward the state-of-the-art on image classification, increasing in performance as networks grow both deeper and wider.
Classic approaches alternate the optimization over the learned metric and the assignment of similar instances.
We show that our approach speeds up the annotation process by a factor of 4. 7 across all classes in Cityscapes, while achieving 78. 4% agreement in IoU with original ground-truth, matching the typical agreement between human annotators.
Despite the substantial progress in recent years, the image captioning techniques are still far from being perfect. Sentences produced by existing methods, e. g. those based on RNNs, are often overly rigid and lacking in variability.
While most approaches to semantic reasoning have focused on improving performance, in this paper we argue that computational times are very important in order to enable real time applications such as autonomous driving.
In this paper we introduce the TorontoCity benchmark, which covers the full greater Toronto area (GTA) with 712. 5 $km^2$ of land, 8439 $km$ of road and around 400, 000 buildings.
In this paper we aim at facilitating generalization for deep networks while supporting interpretability of the learned representations.
Most contemporary approaches to instance segmentation use complex pipelines involving conditional random fields, recurrent neural networks, object proposals, or template matching schemes.
Ranked #10 on Instance Segmentation on Cityscapes test
On the other hand, layer normalization normalizes the activations across all activities within a layer.
Towards this goal, we first introduce a simple mechanism that first reads the input sequence before committing to a representation of each word.
We then exploit a CNN on top of these proposals to perform object detection.
In this paper we present a robust, efficient and affordable approach to self-localization which does not require neither GPS nor knowledge about the appearance of the world.
In this paper we present an approach to enhance existing maps with fine grained segmentation categories such as parking spots and sidewalk, as well as the number and location of road lanes.
The focus of this paper is on proposal generation.
Ranked #8 on Vehicle Pose Estimation on KITTI Cars Hard
We tackle the problem of estimating optical flow from a monocular camera in the context of autonomous driving.
Our aim is to provide a pixel-wise instance-level labeling of a monocular image in the context of autonomous driving.
We introduce the MovieQA dataset which aims to evaluate automatic story comprehension from both video and text.
The goal of this paper is to generate high-quality 3D object proposals in the context of autonomous driving.
Ranked #10 on Vehicle Pose Estimation on KITTI Cars Hard
In this paper we propose a novel approach to localization in very large indoor spaces (i. e., 200+ store shopping malls) that takes a single image and a floor plan of the environment as input.
In recent years, contextual models that exploit maps have been shown to be very effective for many recognition and localization tasks.
Supervised training of deep neural nets typically relies on minimizing cross-entropy.
Hypernymy, textual entailment, and image captioning can be seen as special cases of a single visual-semantic hierarchy over words, sentences, and images.
Ranked #85 on Natural Language Inference on SNLI
A diverse set of CNNs is analyzed showing that compared to a conventional implementation using a 32-bit floating-point representation for all layers, and with less than 1% loss in relative accuracy, the data footprint required by these networks can be reduced by an average of 74% and up to 92%.
Books are a rich source of both fine-grained information, how a character, an object or a scene looks like, as well as high-level semantics, what someone is thinking, feeling and how these states evolve through a story.
The end result is an off-the-shelf encoder that can produce highly generic sentence representations that are robust and perform well in practice.
Ranked #2 on Semantic Similarity on SICK
Despite the promising performance of conventional fully supervised algorithms, semantic segmentation has remained an important, yet challenging task.
What sets us apart from past work in layout estimation is the use of floor plans as a source of prior knowledge, as well as localization of each image within a bigger space (apartment).
In this paper we tackle the problem of instance-level segmentation and depth ordering from a single monocular image.
Convolutional neural networks with many layers have recently been shown to achieve excellent results on many high-level tasks such as image classification, object detection and more recently also semantic segmentation.
In this paper, we propose an approach that exploits object segmentation in order to improve the accuracy of object detection.
Importantly, our model is able to give rich feedback back to the user, conveying which garments or even scenery she/he should change in order to improve fashionability.
In this paper, we prove that every multivariate polynomial with even degree can be decomposed into a sum of convex and concave polynomials.
To keep up with the Big Data challenge, parallelized algorithms based on dual decomposition have been proposed to perform inference in Markov random fields.
One of the most popular approaches to multi-target tracking is tracking-by-detection.
Ranked #22 on Multiple Object Tracking on KITTI Tracking test
Towards this goal, we propose a training algorithm that is able to learn structured models jointly with deep features that form the MRF potentials.
Recent trends in image understanding have pushed for holistic scene understanding models that jointly reason about various tasks such as object detection, scene recognition, shape analysis, contextual reasoning, and local appearance based classifiers.
Our model automatically decouples the holistic object or body parts from the model when they are hard to detect.
We tackle the problem of weakly labeled semantic segmentation, where the only source of annotation are image tags encoding which classes are present in the scene.
In this paper we study the role of context in existing state-of-the-art detection and segmentation approaches.
Labeling large-scale datasets with very accurate object segmentations is an elaborate task that requires a high degree of quality control and a budget of tens or hundreds of thousands of dollars.
In this paper we exploit natural sentential descriptions of RGB-D scenes in order to improve 3D semantic parsing.
When employing the parts, we outperform the original DPM  in 19 out of 20 classes, achieving an improvement of 8% AP.
Recent trends in semantic image segmentation have pushed for holistic scene understanding models that jointly reason about various tasks such as object detection, scene recognition, shape analysis, contextual reasoning.
In this paper we propose an affordable solution to selflocalization, which utilizes visual odometry and road maps as the only inputs.
We demonstrate the effectiveness of our approach in indoor and outdoor scenarios, and show that our approach outperforms the state-of-the-art in both 2D[Felz09] and 3D object detection[Hedau12].
While finding the exact solution for the MAP inference problem is intractable for many real-world tasks, MAP LP relaxations have been shown to be very effective in practice.
In this paper we derive an efficient algorithm to learn the parameters of structured predictors in general graphical models.
A common approach for handling the complexity and inherent ambiguities of 3D human pose estimation is to use pose priors learned from training data.
We then propose an intuitive approximation for structured prediction problems using Fenchel duality based on a local entropy approximation that computes the exact gradients of the approximated problem and is guaranteed to converge.