Moreover, PanopticNeRF-360 enables omnidirectional rendering of high-fidelity, multi-view and spatiotemporally consistent appearance, semantic and instance labels.
Prior work in 3D object detection evaluates models using offline metrics like average precision since closed-loop online evaluation on the downstream driving task is costly.
The autonomous driving community has witnessed a rapid growth in approaches that embrace an end-to-end algorithm framework, utilizing raw sensor input to generate vehicle motion plans, instead of concentrating on individual tasks such as detection and motion prediction.
The release of nuPlan marks a new era in vehicle motion planning research, offering the first large-scale real-world dataset and evaluation schemes requiring both precise short-term planning and long-horizon ego-forecasting.
In this paper, we propose a novel method for joint recovery of camera pose, object geometry and spatially-varying Bidirectional Reflectance Distribution Function (svBRDF) of 3D scenes that exceed object-scale and hence cannot be captured with stationary light stages.
The key to progress is hence to learn generative models of 3D avatars from abundant unstructured 2D image collections.
We design a lossless procedure for baking the parameterization used during training into a model that achieves real-time rendering while still preserving the photorealistic view synthesis quality of a volumetric radiance field.
Neural implicit representations have recently become popular in simultaneous localization and mapping (SLAM), especially in dense visual SLAM.
Our experiments show that DiF leads to improvements in approximation quality, compactness, and training time when compared to previous fast reconstruction methods.
Text-to-image synthesis has recently seen significant progress thanks to large pretrained language models, large-scale training data, and the introduction of scalable model families such as diffusion and autoregressive models.
Ranked #18 on Text-to-Image Generation on COCO
We address the task of open-world class-agnostic object detection, i. e., detecting every object in an image by learning from a limited number of base object classes.
Ranked #1 on Open World Object Detection on COCO VOC to non-VOC
A key challenge in making such methods applicable to articulated objects, such as the human body, is to model the deformation of 3D locations between the rest pose (a canonical space) and the deformed space.
We present a unified formulation and model for three motion and 3D perception tasks: optical flow, rectified stereo matching and unrectified stereo depth estimation from posed images.
Ranked #1 on Optical Flow Estimation on Sintel-clean
Visually exploring in a real-world 4D spatiotemporal space freely in VR has been a long-term quest.
In this survey, we thoroughly review the ongoing developments of 3D generative models, including methods that employ 2D and 3D supervision.
Planning an optimal route in a complex environment requires efficient reasoning about the surrounding scene.
We demonstrate that our proposed pipeline can generate clothed avatars with high-quality pose-dependent geometry and appearance from a sparse set of multi-view RGB videos.
State-of-the-art 3D-aware generative models rely on coordinate-based MLPs to parameterize 3D radiance fields.
Motivated by recent advances in the area of monocular geometry prediction, we systematically explore the utility these cues provide for improving neural implicit surface reconstruction.
At the time of submission, TransFuser outperforms all prior work on the CARLA leaderboard in terms of driving score by a large margin.
Ranked #4 on Autonomous Driving on CARLA Leaderboard
Simulators offer the possibility of safe, low-cost development of self-driving systems.
In this work, we present a novel 3D-to-2D label transfer method, Panoptic NeRF, which aims for obtaining per-pixel 2D semantic and instance labels from easy-to-obtain coarse 3D bounding primitives.
We demonstrate that applying traditional CP decomposition -- that factorizes tensors into rank-one components with compact vectors -- in our framework leads to improvements over vanilla NeRF.
We present a novel method to learn Personalized Implicit Neural Avatars (PINA) from a short RGB-D sequence.
StyleGAN in particular sets new standards for generative modeling regarding image quality and controllability.
Ranked #1 on Image Generation on ImageNet 512x512
Furthermore, we show that our method can be used on the task of fitting human models to raw scans, outperforming the previous state-of-the-art.
We observe that the majority of artifacts in sparse input scenarios are caused by errors in the estimated scene geometry, and by divergent behavior at the start of training.
Generative Adversarial Networks (GANs) produce high-quality images but are challenging to train.
Ranked #1 on Image Generation on Stanford Dogs
The ability to synthesize realistic and diverse indoor furniture layouts automatically or based on partial input, unlocks many applications, from better interactive 3D tools to data synthesis for training and simulation.
For the last few decades, several major subfields of artificial intelligence including computer vision, graphics, and robotics have progressed largely independently from each other.
Efficient reasoning about the semantic, spatial, and temporal structure of a scene is a crucial prerequisite for autonomous driving.
Ranked #10 on Autonomous Driving on CARLA Leaderboard
In order to handle the challenges of autonomous driving, deep learning has proven to be crucial in tackling increasingly complex tasks, such as 3D detection or instance segmentation.
In contrast, we propose an approach that can quickly generate realistic clothed human avatars, represented as controllable neural SDFs, given only monocular depth images.
However, the implicit nature of neural implicit representations results in slow inference time and requires careful initialization.
At the same time, neural radiance fields have revolutionized novel view synthesis.
How should representations from complementary sensors be integrated for autonomous driving?
Ranked #1 on Autonomous Driving on Town05 Short
We combine PTF with multi-class occupancy networks, obtaining a novel learning-based framework that learns to simultaneously predict shape and per-point correspondences between the posed space and the canonical space for clothed human.
However, this is problematic since the backward warp field is pose dependent and thus requires large amounts of data to learn.
Despite stereo matching accuracy has greatly improved by deep learning in the last few years, recovering sharp boundaries and high-resolution outputs efficiently remains challenging.
At test time, our model generates images with explicit control over the camera as well as the shape and appearance of the scene.
The INN allows us to compute the inverse mapping of the homeomorphism, which in turn, enables the efficient computation of both the implicit surface function of a primitive and its mesh, without any additional post-processing.
1 code implementation • 23 Feb 2021 • Mark Weber, Jun Xie, Maxwell Collins, Yukun Zhu, Paul Voigtlaender, Hartwig Adam, Bradley Green, Andreas Geiger, Bastian Leibe, Daniel Cremers, Aljoša Ošep, Laura Leal-Taixé, Liang-Chieh Chen
The task of assigning semantic classes and track identities to every pixel in a video is called video panoptic segmentation.
Prior works on image classification show that instead of learning a connection to object shape, deep classifiers tend to exploit spurious correlations with low-level texture or the background for solving the classification task.
Recently, several frameworks for self-supervised learning of 3D scene flow on point clouds have emerged.
While several recent works investigate how to disentangle underlying factors of variation in the data, most of them operate in 2D and hence ignore that our world is three-dimensional.
Multi-Object Tracking (MOT) has been notoriously difficult to evaluate.
Many object pose estimation algorithms rely on the analysis-by-synthesis framework which requires explicit representations of individual object instances.
In contrast to voxel-based representations, radiance fields are not confined to a coarse discretization of the 3D space, yet allow for disentangling camera and scene properties while degrading gracefully in the presence of reconstruction ambiguity.
Ranked #2 on Scene Generation on VizDoom
Neural rendering techniques promise efficient photo-realistic image synthesis while at the same time providing rich control over scene parameters by learning the physical image formation process.
Perceiving the world in terms of objects and tracking them through time is a crucial prerequisite for reasoning and scene understanding.
In recent years, deep generative models have gained significance due to their ability to synthesize natural-looking images with applications ranging from virtual reality to data augmentation for training computer vision models.
Beyond label efficiency, we find several additional training benefits when leveraging visual abstractions, such as a significant reduction in the variance of the learned policy when compared to state-of-the-art end-to-end driving models.
Humans perceive the 3D world as a set of distinct objects that are characterized by various low-level (geometry, reflectance) and high-level (connectivity, adjacency, symmetry) properties.
In this work, we propose a novel implicit representation for capturing the visual appearance of an object in terms of its surface light field.
Motion blurry images challenge many computer vision algorithms, e. g, feature detection, motion estimation, or object recognition.
In this work, we propose a differentiable rendering formulation for implicit shape and texture representations.
We define the new task of 3D controllable image synthesis and propose an approach for solving it by reasoning both in 3D space and in the 2D image domain.
In this paper, we extend adversarial patch attacks to optical flow networks and show that such attacks can compromise their performance.
A major reason for these limitations is that common representations of texture are inefficient or hard to interface for modern deep learning techniques.
We use both instance-aware semantic segmentation and sparse scene flow to classify objects as either background, moving, or potentially moving, thereby ensuring that the system is able to model objects with the potential to transition from static to dynamic, such as parked cars.
This paper extends the popular task of multi-object tracking to multi-object tracking and segmentation (MOTS).
Ranked #6 on Multi-Object Tracking on MOTS20
RayNet integrates a CNN that learns view-invariant feature representations with an MRF that explicitly encodes the physics of perspective projection and occlusion.
In this paper, we provide a modern synthesis of the classic inverse compositional algorithm for dense image alignment.
With the advent of deep neural networks, learning-based approaches for 3D reconstruction have gained popularity.
The task of generating natural images from 3D scenes has been a long standing goal in computer graphics.
Omnidirectional cameras offer great benefits over classical cameras wherever a wide field of view is essential, such as in virtual reality applications or in autonomous robots.
In this paper, we propose a framework for unsupervised learning of optical flow and occlusions over multiple frames.
In contrast to existing variational methods for semantic 3D reconstruction, our model is end-to-end trainable and captures more complex dependencies between the semantic labels and the 3D geometry.
Most existing approaches to autonomous driving fall into one of two categories: modular pipelines, that build an extensive model of the environment, and imitation learning approaches, that map images directly to control outputs.
In this paper, we propose to estimate 3D motion from such unstructured point clouds using a deep neural network.
Existing learning based solutions to 3D surface prediction cannot be trained end-to-end as they operate on intermediate representations (e. g., TSDF) from which 3D surface meshes must be extracted in a post-processing step (e. g., via the marching cubes algorithm).
Learning-based approaches, in contrast, avoid the expensive optimization step and instead directly predict the complete shape from the incomplete observations using deep neural networks.
In this paper, we show that the requirement of absolute continuity is necessary: we describe a simple yet prototypical counterexample showing that in the more realistic case of distributions that are not absolutely continuous, unregularized GAN training is not always convergent.
Here we take a deeper look at the combination of flow and action recognition, and investigate why optical flow is helpful, what makes a flow method good for action recognition, and how we can make it better.
Existing methods for 3D scene flow estimation often fail in the presence of large displacement or local ambiguities, e. g., at texture-less or reflective surfaces.
In this paper, we consider convolutional neural networks operating on sparse inputs with an application to depth upsampling from sparse laser scan data.
Ranked #16 on Depth Completion on KITTI Depth Completion
Further, we demonstrate the utility of our approach on training standard deep models for semantic instance segmentation and object detection of cars in outdoor driving scenes.
Existing optical flow datasets are limited in size and variability due to the difficulty of capturing dense ground truth.
Adding the knowledge of direction of triangulation, we are able to approximate the position of the camera from two matches alone.
Motivated by the limitations of existing multi-view stereo benchmarks, we present a novel dataset for this task.
Due to its probabilistic nature, the approach is able to cope with the approximate geometry of the 3D models as well as input shapes that are not present in the scene.
Towards this goal, we analyze the performance of the state of the art on several challenging benchmarking datasets, including KITTI, MOT, and Cityscapes.
In this paper, we present a learning based approach to depth fusion, i. e., dense 3D reconstruction from multiple depth images.
We show that in the nonparametric limit our method yields an exact maximum-likelihood assignment for the parameters of the generative model, as well as the exact posterior distribution over the latent variables given an observation.
In this paper, we propose a non-local structured prior for volumetric multi-view 3D reconstruction.
Semantic annotations are vital for training models for object recognition, semantic segmentation or scene understanding.
One of the most popular approaches to multi-target tracking is tracking-by-detection.
Ranked #23 on Multiple Object Tracking on KITTI Tracking test
In this paper we propose an affordable solution to selflocalization, which utilizes visual odometry and road maps as the only inputs.