As a result, outlier detection is a fundamental problem in computer vision and a wide range of approaches, from simple checks based on descriptor similarity to geometric verification, have been proposed over the last decades.
The recent trend towards cloud-based localization and mapping systems has raised significant privacy concerns.
Temporal alignment of fine-grained human actions in videos is important for numerous applications in computer vision, robotics, and mixed reality.
Spatial computing -- the ability of devices to be aware of their surroundings and to represent this digitally -- offers novel capabilities in human-robot interaction.
We propose the Model Quality Network, MQ-Net in short, for predicting the quality, e. g. the pose error of essential matrices, of models generated inside RANSAC.
Neural implicit representations have recently shown encouraging results in various domains, including promising progress in simultaneous localization and mapping (SLAM).
Building upon the recent progress in novel view synthesis, we propose its application to improve monocular depth estimation.
However, research in this area is currently hindered by the lack of data.
In this work, we propose a novel neural implicit representation for the human body, which is fully differentiable and optimizable with disentangled shape and pose latent spaces.
We propose a method for jointly estimating the 3D motion, 3D shape, and appearance of highly motion-blurred objects from a video.
To this end, we propose an approach to enforce temporal priors on the optimal transport matrix, which leverages temporal consistency, while allowing for variations in the order of actions.
However, existing recurrent methods only model the local dependencies in the depth domain, which greatly limits the capability of capturing the global scene context along the depth dimension.
Contrary to the standard scenario of instance-level 3D reconstruction, where identical objects or scenes are present in all views, objects in different instructional videos may have large appearance variations given varying conditions and versions of the same product.
To prove the effectiveness of the proposed motion priors, we combine them into a novel pipeline for 4D human body capture in 3D scenes.
Finding local features that are repeatable across multiple views is a cornerstone of sparse 3D reconstruction.
In this paper, we propose a solution to the uncalibrated privacy preserving localization and mapping problem.
We address the novel task of jointly reconstructing the 3D shape, texture, and motion of an object from a single motion-blurred image.
However, the implicit nature of neural implicit representations results in slow inference time and requires careful initialization.
To this end, we propose a method to create a unified dataset for egocentric 3D interaction recognition.
In this paper, we aim at improving the computational efficiency of graph convolutional networks (GCNs) for learning on point clouds.
We thus hereby introduce the first joint detection and description of line segments in a single deep network.
1 code implementation • • Paul-Edouard Sarlin, Ajaykumar Unagar, Måns Larsson, Hugo Germain, Carl Toft, Viktor Larsson, Marc Pollefeys, Vincent Lepetit, Lars Hammarstrand, Fredrik Kahl, Torsten Sattler
In this paper, we go Back to the Feature: we argue that deep networks should focus on learning robust and invariant visual features, while the geometric estimation should be left to principled algorithms.
We not only propose an image-based local structured implicit network to improve the object shape estimation, but also refine the 3D object pose and scene layout via a novel implicit scene graph neural network that exploits the implicit local object features.
Ranked #1 on Room Layout Estimation on SUN RGB-D (using extra training data)
We look at the general case where neither the emission times of the sources nor the reference time frames of the receivers are known.
State-of-the-art GCNs adopt $K$-nearest neighbor (KNN) searches for local feature aggregation and feature extraction operations from layer to layer.
In this work, we present a lightweight, tightly-coupled deep depth network and visual-inertial odometry (VIO) system, which can provide accurate state estimates and dense depth maps of the immediate surroundings.
Compared to other methods, such as deblatting, the inference is of several orders of magnitude faster and allows applications such as real-time fast moving object detection and retrieval in large video collections.
For geometrical and temporal consistency, our approach explicitly creates a 3D point cloud representation of the scene and maintains dense 3D-2D correspondences across frames that reflect the geometric scene configuration inferred from the satellite view.
We propose an online multi-view depth prediction approach on posed video streams, where the scene geometry information computed in the previous time steps is propagated to the current time step in an efficient and geometrically plausible way.
We present PatchmatchNet, a novel and learnable cascade formulation of Patchmatch for high-resolution multi-view stereo.
Ranked #7 on 3D Reconstruction on DTU
Visual localization and mapping is the key technology underlying the majority of mixed reality and robotics systems.
We propose a method that, given a single image with its estimated background, outputs the object's appearance and position in a series of sub-frames as if captured by a high-speed camera (i. e. temporal super-resolution).
Ranked #1 on Video Super-Resolution on Falling Objects
Localization of a robotic system within a previously mapped environment is important for reducing estimation drift and for reusing previously built maps.
Our approach learns to decompose images of synthetic scenes with multiple objects on a planar surface into its constituent scene objects and to infer their 3D properties from a single view.
Surface reconstruction from magnetic resonance (MR) imaging data is indispensable in medical image analysis and clinical research.
By differentiable rendering, we train our model to decompose scenes self-supervised from RGB-D images.
Most of the current scene flow methods choose to model scene flow as a per point translation vector without differentiating between static and dynamic components of 3D motion.
1 code implementation • 25 Aug 2020 • Dorin Ungureanu, Federica Bogo, Silvano Galliani, Pooja Sama, Xin Duan, Casey Meekhof, Jan Stühmer, Thomas J. Cashman, Bugra Tekin, Johannes L. Schönberger, Pawel Olszta, Marc Pollefeys
Mixed reality headsets, such as the Microsoft HoloLens 2, are powerful sensing devices with integrated compute capabilities, which makes it an ideal platform for computer vision research.
Only the tracked planar points belonging to the same plane will be used for plane initialization, which makes the plane extraction efficient and robust.
We present a novel 3D shape completion method that operates directly on unstructured point clouds, thus avoiding resource-intensive data structures like voxel grids.
In particular, our approach is more robust than the naive approach of first estimating intrinsic parameters and pose per camera before refining the extrinsic parameters of the system.
To be invariant, or not to be invariant: that is the question formulated in this work about local descriptors.
Many computer vision systems require users to upload image features to the cloud for processing and storage.
Local feature matching is a critical component of many computer vision pipelines, including among others Structure-from-Motion, SLAM, and Visual Localization.
Previous methods on estimating detailed human depth often require supervised training with `ground truth' depth data.
Modeling hand-object manipulations is essential for understanding how humans interact with their environment.
In this paper, we present an omnidirectional localization and dense mapping system for a wide-baseline multiview stereo setup with ultra-wide field-of-view (FOV) fisheye cameras, which has a 360 degrees coverage of stereo observations of the environment.
In this work, we address the problem of refining the geometry of local image features from multiple views without known scene or camera geometry.
Motion blurry images challenge many computer vision algorithms, e. g, feature detection, motion estimation, or object recognition.
In this paper, we propose a depth completion and uncertainty estimation approach that better handles the challenges of aerial platforms, such as large viewpoint and depth variations, and limited computing resources.
We present a super-resolution method capable of creating a high-resolution texture map for a virtual 3D object from a set of lower-resolution images of that object.
In contrast, generic camera models allow for very accurate calibration due to their flexibility.
When we take photos through glass windows or doors, the transmitted background scene is often blended with undesirable reflection.
We propose a differentiable sphere tracing algorithm to bridge the gap between inverse graphics methods and the recently proposed deep learning based implicit signed distance function.
This work presents and evaluates a novel compact scene representation based on Stixels that infers geometric and semantic information.
Our method learns sensor or algorithm properties jointly with semantic depth fusion and scene completion and can also be used as an expert system, e. g. to unify the strengths of various photometric stereo algorithms.
Using a classical feature-based approach within this framework, we show state-of-the-art performance.
Our approach spans from offline model building to real-time client-side pose fusion.
The depth and semantic information is incorporated as a unary potential, smoothed by a pairwise regularizer.
The second goal is to learn instance information by densely estimating directional information of the instance's center of mass for each voxel.
Ranked #2 on 3D Semantic Instance Segmentation on ScanNetV2
Experimental results demonstrate that our proposed networks successfully incorporate the 3D geometric information and super-resolve the texture maps.
In this work we address the problem of finding reliable pixel-level correspondences under difficult imaging conditions.
Ranked #10 on Image Matching on IMC PhotoTourism
We use both instance-aware semantic segmentation and sparse scene flow to classify objects as either background, moving, or potentially moving, thereby ensuring that the system is able to model objects with the potential to transition from static to dynamic, such as parked cars.
Given a single RGB image, our model jointly estimates the 3D hand and object poses, models their interactions, and recognizes the object and action classes with a single feed-forward pass through a neural network.
We furthermore use our model to show that pose regression is more closely related to pose approximation via image retrieval than to accurate pose estimation via 3D structure.
Current localization systems rely on the persistent storage of 3D point clouds of the scene to enable camera pose estimation, but such data reveals potentially sensitive scene information.
We propose instead to tightly couple mesh regularization and state estimation by detecting and enforcing structural regularities in a novel factor-graph formulation.
In this paper, we propose a deep learning architecture that produces accurate dense depth for the outdoor scene from a single color image and a sparse depth.
This paper addresses the challenge of dense pixel correspondence estimation between two images.
Ranked #2 on Dense Pixel Correspondence Estimation on HPatches
One solution to this problem is to allow the agent to create rewards for itself - thus making rewards dense and more suitable for learning.
We then compare the daytime and translated night images to obtain a pose estimate for the night image using the known 6-DOF position of the closest day image.
This results in a system that provides reliable and drift-less pose estimations for high speed autonomous driving.
In contrast to existing variational methods for semantic 3D reconstruction, our model is end-to-end trainable and captures more complex dependencies between the semantic labels and the 3D geometry.
Robust data association is a core problem of visual odometry, where image-to-image correspondences provide constraints for camera pose and map estimation.
We present a dataset of thousands of ambient and flash illumination pairs to enable studying flash photography and other applications that can benefit from having separate illuminations.
Robust and accurate visual localization across large appearance variations due to changes in time of day, seasons, or changes of the environment is a challenging problem which is of importance to application areas such as navigation of autonomous robots.
Surface reconstruction is a vital tool in a wide range of areas of medical image analysis and clinical research.
Besides outperforming previous compression techniques in terms of pose accuracy under the same memory constraints, our compression scheme itself is also more efficient.
Image-based 3D reconstruction for Internet photo collections has become a robust technology to produce impressive virtual representations of real-world scenes.
We seek to predict the 6 degree-of-freedom (6DoF) pose of a query photograph with respect to a large indoor 3D map.
Surface reconstruction from a point cloud is a standard subproblem in many algorithms for dense 3D reconstruction from RGB images or depth maps.
Accurate segmentation of the heart is an important step towards evaluating cardiac function.
To minimize the number of cameras needed for surround perception, we utilize fisheye cameras.
2 code implementations • • Torsten Sattler, Will Maddern, Carl Toft, Akihiko Torii, Lars Hammarstrand, Erik Stenborg, Daniel Safari, Masatoshi Okutomi, Marc Pollefeys, Josef Sivic, Fredrik Kahl, Tomas Pajdla
Visual localization enables autonomous vehicles to navigate in their surroundings and augmented reality applications to link virtual to real worlds.
In this work we present a novel compact scene representation based on Stixels that infers geometric and semantic information.
Our resulting novel linear system formulation can be solved in closed-form and is robust against several fundamental challenges of natural matting such as holes and remote intricate structures.
Motivated by the limitations of existing multi-view stereo benchmarks, we present a novel dataset for this task.
3D structure-based methods employ 3D models of the scene to estimate the full 6DOF pose of a camera very accurately.
Moreover, we propose a novel SGM parameterization, which deploys different penalties depending on either positive or negative disparity changes in order to represent the object structures more discriminatively.
In terms of matching performance, we evaluate the different descriptors regarding standard criteria. However, considering matching performance in isolation only provides an incomplete measure of a descriptor’s quality.
While randomized methods like RANSAC are fast, they do not guarantee global optimality and fail to manage large amounts of outliers.
Adding the knowledge of direction of triangulation, we are able to approximate the position of the camera from two matches alone.
Our resulting novel linear system formulation can be solved in closed-form and is robust against several fundamental challenges in natural matting such as holes and remote intricate structures.
We present a method to jointly refine the geometry and semantic segmentation of 3D surface meshes.
We propose to use a hierarchical semantic representation of the objects, coming from a convolutional neural network, to solve this ambiguity.
With the massive data set presented in this paper, we aim at closing this data gap to help unleash the full potential of deep learning methods for 3D labelling tasks.
We believe this challenge should be faced by introducing a representation of the sensory data that provides compressed and structured access to all relevant visual content of the scene.
In this paper, we ask a fundamental question: can we learn such detectors from scratch?
We present a novel method for accurate and efficient up- sampling of sparse depth data, guided by high-resolution imagery.
It is well known that the rolling shutter effect in images captured with a moving rolling shutter camera causes inaccuracies to 3D reconstructions.
In particular, due to the differences in spectral sensitivities of the cameras, different cameras yield different RGB measurements for the same spectral signal.
Visual location recognition is the task of determining the place depicted in a query image from a given database of geo-tagged images.
In this paper we propose a new approach to incrementally initialize a manifold surface for automatic 3D reconstruction from images.
This more efficient use of training data results in better performance on popular benchmark datasets with smaller number of parameters when comparing to standard convolutional neural networks with dataset augmentation and to other baselines.
We propose an approach for dense semantic 3D reconstruction which uses a data term that is defined as potentials over viewing rays, combined with continuous surface area penalization.
An important variant of this problem is the case in which individual sides of a building can be reconstructed but not joined due to the missing visual overlap.
As a second step, we obtain the calibration by finding the translation of the camera center using an ordering constraint.
Despite their enormous success in solving hard combinatorial problems, convex relaxation approaches often suffer from the fact that the computed solutions are far from binary and that subsequent heuristic binarization may substantially degrade the quality of computed solutions.
Hand motion capture is a popular research field, recently gaining more attention due to the ubiquity of RGB-D sensors.
Videos consisting of thousands of high resolution frames are challenging for existing structure from motion (SfM) and simultaneous-localization and mapping (SLAM) techniques.
In this work we make use of recent advances in data driven classification to improve standard approaches for binocular stereo matching and single view depth estimation.
In this paper we propose a method, which learns the matching function, that automatically finds the space of allowed changes in visual appearance, such as due to the motion blur, chromatic distortions, different colour calibration or seasonal changes.
The limitations of current state-of-the-art methods for single-view depth estimation and semantic segmentations are closely tied to the property of perspective geometry, that the perceived size of the objects scales inversely with the distance.
In this paper, we present our minimal 4-point and linear 8-point algorithms to estimate the relative pose of a multi-camera system with known vertical directions, i. e. known absolute roll and pitch angles.
Motivated by a Bayesian vision of the 3D multi-view reconstruction from images problem, we propose a dense 3D reconstruction technique that jointly refines the shape and the camera parameters of a scene by minimizing the photometric reprojection error between a generated model and the observed images, hence considering all pixels in the original images.
We propose a sequential optimization technique for segmenting a rectified image of a facade into semantic categories.
Image segmentations provide geometric cues about which surface orientations are more likely to appear at a certain location in space whereas a dense 3D reconstruction yields a suitable regularization for the segmentation problem by lifting the labeling from 2D images to 3D space.
By modeling the multicamera system as a generalized camera and applying the non-holonomic motion constraint of a car, we show that this leads to a novel 2-point minimal solution for the generalized essential matrix where the full relative motion including metric scale can be obtained.
In this paper, we propose a method to detect changes in the geometry of a city using panoramic images captured by a car driving around the city.
While finding the exact solution for the MAP inference problem is intractable for many real-world tasks, MAP LP relaxations have been shown to be very effective in practice.
We describe a log-bilinear" model that computes class probabilities by combining an input vector multiplicatively with a vector of binary latent variables.