The recent trend towards cloud-based localization and mapping systems has raised significant privacy concerns.
As a result, outlier detection is a fundamental problem in computer vision and a wide range of approaches, from simple checks based on descriptor similarity to geometric verification, have been proposed over the last decades.
Skeletal Action recognition from an egocentric view is important for applications such as interfaces in AR/VR glasses and human-robot interaction, where the device has limited resources.
Towards this end, both explicit and implicit 3D representations are heavily studied for a holistic modeling and capture of the whole human (e. g., body, clothing, face and hair), but neither representation is an optimal choice in terms of representation efficacy since different parts of the human avatar have different modeling desiderata.
Neural fields, a category of neural networks trained to represent high-frequency signals, have gained significant attention in recent years due to their impressive performance in modeling complex 3D data, especially large neural signed distance (SDFs) or radiance fields (NeRFs) via a single multi-layer perceptron (MLP).
We propose R3D3, a multi-camera system for dense 3D reconstruction and ego-motion estimation.
We tackle the problem of estimating a Manhattan frame, i. e. three orthogonal vanishing points, and the unknown focal length of the camera, leveraging a prior vertical direction.
Therefore, RLSAC can avoid differentiating to learn the features and the feedback of downstream tasks for end-to-end robust estimation.
no code implementations • 5 Aug 2023 • Florentin Liebmann, Marco von Atzigen, Dominik Stütz, Julian Wolf, Lukas Zingg, Daniel Suter, Laura Leoty, Hooman Esfandiari, Jess G. Snedeker, Martin R. Oswald, Marc Pollefeys, Mazda Farshad, Philipp Fürnstahl
An intuitive surgical guidance is provided thanks to the integration into an augmented reality based navigation system.
Predictive variability due to data ambiguities has typically been addressed via construction of dedicated models with built-in probabilistic capabilities that are trained to predict uncertainty estimates as variables of interest.
Moreover, we derive a new minimal solver for homography estimation, requiring only a single affine correspondence (AC) and a gravity prior.
The proposed attention mechanism and one-step transformer provide an adaptive behavior that enhances the performance of RANSAC, making it a more effective tool for robust estimation.
Estimating camera motion in deformable scenes poses a complex and open research challenge.
In this work, we address this limitation, and propose OpenMask3D, which is a zero-shot approach for open-vocabulary 3D instance segmentation.
We introduce LightGlue, a deep neural network that learns to match local features across images.
Semantic 2D maps are commonly used by humans and machines for navigation purposes, whether it's walking or driving.
A particular focus in computer-assisted surgery is to replace marker-based tracking systems for instrument localization with pure image-based 6DoF pose estimation.
Our approach compares favorably to previous state-of-the-art object-level matching approaches and achieves improved performance over a pure keypoint-based approach for large view-point changes.
We propose SGAligner, the first method for aligning pairs of 3D scene graphs that is robust to in-the-wild scenarios (ie, unknown overlap -- if any -- and changes in the environment).
Ranked #1 on Point Cloud Registration on 3RScan
We argue that this representation is limited and instead propose to guide and improve 2D tracking with an explicit object representation, namely the textured 3D shape and 6DoF pose in each video frame.
The key idea is to tackle the inverse problem of image deblurring by modeling the forward problem with a 3D human model, a texture map, and a sequence of poses to describe human motion.
In contrast to sparse keypoints, a handful of line segments can concisely encode the high-level scene layout, as they often delineate the main structural elements.
Specifically, a projection-aware hierarchical transformer is proposed to capture long-range dependencies and filter outliers by extracting point features globally.
Neural implicit representations have recently become popular in simultaneous localization and mapping (SLAM), especially in dense visual SLAM.
The minimal case for reconstruction requires 13 points in 4 views for both the calibrated and uncalibrated cameras.
We validate our approach using a new and still-challenging dataset for the task of NeRF inpainting.
Their learned counterparts are more repeatable and can handle challenging images, but at the cost of a lower accuracy and a bias towards wireframe lines.
The success of the Neural Radiance Fields (NeRF) in novel view synthesis has inspired researchers to propose neural implicit scene reconstruction.
Traditional 3D scene understanding approaches rely on labeled 3D datasets to train a model for a single task with supervision.
To close this gap, we introduce LaMAR, a new benchmark with a comprehensive capture and GT pipeline that co-registers realistic trajectories and sensor streams captured by heterogeneous AR devices in large, unconstrained scenes.
The generation of triangle meshes from point clouds, i. e. meshing, is a core task in computer graphics and computer vision.
Building on this insight, we propose SCARF (Segmented Clothed Avatar Radiance Field), a hybrid model combining a mesh-based body with a neural radiance field.
Existing inverse rendering combined with neural rendering methods can only perform editable novel view synthesis on object-specific scenes, while we present intrinsic neural radiance fields, dubbed IntrinsicNeRF, which introduce intrinsic decomposition into the NeRF-based neural rendering method and can extend its application to room-scale scenes.
A distinctive representation of image patches in form of features is a key component of many computer vision and robotics tasks, such as image matching, image retrieval, and visual localization.
3D textured shape recovery from partial scans is crucial for many real-world applications.
Visual (re)localization addresses the problem of estimating the 6-DoF (Degree of Freedom) camera pose of a query image captured in a known scene, which is a key building block of many computer vision and robotics applications.
We introduce a scalable framework for novel view synthesis from RGB-D images with largely incomplete scene coverage.
We tested NeFSAC on more than 100k image pairs from three publicly available real-world datasets and found that it leads to one order of magnitude speed-up, while often finding more accurate results than USAC alone.
Temporal alignment of fine-grained human actions in videos is important for numerous applications in computer vision, robotics, and mixed reality.
Spatial computing -- the ability of devices to be aware of their surroundings and to represent this digitally -- offers novel capabilities in human-robot interaction.
We propose the Model Quality Network, MQ-Net in short, for predicting the quality, e. g. the pose error of essential matrices, of models generated inside RANSAC.
Building upon the recent progress in novel view synthesis, we propose its application to improve monocular depth estimation.
Ranked #15 on Monocular Depth Estimation on KITTI Eigen split
Neural implicit representations have recently shown encouraging results in various domains, including promising progress in simultaneous localization and mapping (SLAM).
Key to reasoning about interactions is to understand the body pose and motion of the interaction partner from the egocentric view.
In this work, we propose a novel neural implicit representation for the human body, which is fully differentiable and optimizable with disentangled shape and pose latent spaces.
We propose a method for jointly estimating the 3D motion, 3D shape, and appearance of highly motion-blurred objects from a video.
To this end, we propose an approach to enforce temporal priors on the optimal transport matrix, which leverages temporal consistency, while allowing for variations in the order of actions.
However, existing recurrent methods only model the local dependencies in the depth domain, which greatly limits the capability of capturing the global scene context along the depth dimension.
Contrary to the standard scenario of instance-level 3D reconstruction, where identical objects or scenes are present in all views, objects in different instructional videos may have large appearance variations given varying conditions and versions of the same product.
To prove the effectiveness of the proposed motion priors, we combine them into a novel pipeline for 4D human body capture in 3D scenes.
Finding local features that are repeatable across multiple views is a cornerstone of sparse 3D reconstruction.
In this paper, we propose a solution to the uncalibrated privacy preserving localization and mapping problem.
We address the novel task of jointly reconstructing the 3D shape, texture, and motion of an object from a single motion-blurred image.
However, the implicit nature of neural implicit representations results in slow inference time and requires careful initialization.
To this end, we propose a method to create a unified dataset for egocentric 3D interaction recognition.
In this paper, we aim at improving the computational efficiency of graph convolutional networks (GCNs) for learning on point clouds.
We thus hereby introduce the first joint detection and description of line segments in a single deep network.
1 code implementation • • Paul-Edouard Sarlin, Ajaykumar Unagar, Måns Larsson, Hugo Germain, Carl Toft, Viktor Larsson, Marc Pollefeys, Vincent Lepetit, Lars Hammarstrand, Fredrik Kahl, Torsten Sattler
In this paper, we go Back to the Feature: we argue that deep networks should focus on learning robust and invariant visual features, while the geometric estimation should be left to principled algorithms.
We not only propose an image-based local structured implicit network to improve the object shape estimation, but also refine the 3D object pose and scene layout via a novel implicit scene graph neural network that exploits the implicit local object features.
Ranked #1 on 3D Shape Reconstruction on Pix3D
We look at the general case where neither the emission times of the sources nor the reference time frames of the receivers are known.
State-of-the-art GCNs adopt $K$-nearest neighbor (KNN) searches for local feature aggregation and feature extraction operations from layer to layer.
In this work, we present a lightweight, tightly-coupled deep depth network and visual-inertial odometry (VIO) system, which can provide accurate state estimates and dense depth maps of the immediate surroundings.
Compared to other methods, such as deblatting, the inference is of several orders of magnitude faster and allows applications such as real-time fast moving object detection and retrieval in large video collections.
For geometrical and temporal consistency, our approach explicitly creates a 3D point cloud representation of the scene and maintains dense 3D-2D correspondences across frames that reflect the geometric scene configuration inferred from the satellite view.
We propose an online multi-view depth prediction approach on posed video streams, where the scene geometry information computed in the previous time steps is propagated to the current time step in an efficient and geometrically plausible way.
Visual localization and mapping is the key technology underlying the majority of mixed reality and robotics systems.
We present PatchmatchNet, a novel and learnable cascade formulation of Patchmatch for high-resolution multi-view stereo.
Ranked #5 on Point Clouds on Tanks and Temples
We propose a method that, given a single image with its estimated background, outputs the object's appearance and position in a series of sub-frames as if captured by a high-speed camera (i. e. temporal super-resolution).
Ranked #1 on Video Super-Resolution on Falling Objects
Localization of a robotic system within a previously mapped environment is important for reducing estimation drift and for reusing previously built maps.
Our approach learns to decompose images of synthetic scenes with multiple objects on a planar surface into its constituent scene objects and to infer their 3D properties from a single view.
Surface reconstruction from magnetic resonance (MR) imaging data is indispensable in medical image analysis and clinical research.
By differentiable rendering, we train our model to decompose scenes self-supervised from RGB-D images.
Most of the current scene flow methods choose to model scene flow as a per point translation vector without differentiating between static and dynamic components of 3D motion.
1 code implementation • 25 Aug 2020 • Dorin Ungureanu, Federica Bogo, Silvano Galliani, Pooja Sama, Xin Duan, Casey Meekhof, Jan Stühmer, Thomas J. Cashman, Bugra Tekin, Johannes L. Schönberger, Pawel Olszta, Marc Pollefeys
Mixed reality headsets, such as the Microsoft HoloLens 2, are powerful sensing devices with integrated compute capabilities, which makes it an ideal platform for computer vision research.
Only the tracked planar points belonging to the same plane will be used for plane initialization, which makes the plane extraction efficient and robust.
We present a novel 3D shape completion method that operates directly on unstructured point clouds, thus avoiding resource-intensive data structures like voxel grids.
In particular, our approach is more robust than the naive approach of first estimating intrinsic parameters and pose per camera before refining the extrinsic parameters of the system.
To be invariant, or not to be invariant: that is the question formulated in this work about local descriptors.
Many computer vision systems require users to upload image features to the cloud for processing and storage.
Local feature matching is a critical component of many computer vision pipelines, including among others Structure-from-Motion, SLAM, and Visual Localization.
Previous methods on estimating detailed human depth often require supervised training with `ground truth' depth data.
Modeling hand-object manipulations is essential for understanding how humans interact with their environment.
In this paper, we present an omnidirectional localization and dense mapping system for a wide-baseline multiview stereo setup with ultra-wide field-of-view (FOV) fisheye cameras, which has a 360 degrees coverage of stereo observations of the environment.
In this work, we address the problem of refining the geometry of local image features from multiple views without known scene or camera geometry.
Motion blurry images challenge many computer vision algorithms, e. g, feature detection, motion estimation, or object recognition.
In this paper, we propose a depth completion and uncertainty estimation approach that better handles the challenges of aerial platforms, such as large viewpoint and depth variations, and limited computing resources.
We present a super-resolution method capable of creating a high-resolution texture map for a virtual 3D object from a set of lower-resolution images of that object.
In contrast, generic camera models allow for very accurate calibration due to their flexibility.
When we take photos through glass windows or doors, the transmitted background scene is often blended with undesirable reflection.
We propose a differentiable sphere tracing algorithm to bridge the gap between inverse graphics methods and the recently proposed deep learning based implicit signed distance function.
This work presents and evaluates a novel compact scene representation based on Stixels that infers geometric and semantic information.
Our method learns sensor or algorithm properties jointly with semantic depth fusion and scene completion and can also be used as an expert system, e. g. to unify the strengths of various photometric stereo algorithms.
Using a classical feature-based approach within this framework, we show state-of-the-art performance.
Our approach spans from offline model building to real-time client-side pose fusion.
The depth and semantic information is incorporated as a unary potential, smoothed by a pairwise regularizer.
The second goal is to learn instance information by densely estimating directional information of the instance's center of mass for each voxel.
Ranked #2 on 3D Semantic Instance Segmentation on ScanNetV2
Experimental results demonstrate that our proposed networks successfully incorporate the 3D geometric information and super-resolve the texture maps.
In this work we address the problem of finding reliable pixel-level correspondences under difficult imaging conditions.
Ranked #8 on Image Matching on IMC PhotoTourism
We use both instance-aware semantic segmentation and sparse scene flow to classify objects as either background, moving, or potentially moving, thereby ensuring that the system is able to model objects with the potential to transition from static to dynamic, such as parked cars.
Given a single RGB image, our model jointly estimates the 3D hand and object poses, models their interactions, and recognizes the object and action classes with a single feed-forward pass through a neural network.
We furthermore use our model to show that pose regression is more closely related to pose approximation via image retrieval than to accurate pose estimation via 3D structure.
Current localization systems rely on the persistent storage of 3D point clouds of the scene to enable camera pose estimation, but such data reveals potentially sensitive scene information.
We propose instead to tightly couple mesh regularization and state estimation by detecting and enforcing structural regularities in a novel factor-graph formulation.
In this paper, we propose a deep learning architecture that produces accurate dense depth for the outdoor scene from a single color image and a sparse depth.
This paper addresses the challenge of dense pixel correspondence estimation between two images.
Ranked #2 on Dense Pixel Correspondence Estimation on HPatches
One solution to this problem is to allow the agent to create rewards for itself - thus making rewards dense and more suitable for learning.
We then compare the daytime and translated night images to obtain a pose estimate for the night image using the known 6-DOF position of the closest day image.
This results in a system that provides reliable and drift-less pose estimations for high speed autonomous driving.
We present a dataset of thousands of ambient and flash illumination pairs to enable studying flash photography and other applications that can benefit from having separate illuminations.
Robust and accurate visual localization across large appearance variations due to changes in time of day, seasons, or changes of the environment is a challenging problem which is of importance to application areas such as navigation of autonomous robots.
Robust data association is a core problem of visual odometry, where image-to-image correspondences provide constraints for camera pose and map estimation.
In contrast to existing variational methods for semantic 3D reconstruction, our model is end-to-end trainable and captures more complex dependencies between the semantic labels and the 3D geometry.
Surface reconstruction is a vital tool in a wide range of areas of medical image analysis and clinical research.
Besides outperforming previous compression techniques in terms of pose accuracy under the same memory constraints, our compression scheme itself is also more efficient.
Image-based 3D reconstruction for Internet photo collections has become a robust technology to produce impressive virtual representations of real-world scenes.
We seek to predict the 6 degree-of-freedom (6DoF) pose of a query photograph with respect to a large indoor 3D map.
Surface reconstruction from a point cloud is a standard subproblem in many algorithms for dense 3D reconstruction from RGB images or depth maps.
Accurate segmentation of the heart is an important step towards evaluating cardiac function.
To minimize the number of cameras needed for surround perception, we utilize fisheye cameras.
2 code implementations • • Torsten Sattler, Will Maddern, Carl Toft, Akihiko Torii, Lars Hammarstrand, Erik Stenborg, Daniel Safari, Masatoshi Okutomi, Marc Pollefeys, Josef Sivic, Fredrik Kahl, Tomas Pajdla
Visual localization enables autonomous vehicles to navigate in their surroundings and augmented reality applications to link virtual to real worlds.
In this work we present a novel compact scene representation based on Stixels that infers geometric and semantic information.
Our resulting novel linear system formulation can be solved in closed-form and is robust against several fundamental challenges of natural matting such as holes and remote intricate structures.
Adding the knowledge of direction of triangulation, we are able to approximate the position of the camera from two matches alone.
While randomized methods like RANSAC are fast, they do not guarantee global optimality and fail to manage large amounts of outliers.
In terms of matching performance, we evaluate the different descriptors regarding standard criteria. However, considering matching performance in isolation only provides an incomplete measure of a descriptor’s quality.
Motivated by the limitations of existing multi-view stereo benchmarks, we present a novel dataset for this task.
3D structure-based methods employ 3D models of the scene to estimate the full 6DOF pose of a camera very accurately.
Moreover, we propose a novel SGM parameterization, which deploys different penalties depending on either positive or negative disparity changes in order to represent the object structures more discriminatively.
Our resulting novel linear system formulation can be solved in closed-form and is robust against several fundamental challenges in natural matting such as holes and remote intricate structures.
We present a method to jointly refine the geometry and semantic segmentation of 3D surface meshes.
We propose to use a hierarchical semantic representation of the objects, coming from a convolutional neural network, to solve this ambiguity.
With the massive data set presented in this paper, we aim at closing this data gap to help unleash the full potential of deep learning methods for 3D labelling tasks.
We believe this challenge should be faced by introducing a representation of the sensory data that provides compressed and structured access to all relevant visual content of the scene.
In this paper, we ask a fundamental question: can we learn such detectors from scratch?
We present a novel method for accurate and efficient up- sampling of sparse depth data, guided by high-resolution imagery.
In particular, due to the differences in spectral sensitivities of the cameras, different cameras yield different RGB measurements for the same spectral signal.
Visual location recognition is the task of determining the place depicted in a query image from a given database of geo-tagged images.
It is well known that the rolling shutter effect in images captured with a moving rolling shutter camera causes inaccuracies to 3D reconstructions.
This more efficient use of training data results in better performance on popular benchmark datasets with smaller number of parameters when comparing to standard convolutional neural networks with dataset augmentation and to other baselines.
In this paper we propose a new approach to incrementally initialize a manifold surface for automatic 3D reconstruction from images.
We propose an approach for dense semantic 3D reconstruction which uses a data term that is defined as potentials over viewing rays, combined with continuous surface area penalization.
An important variant of this problem is the case in which individual sides of a building can be reconstructed but not joined due to the missing visual overlap.
As a second step, we obtain the calibration by finding the translation of the camera center using an ordering constraint.
Despite their enormous success in solving hard combinatorial problems, convex relaxation approaches often suffer from the fact that the computed solutions are far from binary and that subsequent heuristic binarization may substantially degrade the quality of computed solutions.
Hand motion capture is a popular research field, recently gaining more attention due to the ubiquity of RGB-D sensors.
In this work we make use of recent advances in data driven classification to improve standard approaches for binocular stereo matching and single view depth estimation.