We demonstrate that it is possible to perform face-related computer vision in the wild using synthetic data alone.
Recent work on Neural Radiance Fields (NeRF) showed how neural networks can be used to encode complex 3D environments that can be rendered photorealistically from novel viewpoints.
The increased availability and maturity of head-mounted and wearable devices opens up opportunities for remote communication and collaboration.
Analysis of faces is one of the core applications of computer vision, with tasks ranging from landmark alignment, head pose estimation, expression recognition, and face recognition among others.
Realtime perceptual and interaction capabilities in mixed reality require a range of 3D tracking problems to be solved at low latency on resource-constrained hardware such as head-mounted devices.
In contrast to computer graphics approaches, generative models learned from more readily available 2D image data have been shown to produce samples of human faces that are hard to distinguish from real data.
Our ability to sample realistic natural images, particularly faces, has advanced by leaps and bounds in recent years, yet our ability to exert fine-tuned control over the generative process has lagged behind.
In this work we propose to learn an efficient algorithm for the task of 6D object pose estimation.
The most promising approach is inspired by reinforcement learning, namely to replace the deterministic hypothesis selection by a probabilistic selection for which we can derive the expected loss w. r. t.
We present a fast, practical method for personalizing a hand shape basis to an individual user's detailed hand shape using only a small set of depth images.
We present a systematic analysis of how to fuse conditional computation with representation learning and achieve a continuum of hybrid models with different ratios of accuracy vs. efficiency.
In this paper, we show how to perform model-based object tracking which allows to reconstruct the object's depth at an order of magnitude higher frame-rate through simple modifications to an off-the-shelf depth camera.
In this paper, we show that we can significantly improving upon black box optimization by exploiting high-level knowledge of the structure of the parameters and using a local surrogate energy function.
Applying our method to a near state-of-the-art network for CIFAR, we achieved comparable accuracy with 46% less compute and 55% fewer parameters.
Recent advances in camera relocalization use predictions from a regression forest to guide the camera pose optimization procedure.
We represent the observed surface using Loop subdivision of a control mesh that is deformed by our learned parametric shape and pose model.
To spur further progress we introduce a challenging new dataset with diverse, cluttered scenes.
We focus on modeling the human hand, and assume that a single rough template model is available.
We formulate this problem as inversion of the generative rendering procedure, i. e., we want to find the camera pose corresponding to a rendering of the 3D scene model that is most similar with the observed input.
We propose 'filter forests' (FF), an efficient new discriminative approach for predicting continuous variables given a signal and its context.
Randomized decision trees and forests have a rich history in machine learning and have seen considerable success in application, perhaps particularly so for computer vision.
This paper presents a new and efficient forest based model that achieves spatially consistent semantic image segmentation by encoding variable dependencies directly in the feature space the forests operate on.
We address the problem of inferring the pose of an RGB-D camera relative to a known 3D scene, given only a single acquired image.