3D hand reconstruction from images is a widely-studied problem in computer vision and graphics, and has a particularly high relevance for virtual and augmented reality.
We introduce the first dense neural non-rigid structure from motion (N-NRSfM) approach, which can be trained end-to-end in an unsupervised manner from 2D point tracks.
We perform volume rendering only to produce a low-resolution feature map and progressively apply upsampling in 2D to address the first issue.
We evaluate GraviCap on a new dataset with ground-truth annotations for persons and different objects undergoing free flights.
Experiments demonstrate that NeuRay can quickly generate high-quality novel view images of unseen scenes with little finetuning and can handle complex scenes with severe self-occlusions which previous methods struggle with.
We demonstrate that it suffices to train our temporal architecture on only 10 minutes of footage of 1 subject for about 6 hours.
In this work, we address such problems with emerging quantum computing technology and propose several reformulations of QAPs as unconstrained problems suitable for efficient execution on quantum hardware.
Even holding a mobile phone camera in the front of the face while sitting for a long duration is not convenient.
Outdoor scene relighting is a challenging problem that requires good understanding of the scene geometry, illumination and albedo.
Our texture term exploits the orientation information in the micro-structures of the objects, e. g., the yarn patterns of fabrics.
To address the limitations of the existing methods, we develop HandVoxNet++, i. e., a voxel-based deep network with 3D and graph convolutions trained in a fully supervised manner.
Moreover, we demonstrate that our approach offers previously unseen two-hand tracking performance from RGB, and quantitatively and qualitatively outperforms existing RGB-based methods that were not explicitly designed for two-hand interactions.
The problem of simultaneous rigid alignment of multiple unordered point sets which is unbiased towards any of the inputs has recently attracted increasing interest, and several reliable methods have been newly proposed.
In NeuS, we propose to represent a surface as the zero-level set of a signed distance function (SDF) and develop a new volume rendering method to train a neural SDF representation.
We present a novel method for real-time pose and shape reconstruction of two strongly interacting hands.
To address this problem, we utilize a coarse body model as the proxy to unwarp the surrounding 3D space into a canonical pose.
Finding shape correspondences can be formulated as an NP-hard quadratic assignment problem (QAP) that becomes infeasible for shapes with high sampling density.
We propose a deep videorealistic 3D human character model displaying highly realistic shape, motion, and dynamic appearance learned in a new weakly supervised way from multi-view imagery.
We present a new trainable system for physically plausible markerless 3D human motion capture, which achieves state-of-the-art results in a broad range of challenging scenarios.
This paper introduces the first differentiable simulator of event streams, i. e., streams of asynchronous brightness change signals recorded by event cameras.
Furthermore, these methods suffer from limited accuracy and temporal instability due to ambiguities caused by the monocular setup and the severe occlusion in a strongly distorted egocentric perspective.
In this paper, we build our work on the aforementioned approaches and propose a new method that greatly improves reconstruction quality and robustness in general scenes.
Ranked #8 on 3D Face Reconstruction on NoW Benchmark
We present a novel method for single image depth estimation using surface normal constraints.
Our model can generate plausible pose sequences for short sentences describing single actions as well as long compositional sentences describing multiple sequential and superimposed actions.
1 code implementation • 13 Mar 2021 • Mallikarjun B R, Ayush Tewari, Abdallah Dib, Tim Weyrich, Bernd Bickel, Hans-Peter Seidel, Hanspeter Pfister, Wojciech Matusik, Louis Chevallier, Mohamed Elgharib, Christian Theobalt
We present an approach for high-quality intuitive editing of the camera viewpoint and scene illumination in a portrait image.
We address these limitations and present a generative model for images of dressed humans offering control over pose, local body part appearance and garment style.
Photo-realistic re-rendering of a human from a single image with explicit control over body pose, shape and appearance enables a wide range of applications, such as human appearance transfer, virtual try-on, motion imitation, and novel view synthesis.
We propose the first approach to automatically and jointly synthesize both the synchronous 3D conversational body and hand gestures, as well as 3D face and head animations, of a virtual character from speech input.
Human re-rendering from a single image is a starkly under-constrained problem, and state-of-the-art algorithms often exhibit undesired artefacts, such as over-smoothing, unrealistic distortions of the body parts and garments, or implausible changes of the texture.
We next combine the target pose image and the textures into a combined feature image, which is transformed into the output color image using a neural image translation network.
We show that a single handheld consumer-grade camera is sufficient to synthesize sophisticated renderings of a dynamic scene from novel virtual camera views, e. g. a `bullet-time' video effect.
We address these limitations for the first time in the literature and present a new framework which performs high-fidelity and temporally-consistent human motion transfer with natural pose-dependent non-rigid deformations, for several types of loose garments.
Due to the different data modality of event cameras compared to classical cameras, existing methods cannot be directly applied to and re-trained for event streams.
We present the first method for real-time full body capture that estimates shape and motion of body and hands together with a dynamic 3D face model from a single color image.
We present a new pose transfer method for synthesizing a human animation from a single image of a person controlled by a sequence of body poses.
Our approach has the following favorable properties: (i) It is the first full head morphable model that includes hair.
We present a novel method for multi-view depth estimation from a single video, which is a critical task in various applications, such as perception, reconstruction and robot navigation.
Recent monocular human performance capture approaches have shown compelling dense tracking results of the full body from a single RGB camera.
Formulating this closed loop is not straightforward because it is not trivial to force the output of the NN to be on the surface of the human model - outside this surface the human model is not even defined.
Our network design and loss functions ensure a disentangled parameterization of not only identity and albedo, but also, for the first time, an expression basis.
This article introduces a new physics-based method for rigid point set alignment called Fast Gravitational Approach (FGA).
We present the first approach for embedding real portrait images in the latent space of StyleGAN, which allows for intuitive editing of the head pose, facial expression, and scene illumination in the image.
The reflectance field of a face describes the reflectance properties responsible for complex lighting effects including diffuse, specular, inter-reflection and self shadowing.
We, therefore, present PhysCap, the first algorithm for physically plausible, real-time and marker-less human 3D motion capture with a single colour camera at 25 fps.
At the level of patches, objects across different categories share similarities, which leads to more generalizable models.
Our goal is to animate the facial expressions of the target video by a source actor and re-render the manipulated output video in a photo-realistic fashion.
In this work, we present methodology that combines detail-rich implicit functions and parametric representations in order to reconstruct 3D models of people that remain controllable and accurate even in the presence of clothing.
We propose to use a model-based generative loss for training hand pose estimators on depth images based on a volumetric hand model.
We introduce a new benchmark dataset for face video forgery detection, of unprecedented quality.
We propose the first approach that simultaneously estimates camera motion and reconstructs the geometry of complex 3D thin structures in high quality from a color video captured by a handheld camera.
no code implementations • 8 Apr 2020 • Ayush Tewari, Ohad Fried, Justus Thies, Vincent Sitzmann, Stephen Lombardi, Kalyan Sunkavalli, Ricardo Martin-Brualla, Tomas Simon, Jason Saragih, Matthias Nießner, Rohit Pandey, Sean Fanello, Gordon Wetzstein, Jun-Yan Zhu, Christian Theobalt, Maneesh Agrawala, Eli Shechtman, Dan B. Goldman, Michael Zollhöfer
Neural rendering is a new and rapidly emerging field that combines generative machine learning techniques with physical knowledge from computer graphics, e. g., by the integration of differentiable rendering into network training.
The input to our method is a 3D voxelized depth map, and we rely on two hand shape representations.
We present a new learning-based method for multi-frame depth estimation from a color video, which is a fundamental problem in scene understanding, robot navigation or handheld 3D reconstruction.
StyleGAN generates photorealistic portrait images of faces with eyes, teeth, hair and context (neck, shoulders, background), but lacks a rig-like control over semantic face parameters that are interpretable in 3D, such as face pose, expressions, and scene illumination.
We present a novel method for monocular hand shape and pose estimation at unprecedented runtime performance of 100fps and at state-of-the-art accuracy.
Human performance capture is a highly important computer vision problem with many applications in movie production and virtual/augmented reality.
In this paper, we propose a novel human video synthesis method that approaches these limiting factors by explicitly disentangling the learning of time-coherent fine-scale details from the embedding of the human in 2D screen space.
Based on this 3D proxy, the appearance of a captured view can be warped into a new target view as in classical image-based rendering.
Neural Voice Puppetry has a variety of use-cases, including audio-driven video avatars, video dubbing, and text-driven video synthesis of a talking head.
Applying data-driven approaches to non-rigid 3D reconstruction has been difficult, which we believe can be attributed to the lack of a large-scale training corpus.
We consider the problem of inverse kinematics (IK), where one wants to find the parameters of a given kinematic skeleton that best explain a set of observed 3D joint locations.
We present a style-preserving visual dubbing approach from single video inputs, which maintains the signature style of target actors when modifying facial expressions, including mouth motions, to match foreign languages.
The reasons for the slow dissemination are the severe ill-posedness, high sensitivity to motion and deformation cues and the difficulty to obtain reliable point tracks in the vast majority of practical scenarios.
1 code implementation • 3 Sep 2019 • Bernhard Egger, William A. P. Smith, Ayush Tewari, Stefanie Wuhrer, Michael Zollhoefer, Thabo Beeler, Florian Bernard, Timo Bolkart, Adam Kortylewski, Sami Romdhani, Christian Theobalt, Volker Blanz, Thomas Vetter
In this paper, we provide a detailed survey of 3D Morphable Face Models over the 20 years since they were first proposed.
We present Multi-Garment Network (MGN), a method to predict body shape and clothing, layered on top of the SMPL model from a few frames (1-8) of a video.
We propose the first approach for the decomposition of a monocular color video into direct and indirect illumination components in real time.
We introduce a supervised-learning framework for non-rigid point set alignment of a new kind - Displacements on Voxels Networks (DispVoxNets) - which abstracts away from the point set representation and regresses 3D displacement fields on regularly sampled proxy 3D voxel grids.
4 code implementations • 1 Jul 2019 • Dushyant Mehta, Oleksandr Sotnychenko, Franziska Mueller, Weipeng Xu, Mohamed Elgharib, Pascal Fua, Hans-Peter Seidel, Helge Rhodin, Gerard Pons-Moll, Christian Theobalt
The first stage is a convolutional neural network (CNN) that estimates 2D and 3D pose features along with identity assignments for all visible joints of all individuals. We contribute a new architecture for this CNN, called SelecSLS Net, that uses novel selective long and short range skip connections to improve the information flow allowing for a drastically faster network without compromising accuracy.
Ranked #3 on 3D Multi-Person Pose Estimation on MuPoTS-3D
To edit a video, the user has to only edit the transcript, and an optimization strategy then chooses segments of the input corpus as base material.
Our lightweight setup allows operations in uncontrolled environments, and lends itself to telepresence applications such as video-conferencing from dynamic environments.
Mesh autoencoders are commonly used for dimensionality reduction, sampling and mesh modeling.
We show implicit filter level sparsity manifests in convolutional neural networks (CNNs) which employ Batch Normalization and ReLU activation, and are trained with adaptive gradient descent techniques and L2 regularization or weight decay.
The majority of the existing methods for non-rigid 3D surface regression from monocular 2D images require an object template or point tracks over multiple frames as an input, and are still far from real-time processing rates.
From a partial texture, we estimate detailed normal and vector displacement maps, which can be applied to a low-resolution smooth body model to add detail and clothing.
Convolutional Neural Network based approaches for monocular 3D human pose estimation usually require a large amount of training images with 3D pose annotations.
We present a learning-based model to infer the personalized 3D shape of people from a few frames (1-8) of a monocular video in which the person is moving, in less than 10 seconds with a reconstruction accuracy of 5mm.
We propose a novel approach to jointly perform 3D shape retrieval and pose estimation from monocular images. In order to make the method robust to real-world image variations, e. g. complex textures and backgrounds, we learn an embedding space from 3D data that only includes the relevant information, namely the shape and pose.
In contrast, we propose multi-frame video-based self-supervised training of a deep network that (i) learns a face identity model both in shape and appearance while (ii) jointly learning to reconstruct 3D faces.
We investigate filter level sparsity that emerges in convolutional neural networks (CNNs) which employ Batch Normalization and ReLU activation, and are trained with adaptive gradient descent techniques and L2 regularization or weight decay.
Based on this 3D proxy, the appearance of a captured view can be warped into a new target view as in classical image-based rendering.
In contrast to conventional human character rendering, we do not require the availability of a production-quality photo-realistic 3D model of the human, but instead rely on a video sequence in conjunction with a (medium-quality) controllable 3D template model of the person.
no code implementations • 29 May 2018 • Hyeongwoo Kim, Pablo Garrido, Ayush Tewari, Weipeng Xu, Justus Thies, Matthias Nießner, Patrick Pérez, Christian Richardt, Michael Zollhöfer, Christian Theobalt
In order to enable source-to-target video re-animation, we render a synthetic target video with the reconstructed head animation parameters from a source video, and feed it into the trained network -- thus taking full control of the target.
We propose HeadOn, the first real-time source-to-target reenactment approach for complete human portrait videos that enables transfer of torso and head motion, face expression, and eye gaze.
As more and more personal photos are shared and tagged in social media, avoiding privacy risks such as unintended recognition becomes increasingly challenging.
In order to deal with the inherent non-convexity of the permutation synchronisation problem, we use an initialisation procedure based on a novel rotation scheme applied to the solution of the spectral relaxation.
We tackle these challenges based on a novel lightweight setup that converts a standard baseball cap to a device for high-quality pose estimation based on a single cap-mounted fisheye camera.
This paper describes how to obtain accurate 3D body models and texture of arbitrary people from a single, monocular video in which a person is moving.
We present the first end to end approach for real time material estimation for general object shapes with uniform material that only requires a single color image as input.
Our approach uses novel occlusion-robust pose-maps (ORPM) which enable full body pose inference even under strong partial occlusions by other people and objects in the scene.
Ranked #3 on 3D Multi-Person Pose Estimation (root-relative) on MuPoTS-3D (MPJPE metric)
To alleviate this problem, we present the first approach that jointly learns 1) a regressor for face shape, expression, reflectance and illumination on the basis of 2) a concurrently learned parametric face model.
We address the highly challenging problem of real-time 3D hand tracking based on a monocular RGB-only sequence.
We propose an automatic method for generating high-quality annotations for depth-based hand segmentation, and introduce a large-scale hand segmentation dataset.
Reconstruction from monocular video alone is drastically more challenging, since strong occlusions and the inherent depth ambiguity lead to a highly ill-posed reconstruction problem.
Large databases are often organized by hand-labeled metadata, or criteria, which are expensive to collect.
A real-time kinematic skeleton fitting method uses the CNN output to yield temporally stable 3D global pose reconstructions on the basis of a coherent kinematic skeleton.
Ranked #17 on 3D Human Pose Estimation on MPI-INF-3DHP
We present an approach for real-time, robust and accurate hand pose estimation from moving egocentric RGB-D cameras in cluttered real environments.
In contrast, we propose to recover high-quality facial pose, shape, expression, reflectance and illumination using a deep neural network that is trained using a large, synthetically created training corpus.
In this work we propose a novel model-based deep convolutional autoencoder that addresses the highly challenging problem of reconstructing a 3D human face from a single in-the-wild color image.
Marker-based and marker-less optical skeletal motion-capture methods use an outside-in arrangement of cameras placed around a scene, with viewpoints converging on the center.
We propose a CNN-based approach for 3D human body pose estimation from single RGB images that addresses the issue of limited generalizability of models trained solely on the starkly limited publicly available 3D pose data.
Our extensive experiments and evaluations show that our approach produces high-quality dense reconstructions of 3D geometry and scene flow at real-time frame rates, and compares favorably to the state of the art.
We propose a new model-based method to accurately reconstruct human performances captured outdoors in a multi-camera setup.
However, due to difficult occlusions, fast motions, and uniform hand appearance, jointly tracking hand and object pose is more challenging than tracking either of the two separately.
Many compelling video post-processing effects, in particular aesthetic focus editing and refocusing effects, are feasible if per-frame depth information is available.
Based on reenactment of a prerecorded stereo video of the person without the HMD, FaceVR incorporates photo-realistic re-rendering in real time, thus allowing artificial modifications of face and eye appearances.
We therefore propose a new method for real-time, marker-less and egocentric motion capture which estimates the full-body skeleton pose from a lightweight stereo pair of fisheye cameras that are attached to a helmet or virtual reality headset.
Our method uses a new image formation model with analytic visibility and analytically differentiable alignment energy.
Many graphics and vision problems can be expressed as non-linear least squares optimizations of objective functions over visual data, such as images and meshes.
Our approach estimates globally optimized (i. e., bundle adjusted) poses in real-time, supports robust tracking with recovery from gross tracking failures (i. e., relocalization), and re-estimates the 3D model in real-time to ensure global consistency; all within a single framework.
We cast finding the optimal deformation of space as a non-linear regularized variational optimization problem by enforcing local smoothness and proximity to the input constraints.
Existing approaches for diffusion on graphs, e. g., for label propagation, are mainly focused on isotropic diffusion, which is induced by the commonly-used graph Laplacian regularizer.
In the optimization step, a novel objective function combines the detected part labels and a Gaussian mixture representation of the depth to estimate a pose that best fits the depth.
In this paper, we propose a new approach that tracks the full skeleton motion of the hand from multiple RGB cameras in real-time.
In many learning tasks, the structure of the target space of a function holds rich information about the relationships between evaluations of functions on different data points.
Generative reconstruction methods compute the 3D configuration (such as pose and/or geometry) of a shape by optimizing the overlap of the projected 3D shape model with images.
The iterated graph Laplacian enables high-order regularization, but it has a high computational complexity and so cannot be applied to large problems.
We propose an image-based, facial reenactment system that replaces the face of an actor in an existing target video with the face of a user from a source video, while preserving the original target performance.
We present a new effective way for performance capture of deforming meshes with fine-scale time-varying surface detail from multi-view video.
Our approach unites a discriminative image-based joint detection method with a model-based generative motion tracking algorithm through a combined pose optimization energy.
Statistical models of 3D human shape and pose learned from scan databases have developed into valuable tools to solve a variety of vision and graphics problems.