While a few such approaches exist, those have limited generalization capabilities and are prone to learning spurious (chance) correlations between irrelevant body parts, resulting in implausible deformations and missing body parts on unseen poses.
While the quality of this pseudo-ground-truth is challenging to assess due to the lack of actual ground-truth SMPL, with the Human 3. 6m dataset, we qualitatively show that our joint locations are more accurate and that our regressor leads to improved pose estimations results on the test set without any need for retraining.
To that end, we propose to learn exercise-specific representations from unlabeled samples such that a small dataset annotated by experts suffices for supervised error detection.
Injury analysis may be one of the most beneficial applications of deep learning based human pose estimation.
To this end, we propose AdaptPose, an end-to-end framework that generates synthetic 3D human motions from a source dataset and uses them to fine-tune a 3D pose estimator.
Human pose estimation from single images is a challenging problem that is typically solved by supervised learning.
Segmenting an image into its parts is a frequent preprocess for high-level vision tasks such as image editing.
Our texture term exploits the orientation information in the micro-structures of the objects, e. g., the yarn patterns of fabrics.
Estimating 3D human poses from video is a challenging problem.
Generative adversarial networks (GANs) have attained photo-realistic quality in image generation.
Ranked #1 on Unsupervised Facial Landmark Detection on CelebA
We propose a method to learn a generative neural body model from unlabelled monocular videos by extending Neural Radiance Fields (NeRFs).
A long-standing goal in the field of sensory substitution is enabling sound perception for deaf people by visualizing audio content.
Self-supervised detection and segmentation of foreground objects aims for accuracy without annotated training data.
In this paper we propose an unsupervised learning method to extract temporal information on monocular videos, where we detect and encode subject of interest in each frame and leverage contrastive self-supervised (CSS) learning to extract rich latent vectors.
Human pose estimation from single images is a challenging problem in computer vision that requires large amounts of labeled training data to be solved accurately.
Our conclusion is that it is important to utilize camera calibration information when available, for classical and deep-learning-based computer vision alike.
While supervised object detection and segmentation methods achieve impressive accuracy, they generalize poorly to images whose appearance significantly differs from the data they have been trained on.
Specific to the lumber application, we also propose an algorithm to correct any misalignment in the raw timber images during scanning, and contribute the first open-source lumber knot dataset by labeling the elliptical knots in the preprocessed images.
We compare our approach with existing domain transfer methods and demonstrate improved pose estimation accuracy on Drosophila melanogaster (fruit fly), Caenorhabditis elegans (worm) and Danio rerio (zebrafish), without requiring any manual annotation on the target domain and despite using simplistic off-the-shelf animal characters for simulation, or simple geometric shapes as models.
Reconstruction of a 3D shape from a single 2D image is a classical computer vision problem, whose difficulty stems from the inherent ambiguity of recovering occluded or only partially observed surfaces.
The accuracy of monocular 3D human pose estimation depends on the viewpoint from which the image is captured.
We show theoretically and empirically that a simple motion trajectory analysis suffices to translate from pixel measurements to the person's metric height, reaching a MAE of up to 3. 9 cm on jumping motions, and that this works without camera and ground plane calibration.
We propose a method for estimating an athlete's global 3D position and articulated pose using multiple cameras.
While supervised object detection methods achieve impressive accuracy, they generalize poorly to images whose appearance significantly differs from the data they have been trained on.
4 code implementations • 1 Jul 2019 • Dushyant Mehta, Oleksandr Sotnychenko, Franziska Mueller, Weipeng Xu, Mohamed Elgharib, Pascal Fua, Hans-Peter Seidel, Helge Rhodin, Gerard Pons-Moll, Christian Theobalt
The first stage is a convolutional neural network (CNN) that estimates 2D and 3D pose features along with identity assignments for all visible joints of all individuals. We contribute a new architecture for this CNN, called SelecSLS Net, that uses novel selective long and short range skip connections to improve the information flow allowing for a drastically faster network without compromising accuracy.
Ranked #3 on 3D Multi-Person Pose Estimation on MuPoTS-3D
To this end, we introduce a self-supervised approach to learning what we call a neural scene decomposition (NSD) that can be exploited for 3D pose estimation.
Recovering a person's height from a single image is important for virtual garment fitting, autonomous driving and surveillance, however, it is also very challenging due to the absence of absolute scale information.
In this paper, we propose to overcome this problem by learning a geometry-aware body representation from multi-view images without annotations.
Ranked #15 on Weakly-supervised 3D Human Pose Estimation on Human3.6M
We tackle these challenges based on a novel lightweight setup that converts a standard baseball cap to a device for high-quality pose estimation based on a single cap-mounted fisheye camera.
Accurate 3D human pose estimation from single images is possible with sophisticated deep-net architectures that have been trained on very large datasets.
Reconstruction from monocular video alone is drastically more challenging, since strong occlusions and the inherent depth ambiguity lead to a highly ill-posed reconstruction problem.
A real-time kinematic skeleton fitting method uses the CNN output to yield temporally stable 3D global pose reconstructions on the basis of a coherent kinematic skeleton.
Ranked #14 on 3D Human Pose Estimation on MPI-INF-3DHP
Marker-based and marker-less optical skeletal motion-capture methods use an outside-in arrangement of cameras placed around a scene, with viewpoints converging on the center.
We propose a CNN-based approach for 3D human body pose estimation from single RGB images that addresses the issue of limited generalizability of models trained solely on the starkly limited publicly available 3D pose data.
We propose a new model-based method to accurately reconstruct human performances captured outdoors in a multi-camera setup.
We therefore propose a new method for real-time, marker-less and egocentric motion capture which estimates the full-body skeleton pose from a lightweight stereo pair of fisheye cameras that are attached to a helmet or virtual reality headset.
Our method uses a new image formation model with analytic visibility and analytically differentiable alignment energy.
Generative reconstruction methods compute the 3D configuration (such as pose and/or geometry) of a shape by optimizing the overlap of the projected 3D shape model with images.
In this paper, we propose a new approach that tracks the full skeleton motion of the hand from multiple RGB cameras in real-time.