Estimating the 3D pose of a hand is an essential part of human-computer
interaction. Estimating 3D pose using depth or multi-view sensors has become
easier with recent advances in computer vision, however, regressing pose from a
single RGB image is much less straightforward...
The main difficulty arises from
the fact that 3D pose requires some form of depth estimates, which are
ambiguous given only an RGB image. In this paper we propose a new method for 3D
hand pose estimation from a monocular image through a novel 2.5D pose
representation. Our new representation estimates pose up to a scaling factor,
which can be estimated additionally if a prior of the hand size is given. We
implicitly learn depth maps and heatmap distributions with a novel CNN
architecture. Our system achieves the state-of-the-art estimation of 2D and 3D
hand pose on several challenging datasets in presence of severe occlusions.