Towards Good Practices for Deep 3D Hand Pose Estimation

23 Jul 2017  ·  Hengkai Guo, Guijin Wang, Xinghao Chen, Cairong Zhang ·

3D hand pose estimation from single depth image is an important and challenging problem for human-computer interaction. Recently deep convolutional networks (ConvNet) with sophisticated design have been employed to address it, but the improvement over traditional random forest based methods is not so apparent. To exploit the good practice and promote the performance for hand pose estimation, we propose a tree-structured Region Ensemble Network (REN) for directly 3D coordinate regression. It first partitions the last convolution outputs of ConvNet into several grid regions. The results from separate fully-connected (FC) regressors on each regions are then integrated by another FC layer to perform the estimation. By exploitation of several training strategies including data augmentation and smooth $L_1$ loss, proposed REN can significantly improve the performance of ConvNet to localize hand joints. The experimental results demonstrate that our approach achieves the best performance among state-of-the-art algorithms on three public hand pose datasets. We also experiment our methods on fingertip detection and human pose datasets and obtain state-of-the-art accuracy.

PDF Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Hand Pose Estimation ICVL Hands Tree Region Ensemble Network Average 3D Error 7.31 # 13
Pose Estimation ITOP front-view REN Mean mAP 84.9 # 5
Pose Estimation ITOP top-view REN Mean mAP 75.5 # 4
Hand Pose Estimation NYU Hands REN Average 3D Error 15.6 # 17

Methods


No methods listed for this paper. Add relevant methods here