As a fundamental and challenging problem in computer vision, hand pose
estimation aims to estimate the hand joint locations from depth images.
Typically, the problem is modeled as learning a mapping function from images to
hand joint coordinates in a data-driven manner. In this paper, we propose
Context-Aware Deep Spatio-Temporal Network (CADSTN), a novel method to jointly
model the spatio-temporal properties for hand pose estimation. Our proposed
network is able to learn the representations of the spatial information and the
temporal structure from the image sequences. Moreover, by adopting adaptive
fusion method, the model is capable of dynamically weighting different
predictions to lay emphasis on sufficient context. Our method is examined on
two common benchmarks, the experimental results demonstrate that our proposed
approach achieves the best or the second-best performance with state-of-the-art
methods and runs in 60fps.