We use parametric 3D deformable human mesh model (SMPL-X) as a representation and focus on the real-time estimation of parameters for the body pose, hands pose and facial expression from Kinect Azure RGB-D camera.
To quantitatively evaluate performance on transitions and generalizations to longer time horizons, we present well-defined in-betweening benchmarks on a subset of the widely used Human3. 6M dataset and on LaFAN1, a novel high quality motion capture dataset that is more appropriate for transition generation.
In this paper, we apply multiscale area attention in a deep convolutional neural network to attend emotional characteristics with varied granularities and therefore the classifier can benefit from an ensemble of attentions with different scales.
In this paper, we propose to learn system dynamics from irregularly-sampled partial observations with underlying graph structure for the first time.
The physical world experiments show how the proposed method can be applied to the wide-breadth of robotic applications that require visual feedback, such as camera-to-robot calibration, robotic tool tracking, and whole-arm pose estimation.
We argue that the diverse temporal scales are important as they allow us to look at the past frames with different receptive fields, which can lead to better predictions.
The task of predicting human motion is complicated by the natural heterogeneity and compositionality of actions, necessitating robustness to distributional shifts as far as out-of-distribution (OoD).
Extracting behavioral measurements non-invasively from video is stymied by the fact that it is a hard computational problem.