The ability of forecasting future human motion is important for human-machine interaction systems to understand human behaviors and make interaction.
We propose an effective two-stage approach to tackle the problem of language-based Human-centric Spatio-Temporal Video Grounding (HC-STVG) task.
In the latent feature learned by the autoencoder, global structures are enhanced and local details are suppressed so that it is more predictive.
Recently, segmentation neural networks have been significantly improved by demonstrating very promising accuracies on public benchmarks.
Our formulation of soft regression framework 1) overcomes a usual assumption in existing early action prediction systems that the progress level of on-going sequence is given in the testing stage; and 2) presents a theoretical framework to better resolve the ambiguity and uncertainty of subsequences at early performing stage.
Ranked #41 on Skeleton Based Action Recognition on NTU RGB+D 120
Rather than simply recognizing the action of a person individually, collective activity recognition aims to find out what a group of people is acting in a collective scene.
The proposed model formed in a unified framework is capable of: 1) jointly mining a set of subspaces with the same dimensionality to exploit latent shared features across different feature channels, 2) meanwhile, quantifying the shared and feature-specific components of features in the subspaces, and 3) transferring feature-specific intermediate transforms (i-transforms) for learning fusion of heterogeneous features across datasets.
Ranked #8 on Skeleton Based Action Recognition on SYSU 3D