The static stream performs cross-modal understanding in a single frame and learns to attend to the target object spatially according to intra-frame visual cues like object appearances.
Specifically, we develop a hierarchical encoder that encodes the multi-modal inputs into semantics-aligned representations at different levels.
The static branch performs cross-modal understanding in a single frame and learns to localize the target object spatially according to intra-frame visual cues like object appearances.
Ranked #1 on Spatio-Temporal Video Grounding on HC-STVG2
The ability of forecasting future human motion is important for human-machine interaction systems to understand human behaviors and make interaction.
We propose an effective two-stage approach to tackle the problem of language-based Human-centric Spatio-Temporal Video Grounding (HC-STVG) task.
In the latent feature learned by the autoencoder, global structures are enhanced and local details are suppressed so that it is more predictive.
Recently, segmentation neural networks have been significantly improved by demonstrating very promising accuracies on public benchmarks.
In this paper, we focus on exploring modality-temporal mutual information for RGB-D action recognition.
Our formulation of soft regression framework 1) overcomes a usual assumption in existing early action prediction systems that the progress level of on-going sequence is given in the testing stage; and 2) presents a theoretical framework to better resolve the ambiguity and uncertainty of subsequences at early performing stage.
Ranked #67 on Skeleton Based Action Recognition on NTU RGB+D 120
Rather than simply recognizing the action of a person individually, collective activity recognition aims to find out what a group of people is acting in a collective scene.
The proposed model formed in a unified framework is capable of: 1) jointly mining a set of subspaces with the same dimensionality to exploit latent shared features across different feature channels, 2) meanwhile, quantifying the shared and feature-specific components of features in the subspaces, and 3) transferring feature-specific intermediate transforms (i-transforms) for learning fusion of heterogeneous features across datasets.
Ranked #8 on Skeleton Based Action Recognition on SYSU 3D