Compositional Zero-Shot Learning (CZSL) aims to recognize unseen compositions from seen states and objects.
In this paper, we study the problem of one-shot skeleton-based action recognition, which poses unique challenges in learning transferable representation from base classes to novel classes, particularly for fine-grained actions.
Most generative ZSL methods use category semantic attributes plus a Gaussian noise to generate visual features.
We devise our model into a pure factorised architecture which can alternately perform spatial feature aggregation and temporal feature aggregation.
The task of skeleton-based action recognition remains a core challenge in human-centred scene understanding due to the multiple granularities and large variation in human motion.
In this article, we propose a novel invariant deep compressible covariance pooling (IDCCP) to solve nuisance variations in aerial scene categorization.
To obtain the appropriate crowd representation, in this work we proposed SOFA-Net(Second-Order and First-order Attention Network): second-order statistics were extracted to retain selectivity of the channel-wise spatial information for dense heads while first-order statistics, which can enhance the feature discrimination for the heads' areas, were used as complementary information.