Graph convolutional networks (GCNs) have been widely used and achieved remarkable results in skeleton-based action recognition.
Ranked #1 on Skeleton Based Action Recognition on NTU RGB+D 120
However, the most suitable positions for inferring different targets, i. e., the object category and boundaries, are generally different.
Due to the rapid emergence of short videos and the requirement for content understanding and creation, the video captioning task has received increasing attention in recent years.
Extensive experiments on classification and regression datasets demonstrate that DIFER can significantly improve the performance of various machine learning algorithms and outperform current state-of-the-art AutoFE methods in terms of both efficiency and performance.
Bayesian optimization is a broadly applied methodology to optimize the expensive black-box function.
In this paper, we propose a complete video captioning system including both a novel model and an effective training strategy.
Ranked #1 on Video Captioning on MSR-VTT (using extra training data)
Multi-modal information is essential to describe what has happened in a video.
Inspired by the fact that different modalities in videos carry complementary information, we propose a Multimodal Semantic Attention Network(MSAN), which is a new encoder-decoder framework incorporating multimodal semantic attributes for video captioning.
Furthermore, since different layers in a deep network capture feature maps of different scales, we use these feature maps to construct a spatial pyramid and then utilize multi-scale information to obtain more accurate attention scores, which are used to weight the local features in all spatial positions of feature maps to calculate attention maps.
In dynamic object detection, it is challenging to construct an effective model to sufficiently characterize the spatial-temporal properties of the background.
In this paper, we study the problem of online action detection from streaming skeleton data.
In this paper, a multi-feature max-margin hierarchical Bayesian model (M3HBM) is proposed for action recognition.
In this paper, we model interactions between neighbor targets by pair-wise motion context, and further encode such context into the global association optimization.
In this paper, we formulate human action recognition as a novel Multi-Task Sparse Learning(MTSL) framework which aims to construct a test sample with multiple features from as few bases as possible.
In this paper, we propose a new global feature to capture the detailed geometrical distribution of interest points.