Video moment retrieval pursues an efficient and generalized solution to identify the specific temporal segments within an untrimmed video that correspond to a given language description.
This process can be considered as an open-book exam: with the reference set as a cheat sheet, the proposed method doesn't need to memorize all visual concepts in the training data.
To be specific, our approach outperforms the previous state-of-the-art model named DeepLab v3 by 1. 5% on the PASCAL VOC 2012 val set and 0. 6% on the test set by replacing the Atrous Spatial Pyramid Pooling (ASPP) module in DeepLab v3 with the proposed Vortex Pooling.
However, it is difficult to collect sufficient training images with precise labels in some domains such as apparent age estimation, head pose estimation, multi-label classification and semantic segmentation.
Ranked #1 on Head Pose Estimation on BJUT-3D
Large receptive field and dense prediction are both important for achieving high accuracy in pixel labeling tasks such as semantic segmentation.
Fine-grained image recognition is a challenging computer vision problem, due to the small inter-class variations caused by highly similar subordinate categories, and the large intra-class variations in poses, scales and rotations.