However, this option traditionally hurts the detection performance much.
We introduce a new image segmentation task, termed Entity Segmentation (ES) with the aim to segment all visual entities in an image without considering semantic category labels.
The learned AU semantic embeddings are then used as guidance for the generation of attention maps through a cross-modality attention network.
For downstream usage, we propose a novel modality-adaptive attention mechanism for multimodal feature fusion by adaptively emphasizing language and vision signals.
These two observations are further employed to formulate a measurement which can quantify the shortcut degree of each training sample.
Structured representations of images that model visual relationships are beneficial for many vision and vision-language applications.
Research in image captioning has mostly focused on English because of the availability of image-caption paired datasets in this language.
In this paper, we propose a novel visual encoder to explicitly distinguish viewpoint changes from semantic changes in the change captioning task.
Mobile energy storage systems (MESSs) provide mobility and flexibility to enhance distribution system resilience.
With the rapid growth of video data and the increasing demands of various applications such as intelligent video search and assistance toward visually-impaired people, video captioning task has received a lot of attention recently in computer vision and natural language processing fields.
Scene graph generation has received growing attention with the advancements in image understanding tasks such as object detection, attributes and relationship prediction,~\etc.
In this paper, we propose a boundary-aware hierarchical language decoder for video captioning, which consists of a high-level GRU based language decoder, working as a global (caption-level) language model, and a low-level GRU based language decoder, working as a local (phrase-level) language model.
Image captioning is a multimodal task involving computer vision and natural language processing, where the goal is to learn a mapping from the image to its natural language description.
Textual-visual cross-modal retrieval has been a hot research topic in both computer vision and natural language processing communities.
In the last few years, deep learning has led to very good performance on a variety of problems, such as visual recognition, speech recognition and natural language processing.