This paper aims to bring a new lightweight yet powerful solution for the task of Emotion Recognition and Sentiment Analysis.
Understanding expressed sentiment and emotions are two crucial factors in human multimodal language.
Ranked #1 on Multimodal Sentiment Analysis on CMU-MOSEI
As new data-sets for real-world visual reasoning and compositional question answering are emerging, it might be needed to use the visual feature extraction as a end-to-end process during training.
Even with the growing interest in problems at the intersection of Computer Vision and Natural Language, grounding (i. e. identifying) the components of a structured description in an image still remains a challenging task.
When searching for an object humans navigate through a scene using semantic information and spatial relationships.
So far, the goal has been to maximize scores on automated metric and to do so, one has to come up with a plurality of new modules and techniques.
This paper describes the UMONS solution for the Multimodal Machine Translation Task presented at the third conference on machine translation (WMT18).
We propose a new and fully end-to-end approach for multimodal translation where the source text encoder modulates the entire visual input processing using conditional batch normalization, in order to compute the most informative image features for our task.
In Multimodal Neural Machine Translation (MNMT), a neural model generates a translated sentence that describes an image, given the image itself and one source descriptions in English.
In state-of-the-art Neural Machine Translation (NMT), an attention mechanism is used during decoding to enhance the translation.
Recently, the effectiveness of the attention mechanism has also been explored for multimodal tasks, where it becomes possible to focus both on sentence parts and image regions.