( Image credit: Deep Visual-Semantic Alignments for Generating Image Descriptions )
Several mechanisms to focus attention of a neural network on selected parts of its input or memory have been used successfully in deep learning models in recent years.
#29 best model for Machine Translation on WMT2014 English-French
Automatically describing the content of an image is a fundamental problem in artificial intelligence that connects computer vision and natural language processing.
We observe that our method consistently outperforms BS and previously proposed techniques for diverse decoding from neural sequence models.
In this paper, we design a benchmark task and provide the associated datasets for recognizing face images and link them to corresponding entity keys in a knowledge base.
Our approach leverages datasets of images and their sentence descriptions to learn about the inter-modal correspondences between language and visual data.
Top-down visual attention mechanisms have been used extensively in image captioning and visual question answering (VQA) to enable deeper image understanding through fine-grained analysis and even multiple steps of reasoning.
Inspired by recent work in machine translation and object detection, we introduce an attention based model that automatically learns to describe the content of images.
We present a simple regularization technique for Recurrent Neural Networks (RNNs) with Long Short-Term Memory (LSTM) units.
#26 best model for Machine Translation on WMT2014 English-French