( Image credit: Deep Visual-Semantic Alignments for Generating Image Descriptions )
|TREND||DATASET||BEST METHOD||PAPER TITLE||PAPER||CODE||COMPARE|
Several mechanisms to focus attention of a neural network on selected parts of its input or memory have been used successfully in deep learning models in recent years.
Ranked #37 on Machine Translation on WMT2014 English-French
Automatically describing the content of an image is a fundamental problem in artificial intelligence that connects computer vision and natural language processing.
We observe that our method consistently outperforms BS and previously proposed techniques for diverse decoding from neural sequence models.
In this paper, we design a benchmark task and provide the associated datasets for recognizing face images and link them to corresponding entity keys in a knowledge base.
Regional dropout strategies have been proposed to enhance the performance of convolutional neural network classifiers.
Ranked #2 on Domain Generalization on ImageNet-A
In this work we present Ludwig, a flexible, extensible and easy to use toolbox which allows users to train deep learning models and use them for obtaining predictions without writing code.
IMAGE CAPTIONING IMAGE CLASSIFICATION LANGUAGE MODELLING MACHINE TRANSLATION MULTI-LABEL CLASSIFICATION MULTI-TASK LEARNING NAMED ENTITY RECOGNITION NATURAL LANGUAGE UNDERSTANDING ONE-SHOT LEARNING SENTIMENT ANALYSIS SPEAKER VERIFICATION TEXT CLASSIFICATION TIME SERIES FORECASTING VISUAL QUESTION ANSWERING
Experiments on several datasets show the accuracy of the model and the fluency of the language it learns solely from image descriptions.
Ranked #3 on Image Retrieval with Multi-Modal Query on MIT-States
Top-down visual attention mechanisms have been used extensively in image captioning and visual question answering (VQA) to enable deeper image understanding through fine-grained analysis and even multiple steps of reasoning.
Ranked #46 on Visual Question Answering on VQA v2 test-std