Visual Question Answering
486 papers with code • 48 benchmarks • 94 datasets
Visual Question Answering is a semantic task that aims to answer questions based on an image.
Image Source: visualqa.org
Top-down visual attention mechanisms have been used extensively in image captioning and visual question answering (VQA) to enable deeper image understanding through fine-grained analysis and even multiple steps of reasoning.
Relational reasoning is a central component of generally intelligent behavior, but has proven difficult for neural networks to learn.
We present ViLBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language.
Neural network architectures with memory and attention mechanisms exhibit certain reasoning capabilities required for question answering.
This paper presents a state-of-the-art model for visual question answering (VQA), which won the first place in the 2017 VQA Challenge.