Visual Question Answering
324 papers with code • 42 benchmarks • 71 datasets
Visual Question Answering is a semantic task that aims to answer questions based on an image.
Image Source: visualqa.org
In LXMERT, we build a large-scale Transformer model that consists of three encoders: an object relationship encoder, a language encoder, and a cross-modality encoder.
Ranked #1 on Visual Question Answering on VizWiz 2018
In this work we present Ludwig, a flexible, extensible and easy to use toolbox which allows users to train deep learning models and use them for obtaining predictions without writing code.
We demonstrate that by making subtle but important changes to the model architecture and the learning rate schedule, fine-tuning image features, and adding data augmentation, we can significantly improve the performance of the up-down model on VQA v2. 0 dataset -- from 65. 67% to 70. 22%.
Top-down visual attention mechanisms have been used extensively in image captioning and visual question answering (VQA) to enable deeper image understanding through fine-grained analysis and even multiple steps of reasoning.
Ranked #48 on Visual Question Answering on VQA v2 test-std