Visual Question Answering is a semantic task that aims to answer questions based on an image.
Image Source: visualqa.org
|TREND||DATASET||BEST METHOD||PAPER TITLE||PAPER||CODE||COMPARE|
Natural language questions are inherently compositional, and many are most easily answered by reasoning about their decomposition into modular sub-problems.
Ranked #19 on Visual Question Answering on VQA v2 test-dev
In LXMERT, we build a large-scale Transformer model that consists of three encoders: an object relationship encoder, a language encoder, and a cross-modality encoder.
Ranked #1 on Visual Question Answering on VizWiz 2018
In this work we present Ludwig, a flexible, extensible and easy to use toolbox which allows users to train deep learning models and use them for obtaining predictions without writing code.
IMAGE CAPTIONING IMAGE CLASSIFICATION LANGUAGE MODELLING MACHINE TRANSLATION MULTI-LABEL CLASSIFICATION MULTI-TASK LEARNING NAMED ENTITY RECOGNITION NATURAL LANGUAGE UNDERSTANDING ONE-SHOT LEARNING SENTIMENT ANALYSIS SPEAKER VERIFICATION TEXT CLASSIFICATION TIME SERIES FORECASTING VISUAL QUESTION ANSWERING
We demonstrate that by making subtle but important changes to the model architecture and the learning rate schedule, fine-tuning image features, and adding data augmentation, we can significantly improve the performance of the up-down model on VQA v2. 0 dataset -- from 65. 67% to 70. 22%.
Top-down visual attention mechanisms have been used extensively in image captioning and visual question answering (VQA) to enable deeper image understanding through fine-grained analysis and even multiple steps of reasoning.
Ranked #46 on Visual Question Answering on VQA v2 test-std