|Trend||Dataset||Best Method||Paper title||Paper||Code||Compare|
Natural language questions are inherently compositional, and many are most easily answered by reasoning about their decomposition into modular sub-problems.
We demonstrate that by making subtle but important changes to the model architecture and the learning rate schedule, fine-tuning image features, and adding data augmentation, we can significantly improve the performance of the up-down model on VQA v2. 0 dataset -- from 65. 67% to 70. 22%.
#3 best model for Visual Question Answering on VQA v2
In this paper, we propose bilinear attention networks (BAN) that find bilinear attention distributions to utilize given vision-language information seamlessly.
#4 best model for Visual Question Answering on VQA v2
Top-down visual attention mechanisms have been used extensively in image captioning and visual question answering (VQA) to enable deeper image understanding through fine-grained analysis and even multiple steps of reasoning.
We propose a technique for producing "visual explanations" for decisions from a large class of CNN-based models, making them more transparent.
This paper presents a state-of-the-art model for visual question answering (VQA), which won the first place in the 2017 VQA Challenge.
#2 best model for Visual Question Answering on VizWiz
In this paper we instead present a modular deep architecture capable of analyzing referential expressions into their component parts, identifying entities and relationships mentioned in the input expression and grounding them all in the scene.