|TREND||DATASET||BEST METHOD||PAPER TITLE||PAPER||CODE||COMPARE|
Natural language questions are inherently compositional, and many are most easily answered by reasoning about their decomposition into modular sub-problems.
We demonstrate that by making subtle but important changes to the model architecture and the learning rate schedule, fine-tuning image features, and adding data augmentation, we can significantly improve the performance of the up-down model on VQA v2. 0 dataset -- from 65. 67% to 70. 22%.
#3 best model for Visual Question Answering on VQA v2
Top-down visual attention mechanisms have been used extensively in image captioning and visual question answering (VQA) to enable deeper image understanding through fine-grained analysis and even multiple steps of reasoning.
This paper presents a state-of-the-art model for visual question answering (VQA), which won the first place in the 2017 VQA Challenge.
#2 best model for Visual Question Answering on VizWiz