Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering

Top-down visual attention mechanisms have been used extensively in image captioning and visual question answering (VQA) to enable deeper image understanding through fine-grained analysis and even multiple steps of reasoning. In this work, we propose a combined bottom-up and top-down attention mechanism that enables attention to be calculated at the level of objects and other salient image regions... (read more)

PDF Abstract CVPR 2018 PDF CVPR 2018 Abstract
TASK DATASET MODEL METRIC NAME METRIC VALUE GLOBAL RANK RESULT BENCHMARK
Visual Question Answering GQA Test2019 BottomUp Accuracy 49.74 # 93
Binary 66.64 # 94
Open 34.83 # 95
Consistency 78.71 # 96
Plausibility 84.57 # 59
Validity 96.18 # 71
Distribution 5.98 # 52
Visual Question Answering VQA v2 test-std Up-Down overall 70.34 # 47

Methods used in the Paper


METHOD TYPE
🤖 No Methods Found Help the community by adding them if they're not listed; e.g. Deep Residual Learning for Image Recognition uses ResNet