Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering

CVPR 2018 Peter AndersonXiaodong HeChris BuehlerDamien TeneyMark JohnsonStephen GouldLei Zhang

Top-down visual attention mechanisms have been used extensively in image captioning and visual question answering (VQA) to enable deeper image understanding through fine-grained analysis and even multiple steps of reasoning. In this work, we propose a combined bottom-up and top-down attention mechanism that enables attention to be calculated at the level of objects and other salient image regions... (read more)

PDF Abstract
TASK DATASET MODEL METRIC NAME METRIC VALUE GLOBAL RANK RESULT LEADERBOARD
Visual Question Answering VQA v2 test-std Up-Down Accuracy 70.34 # 9