Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering

CVPR 2018 Peter AndersonXiaodong HeChris BuehlerDamien TeneyMark JohnsonStephen GouldLei Zhang

Top-down visual attention mechanisms have been used extensively in image captioning and visual question answering (VQA) to enable deeper image understanding through fine-grained analysis and even multiple steps of reasoning. In this work, we propose a combined bottom-up and top-down attention mechanism that enables attention to be calculated at the level of objects and other salient image regions... (read more)

PDF Abstract
Task Dataset Model Metric name Metric value Global rank Compare
Visual Question Answering COCO Visual Question Answering (VQA) real images 2.0 open ended Up-Down Percentage correct 70.34 # 1