TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Visual Question Answering (VQA)	GQA Test2019	BottomUp	Accuracy	49.74	# 107
Visual Question Answering (VQA)	GQA Test2019	BottomUp	Binary	66.64	# 108
Visual Question Answering (VQA)	GQA Test2019	BottomUp	Open	34.83	# 109
Visual Question Answering (VQA)	GQA Test2019	BottomUp	Consistency	78.71	# 110
Visual Question Answering (VQA)	GQA Test2019	BottomUp	Plausibility	84.57	# 65
Visual Question Answering (VQA)	GQA Test2019	BottomUp	Validity	96.18	# 79
Visual Question Answering (VQA)	GQA Test2019	BottomUp	Distribution	5.98	# 61
Visual Question Answering (VQA)	VQA v2 test-std	Up-Down	overall	70.34	# 29

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/bottom-up-and-top-down-attention-for-image/visual-question-answering-on-vqa-v2-test-std)](https://paperswithcode.com/sota/visual-question-answering-on-vqa-v2-test-std?p=bottom-up-and-top-down-attention-for-image)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/bottom-up-and-top-down-attention-for-image/visual-question-answering-on-gqa-test2019)](https://paperswithcode.com/sota/visual-question-answering-on-gqa-test2019?p=bottom-up-and-top-down-attention-for-image)`

Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering

CVPR 2018 · Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, Lei Zhang ·

Top-down visual attention mechanisms have been used extensively in image captioning and visual question answering (VQA) to enable deeper image understanding through fine-grained analysis and even multiple steps of reasoning. In this work, we propose a combined bottom-up and top-down attention mechanism that enables attention to be calculated at the level of objects and other salient image regions. This is the natural basis for attention to be considered. Within our approach, the bottom-up mechanism (based on Faster R-CNN) proposes image regions, each with an associated feature vector, while the top-down mechanism determines feature weightings. Applying this approach to image captioning, our results on the MSCOCO test server establish a new state-of-the-art for the task, achieving CIDEr / SPICE / BLEU-4 scores of 117.9, 21.5 and 36.9, respectively. Demonstrating the broad applicability of the method, applying the same approach to VQA we obtain first place in the 2017 VQA Challenge.

PDF Abstract CVPR 2018 PDF CVPR 2018 Abstract

Code

Add Remove Mark official

peteanderson80/bottom-up-attention official

1,404

facebookresearch/mmf

5,415

ruotianluo/neuraltalk2.pytorch

↳ Quickstart in

Colab

1,411

ruotianluo/ImageCaptioning.pytorch

↳ Quickstart in

Colab

1,411

ruotianluo/self-critical.pytorch

↳ Quickstart in

Colab

984

See all 65 implementations

Tasks

Add Remove

Image Captioning

Visual Question Answering

Visual Question Answering (VQA)

Datasets

Visual Question Answering

GQA

Visual Question Answering v2.0

Results from the Paper

Edit

Ranked #29 on Visual Question Answering (VQA) on VQA v2 test-std

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Visual Question Answering (VQA)	GQA Test2019	BottomUp	Accuracy	49.74	# 107	Compare
			Binary	66.64	# 108	Compare
			Open	34.83	# 109	Compare
			Consistency	78.71	# 110	Compare
			Plausibility	84.57	# 65	Compare
			Validity	96.18	# 79	Compare
			Distribution	5.98	# 61	Compare
Visual Question Answering (VQA)	VQA v2 test-std	Up-Down	overall	70.34	# 29	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove