TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Phrase Grounding	Flickr30k Entities Test	BAN (Bottom-Up detector)	R@1	69.69	# 11
Phrase Grounding	Flickr30k Entities Test	BAN (Bottom-Up detector)	R@10	86.35	# 5
Phrase Grounding	Flickr30k Entities Test	BAN (Bottom-Up detector)	R@5	84.22	# 5
Visual Question Answering (VQA)	VQA v2 test-dev	BAN+Glove+Counter	Accuracy	70.04	# 31
Visual Question Answering (VQA)	VQA v2 test-std	BAN+Glove+Counter	overall	70.4	# 28

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/bilinear-attention-networks/phrase-grounding-on-flickr30k-entities-test)](https://paperswithcode.com/sota/phrase-grounding-on-flickr30k-entities-test?p=bilinear-attention-networks)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/bilinear-attention-networks/visual-question-answering-on-vqa-v2-test-std)](https://paperswithcode.com/sota/visual-question-answering-on-vqa-v2-test-std?p=bilinear-attention-networks)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/bilinear-attention-networks/visual-question-answering-on-vqa-v2-test-dev)](https://paperswithcode.com/sota/visual-question-answering-on-vqa-v2-test-dev?p=bilinear-attention-networks)`

Bilinear Attention Networks

NeurIPS 2018 · Jin-Hwa Kim, Jaehyun Jun, Byoung-Tak Zhang ·

Attention networks in multimodal learning provide an efficient way to utilize given visual information selectively. However, the computational cost to learn attention distributions for every pair of multimodal input channels is prohibitively expensive. To solve this problem, co-attention builds two separate attention distributions for each modality neglecting the interaction between multimodal inputs. In this paper, we propose bilinear attention networks (BAN) that find bilinear attention distributions to utilize given vision-language information seamlessly. BAN considers bilinear interactions among two groups of input channels, while low-rank bilinear pooling extracts the joint representations for each pair of channels. Furthermore, we propose a variant of multimodal residual networks to exploit eight-attention maps of the BAN efficiently. We quantitatively and qualitatively evaluate our model on visual question answering (VQA 2.0) and Flickr30k Entities datasets, showing that BAN significantly outperforms previous methods and achieves new state-of-the-arts on both datasets.

PDF Abstract NeurIPS 2018 PDF NeurIPS 2018 Abstract

Code

Add Remove Mark official

jnhwkim/ban-vqa official

534

facebookresearch/mmf

5,413

facebookresearch/pythia

5,413

Cyanogenoid/vqa-counting

200

ZephyrZhuQi/ssbaseline

See all 8 implementations

Tasks

Add Remove

Visual Question Answering

Visual Question Answering (VQA)

Datasets

Visual Question Answering

Visual Genome

Flickr30k

Visual Question Answering v2.0

Flickr30K Entities

Results from the Paper

Edit

Ranked #11 on Phrase Grounding on Flickr30k Entities Test

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Phrase Grounding	Flickr30k Entities Test	BAN (Bottom-Up detector)	R@1	69.69	# 11	Compare
			R@10	86.35	# 5	Compare
			R@5	84.22	# 5	Compare
Visual Question Answering (VQA)	VQA v2 test-dev	BAN+Glove+Counter	Accuracy	70.04	# 31	Compare
Visual Question Answering (VQA)	VQA v2 test-std	BAN+Glove+Counter	overall	70.4	# 28	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

Bilinear Attention Networks

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove