TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Visual Dialog	VisDial v0.9 val	DAN	MRR	66.38	# 2
Visual Dialog	VisDial v0.9 val	DAN	Mean Rank	4.04	# 6
Visual Dialog	VisDial v0.9 val	DAN	R@1	53.33	# 5
Visual Dialog	VisDial v0.9 val	DAN	R@10	90.38	# 6
Visual Dialog	VisDial v0.9 val	DAN	R@5	82.42	# 6
Visual Dialog	Visual Dialog v1.0 test-std	DAN	NDCG (x 100)	57.59	# 55
Visual Dialog	Visual Dialog v1.0 test-std	DAN	MRR (x 100)	63.2	# 32
Visual Dialog	Visual Dialog v1.0 test-std	DAN	R@1	49.63	# 31
Visual Dialog	Visual Dialog v1.0 test-std	DAN	R@5	79.75	# 35
Visual Dialog	Visual Dialog v1.0 test-std	DAN	R@10	89.35	# 34
Visual Dialog	Visual Dialog v1.0 test-std	DAN	Mean	4.3	# 45

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/dual-attention-networks-for-visual-reference/visual-dialog-on-visdial-v09-val)](https://paperswithcode.com/sota/visual-dialog-on-visdial-v09-val?p=dual-attention-networks-for-visual-reference)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/dual-attention-networks-for-visual-reference/visual-dialog-on-visual-dialog-v1-0-test-std)](https://paperswithcode.com/sota/visual-dialog-on-visual-dialog-v1-0-test-std?p=dual-attention-networks-for-visual-reference)`

Dual Attention Networks for Visual Reference Resolution in Visual Dialog

IJCNLP 2019 · Gi-Cheon Kang, Jaeseo Lim, Byoung-Tak Zhang ·

Visual dialog (VisDial) is a task which requires an AI agent to answer a series of questions grounded in an image. Unlike in visual question answering (VQA), the series of questions should be able to capture a temporal context from a dialog history and exploit visually-grounded information. A problem called visual reference resolution involves these challenges, requiring the agent to resolve ambiguous references in a given question and find the references in a given image. In this paper, we propose Dual Attention Networks (DAN) for visual reference resolution. DAN consists of two kinds of attention networks, REFER and FIND. Specifically, REFER module learns latent relationships between a given question and a dialog history by employing a self-attention mechanism. FIND module takes image features and reference-aware representations (i.e., the output of REFER module) as input, and performs visual grounding via bottom-up attention mechanism. We qualitatively and quantitatively evaluate our model on VisDial v1.0 and v0.9 datasets, showing that DAN outperforms the previous state-of-the-art model by a significant margin.

PDF Abstract IJCNLP 2019 PDF IJCNLP 2019 Abstract

Code

Add Remove Mark official

gicheonkang/DAN-VisDial official

phellonchen/DMRM

Tasks

Add Remove

Question Answering

Visual Dialog

Visual Grounding

Visual Question Answering

Visual Question Answering (VQA)

Datasets

MS COCO

Visual Question Answering

VisDial

Results from the Paper

Edit

Ranked #2 on Visual Dialog on VisDial v0.9 val

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Visual Dialog	VisDial v0.9 val	DAN	MRR	66.38	# 2	Compare
			Mean Rank	4.04	# 6	Compare
			R@1	53.33	# 5	Compare
			R@10	90.38	# 6	Compare
			R@5	82.42	# 6	Compare
Visual Dialog	Visual Dialog v1.0 test-std	DAN	NDCG (x 100)	57.59	# 55	Compare
			MRR (x 100)	63.2	# 32	Compare
			R@1	49.63	# 31	Compare
			R@5	79.75	# 35	Compare
			R@10	89.35	# 34	Compare
			Mean	4.3	# 45	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

Dual Attention Networks for Visual Reference Resolution in Visual Dialog

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove