TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Vision and Language Navigation	VLN Challenge	VLN-Bert	success	0.73	# 6
Vision and Language Navigation	VLN Challenge	VLN-Bert	length	686.62	# 8
Vision and Language Navigation	VLN Challenge	VLN-Bert	error	3.09	# 138
Vision and Language Navigation	VLN Challenge	VLN-Bert	oracle success	0.99	# 2
Vision and Language Navigation	VLN Challenge	VLN-Bert	spl	0.01	# 134

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/improving-vision-and-language-navigation-with/vision-and-language-navigation-on-vln)](https://paperswithcode.com/sota/vision-and-language-navigation-on-vln?p=improving-vision-and-language-navigation-with)`

Improving Vision-and-Language Navigation with Image-Text Pairs from the Web

ECCV 2020 · Arjun Majumdar, Ayush Shrivastava, Stefan Lee, Peter Anderson, Devi Parikh, Dhruv Batra ·

Following a navigation instruction such as 'Walk down the stairs and stop at the brown sofa' requires embodied AI agents to ground scene elements referenced via language (e.g. 'stairs') to visual content in the environment (pixels corresponding to 'stairs'). We ask the following question -- can we leverage abundant 'disembodied' web-scraped vision-and-language corpora (e.g. Conceptual Captions) to learn visual groundings (what do 'stairs' look like?) that improve performance on a relatively data-starved embodied perception task (Vision-and-Language Navigation)? Specifically, we develop VLN-BERT, a visiolinguistic transformer-based model for scoring the compatibility between an instruction ('...stop at the brown sofa') and a sequence of panoramic RGB images captured by the agent. We demonstrate that pretraining VLN-BERT on image-text pairs from the web before fine-tuning on embodied path-instruction data significantly improves performance on VLN -- outperforming the prior state-of-the-art in the fully-observed setting by 4 absolute percentage points on success rate. Ablations of our pretraining curriculum show each stage to be impactful -- with their combination resulting in further positive synergistic effects.

PDF Abstract ECCV 2020 PDF ECCV 2020 Abstract

Code

Add Remove Mark official

arjunmajum/vln-bert official

Tasks

Add Remove

Vision and Language Navigation

Datasets

Visual Genome

Matterport3D

Conceptual Captions

R2R

Results from the Paper

Edit

Ranked #6 on Vision and Language Navigation on VLN Challenge

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Vision and Language Navigation	VLN Challenge	VLN-Bert	success	0.73	# 6	Compare
			length	686.62	# 8	Compare
			error	3.09	# 138	Compare
			oracle success	0.99	# 2	Compare
			spl	0.01	# 134	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

Improving Vision-and-Language Navigation with Image-Text Pairs from the Web

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove