TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK	REMOVE
Visual Question Answering (VQA)	VizWiz 2018	Pythia v0.3	overall	54.72	# 3
Visual Question Answering (VQA)	VQA v2 test-dev	Pythia v0.3 + LoRRA	Accuracy	69.21	# 34

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/towards-vqa-models-that-can-read/visual-question-answering-on-vizwiz-2018-1)](https://paperswithcode.com/sota/visual-question-answering-on-vizwiz-2018-1?p=towards-vqa-models-that-can-read)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/towards-vqa-models-that-can-read/visual-question-answering-on-vqa-v2-test-dev)](https://paperswithcode.com/sota/visual-question-answering-on-vqa-v2-test-dev?p=towards-vqa-models-that-can-read)`

Towards VQA Models That Can Read

CVPR 2019 · Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, Marcus Rohrbach ·

Studies have shown that a dominant class of questions asked by visually impaired users on images of their surroundings involves reading text in the image. But today's VQA models can not read! Our paper takes a first step towards addressing this problem. First, we introduce a new "TextVQA" dataset to facilitate progress on this important problem. Existing datasets either have a small proportion of questions about text (e.g., the VQA dataset) or are too small (e.g., the VizWiz dataset). TextVQA contains 45,336 questions on 28,408 images that require reasoning about text to answer. Second, we introduce a novel model architecture that reads text in the image, reasons about it in the context of the image and the question, and predicts an answer which might be a deduction based on the text and the image or composed of the strings found in the image. Consequently, we call our approach Look, Read, Reason & Answer (LoRRA). We show that LoRRA outperforms existing state-of-the-art VQA models on our TextVQA dataset. We find that the gap between human performance and machine performance is significantly larger on TextVQA than on VQA 2.0, suggesting that TextVQA is well-suited to benchmark progress along directions complementary to VQA 2.0.

PDF Abstract CVPR 2019 PDF CVPR 2019 Abstract

Code

Add Remove Mark official

facebookresearch/pythia official

5,415

facebookresearch/mmf

5,415

ZephyrZhuQi/ssbaseline

ronghanghu/pythia

↳ Quickstart in

Colab

allenai/pythia

↳ Quickstart in

Colab

See all 7 implementations

Tasks

Add Remove

Visual Question Answering (VQA)

Datasets

Introduced in the Paper:

TextVQA

Used in the Paper:

Visual Question Answering

Visual Question Answering v2.0

VizWiz

DVQA

Results from the Paper

Edit

Ranked #3 on Visual Question Answering (VQA) on VizWiz 2018

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Result	Benchmark
Visual Question Answering (VQA)	VizWiz 2018	Pythia v0.3	overall	54.72	# 3		Compare
Visual Question Answering (VQA)	VQA v2 test-dev	Pythia v0.3 + LoRRA	Accuracy	69.21	# 34		Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

Towards VQA Models That Can Read

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove