Understanding visually situated language requires recognizing text and visual elements, and interpreting complex layouts.
Internet links enable users to deepen their understanding of a topic by providing convenient access to related information.
To study the ability of retrieval systems to meet such information needs, we construct QUEST, a dataset of 3357 natural language queries with implicit set operations, that map to a set of entities corresponding to Wikipedia documents.
Large-scale multi-modal pre-training models such as CLIP and PaLI exhibit strong generalization on various visual domains and tasks.
Ranked #2 on Fine-Grained Image Recognition on OVEN
Visually-situated language is ubiquitous -- sources range from textbooks with diagrams to web pages with images and tables, to mobile apps with buttons and forms.
Ranked #14 on Visual Question Answering (VQA) on InfographicVQA
Meanwhile, recent work has shown considerable improvements on many NLP tasks from model scaling.
Generic unstructured neural networks have been shown to struggle on out-of-distribution compositional generalization.
Zero-shot cross-lingual transfer is emerging as a practical solution: pre-trained models later fine-tuned on one transfer language exhibit surprising performance when tested on many target languages.
We study multi-answer retrieval, an under-explored problem that requires retrieving passages to cover multiple distinct answers for a given question.
Tables in Web documents are pervasive and can be directly used to answer many of the queries searched on the Web, motivating their integration in question answering.
This has motivated new specialized architectures with stronger compositional biases, but most of these approaches have only been evaluated on synthetically-generated datasets, which are not representative of natural language variation.
We address the problem of extractive question answering using document-level distant super-vision, pairing questions and relevant documents with answer strings.
Dual encoders perform retrieval by encoding documents and queries into dense lowdimensional vectors, scoring each document by its inner product with the query.
We present a method to represent input texts by contextualizing them jointly with dynamically retrieved textual encyclopedic background knowledge from multiple documents.
Recent developments in natural language representations have been accompanied by large and expensive models that leverage vast amounts of general-domain text through self-supervised pre-training.
First, we show that strong reading comprehension models pre-trained on large unlabeled data can be used to generalize to unseen entities.
1 code implementation • • Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, Slav Petrov
The public release consists of 307, 373 training examples with single annotations, 7, 830 examples with 5-way annotations for development data, and a further 7, 842 examples 5-way annotated sequestered as test data.
We show for the first time that it is possible to jointly learn the retriever and reader from question-answer string pairs and without any IR system.
Ranked #10 on Question Answering on WebQuestions
In this paper we study yes/no questions that are naturally occurring --- meaning that they are generated in unprompted and unconstrained settings.
Hierarchical neural architectures are often used to capture long-distance dependencies and have been applied to many document-level tasks such as summarization, document segmentation, and sentiment analysis.
We study approaches to improve fine-grained short answer Question Answering models by integrating coarse-grained data annotated for paragraph-level relevance and show that coarsely annotated data can bring significant performance gains.
We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers.
Ranked #1 on Question Answering on CoQA
Grammatical error correction (GEC) systems strive to correct both global errors in word order and usage, and local errors in spelling and inflection.
In addition, it contains automatically produced annotations of named entities, part-of-speech tags, and syntactic parses for the same queries.