This has led to the development of various editing methods that allow updating facts encoded by the model.
Modern systems for multi-hop question answering (QA) typically break questions into a sequence of reasoning steps, termed chain-of-thought (CoT), before arriving at a final answer.
Our results highlight the need for developing ODQA models that handle a broad range of question types, including single and multi-answer questions.
Constructing benchmarks that test the abilities of modern natural language understanding models is difficult - pre-trained language models exploit artifacts in benchmarks to achieve human parity, but still fail on adversarial examples and make errors that demonstrate a lack of common sense.
NLP benchmarks have largely focused on short texts, such as sentences and paragraphs, even though long texts comprise a considerable amount of natural language in the wild.
Ranked #8 on Long-range modeling on SCROLLS
Models pre-trained with a language modeling objective possess ample world knowledge and language skills, but are known to struggle in tasks that require reasoning.
When answering complex questions, people can seamlessly combine information from visual, textual and tabular sources.