no code implementations • Findings (EMNLP) 2021 • Faeze Brahman, Meng Huang, Oyvind Tafjord, Chao Zhao, Mrinmaya Sachan, Snigdha Chaturvedi
When reading a literary piece, readers often make inferences about various characters’ roles, personalities, relationships, intents, actions, etc.
no code implementations • EMNLP 2020 • Vered Shwartz, Rachel Rudinger, Oyvind Tafjord
Pre-trained language models (LMs) may perpetuate biases originating in their training corpus to downstream models.
no code implementations • 15 Apr 2025 • Ian Magnusson, Nguyen Tai, Ben Bogin, David Heineman, Jena D. Hwang, Luca Soldaini, Akshita Bhagia, Jiacheng Liu, Dirk Groeneveld, Oyvind Tafjord, Noah A. Smith, Pang Wei Koh, Jesse Dodge
Because large language models are expensive to pretrain on different datasets, using smaller-scale experiments to decide on data is crucial for reducing costs.
no code implementations • 20 Mar 2025 • Peter Jansen, Oyvind Tafjord, Marissa Radensky, Pao Siangliulue, Tom Hope, Bhavana Dalvi Mishra, Bodhisattwa Prasad Majumder, Daniel S. Weld, Peter Clark
Despite the surge of interest in autonomous scientific discovery (ASD) of software artifacts (e. g., improved ML algorithms), current ASD systems face two key limitations: (1) they largely explore variants of existing codebases or similarly constrained design spaces, and (2) they produce large volumes of research artifacts (such as automatically generated papers and code) that are typically evaluated using conference-style paper review with limited evaluation of code.
3 code implementations • 31 Dec 2024 • Team OLMo, Pete Walsh, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Shane Arora, Akshita Bhagia, Yuling Gu, Shengyi Huang, Matt Jordan, Nathan Lambert, Dustin Schwenk, Oyvind Tafjord, Taira Anderson, David Atkinson, Faeze Brahman, Christopher Clark, Pradeep Dasigi, Nouha Dziri, Michal Guerquin, Hamish Ivison, Pang Wei Koh, Jiacheng Liu, Saumya Malik, William Merrill, Lester James V. Miranda, Jacob Morrison, Tyler Murray, Crystal Nam, Valentina Pyatkin, Aman Rangapur, Michael Schmitz, Sam Skjonsberg, David Wadden, Christopher Wilhelm, Michael Wilson, Luke Zettlemoyer, Ali Farhadi, Noah A. Smith, Hannaneh Hajishirzi
Our modified model architecture and training recipe achieve both better training stability and improved per-token efficiency.
1 code implementation • 23 Dec 2024 • Nathaniel Weir, Bhavana Dalvi Mishra, Orion Weller, Oyvind Tafjord, Sam Hornstein, Alexander Sabol, Peter Jansen, Benjamin Van Durme, Peter Clark
We show that, when added to a general corpus (e. g., Wikipedia), microtheories can supply critical, topical information not necessarily present in the corpus, improving both a model's ability to ground its answers to verifiable knowledge (i. e., show how answers are systematically entailed by documents in the corpus, fully grounding up to +8% more answers), and the accuracy of those grounded answers (up to +8% absolute).
no code implementations • 5 Dec 2024 • Akshita Bhagia, Jiacheng Liu, Alexander Wettig, David Heineman, Oyvind Tafjord, Ananya Harsh Jha, Luca Soldaini, Noah A. Smith, Dirk Groeneveld, Pang Wei Koh, Jesse Dodge, Hannaneh Hajishirzi
We develop task scaling laws and model ladders to predict the individual task performance of pretrained language models (LMs) in the overtrained setting.
1 code implementation • 22 Nov 2024 • Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V. Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, Yuling Gu, Saumya Malik, Victoria Graf, Jena D. Hwang, Jiangjiang Yang, Ronan Le Bras, Oyvind Tafjord, Chris Wilhelm, Luca Soldaini, Noah A. Smith, Yizhong Wang, Pradeep Dasigi, Hannaneh Hajishirzi
Language model post-training is applied to refine behaviors and unlock new skills across a wide range of recent language models, but open recipes for applying these techniques lag behind proprietary ones.
no code implementations • 17 Oct 2024 • Yuling Gu, Oyvind Tafjord, Hyunwoo Kim, Jared Moore, Ronan Le Bras, Peter Clark, Yejin Choi
"), and (c) judgment ("Mary paid for the chips.
2 code implementations • 3 Sep 2024 • Niklas Muennighoff, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Jacob Morrison, Sewon Min, Weijia Shi, Pete Walsh, Oyvind Tafjord, Nathan Lambert, Yuling Gu, Shane Arora, Akshita Bhagia, Dustin Schwenk, David Wadden, Alexander Wettig, Binyuan Hui, Tim Dettmers, Douwe Kiela, Ali Farhadi, Noah A. Smith, Pang Wei Koh, Amanpreet Singh, Hannaneh Hajishirzi
We introduce OLMoE, a fully open, state-of-the-art language model leveraging sparse Mixture-of-Experts (MoE).
no code implementations • 21 Jul 2024 • Sarah Wiegreffe, Oyvind Tafjord, Yonatan Belinkov, Hannaneh Hajishirzi, Ashish Sabharwal
Multiple-choice question answering (MCQA) is a key competence of performant transformer language models that is tested by mainstream benchmarks.
no code implementations • 12 Jun 2024 • Yuling Gu, Oyvind Tafjord, Bailey Kuehl, Dany Haddad, Jesse Dodge, Hannaneh Hajishirzi
Evaluating language models in particular is challenging, as small changes to how a model is evaluated on a task can lead to large changes in measured performance.
1 code implementation • 10 Jun 2024 • Peter Jansen, Marc-Alexandre Côté, Tushar Khot, Erin Bransom, Bhavana Dalvi Mishra, Bodhisattwa Prasad Majumder, Oyvind Tafjord, Peter Clark
However, developing and evaluating an AI agent's capacity for end-to-end scientific reasoning is challenging as running real-world experiments is often prohibitively expensive or infeasible.
no code implementations • 22 Feb 2024 • Nathaniel Weir, Kate Sanders, Orion Weller, Shreya Sharma, Dongwei Jiang, Zhengping Jiang, Bhavana Dalvi Mishra, Oyvind Tafjord, Peter Jansen, Peter Clark, Benjamin Van Durme
Recent language models enable new opportunities for structured reasoning with text, such as the construction of intuitive, proof-like textual entailment trees without relying on brittle formal logic.
3 code implementations • 1 Feb 2024 • Dirk Groeneveld, Iz Beltagy, Pete Walsh, Akshita Bhagia, Rodney Kinney, Oyvind Tafjord, Ananya Harsh Jha, Hamish Ivison, Ian Magnusson, Yizhong Wang, Shane Arora, David Atkinson, Russell Authur, Khyathi Raghavi Chandu, Arman Cohan, Jennifer Dumas, Yanai Elazar, Yuling Gu, Jack Hessel, Tushar Khot, William Merrill, Jacob Morrison, Niklas Muennighoff, Aakanksha Naik, Crystal Nam, Matthew E. Peters, Valentina Pyatkin, Abhilasha Ravichander, Dustin Schwenk, Saurabh Shah, Will Smith, Emma Strubell, Nishant Subramani, Mitchell Wortsman, Pradeep Dasigi, Nathan Lambert, Kyle Richardson, Luke Zettlemoyer, Jesse Dodge, Kyle Lo, Luca Soldaini, Noah A. Smith, Hannaneh Hajishirzi
Given the importance of these details in scientifically studying these models, including their biases and potential risks, we believe it is essential for the research community to have access to powerful, truly open LMs.
1 code implementation • 31 Jan 2024 • Luca Soldaini, Rodney Kinney, Akshita Bhagia, Dustin Schwenk, David Atkinson, Russell Authur, Ben Bogin, Khyathi Chandu, Jennifer Dumas, Yanai Elazar, Valentin Hofmann, Ananya Harsh Jha, Sachin Kumar, Li Lucy, Xinxi Lyu, Nathan Lambert, Ian Magnusson, Jacob Morrison, Niklas Muennighoff, Aakanksha Naik, Crystal Nam, Matthew E. Peters, Abhilasha Ravichander, Kyle Richardson, Zejiang Shen, Emma Strubell, Nishant Subramani, Oyvind Tafjord, Pete Walsh, Luke Zettlemoyer, Noah A. Smith, Hannaneh Hajishirzi, Iz Beltagy, Dirk Groeneveld, Jesse Dodge, Kyle Lo
As a result, it is challenging to conduct and advance scientific research on language modeling, such as understanding how training data impacts model capabilities and limitations.
no code implementations • 16 Dec 2023 • Ian Magnusson, Akshita Bhagia, Valentin Hofmann, Luca Soldaini, Ananya Harsh Jha, Oyvind Tafjord, Dustin Schwenk, Evan Pete Walsh, Yanai Elazar, Kyle Lo, Dirk Groeneveld, Iz Beltagy, Hannaneh Hajishirzi, Noah A. Smith, Kyle Richardson, Jesse Dodge
Evaluations of language models (LMs) commonly report perplexity on monolithic data held out from training.
1 code implementation • 15 Dec 2023 • Dirk Groeneveld, Anas Awadalla, Iz Beltagy, Akshita Bhagia, Ian Magnusson, Hao Peng, Oyvind Tafjord, Pete Walsh, Kyle Richardson, Jesse Dodge
The success of large language models has shifted the evaluation paradigms in natural language processing (NLP).
no code implementations • 12 Dec 2023 • Peter Clark, Bhavana Dalvi Mishra, Oyvind Tafjord
This shows the clear progression of models towards improved factual accuracy and entailment reasoning, and the dataset provides a new benchmark that more cleanly separates and quantifies these two notions.
no code implementations • 16 Nov 2023 • Yuling Gu, Oyvind Tafjord, Peter Clark
While LLMs can provide reasoned explanations along with their answers, the nature and quality of those explanations are still poorly understood.
no code implementations • 16 Oct 2023 • Bodhisattwa Prasad Majumder, Bhavana Dalvi Mishra, Peter Jansen, Oyvind Tafjord, Niket Tandon, Li Zhang, Chris Callison-Burch, Peter Clark
Language agents have shown some ability to interact with an external environment, e. g., a virtual world such as ScienceWorld, to perform complex tasks, e. g., growing a plant, without the startup costs of reinforcement learning.
1 code implementation • 24 May 2023 • Sarah Wiegreffe, Matthew Finlayson, Oyvind Tafjord, Peter Clark, Ashish Sabharwal
For example, both normalization and prompting methods for reducing SFC can be ineffective or even detrimental to task performance for some LMs.
no code implementations • 23 May 2023 • Nora Kassner, Oyvind Tafjord, Ashish Sabharwal, Kyle Richardson, Hinrich Schuetze, Peter Clark
To address this, our goals are to make model beliefs and their inferential relationships explicit, and to resolve inconsistencies that may exist, so that answers are supported by interpretable chains of reasoning drawn from a consistent network of beliefs.
1 code implementation • 31 Oct 2022 • Swaroop Mishra, Matthew Finlayson, Pan Lu, Leonard Tang, Sean Welleck, Chitta Baral, Tanmay Rajpurohit, Oyvind Tafjord, Ashish Sabharwal, Peter Clark, Ashwin Kalyan
Mathematical reasoning skills are essential for general-purpose intelligent systems to perform tasks from grocery shopping to climate modeling.
Ranked #1 on
Mathematical Reasoning
on Lila (OOD)
no code implementations • 21 Oct 2022 • Oyvind Tafjord, Bhavana Dalvi Mishra, Peter Clark
Our goal is a question-answering (QA) system that can show how its answers are implied by its own internal beliefs via a systematic chain of reasoning.
1 code implementation • 20 Sep 2022 • Pan Lu, Swaroop Mishra, Tony Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, Ashwin Kalyan
We further design language models to learn to generate lectures and explanations as the chain of thought (CoT) to mimic the multi-hop reasoning process when answering ScienceQA questions.
Ranked #6 on
Science Question Answering
on ScienceQA
no code implementations • 27 Apr 2022 • Bhavana Dalvi Mishra, Oyvind Tafjord, Peter Clark
Our goal is a teachable reasoning system for question-answering (QA), where a user can interact with faithful answer explanations, and correct its errors so that the system improves over time.
no code implementations • EMNLP 2021 • Nora Kassner, Oyvind Tafjord, Hinrich Schütze, Peter Clark
We show that, in a controlled experimental setting, these two mechanisms result in more consistent beliefs in the overall system, improving both the accuracy and consistency of its answers over time.
no code implementations • 12 Sep 2021 • Faeze Brahman, Meng Huang, Oyvind Tafjord, Chao Zhao, Mrinmaya Sachan, Snigdha Chaturvedi
When reading a literary piece, readers often make inferences about various characters' roles, personalities, relationships, intents, actions, etc.
2 code implementations • 6 Sep 2021 • Oyvind Tafjord, Peter Clark
Despite the successes of pretrained language models, there are still few high-quality, general-purpose QA systems that are freely available.
1 code implementation • EMNLP 2021 • Bhavana Dalvi, Peter Jansen, Oyvind Tafjord, Zhengnan Xie, Hannah Smith, Leighanna Pipatanangkura, Peter Clark
Our approach is to generate explanations in the form of entailment trees, namely a tree of multipremise entailment steps from facts that are known, through intermediate conclusions, to the hypothesis of interest (namely the question + answer).
no code implementations • 16 Apr 2021 • Nora Kassner, Oyvind Tafjord, Hinrich Schutze, Peter Clark
(This is an old and now obsolete draft.
no code implementations • 5 Feb 2021 • Sumithra Bhakthavatsalam, Daniel Khashabi, Tushar Khot, Bhavana Dalvi Mishra, Kyle Richardson, Ashish Sabharwal, Carissa Schoenick, Oyvind Tafjord, Peter Clark
We present the ARC-DA dataset, a direct-answer ("open response", "freeform") version of the ARC (AI2 Reasoning Challenge) multiple-choice dataset.
no code implementations • Findings (ACL) 2021 • Oyvind Tafjord, Bhavana Dalvi Mishra, Peter Clark
In this work we show that a generative model, called ProofWriter, can reliably generate both implications of a theory and the natural language proof(s) that support them.
1 code implementation • NeurIPS 2020 • Alon Talmor, Oyvind Tafjord, Peter Clark, Yoav Goldberg, Jonathan Berant
In this work, we provide a first demonstration that LMs can be trained to reliably perform systematic reasoning combining both implicit, pre-trained knowledge and explicit natural language statements.
2 code implementations • Findings of the Association for Computational Linguistics 2020 • Daniel Khashabi, Sewon Min, Tushar Khot, Ashish Sabharwal, Oyvind Tafjord, Peter Clark, Hannaneh Hajishirzi
As evidence, we use the latest advances in language modeling to build a single pre-trained QA model, UnifiedQA, that performs surprisingly well across 17 QA datasets spanning 4 diverse formats.
Ranked #5 on
Common Sense Reasoning
on WinoGrande
1 code implementation • 6 Apr 2020 • Vered Shwartz, Rachel Rudinger, Oyvind Tafjord
Pre-trained language models (LMs) may perpetuate biases originating in their training corpus to downstream models.
2 code implementations • 14 Feb 2020 • Peter Clark, Oyvind Tafjord, Kyle Richardson
However, expressing the knowledge in a formal (logical or probabilistic) representation has been a major obstacle to this research.
1 code implementation • ACL 2020 • Lucy Lu Wang, Oyvind Tafjord, Arman Cohan, Sarthak Jain, Sam Skjonsberg, Carissa Schoenick, Nick Botner, Waleed Ammar
We fine-tune the contextualized word representations of the RoBERTa language model using labeled DDI data, and apply the fine-tuned model to identify supplement interactions.
no code implementations • IJCNLP 2019 • Oyvind Tafjord, Matt Gardner, Kevin Lin, Peter Clark
QuaRTz contains general qualitative statements, e. g., "A sunscreen with a higher SPF protects the skin longer.
no code implementations • 4 Sep 2019 • Peter Clark, Oren Etzioni, Daniel Khashabi, Tushar Khot, Bhavana Dalvi Mishra, Kyle Richardson, Ashish Sabharwal, Carissa Schoenick, Oyvind Tafjord, Niket Tandon, Sumithra Bhakthavatsalam, Dirk Groeneveld, Michal Guerquin, Michael Schmitz
This paper reports unprecedented success on the Grade 8 New York Regents Science Exam, where for the first time a system scores more than 90% on the exam's non-diagram, multiple choice (NDMC) questions.
no code implementations • WS 2019 • Kevin Lin, Oyvind Tafjord, Peter Clark, Matt Gardner
A system is presented a background passage containing at least one of these relations, a novel situation that uses this background, and questions that require reasoning about effects of the relationships in the background passage in the context of the situation.
1 code implementation • LREC 2020 • Dongfang Xu, Peter Jansen, Jaycie Martin, Zhengnan Xie, Vikas Yadav, Harish Tayyar Madabushi, Oyvind Tafjord, Peter Clark
Prior work has demonstrated that question classification (QC), recognizing the problem domain of a question, can help answer it more accurately.
1 code implementation • 1 May 2019 • Arindam Mitra, Peter Clark, Oyvind Tafjord, Chitta Baral
While in recent years machine learning (ML) based approaches have been the popular approach in developing end-to-end question answering systems, such systems often struggle when additional knowledge is needed to correctly answer the questions.
no code implementations • 20 Nov 2018 • Oyvind Tafjord, Peter Clark, Matt Gardner, Wen-tau Yih, Ashish Sabharwal
Many natural language questions require recognizing and reasoning with qualitative relationships (e. g., in science, economics, and medicine), but are challenging to answer with corpus-based methods.
2 code implementations • WS 2018 • Matt Gardner, Joel Grus, Mark Neumann, Oyvind Tafjord, Pradeep Dasigi, Nelson Liu, Matthew Peters, Michael Schmitz, Luke Zettlemoyer
This paper describes AllenNLP, a platform for research on deep learning methods in natural language understanding.
1 code implementation • 14 Mar 2018 • Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, Oyvind Tafjord
We present a new question set, text corpus, and baselines assembled to encourage AI research in advanced question answering.
no code implementations • EMNLP 2016 • Jayant Krishnamurthy, Oyvind Tafjord, Aniruddha Kembhavi
Situated question answering is the problem of answering questions about an environment such as an image or diagram.
3 code implementations • 14 Apr 2016 • Carissa Schoenick, Peter Clark, Oyvind Tafjord, Peter Turney, Oren Etzioni
Given recent successes in AI (e. g., AlphaGo's victory against Lee Sedol in the game of GO), it's become increasingly important to assess: how close are AI systems to human-level intelligence?