no code implementations • 19 Dec 2024 • Simon Frieder, Jonas Bayer, Katherine M. Collins, Julius Berner, Jacob Loader, András Juhász, Fabian Ruehle, Sean Welleck, Gabriel Poesia, Ryan-Rhys Griffiths, Adrian Weller, Anirudh Goyal, Thomas Lukasiewicz, Timothy Gowers
The suite of datasets commonly used to train and evaluate the mathematical capabilities of AI-based mathematical copilots (primarily large language models) exhibit several shortcomings.
no code implementations • 9 Dec 2024 • Pranjal Aggarwal, Bryan Parno, Sean Welleck
Automated code generation with large language models has gained significant traction, but there remains no guarantee on the correctness of generated code.
1 code implementation • 4 Dec 2024 • Seungone Kim, Juyoung Suk, Xiang Yue, Vijay Viswanathan, Seongyun Lee, Yizhong Wang, Kiril Gashteovski, Carolin Lawrence, Sean Welleck, Graham Neubig
Given the increasing use of synthetic data in language model (LM) post-training, an LM's ability to generate high-quality data has become nearly as crucial as its ability to solve problems directly.
2 code implementations • 7 Oct 2024 • Riyaz Ahuja, Jeremy Avigad, Prasad Tetali, Sean Welleck
To this end, we study a new problem of automated proof optimization: rewriting a proof so that it is correct and optimizes for an arbitrary criterion, such as length or readability.
3 code implementations • 5 Aug 2024 • Jiewen Hu, Thomas Zhu, Sean Welleck
We introduce miniCTX, which tests a model's ability to prove formal mathematical theorems that depend on new context that is not seen during training.
no code implementations • 1 Aug 2024 • Yangzhen Wu, Zhiqing Sun, Shanda Li, Sean Welleck, Yiming Yang
As a first step towards understanding and designing compute-optimal inference methods, we studied cost-performance trade-offs for inference strategies such as greedy search, majority voting, best-of-$n$, weighted voting, and two different tree search algorithms, using different model sizes and compute budgets.
no code implementations • 14 Jul 2024 • Haohan Lin, Zhiqing Sun, Yiming Yang, Sean Welleck
We present Lean-STaR, a framework for training language models to produce informal thoughts prior to each step of a proof, thereby boosting the model's theorem-proving capabilities.
no code implementations • 24 Jun 2024 • Sean Welleck, Amanda Bertsch, Matthew Finlayson, Hailey Schoelkopf, Alex Xie, Graham Neubig, Ilia Kulikov, Zaid Harchaoui
One of the most striking findings in modern research on large language models (LLMs) is that scaling up compute during training leads to better results.
no code implementations • 16 Jun 2024 • Evan Lohn, Sean Welleck
We publicly release miniCodeProps as a benchmark for furthering automated theorem proving in the context of formally verified code.
1 code implementation • 9 Jun 2024 • Seungone Kim, Juyoung Suk, Ji Yong Cho, Shayne Longpre, Chaeeun Kim, Dongkeun Yoon, Guijin Son, Yejin Cho, Sheikh Shafayat, Jinheon Baek, Sue Hyun Park, Hyeonbin Hwang, Jinkyung Jo, Hyowon Cho, Haebin Shin, Seongyun Lee, Hanseok Oh, Noah Lee, Namgyu Ho, Se June Joo, Miyoung Ko, Yoonjoo Lee, Hyungjoo Chae, Jamin Shin, Joel Jang, Seonghyeon Ye, Bill Yuchen Lin, Sean Welleck, Graham Neubig, Moontae Lee, Kyungjae Lee, Minjoon Seo
To overcome these limitations, we introduce the BiGGen Bench, a principled generation benchmark designed to thoroughly evaluate nine distinct capabilities of LMs across 77 diverse tasks.
1 code implementation • 2 May 2024 • Seungone Kim, Juyoung Suk, Shayne Longpre, Bill Yuchen Lin, Jamin Shin, Sean Welleck, Graham Neubig, Moontae Lee, Kyungjae Lee, Minjoon Seo
Proprietary LMs such as GPT-4 are often employed to assess the quality of responses from various LMs.
1 code implementation • 14 Mar 2024 • Zhiqing Sun, Longhui Yu, Yikang Shen, Weiyang Liu, Yiming Yang, Sean Welleck, Chuang Gan
This paper answers this question in the context of tackling hard reasoning tasks (e. g., level 4-5 MATH problems) via learning from human annotations on easier tasks (e. g., level 1-3 MATH problems), which we term as easy-to-hard generalization.
1 code implementation • 13 Nov 2023 • Skyler Hallinan, Faeze Brahman, Ximing Lu, JaeHun Jung, Sean Welleck, Yejin Choi
We propose STEER: Unified Style Transfer with Expert Reinforcement, a unified frame-work developed to overcome the challenge of limited parallel data for style transfer.
1 code implementation • 27 Oct 2023 • Sean Welleck, Rahul Saha
LLMSTEP is a Lean 4 tactic that sends a user's proof state to a server hosting a language model.
4 code implementations • 16 Oct 2023 • Zhangir Azerbayev, Hailey Schoelkopf, Keiran Paster, Marco Dos Santos, Stephen Mcaleer, Albert Q. Jiang, Jia Deng, Stella Biderman, Sean Welleck
We present Llemma, a large language model for mathematics.
Ranked #20 on
Automated Theorem Proving
on miniF2F-test
1 code implementation • NeurIPS 2023 • Nouha Dziri, Ximing Lu, Melanie Sclar, Xiang Lorraine Li, Liwei Jiang, Bill Yuchen Lin, Peter West, Chandra Bhagavatula, Ronan Le Bras, Jena D. Hwang, Soumya Sanyal, Sean Welleck, Xiang Ren, Allyson Ettinger, Zaid Harchaoui, Yejin Choi
We formulate compositional tasks as computation graphs to systematically quantify the level of complexity, and break down reasoning steps into intermediate sub-procedures.
1 code implementation • 24 May 2023 • Ximing Lu, Faeze Brahman, Peter West, Jaehun Jang, Khyathi Chandu, Abhilasha Ravichander, Lianhui Qin, Prithviraj Ammanabrolu, Liwei Jiang, Sahana Ramnath, Nouha Dziri, Jillian Fisher, Bill Yuchen Lin, Skyler Hallinan, Xiang Ren, Sean Welleck, Yejin Choi
While extreme-scale language models have demonstrated exceptional performance on a variety of language tasks, the degree of control over these language models through pure prompting can often be limited.
3 code implementations • NeurIPS 2023 • Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, Peter Clark
Motivated by how humans refine their written text, we introduce Self-Refine, an approach for improving initial outputs from LLMs through iterative feedback and refinement.
1 code implementation • 30 Dec 2022 • Krishna Pillutla, Lang Liu, John Thickstun, Sean Welleck, Swabha Swayamdipta, Rowan Zellers, Sewoong Oh, Yejin Choi, Zaid Harchaoui
We present MAUVE, a family of comparison measures between pairs of distributions such as those encountered in the generative modeling of text or images.
1 code implementation • 20 Dec 2022 • Pan Lu, Liang Qiu, Wenhao Yu, Sean Welleck, Kai-Wei Chang
Mathematical reasoning is a fundamental aspect of human intelligence and is applicable in various fields, including science, engineering, finance, and everyday life.
no code implementations • 31 Oct 2022 • Sean Welleck, Ximing Lu, Peter West, Faeze Brahman, Tianxiao Shen, Daniel Khashabi, Yejin Choi
Sequence generation applications require satisfying semantic constraints, such as ensuring that programs are correct, using certain keywords, or avoiding undesirable content.
1 code implementation • 31 Oct 2022 • Swaroop Mishra, Matthew Finlayson, Pan Lu, Leonard Tang, Sean Welleck, Chitta Baral, Tanmay Rajpurohit, Oyvind Tafjord, Ashish Sabharwal, Peter Clark, Ashwin Kalyan
Mathematical reasoning skills are essential for general-purpose intelligent systems to perform tasks from grocery shopping to climate modeling.
Ranked #1 on
Mathematical Reasoning
on Lila (OOD)
3 code implementations • 21 Oct 2022 • Albert Q. Jiang, Sean Welleck, Jin Peng Zhou, Wenda Li, Jiacheng Liu, Mateja Jamnik, Timothée Lacroix, Yuhuai Wu, Guillaume Lample
In this work, we introduce Draft, Sketch, and Prove (DSP), a method that maps informal proofs to formal proof sketches, and uses the sketches to guide an automated prover by directing its search to easier sub-problems.
Ranked #3 on
Automated Theorem Proving
on miniF2F-valid
(Pass@100 metric)
1 code implementation • 6 Oct 2022 • Jiacheng Liu, Skyler Hallinan, Ximing Lu, Pengfei He, Sean Welleck, Hannaneh Hajishirzi, Yejin Choi
Our work is the first to report that knowledge generated by models that are orders of magnitude smaller than GPT-3, even without direct supervision on the knowledge itself, can exceed the quality of commonsense knowledge elicited from GPT-3.
1 code implementation • 26 May 2022 • Ximing Lu, Sean Welleck, Jack Hessel, Liwei Jiang, Lianhui Qin, Peter West, Prithviraj Ammanabrolu, Yejin Choi
Large-scale language models often learn behaviors that are misaligned with user expectations.
1 code implementation • 25 May 2022 • Sean Welleck, Jiacheng Liu, Ximing Lu, Hannaneh Hajishirzi, Yejin Choi
Theorem proving in natural mathematical language - the mixture of symbolic and natural language used by humans - plays a central role in mathematical advances and education, and tests aspects of reasoning that are core to intelligence.
no code implementations • 24 May 2022 • JaeHun Jung, Lianhui Qin, Sean Welleck, Faeze Brahman, Chandra Bhagavatula, Ronan Le Bras, Yejin Choi
Despite their impressive capabilities, large pre-trained language models (LMs) struggle with consistent reasoning; recently, prompting LMs to generate explanations that self-guide the inference has emerged as a promising direction to amend this.
2 code implementations • 23 Feb 2022 • Lianhui Qin, Sean Welleck, Daniel Khashabi, Yejin Choi
Many applications of text generation require incorporating different constraints to control the semantics or style of generated text.
1 code implementation • NAACL 2022 • Ximing Lu, Sean Welleck, Peter West, Liwei Jiang, Jungo Kasai, Daniel Khashabi, Ronan Le Bras, Lianhui Qin, Youngjae Yu, Rowan Zellers, Noah A. Smith, Yejin Choi
To enable constrained generation, we build on NeuroLogic decoding (Lu et al., 2021), combining its flexibility in incorporating logical constraints with A*esque estimates of future constraint satisfaction.
Ranked #1 on
Text Generation
on ROCStories
1 code implementation • NAACL 2022 • Daniel Khashabi, Shane Lyu, Sewon Min, Lianhui Qin, Kyle Richardson, Sean Welleck, Hannaneh Hajishirzi, Tushar Khot, Ashish Sabharwal, Sameer Singh, Yejin Choi
Fine-tuning continuous prompts for target tasks has recently emerged as a compact alternative to full model fine-tuning.
1 code implementation • ACL 2022 • Jiacheng Liu, Alisa Liu, Ximing Lu, Sean Welleck, Peter West, Ronan Le Bras, Yejin Choi, Hannaneh Hajishirzi
It remains an open question whether incorporating external knowledge benefits commonsense reasoning while maintaining the flexibility of pretrained sequence models.
1 code implementation • NAACL 2022 • Peter West, Chandra Bhagavatula, Jack Hessel, Jena D. Hwang, Liwei Jiang, Ronan Le Bras, Ximing Lu, Sean Welleck, Yejin Choi
We apply this to the ATOMIC resource, and share our new symbolic knowledge graph and commonsense models.
1 code implementation • 28 Sep 2021 • Sean Welleck, Peter West, Jize Cao, Yejin Choi
Neural sequence models trained with maximum likelihood estimation have led to breakthroughs in many tasks, where success is defined by the gap between training and test performance.
Out-of-Distribution Generalization
Systematic Generalization
1 code implementation • NeurIPS 2021 • Lang Liu, Krishna Pillutla, Sean Welleck, Sewoong Oh, Yejin Choi, Zaid Harchaoui
The spectacular success of deep generative models calls for quantitative tools to measure their statistical performance.
1 code implementation • ACL (spnlp) 2021 • Ilia Kulikov, Sean Welleck, Kyunghyun Cho
We propose to study these phenomena by investigating how the modes, or local maxima, of a distribution are maintained throughout the full learning chain of the ground-truth, empirical, learned and decoding-induced distributions, via the newly proposed mode recovery cost.
1 code implementation • 24 Mar 2021 • Sean Welleck, Jiacheng Liu, Ronan Le Bras, Hannaneh Hajishirzi, Yejin Choi, Kyunghyun Cho
Understanding and creating mathematics using natural mathematical language - the mixture of symbolic and natural language used by humans - is a challenging and important problem for driving progress in machine learning.
5 code implementations • NeurIPS 2021 • Krishna Pillutla, Swabha Swayamdipta, Rowan Zellers, John Thickstun, Sean Welleck, Yejin Choi, Zaid Harchaoui
As major progress is made in open-ended text generation, measuring how close machine-generated text is to human language remains a critical open problem.
1 code implementation • 4 Jun 2020 • Sean Welleck, Kyunghyun Cho
Typical approaches to directly optimizing the task loss such as policy gradient and minimum risk training are based around sampling in the sequence space to obtain candidate update directions that are scored based on the loss of a single sequence.
1 code implementation • EMNLP 2020 • Sean Welleck, Ilia Kulikov, Jaedeok Kim, Richard Yuanzhe Pang, Kyunghyun Cho
Despite strong performance on a variety of tasks, neural sequence models trained with maximum likelihood have been shown to exhibit issues such as length bias and degenerate repetition.
1 code implementation • ACL 2020 • Margaret Li, Stephen Roller, Ilia Kulikov, Sean Welleck, Y-Lan Boureau, Kyunghyun Cho, Jason Weston
Generative dialogue models currently suffer from a number of problems which standard maximum likelihood training does not address.
6 code implementations • ICLR 2020 • Sean Welleck, Ilia Kulikov, Stephen Roller, Emily Dinan, Kyunghyun Cho, Jason Weston
Neural text generation is a key tool in natural language applications, but it is well known there are major problems at its core.
1 code implementation • 29 May 2019 • Elman Mansimov, Alex Wang, Sean Welleck, Kyunghyun Cho
We investigate this problem by proposing a generalized model of sequence generation that unifies decoding in directed and undirected models.
no code implementations • RANLP 2019 • Sean Welleck, Kyunghyun Cho
We propose a method for non-projective dependency parsing by incrementally predicting a set of edges.
1 code implementation • WS 2019 • Sean Welleck, Kianté Brantley, Hal Daumé III, Kyunghyun Cho
Standard sequential generation methods assume a pre-specified generation order, such as text generation methods which generate words from left to right.
no code implementations • ACL 2019 • Sean Welleck, Jason Weston, Arthur Szlam, Kyunghyun Cho
Consistency is a long standing issue faced by dialogue models.
no code implementations • NeurIPS 2017 • Sean Welleck, Jialin Mao, Kyunghyun Cho, Zheng Zhang
Humans process visual scenes selectively and sequentially using attention.
no code implementations • ICLR 2018 • Sean Welleck, Zixin Yao, Yu Gai, Jialin Mao, Zheng Zhang, Kyunghyun Cho
In this paper, we propose a novel multiset loss function by viewing this problem from the perspective of sequential decision making.