2 code implementations • 18 Jan 2024 • Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Xian Li, Sainbayar Sukhbaatar, Jing Xu, Jason Weston
We posit that to achieve superhuman agents, future models require superhuman feedback in order to provide an adequate training signal.
1 code implementation • 20 Nov 2023 • David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, Samuel R. Bowman
We present GPQA, a challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry.
no code implementations • 26 Jul 2023 • Richard Yuanzhe Pang, Stephen Roller, Kyunghyun Cho, He He, Jason Weston
We study improving social conversational agents by learning from natural dialogue between users and a deployed model, without extra annotations.
1 code implementation • NeurIPS 2023 • Abulhair Saparov, Richard Yuanzhe Pang, Vishakh Padmakumar, Nitish Joshi, Seyed Mehran Kazemi, Najoung Kim, He He
Given the intractably large size of the space of proofs, any model that is capable of general deductive reasoning must generalize to proofs of greater complexity.
1 code implementation • 8 Mar 2023 • Vishakh Padmakumar, Richard Yuanzhe Pang, He He, Ankur P. Parikh
We study the problem of extrapolative controlled generation, i. e., generating sequences with attribute values beyond the range seen in training.
no code implementations • 16 Nov 2022 • Richard Yuanzhe Pang, Vishakh Padmakumar, Thibault Sellam, Ankur P. Parikh, He He
To align conditional text generation model outputs with desired behaviors, there has been an increasing focus on training the model using reinforcement learning (RL) with reward functions learned from human annotations.
no code implementations • 26 Aug 2022 • Julian Michael, Ari Holtzman, Alicia Parrish, Aaron Mueller, Alex Wang, Angelica Chen, Divyam Madaan, Nikita Nangia, Richard Yuanzhe Pang, Jason Phang, Samuel R. Bowman
We present the results of the NLP Community Metasurvey.
1 code implementation • 23 May 2022 • Alex Wang, Richard Yuanzhe Pang, Angelica Chen, Jason Phang, Samuel R. Bowman
Summarization datasets are often assembled either by scraping naturally occurring public-domain summaries -- which are nearly always in difficult-to-work-with technical domains -- or by using approximate heuristics to extract them from everyday text -- which frequently yields unfaithful summaries.
no code implementations • ACL 2022 • Le Hou, Richard Yuanzhe Pang, Tianyi Zhou, Yuexin Wu, Xinying Song, Xiaodan Song, Denny Zhou
Transformer-based models generally allocate the same amount of computation for each token in a given sequence.
2 code implementations • NAACL 2022 • Richard Yuanzhe Pang, Alicia Parrish, Nitish Joshi, Nikita Nangia, Jason Phang, Angelica Chen, Vishakh Padmakumar, Johnny Ma, Jana Thompson, He He, Samuel R. Bowman
To enable building and testing models on long-document comprehension, we introduce QuALITY, a multiple-choice QA dataset with context passages in English that have an average length of about 5, 000 tokens, much longer than typical current models can process.
no code implementations • 16 Dec 2021 • Richard Yuanzhe Pang, He He, Kyunghyun Cho
For all three approaches, the generated translations fail to achieve rewards comparable to BSR, but the translation quality approximated by BLEU and BLEURT is similar to the quality of BSR-produced translations.
no code implementations • Findings (ACL) 2021 • Richard Yuanzhe Pang, Adam D. Lelkes, Vinh Q. Tran, Cong Yu
Given the lack of existing datasets, we create a dataset for AgreeSum, and provide annotations on article-summary entailment relations for a subset of the clusters in the dataset.
no code implementations • ACL 2021 • Clara Vania, Phu Mon Htut, William Huang, Dhara Mungra, Richard Yuanzhe Pang, Jason Phang, Haokun Liu, Kyunghyun Cho, Samuel R. Bowman
Recent years have seen numerous NLP datasets introduced to evaluate the performance of fine-tuned models on natural language understanding tasks.
1 code implementation • ICLR 2021 • Richard Yuanzhe Pang, He He
Current approaches to text generation largely rely on autoregressive models and maximum likelihood estimation.
no code implementations • ACL 2020 • Yada Pruksachatkun, Jason Phang, Haokun Liu, Phu Mon Htut, Xiaoyi Zhang, Richard Yuanzhe Pang, Clara Vania, Katharina Kann, Samuel R. Bowman
However, we fail to observe more granular correlations between probing and target task performance, highlighting the need for further work on broad-coverage probing benchmarks.
1 code implementation • ACL 2020 • Lifu Tu, Richard Yuanzhe Pang, Sam Wiseman, Kevin Gimpel
We propose to train a non-autoregressive machine translation model to minimize the energy defined by a pretrained autoregressive model.
no code implementations • 1 May 2020 • Yada Pruksachatkun, Jason Phang, Haokun Liu, Phu Mon Htut, Xiaoyi Zhang, Richard Yuanzhe Pang, Clara Vania, Katharina Kann, Samuel R. Bowman
However, we fail to observe more granular correlations between probing and target task performance, highlighting the need for further work on broad-coverage probing benchmarks.
1 code implementation • EMNLP 2020 • Sean Welleck, Ilia Kulikov, Jaedeok Kim, Richard Yuanzhe Pang, Kyunghyun Cho
Despite strong performance on a variety of tasks, neural sequence models trained with maximum likelihood have been shown to exhibit issues such as length bias and degenerate repetition.
no code implementations • EMNLP (spnlp) 2020 • Lifu Tu, Richard Yuanzhe Pang, Kevin Gimpel
Deep energy-based models are powerful, but pose challenges for learning and inference (Belanger and McCallum, 2016).
no code implementations • WS 2019 • Richard Yuanzhe Pang
Regarding the problem of automatically generating paraphrases with modified styles or attributes, the difficulty lies in the lack of parallel corpora.
no code implementations • 9 Oct 2019 • Richard Yuanzhe Pang
The difficulty of textual style transfer lies in the lack of parallel corpora.
no code implementations • WS 2019 • Richard Yuanzhe Pang, Kevin Gimpel
We show that the metric of post-transfer classification accuracy is insufficient on its own, and propose additional metrics based on semantic preservation and fluency as well as a way to combine them into a single overall score.