no code implementations • 31 Jan 2025 • Mrinank Sharma, Meg Tong, Jesse Mu, Jerry Wei, Jorrit Kruthoff, Scott Goodfriend, Euan Ong, Alwin Peng, Raj Agarwal, Cem Anil, Amanda Askell, Nathan Bailey, Joe Benton, Emma Bluemke, Samuel R. Bowman, Eric Christiansen, Hoagy Cunningham, Andy Dau, Anjali Gopal, Rob Gilson, Logan Graham, Logan Howard, Nimit Kalra, Taesung Lee, Kevin Lin, Peter Lofgren, Francesco Mosconi, Clare O'Hara, Catherine Olsson, Linda Petrini, Samir Rajani, Nikhil Saxena, Alex Silverstein, Tanya Singh, Theodore Sumers, Leonard Tang, Kevin K. Troy, Constantin Weisser, Ruiqi Zhong, Giulio Zhou, Jan Leike, Jared Kaplan, Ethan Perez
Large language models (LLMs) are vulnerable to universal jailbreaks-prompting strategies that systematically bypass model safeguards and enable users to carry out harmful processes that require many model interactions, like manufacturing illegal substances at scale.
no code implementations • 3 Dec 2024 • Tony T. Wang, John Hughes, Henry Sleight, Rylan Schaeffer, Rajashree Agrawal, Fazl Barez, Mrinank Sharma, Jesse Mu, Nir Shavit, Ethan Perez
Defending large language models against jailbreaks so that they never engage in a broadly-defined set of forbidden behaviors is an open problem.
1 code implementation • 10 Jan 2024 • Evan Hubinger, Carson Denison, Jesse Mu, Mike Lambert, Meg Tong, Monte MacDiarmid, Tamera Lanham, Daniel M. Ziegler, Tim Maxwell, Newton Cheng, Adam Jermyn, Amanda Askell, Ansh Radhakrishnan, Cem Anil, David Duvenaud, Deep Ganguli, Fazl Barez, Jack Clark, Kamal Ndousse, Kshitij Sachan, Michael Sellitto, Mrinank Sharma, Nova DasSarma, Roger Grosse, Shauna Kravec, Yuntao Bai, Zachary Witten, Marina Favaro, Jan Brauner, Holden Karnofsky, Paul Christiano, Samuel R. Bowman, Logan Graham, Jared Kaplan, Sören Mindermann, Ryan Greenblatt, Buck Shlegeris, Nicholas Schiefer, Ethan Perez
We find that such backdoor behavior can be made persistent, so that it is not removed by standard safety training techniques, including supervised fine-tuning, reinforcement learning, and adversarial training (eliciting unsafe behavior and then training to remove it).
no code implementations • 19 May 2023 • Dhara Yu, Noah D. Goodman, Jesse Mu
Humans teach others about the world through language and demonstration.
1 code implementation • NeurIPS 2023 • Jesse Mu, Xiang Lisa Li, Noah Goodman
Prompting is the primary way to utilize the multitask capabilities of language models (LMs), but prompts occupy valuable space in the input context window, and repeatedly encoding the same prompt is computationally inefficient.
1 code implementation • 30 Sep 2022 • Victor Zhong, Jesse Mu, Luke Zettlemoyer, Edward Grefenstette, Tim Rocktäschel
Recent work has shown that augmenting environments with language descriptions improves policy learning.
1 code implementation • 18 Apr 2022 • Alex Tamkin, Dat Nguyen, Salil Deshpande, Jesse Mu, Noah Goodman
Models can fail in unpredictable ways during deployment due to task ambiguity, when multiple behaviors are consistent with the provided training data.
1 code implementation • 28 Mar 2022 • Eric Zelikman, Yuhuai Wu, Jesse Mu, Noah D. Goodman
We show that STaR significantly improves performance on multiple datasets compared to a model fine-tuned to directly predict final answers, and performs comparably to fine-tuning a 30$\times$ larger state-of-the-art language model on CommensenseQA.
Ranked #18 on
Common Sense Reasoning
on CommonsenseQA
1 code implementation • 17 Feb 2022 • Jesse Mu, Victor Zhong, Roberta Raileanu, Minqi Jiang, Noah Goodman, Tim Rocktäschel, Edward Grefenstette
Reinforcement learning (RL) agents are particularly hard to train when rewards are sparse.
1 code implementation • Findings (EMNLP) 2021 • Rose E. Wang, Julia White, Jesse Mu, Noah D. Goodman
We propose a method that uses a population of neural listeners to regularize speaker training.
no code implementations • 29 Sep 2021 • Alex Tamkin, Dat Nguyen, Salil Deshpande, Jesse Mu, Noah Goodman
An important barrier to the safe deployment of machine learning systems is the risk of \emph{task ambiguity}, where multiple behaviors are consistent with the provided examples.
1 code implementation • NeurIPS 2021 • Jesse Mu, Noah Goodman
To build agents that can collaborate effectively with others, recent research has trained artificial agents to communicate with each other in Lewis-style referential games.
2 code implementations • NeurIPS 2020 • Jesse Mu, Jacob Andreas
We describe a procedure for explaining neurons in deep representations by identifying compositional logical concepts that closely approximate neuron behavior.
1 code implementation • 31 May 2020 • Julia White, Jesse Mu, Noah D. Goodman
A hallmark of human language is the ability to effectively and efficiently convey contextually relevant information.
2 code implementations • ACL 2020 • Jesse Mu, Percy Liang, Noah Goodman
By describing the features and abstractions of our world, language is a crucial tool for human learning and a promising source of supervision for machine learning models.
1 code implementation • NAACL 2019 • Jesse Mu, Helen Yannakoudakis, Ekaterina Shutova
Most current approaches to metaphor identification use restricted linguistic contexts, e. g. by considering only a verb's arguments or the sentence containing a phrase.
no code implementations • EMNLP 2017 • Jesse Mu, Joshua K. Hartshorne, Timothy O{'}Donnell
Verbs can only be used with a few specific arrangements of their arguments (syntactic frames).