1 code implementation • 24 Apr 2024 • Jacob Pfau, William Merrill, Samuel R. Bowman
We also provide a theoretical characterization of the class of problems where filler tokens are useful in terms of the quantifier depth of a first-order formula.
1 code implementation • 20 Oct 2023 • Henning Bartsch, Ole Jorgensen, Domenic Rosati, Jason Hoelscher-Obermaier, Jacob Pfau
Using this test, we find that despite increases in self-consistency, models usually place significant weight on alternative, inconsistent answers.
no code implementations • 27 Jul 2023 • Stephen Casper, Xander Davies, Claudia Shi, Thomas Krendl Gilbert, Jérémy Scheurer, Javier Rando, Rachel Freedman, Tomasz Korbak, David Lindner, Pedro Freire, Tony Wang, Samuel Marks, Charbel-Raphaël Segerie, Micah Carroll, Andi Peng, Phillip Christoffersen, Mehul Damani, Stewart Slocum, Usman Anwar, Anand Siththaranjan, Max Nadeau, Eric J. Michaud, Jacob Pfau, Dmitrii Krasheninnikov, Xin Chen, Lauro Langosco, Peter Hase, Erdem Biyik, Anca Dragan, David Krueger, Dorsa Sadigh, Dylan Hadfield-Menell
Reinforcement learning from human feedback (RLHF) is a technique for training AI systems to align with human goals.
4 code implementations • 28 May 2021 • Lauro Langosco, Jack Koch, Lee Sharkey, Jacob Pfau, Laurent Orseau, David Krueger
We study goal misgeneralization, a type of out-of-distribution generalization failure in reinforcement learning (RL).
1 code implementation • 6 Apr 2021 • Jacob Pfau, Albert T. Young, Jerome Wei, Maria L. Wei, Michael J. Keiser
Our proposed Robust Concept Activation Vectors (RCAV) quantifies the effects of semantic concepts on individual model predictions and on model behavior as a whole.
no code implementations • 16 Oct 2019 • Jacob Pfau, Albert T. Young, Maria L. Wei, Michael J. Keiser
In high-stakes applications of machine learning models, interpretability methods provide guarantees that models are right for the right reasons.