no code implementations • 16 Apr 2024 • Vincent Conitzer, Rachel Freedman, Jobst Heitzig, Wesley H. Holliday, Bob M. Jacobs, Nathan Lambert, Milan Mossé, Eric Pacuit, Stuart Russell, Hailey Schoelkopf, Emanuel Tewolde, William S. Zwicker
Foundation models such as GPT-4 are fine-tuned to avoid unsafe or otherwise problematic behavior, such as helping to commit crimes or producing racist text.
no code implementations • 23 Oct 2023 • Rachel Freedman, Justin Svegliato, Kyle Wray, Stuart Russell
The HUB framework and ATS algorithm demonstrate the importance of leveraging differences between teachers to learn accurate reward models, facilitating future research on active teacher selection for robust reward modeling.
no code implementations • 27 Jul 2023 • Stephen Casper, Xander Davies, Claudia Shi, Thomas Krendl Gilbert, Jérémy Scheurer, Javier Rando, Rachel Freedman, Tomasz Korbak, David Lindner, Pedro Freire, Tony Wang, Samuel Marks, Charbel-Raphaël Segerie, Micah Carroll, Andi Peng, Phillip Christoffersen, Mehul Damani, Stewart Slocum, Usman Anwar, Anand Siththaranjan, Max Nadeau, Eric J. Michaud, Jacob Pfau, Dmitrii Krasheninnikov, Xin Chen, Lauro Langosco, Peter Hase, Erdem Biyik, Anca Dragan, David Krueger, Dorsa Sadigh, Dylan Hadfield-Menell
Reinforcement learning from human feedback (RLHF) is a technique for training AI systems to align with human goals.
no code implementations • 2 Mar 2023 • Peter Barnett, Rachel Freedman, Justin Svegliato, Stuart Russell
Reward learning algorithms utilize human feedback to infer a reward function, which is then used to train an AI system.
no code implementations • 12 Nov 2022 • Oliver Daniels-Koch, Rachel Freedman
RLHF algorithms that learn from multiple teachers therefore face an expertise problem: the reliability of a given piece of feedback depends both on the teacher that it comes from and how specialized that teacher is on relevant components of the task.
no code implementations • 19 Jan 2021 • Rachel Freedman, Rohin Shah, Anca Dragan
A promising alternative to manually specifying reward functions is to enable robots to infer them from human feedback, like demonstrations or corrections.
no code implementations • 1 Jan 2021 • Rohin Shah, Pedro Freire, Neel Alex, Rachel Freedman, Dmitrii Krasheninnikov, Lawrence Chan, Michael D Dennis, Pieter Abbeel, Anca Dragan, Stuart Russell
By merging reward learning and control, assistive agents can reason about the impact of control actions on reward learning, leading to several advantages over agents based on reward learning.
no code implementations • 16 Jun 2020 • Rachel Freedman
In this paper, we propose, implement and evaluate a methodology for prioritizing patients based on such heterogeneous moral preferences.
1 code implementation • 19 May 2020 • Rachel Freedman, Jana Schaich Borg, Walter Sinnott-Armstrong, John P. Dickerson, Vincent Conitzer
In kidney exchanges, a central market maker allocates living kidney donors to patients in need of an organ.