no code implementations • 30 Aug 2024 • Tan Zhi-Xuan, Micah Carroll, Matija Franklin, Hal Ashton
We first survey the limits of rational choice theory as a descriptive model, explaining how preferences fail to capture the thick semantic content of human values, and how utility representations neglect the possible incommensurability of those values.
no code implementations • 28 May 2024 • Micah Carroll, Davis Foote, Anand Siththaranjan, Stuart Russell, Anca Dragan
Existing AI alignment approaches assume that preferences are static, which is unrealistic: our preferences change, and may even be influenced by our interactions with AI systems themselves.
no code implementations • 27 Jul 2023 • Stephen Casper, Xander Davies, Claudia Shi, Thomas Krendl Gilbert, Jérémy Scheurer, Javier Rando, Rachel Freedman, Tomasz Korbak, David Lindner, Pedro Freire, Tony Wang, Samuel Marks, Charbel-Raphaël Segerie, Micah Carroll, Andi Peng, Phillip Christoffersen, Mehul Damani, Stewart Slocum, Usman Anwar, Anand Siththaranjan, Max Nadeau, Eric J. Michaud, Jacob Pfau, Dmitrii Krasheninnikov, Xin Chen, Lauro Langosco, Peter Hase, Erdem Biyik, Anca Dragan, David Krueger, Dorsa Sadigh, Dylan Hadfield-Menell
Reinforcement learning from human feedback (RLHF) is a technique for training AI systems to align with human goals.
no code implementations • 15 Jun 2023 • Niklas Lauffer, Ameesh Shah, Micah Carroll, Michael Dennis, Stuart Russell
We apply this algorithm to analyze the strategically relevant information for tasks in both a standard and a partially observable version of the Overcooked environment.
no code implementations • 30 Nov 2022 • David Zhang, Micah Carroll, Andreea Bobu, Anca Dragan
One of the most successful paradigms for reward learning uses human feedback in the form of comparisons.
1 code implementation • 20 Nov 2022 • Micah Carroll, Orr Paradise, Jessy Lin, Raluca Georgescu, Mingfei Sun, David Bignell, Stephanie Milani, Katja Hofmann, Matthew Hausknecht, Anca Dragan, Sam Devlin
Randomly masking and predicting word tokens has been a successful approach in pre-training language models for a variety of downstream tasks.
no code implementations • 3 Nov 2022 • Mesut Yang, Micah Carroll, Anca Dragan
We show that using optimal behavior as a prior for human models makes these models vastly more data-efficient and able to generalize to new environments.
no code implementations • 28 Apr 2022 • Micah Carroll, Jessy Lin, Orr Paradise, Raluca Georgescu, Mingfei Sun, David Bignell, Stephanie Milani, Katja Hofmann, Matthew Hausknecht, Anca Dragan, Sam Devlin
Randomly masking and predicting word tokens has been a successful approach in pre-training language models for a variety of downstream tasks.
no code implementations • 25 Apr 2022 • Micah Carroll, Anca Dragan, Stuart Russell, Dylan Hadfield-Menell
These steps involve two challenging ingredients: estimation requires anticipating how hypothetical algorithms would influence user preferences if deployed - we do this by using historical user interaction data to train a predictive user model which implicitly contains their preference dynamics; evaluation and optimization additionally require metrics to assess whether such influences are manipulative or otherwise unwanted - we use the notion of "safe shifts", that define a trust region within which behavior is safe: for instance, the natural way in which users would shift without interference from the system could be deemed "safe".
no code implementations • 14 Jan 2021 • Paul Knott, Micah Carroll, Sam Devlin, Kamil Ciosek, Katja Hofmann, A. D. Dragan, Rohin Shah
We apply this methodology to build a suite of unit tests for the Overcooked-AI environment, and use this test suite to evaluate three proposals for improving robustness.
2 code implementations • NeurIPS 2019 • Micah Carroll, Rohin Shah, Mark K. Ho, Thomas L. Griffiths, Sanjit A. Seshia, Pieter Abbeel, Anca Dragan
While we would like agents that can coordinate with humans, current algorithms such as self-play and population-based training create agents that can coordinate with themselves.