1 code implementation • 3 Apr 2024 • Stephen Casper, Jieun Yun, Joonhyuk Baek, Yeseong Jung, Minhwan Kim, Kiwan Kwon, Saerom Park, Hayden Moore, David Shriver, Marissa Connor, Keltin Grimes, Angus Nicolson, Arush Tagade, Jessica Rumbelow, Hieu Minh Nguyen, Dylan Hadfield-Menell
Interpretability techniques are valuable for helping humans understand and oversee AI systems.
1 code implementation • 8 Mar 2024 • Stephen Casper, Lennart Schulze, Oam Patel, Dylan Hadfield-Menell
Despite extensive diagnostics and debugging by developers, AI systems sometimes exhibit harmful unintended behaviors.
no code implementations • 26 Feb 2024 • Aengus Lynch, Phillip Guo, Aidan Ewart, Stephen Casper, Dylan Hadfield-Menell
Machine unlearning can be useful for removing harmful capabilities and memorized text from large language models (LLMs), but there are not yet standardized methods for rigorously evaluating it.
no code implementations • 25 Jan 2024 • Stephen Casper, Carson Ezell, Charlotte Siegmann, Noam Kolt, Taylor Lynn Curtis, Benjamin Bucknall, Andreas Haupt, Kevin Wei, Jérémy Scheurer, Marius Hobbhahn, Lee Sharkey, Satyapriya Krishna, Marvin Von Hagen, Silas Alberti, Alan Chan, Qinyi Sun, Michael Gerovitch, David Bau, Max Tegmark, David Krueger, Dylan Hadfield-Menell
The effectiveness of an audit, however, depends on the degree of system access granted to auditors.
1 code implementation • 13 Dec 2023 • Anand Siththaranjan, Cassidy Laidlaw, Dylan Hadfield-Menell
We prove that standard applications of preference learning, including reinforcement learning from human feedback (RLHF), implicitly aggregate over hidden contexts according to a well-known voting rule called Borda count.
1 code implementation • 27 Nov 2023 • Kevin Liu, Stephen Casper, Dylan Hadfield-Menell, Jacob Andreas
This has led some researchers to conclude that LMs "lie" or otherwise encode non-cooperative communicative intents.
no code implementations • 27 Jul 2023 • Stephen Casper, Xander Davies, Claudia Shi, Thomas Krendl Gilbert, Jérémy Scheurer, Javier Rando, Rachel Freedman, Tomasz Korbak, David Lindner, Pedro Freire, Tony Wang, Samuel Marks, Charbel-Raphaël Segerie, Micah Carroll, Andi Peng, Phillip Christoffersen, Mehul Damani, Stewart Slocum, Usman Anwar, Anand Siththaranjan, Max Nadeau, Eric J. Michaud, Jacob Pfau, Dmitrii Krasheninnikov, Xin Chen, Lauro Langosco, Peter Hase, Erdem Biyik, Anca Dragan, David Krueger, Dorsa Sadigh, Dylan Hadfield-Menell
Reinforcement learning from human feedback (RLHF) is a technique for training AI systems to align with human goals.
no code implementations • 8 Jul 2023 • Stephen Casper, Zifan Guo, Shreya Mogulothu, Zachary Marinov, Chinmay Deshpande, Rui-Jie Yew, Zheng Dai, Dylan Hadfield-Menell
When Stable Diffusion is prompted to imitate an artist from this set, we find that the artist can be identified from the imitation with an average accuracy of 81. 0%.
3 code implementations • 15 Jun 2023 • Stephen Casper, Jason Lin, Joe Kwon, Gatlen Culp, Dylan Hadfield-Menell
Using a pre-existing classifier does not allow for red-teaming to be tailored to the target model.
no code implementations • 13 Feb 2023 • Andreas Haupt, Dylan Hadfield-Menell, Chara Podimata
We model this user behavior as a two-stage noisy signalling game between the recommendation system and users: the recommendation system initially commits to a recommendation policy, presents content to the users during a cold start phase which the users choose to strategically consume in order to affect the types of content they will be recommended in a recommendation phase.
1 code implementation • 18 Nov 2022 • Stephen Casper, Kaivalya Hariharan, Dylan Hadfield-Menell
Some previous works have proposed using human-interpretable adversarial attacks including copy/paste attacks in which one natural image pasted into another causes an unexpected misclassification.
2 code implementations • 5 Sep 2022 • Stephen Casper, Taylor Killian, Gabriel Kreiman, Dylan Hadfield-Menell
In this work, we study white-box adversarial policies and show that having access to a target agent's internal state can be useful for identifying its vulnerabilities.
1 code implementation • 22 Aug 2022 • Andreas A. Haupt, Phillip J. K. Christoffersen, Mehul Damani, Dylan Hadfield-Menell
In this work, we draw upon the idea of formal contracting from economics to overcome diverging incentives between agents in MARL.
no code implementations • 1 Aug 2022 • Mihaela Curmei, Andreas Haupt, Dylan Hadfield-Menell, Benjamin Recht
Second, we discuss implications of dynamic preference models for recommendation systems evaluation and design.
no code implementations • 27 Jul 2022 • Tilman Räuker, Anson Ho, Stephen Casper, Dylan Hadfield-Menell
The last decade of machine learning has seen drastic increases in scale and capabilities.
no code implementations • 20 Jul 2022 • Jonathan Stray, Alon Halevy, Parisa Assar, Dylan Hadfield-Menell, Craig Boutilier, Amar Ashar, Lex Beattie, Michael Ekstrand, Claire Leibowicz, Connie Moon Sehat, Sara Johansen, Lianne Kerlin, David Vickrey, Spandana Singh, Sanne Vrijenhoek, Amy Zhang, McKane Andrus, Natali Helberger, Polina Proutskova, Tanushree Mitra, Nina Vasan
We collect a set of values that seem most relevant to recommender systems operating across different domains, then examine them from the perspectives of current industry practice, measurement, product design, and policy approaches.
1 code implementation • 16 Jun 2022 • Theodore R Sumers, Robert D Hawkins, Mark K Ho, Thomas L Griffiths, Dylan Hadfield-Menell
We study two distinct types of language: $\textit{instructions}$, which provide information about the desired policy, and $\textit{descriptions}$, which provide information about the reward function.
no code implementations • 25 Apr 2022 • Micah Carroll, Anca Dragan, Stuart Russell, Dylan Hadfield-Menell
These steps involve two challenging ingredients: estimation requires anticipating how hypothetical algorithms would influence user preferences if deployed - we do this by using historical user interaction data to train a predictive user model which implicitly contains their preference dynamics; evaluation and optimization additionally require metrics to assess whether such influences are manipulative or otherwise unwanted - we use the notion of "safe shifts", that define a trust region within which behavior is safe: for instance, the natural way in which users would shift without interference from the system could be deemed "safe".
no code implementations • 11 Apr 2022 • Theodore R. Sumers, Robert D. Hawkins, Mark K. Ho, Thomas L. Griffiths, Dylan Hadfield-Menell
We then define a pragmatic listener which performs inverse reward design by jointly inferring the speaker's latent horizon and rewards.
no code implementations • 6 Dec 2021 • Michael James McDonald, Dylan Hadfield-Menell
While modern policy optimization methods can do complex manipulation from sensory data, they struggle on problems with extended time horizons and multiple sub-goals.
2 code implementations • 7 Oct 2021 • Stephen Casper, Max Nadeau, Dylan Hadfield-Menell, Gabriel Kreiman
We demonstrate that they can be used to produce targeted, universal, disguised, physically-realizable, and black-box attacks at the ImageNet scale.
no code implementations • 22 Jul 2021 • Jonathan Stray, Ivan Vendrov, Jeremy Nixon, Steven Adler, Dylan Hadfield-Menell
We describe cases where real recommender systems were modified in the service of various human values such as diversity, fairness, well-being, time well spent, and factual accuracy.
no code implementations • NeurIPS 2020 • Simon Zhuang, Dylan Hadfield-Menell
We consider the cost of this incompleteness by analyzing a model of a principal and an agent in a resource constrained world where the $L$ attributes of the state correspond to different sources of utility for the principal.
no code implementations • 29 Dec 2020 • Arnaud Fickinger, Simon Zhuang, Andrew Critch, Dylan Hadfield-Menell, Stuart Russell
We introduce the concept of a multi-principal assistance game (MPAG), and circumvent an obstacle in social choice theory, Gibbard's theorem, by using a sufficiently collegial preference inference mechanism.
no code implementations • 19 Jul 2020 • Arnaud Fickinger, Simon Zhuang, Dylan Hadfield-Menell, Stuart Russell
Assistance games (also known as cooperative inverse reinforcement learning games) have been proposed as a model for beneficial AI, wherein a robotic agent must act on behalf of a human principal but is initially uncertain about the humans payoff function.
2 code implementations • 25 Jan 2020 • Raphael Köster, Dylan Hadfield-Menell, Gillian K. Hadfield, Joel Z. Leibo
How can societies learn to enforce and comply with social norms?
no code implementations • 6 Jun 2019 • Matthew Rahtz, James Fang, Anca D. Dragan, Dylan Hadfield-Menell
In deep reinforcement learning, for example, directly specifying a reward as a function of a high-dimensional observation is challenging.
no code implementations • 2 May 2019 • Marc Khoury, Dylan Hadfield-Menell
We show that adversarial training with Voronoi constraints produces robust models which significantly improve over the state-of-the-art on MNIST and are competitive on CIFAR-10.
3 code implementations • 26 Feb 2019 • Alexander Matt Turner, Dylan Hadfield-Menell, Prasad Tadepalli
Reward functions are easy to misspecify; although designers can make corrections after observing mistakes, an agent pursuing a misspecified reward function can irreversibly change the state of its environment.
1 code implementation • 24 Jan 2019 • Lawrence Chan, Dylan Hadfield-Menell, Siddhartha Srinivasa, Anca Dragan
Learning preferences implicit in the choices humans make is a well studied problem in both economics and computer science.
no code implementations • 4 Jan 2019 • Gokul Swamy, Jens Schulz, Rohan Choudhury, Dylan Hadfield-Menell, Anca Dragan
Fundamental to robotics is the debate between model-based and model-free learning: should the robot build an explicit model of the world, or learn a policy directly?
no code implementations • 21 Dec 2018 • Ravi Pandya, Sandy H. Huang, Dylan Hadfield-Menell, Anca D. Dragan
People frequently face challenging decision-making problems in which outcomes are uncertain or unknown.
no code implementations • 3 Nov 2018 • Dylan Hadfield-Menell, McKane Andrus, Gillian K. Hadfield
It has become commonplace to assert that autonomous agents will have to be built to follow human rules of behavior--social norms and laws.
no code implementations • ICLR 2019 • Marc Khoury, Dylan Hadfield-Menell
Adversarial examples are a pervasive phenomenon of machine learning models where seemingly imperceptible perturbations to the input lead to misclassifications for otherwise statistically accurate models.
1 code implementation • 9 Sep 2018 • Sören Mindermann, Rohin Shah, Adam Gleave, Dylan Hadfield-Menell
We propose structuring this process as a series of queries asking the user to compare between different reward functions.
no code implementations • ICML 2018 • Dhruv Malik, Malayandi Palaniappan, Jaime F. Fisac, Dylan Hadfield-Menell, Stuart Russell, Anca D. Dragan
We apply this update to a variety of POMDP solvers and find that it enables us to scale CIRL to non-trivial problems, with larger reward parameter spaces, and larger action spaces for both robot and human.
no code implementations • 7 Jun 2018 • Ellis Ratner, Dylan Hadfield-Menell, Anca D. Dragan
Designing a good reward function is essential to robot planning and reinforcement learning, but it can also be challenging and frustrating.
no code implementations • 12 Apr 2018 • Dylan Hadfield-Menell, Gillian Hadfield
We suggest that the analysis of incomplete contracting developed by law and economics researchers can provide a useful framework for understanding the AI alignment problem and help to generate a systematic approach to finding solutions.
1 code implementation • NeurIPS 2017 • Dylan Hadfield-Menell, Smitha Milli, Pieter Abbeel, Stuart Russell, Anca Dragan
When designing the reward, we might think of some specific training scenarios, and make sure that the reward will lead to the right behavior in those scenarios.
no code implementations • 20 Jul 2017 • Jaime F. Fisac, Monica A. Gates, Jessica B. Hamrick, Chang Liu, Dylan Hadfield-Menell, Malayandi Palaniappan, Dhruv Malik, S. Shankar Sastry, Thomas L. Griffiths, Anca D. Dragan
In robotics, value alignment is key to the design of collaborative robots that can integrate into human workflows, successfully inferring and adapting to their users' objectives as they go.
1 code implementation • 28 May 2017 • Smitha Milli, Dylan Hadfield-Menell, Anca Dragan, Stuart Russell
We show that when a human is not perfectly rational then a robot that tries to infer and act according to the human's underlying preferences can always perform better than a robot that simply follows the human's literal order.
no code implementations • 24 Nov 2016 • Dylan Hadfield-Menell, Anca Dragan, Pieter Abbeel, Stuart Russell
We analyze a simple game between a human H and a robot R, where H can press R's off switch but R can disable the off switch.
2 code implementations • NeurIPS 2016 • Dylan Hadfield-Menell, Anca Dragan, Pieter Abbeel, Stuart Russell
For an autonomous system to be helpful to humans and to pose no unwarranted risks, it needs to align its values with those of the humans in its environment in such a way that its actions contribute to the maximization of value for the humans.