no code implementations • 13 Feb 2023 • Andreas Haupt, Dylan Hadfield-Menell, Chara Podimata
We model this user behavior as a two-stage noisy signalling game between the recommendation system and users: the recommendation system initially commits to a recommendation policy, presents content to the users during a cold start phase which the users choose to strategically consume in order to affect the types of content they will be recommended in a recommendation phase.
1 code implementation • 8 Feb 2023 • Stephen Casper, Yuxiao Li, Jiawei Li, Tong Bu, Kevin Zhang, Dylan Hadfield-Menell
Interpreting deep neural networks is the topic of much current research in AI.
1 code implementation • 18 Nov 2022 • Stephen Casper, Kaivalya Hariharan, Dylan Hadfield-Menell
Third, we compare this approach with other interpretability tools by attempting to rediscover trojans.
1 code implementation • 5 Sep 2022 • Stephen Casper, Dylan Hadfield-Menell, Gabriel Kreiman
Third, we show that training against white-box adversarial policies can be used to make learners in single-agent environments more robust to domain shifts.
no code implementations • 22 Aug 2022 • Phillip J. K. Christoffersen, Andreas A. Haupt, Dylan Hadfield-Menell
It is an open problem in MARL to replicate such cooperative behaviors in selfish agents.
no code implementations • 1 Aug 2022 • Mihaela Curmei, Andreas Haupt, Dylan Hadfield-Menell, Benjamin Recht
Second, we discuss implications of dynamic preference models for recommendation systems evaluation and design.
no code implementations • 27 Jul 2022 • Tilman Räuker, Anson Ho, Stephen Casper, Dylan Hadfield-Menell
The last decade of machine learning has seen drastic increases in scale and capabilities.
no code implementations • 20 Jul 2022 • Jonathan Stray, Alon Halevy, Parisa Assar, Dylan Hadfield-Menell, Craig Boutilier, Amar Ashar, Lex Beattie, Michael Ekstrand, Claire Leibowicz, Connie Moon Sehat, Sara Johansen, Lianne Kerlin, David Vickrey, Spandana Singh, Sanne Vrijenhoek, Amy Zhang, McKane Andrus, Natali Helberger, Polina Proutskova, Tanushree Mitra, Nina Vasan
We collect a set of values that seem most relevant to recommender systems operating across different domains, then examine them from the perspectives of current industry practice, measurement, product design, and policy approaches.
1 code implementation • 16 Jun 2022 • Theodore R Sumers, Robert D Hawkins, Mark K Ho, Thomas L Griffiths, Dylan Hadfield-Menell
We study two distinct types of language: $\textit{instructions}$, which provide information about the desired policy, and $\textit{descriptions}$, which provide information about the reward function.
no code implementations • 25 Apr 2022 • Micah Carroll, Anca Dragan, Stuart Russell, Dylan Hadfield-Menell
These steps involve two challenging ingredients: estimation requires anticipating how hypothetical algorithms would influence user preferences if deployed - we do this by using historical user interaction data to train a predictive user model which implicitly contains their preference dynamics; evaluation and optimization additionally require metrics to assess whether such influences are manipulative or otherwise unwanted - we use the notion of "safe shifts", that define a trust region within which behavior is safe: for instance, the natural way in which users would shift without interference from the system could be deemed "safe".
no code implementations • 11 Apr 2022 • Theodore R. Sumers, Robert D. Hawkins, Mark K. Ho, Thomas L. Griffiths, Dylan Hadfield-Menell
We then define a pragmatic listener which performs inverse reward design by jointly inferring the speaker's latent horizon and rewards.
no code implementations • 6 Dec 2021 • Michael James McDonald, Dylan Hadfield-Menell
While modern policy optimization methods can do complex manipulation from sensory data, they struggle on problems with extended time horizons and multiple sub-goals.
2 code implementations • 7 Oct 2021 • Stephen Casper, Max Nadeau, Dylan Hadfield-Menell, Gabriel Kreiman
We demonstrate that they can be used to produce targeted, universal, disguised, physically-realizable, and black-box attacks at the ImageNet scale.
no code implementations • 22 Jul 2021 • Jonathan Stray, Ivan Vendrov, Jeremy Nixon, Steven Adler, Dylan Hadfield-Menell
We describe cases where real recommender systems were modified in the service of various human values such as diversity, fairness, well-being, time well spent, and factual accuracy.
no code implementations • NeurIPS 2020 • Simon Zhuang, Dylan Hadfield-Menell
We consider the cost of this incompleteness by analyzing a model of a principal and an agent in a resource constrained world where the $L$ attributes of the state correspond to different sources of utility for the principal.
no code implementations • 29 Dec 2020 • Arnaud Fickinger, Simon Zhuang, Andrew Critch, Dylan Hadfield-Menell, Stuart Russell
We introduce the concept of a multi-principal assistance game (MPAG), and circumvent an obstacle in social choice theory, Gibbard's theorem, by using a sufficiently collegial preference inference mechanism.
no code implementations • 19 Jul 2020 • Arnaud Fickinger, Simon Zhuang, Dylan Hadfield-Menell, Stuart Russell
Assistance games (also known as cooperative inverse reinforcement learning games) have been proposed as a model for beneficial AI, wherein a robotic agent must act on behalf of a human principal but is initially uncertain about the humans payoff function.
2 code implementations • 25 Jan 2020 • Raphael Köster, Dylan Hadfield-Menell, Gillian K. Hadfield, Joel Z. Leibo
How can societies learn to enforce and comply with social norms?
no code implementations • 6 Jun 2019 • Matthew Rahtz, James Fang, Anca D. Dragan, Dylan Hadfield-Menell
In deep reinforcement learning, for example, directly specifying a reward as a function of a high-dimensional observation is challenging.
no code implementations • 2 May 2019 • Marc Khoury, Dylan Hadfield-Menell
We show that adversarial training with Voronoi constraints produces robust models which significantly improve over the state-of-the-art on MNIST and are competitive on CIFAR-10.
3 code implementations • 26 Feb 2019 • Alexander Matt Turner, Dylan Hadfield-Menell, Prasad Tadepalli
Reward functions are easy to misspecify; although designers can make corrections after observing mistakes, an agent pursuing a misspecified reward function can irreversibly change the state of its environment.
1 code implementation • 24 Jan 2019 • Lawrence Chan, Dylan Hadfield-Menell, Siddhartha Srinivasa, Anca Dragan
Learning preferences implicit in the choices humans make is a well studied problem in both economics and computer science.
no code implementations • 4 Jan 2019 • Gokul Swamy, Jens Schulz, Rohan Choudhury, Dylan Hadfield-Menell, Anca Dragan
Fundamental to robotics is the debate between model-based and model-free learning: should the robot build an explicit model of the world, or learn a policy directly?
no code implementations • 21 Dec 2018 • Ravi Pandya, Sandy H. Huang, Dylan Hadfield-Menell, Anca D. Dragan
People frequently face challenging decision-making problems in which outcomes are uncertain or unknown.
no code implementations • 3 Nov 2018 • Dylan Hadfield-Menell, McKane Andrus, Gillian K. Hadfield
It has become commonplace to assert that autonomous agents will have to be built to follow human rules of behavior--social norms and laws.
no code implementations • ICLR 2019 • Marc Khoury, Dylan Hadfield-Menell
Adversarial examples are a pervasive phenomenon of machine learning models where seemingly imperceptible perturbations to the input lead to misclassifications for otherwise statistically accurate models.
no code implementations • 9 Sep 2018 • Sören Mindermann, Rohin Shah, Adam Gleave, Dylan Hadfield-Menell
We propose structuring this process as a series of queries asking the user to compare between different reward functions.
no code implementations • ICML 2018 • Dhruv Malik, Malayandi Palaniappan, Jaime F. Fisac, Dylan Hadfield-Menell, Stuart Russell, Anca D. Dragan
We apply this update to a variety of POMDP solvers and find that it enables us to scale CIRL to non-trivial problems, with larger reward parameter spaces, and larger action spaces for both robot and human.
no code implementations • 7 Jun 2018 • Ellis Ratner, Dylan Hadfield-Menell, Anca D. Dragan
Designing a good reward function is essential to robot planning and reinforcement learning, but it can also be challenging and frustrating.
no code implementations • 12 Apr 2018 • Dylan Hadfield-Menell, Gillian Hadfield
We suggest that the analysis of incomplete contracting developed by law and economics researchers can provide a useful framework for understanding the AI alignment problem and help to generate a systematic approach to finding solutions.
no code implementations • NeurIPS 2017 • Dylan Hadfield-Menell, Smitha Milli, Pieter Abbeel, Stuart Russell, Anca Dragan
When designing the reward, we might think of some specific training scenarios, and make sure that the reward will lead to the right behavior in those scenarios.
no code implementations • 20 Jul 2017 • Jaime F. Fisac, Monica A. Gates, Jessica B. Hamrick, Chang Liu, Dylan Hadfield-Menell, Malayandi Palaniappan, Dhruv Malik, S. Shankar Sastry, Thomas L. Griffiths, Anca D. Dragan
In robotics, value alignment is key to the design of collaborative robots that can integrate into human workflows, successfully inferring and adapting to their users' objectives as they go.
1 code implementation • 28 May 2017 • Smitha Milli, Dylan Hadfield-Menell, Anca Dragan, Stuart Russell
We show that when a human is not perfectly rational then a robot that tries to infer and act according to the human's underlying preferences can always perform better than a robot that simply follows the human's literal order.
no code implementations • 24 Nov 2016 • Dylan Hadfield-Menell, Anca Dragan, Pieter Abbeel, Stuart Russell
We analyze a simple game between a human H and a robot R, where H can press R's off switch but R can disable the off switch.
2 code implementations • NeurIPS 2016 • Dylan Hadfield-Menell, Anca Dragan, Pieter Abbeel, Stuart Russell
For an autonomous system to be helpful to humans and to pose no unwarranted risks, it needs to align its values with those of the humans in its environment in such a way that its actions contribute to the maximization of value for the humans.