Search Results for author: Dylan Hadfield-Menell

Found 43 papers, 17 papers with code

The SaTML '24 CNN Interpretability Competition: New Innovations for Concept-Level Interpretability

1 code implementation • 3 Apr 2024 • Stephen Casper, Jieun Yun, Joonhyuk Baek, Yeseong Jung, Minhwan Kim, Kiwan Kwon, Saerom Park, Hayden Moore, David Shriver, Marissa Connor, Keltin Grimes, Angus Nicolson, Arush Tagade, Jessica Rumbelow, Hieu Minh Nguyen, Dylan Hadfield-Menell

Interpretability techniques are valuable for helping humans understand and oversee AI systems.

Paper
Code

Defending Against Unforeseen Failure Modes with Latent Adversarial Training

1 code implementation • 8 Mar 2024 • Stephen Casper, Lennart Schulze, Oam Patel, Dylan Hadfield-Menell

Despite extensive diagnostics and debugging by developers, AI systems sometimes exhibit harmful unintended behaviors.

Image Classification text-classification +2

Paper
Code

Eight Methods to Evaluate Robust Unlearning in LLMs

no code implementations • 26 Feb 2024 • Aengus Lynch, Phillip Guo, Aidan Ewart, Stephen Casper, Dylan Hadfield-Menell

Machine unlearning can be useful for removing harmful capabilities and memorized text from large language models (LLMs), but there are not yet standardized methods for rigorously evaluating it.

Machine Unlearning

Paper
Add Code

Black-Box Access is Insufficient for Rigorous AI Audits

no code implementations • 25 Jan 2024 • Stephen Casper, Carson Ezell, Charlotte Siegmann, Noam Kolt, Taylor Lynn Curtis, Benjamin Bucknall, Andreas Haupt, Kevin Wei, Jérémy Scheurer, Marius Hobbhahn, Lee Sharkey, Satyapriya Krishna, Marvin Von Hagen, Silas Alberti, Alan Chan, Qinyi Sun, Michael Gerovitch, David Bau, Max Tegmark, David Krueger, Dylan Hadfield-Menell

The effectiveness of an audit, however, depends on the degree of system access granted to auditors.

Paper
Add Code

Distributional Preference Learning: Understanding and Accounting for Hidden Context in RLHF

1 code implementation • 13 Dec 2023 • Anand Siththaranjan, Cassidy Laidlaw, Dylan Hadfield-Menell

We prove that standard applications of preference learning, including reinforcement learning from human feedback (RLHF), implicitly aggregate over hidden contexts according to a well-known voting rule called Borda count.

Paper
Code

Cognitive Dissonance: Why Do Language Model Outputs Disagree with Internal Representations of Truthfulness?

1 code implementation • 27 Nov 2023 • Kevin Liu, Stephen Casper, Dylan Hadfield-Menell, Jacob Andreas

This has led some researchers to conclude that LMs "lie" or otherwise encode non-cooperative communicative intents.

Language Modelling

Paper
Code

Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback

no code implementations • 27 Jul 2023 • Stephen Casper, Xander Davies, Claudia Shi, Thomas Krendl Gilbert, Jérémy Scheurer, Javier Rando, Rachel Freedman, Tomasz Korbak, David Lindner, Pedro Freire, Tony Wang, Samuel Marks, Charbel-Raphaël Segerie, Micah Carroll, Andi Peng, Phillip Christoffersen, Mehul Damani, Stewart Slocum, Usman Anwar, Anand Siththaranjan, Max Nadeau, Eric J. Michaud, Jacob Pfau, Dmitrii Krasheninnikov, Xin Chen, Lauro Langosco, Peter Hase, Erdem Biyik, Anca Dragan, David Krueger, Dorsa Sadigh, Dylan Hadfield-Menell

Reinforcement learning from human feedback (RLHF) is a technique for training AI systems to align with human goals.

reinforcement-learning

Paper
Add Code

Measuring the Success of Diffusion Models at Imitating Human Artists

no code implementations • 8 Jul 2023 • Stephen Casper, Zifan Guo, Shreya Mogulothu, Zachary Marinov, Chinmay Deshpande, Rui-Jie Yew, Zheng Dai, Dylan Hadfield-Menell

When Stable Diffusion is prompted to imitate an artist from this set, we find that the artist can be identified from the imitation with an average accuracy of 81. 0%.

Image Classification Image Generation

Paper
Add Code

Explore, Establish, Exploit: Red Teaming Language Models from Scratch

3 code implementations • 15 Jun 2023 • Stephen Casper, Jason Lin, Joe Kwon, Gatlen Culp, Dylan Hadfield-Menell

Using a pre-existing classifier does not allow for red-teaming to be tailored to the target model.

Paper
Code

Recommending to Strategic Users

no code implementations • 13 Feb 2023 • Andreas Haupt, Dylan Hadfield-Menell, Chara Podimata

We model this user behavior as a two-stage noisy signalling game between the recommendation system and users: the recommendation system initially commits to a recommendation policy, presents content to the users during a cold start phase which the users choose to strategically consume in order to affect the types of content they will be recommended in a recommendation phase.

Recommendation Systems

Paper
Add Code

Diagnostics for Deep Neural Networks with Automated Copy/Paste Attacks

1 code implementation • 18 Nov 2022 • Stephen Casper, Kaivalya Hariharan, Dylan Hadfield-Menell

Some previous works have proposed using human-interpretable adversarial attacks including copy/paste attacks in which one natural image pasted into another causes an unexpected misclassification.

Paper
Code

Red Teaming with Mind Reading: White-Box Adversarial Policies Against RL Agents

2 code implementations • 5 Sep 2022 • Stephen Casper, Taylor Killian, Gabriel Kreiman, Dylan Hadfield-Menell

In this work, we study white-box adversarial policies and show that having access to a target agent's internal state can be useful for identifying its vulnerabilities.

reinforcement-learning Reinforcement Learning (RL)

Paper
Code

Formal Contracts Mitigate Social Dilemmas in Multi-Agent RL

1 code implementation • 22 Aug 2022 • Andreas A. Haupt, Phillip J. K. Christoffersen, Mehul Damani, Dylan Hadfield-Menell

In this work, we draw upon the idea of formal contracting from economics to overcome diverging incentives between agents in MARL.

Management Multi-agent Reinforcement Learning +2

Paper
Code

Towards Psychologically-Grounded Dynamic Preference Models

no code implementations • 1 Aug 2022 • Mihaela Curmei, Andreas Haupt, Dylan Hadfield-Menell, Benjamin Recht

Second, we discuss implications of dynamic preference models for recommendation systems evaluation and design.

Recommendation Systems

Paper
Add Code

Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks

no code implementations • 27 Jul 2022 • Tilman Räuker, Anson Ho, Stephen Casper, Dylan Hadfield-Menell

The last decade of machine learning has seen drastic increases in scale and capabilities.

Adversarial Robustness Benchmarking +1

Paper
Add Code

Building Human Values into Recommender Systems: An Interdisciplinary Synthesis

no code implementations • 20 Jul 2022 • Jonathan Stray, Alon Halevy, Parisa Assar, Dylan Hadfield-Menell, Craig Boutilier, Amar Ashar, Lex Beattie, Michael Ekstrand, Claire Leibowicz, Connie Moon Sehat, Sara Johansen, Lianne Kerlin, David Vickrey, Spandana Singh, Sanne Vrijenhoek, Amy Zhang, McKane Andrus, Natali Helberger, Polina Proutskova, Tanushree Mitra, Nina Vasan

We collect a set of values that seem most relevant to recommender systems operating across different domains, then examine them from the perspectives of current industry practice, measurement, product design, and policy approaches.

Causal Inference Ethics +1

Paper
Add Code

How to talk so AI will learn: Instructions, descriptions, and autonomy

1 code implementation • 16 Jun 2022 • Theodore R Sumers, Robert D Hawkins, Mark K Ho, Thomas L Griffiths, Dylan Hadfield-Menell

We study two distinct types of language: $\textit{instructions}$, which provide information about the desired policy, and $\textit{descriptions}$, which provide information about the reward function.

Paper
Code

Estimating and Penalizing Induced Preference Shifts in Recommender Systems

no code implementations • 25 Apr 2022 • Micah Carroll, Anca Dragan, Stuart Russell, Dylan Hadfield-Menell

These steps involve two challenging ingredients: estimation requires anticipating how hypothetical algorithms would influence user preferences if deployed - we do this by using historical user interaction data to train a predictive user model which implicitly contains their preference dynamics; evaluation and optimization additionally require metrics to assess whether such influences are manipulative or otherwise unwanted - we use the notion of "safe shifts", that define a trust region within which behavior is safe: for instance, the natural way in which users would shift without interference from the system could be deemed "safe".

Recommendation Systems

Paper
Add Code

Linguistic communication as (inverse) reward design

no code implementations • 11 Apr 2022 • Theodore R. Sumers, Robert D. Hawkins, Mark K. Ho, Thomas L. Griffiths, Dylan Hadfield-Menell

We then define a pragmatic listener which performs inverse reward design by jointly inferring the speaker's latent horizon and rewards.

Paper
Add Code

Guided Imitation of Task and Motion Planning

no code implementations • 6 Dec 2021 • Michael James McDonald, Dylan Hadfield-Menell

While modern policy optimization methods can do complex manipulation from sensory data, they struggle on problems with extended time horizons and multiple sub-goals.

Imitation Learning Motion Planning +1

Paper
Add Code

Robust Feature-Level Adversaries are Interpretability Tools

2 code implementations • 7 Oct 2021 • Stephen Casper, Max Nadeau, Dylan Hadfield-Menell, Gabriel Kreiman

We demonstrate that they can be used to produce targeted, universal, disguised, physically-realizable, and black-box attacks at the ImageNet scale.

Paper
Code

What are you optimizing for? Aligning Recommender Systems with Human Values

no code implementations • 22 Jul 2021 • Jonathan Stray, Ivan Vendrov, Jeremy Nixon, Steven Adler, Dylan Hadfield-Menell

We describe cases where real recommender systems were modified in the service of various human values such as diversity, fairness, well-being, time well spent, and factual accuracy.

Fairness Recommendation Systems

Paper
Add Code

Consequences of Misaligned AI

no code implementations • NeurIPS 2020 • Simon Zhuang, Dylan Hadfield-Menell

We consider the cost of this incompleteness by analyzing a model of a principal and an agent in a resource constrained world where the $L$ attributes of the state correspond to different sources of utility for the principal.

Paper
Add Code

Multi-Principal Assistance Games: Definition and Collegial Mechanisms

no code implementations • 29 Dec 2020 • Arnaud Fickinger, Simon Zhuang, Andrew Critch, Dylan Hadfield-Menell, Stuart Russell

We introduce the concept of a multi-principal assistance game (MPAG), and circumvent an obstacle in social choice theory, Gibbard's theorem, by using a sufficiently collegial preference inference mechanism.

Paper
Add Code

Multi-Principal Assistance Games

no code implementations • 19 Jul 2020 • Arnaud Fickinger, Simon Zhuang, Dylan Hadfield-Menell, Stuart Russell

Assistance games (also known as cooperative inverse reinforcement learning games) have been proposed as a model for beneficial AI, wherein a robotic agent must act on behalf of a human principal but is initially uncertain about the humans payoff function.

Paper
Add Code

Silly rules improve the capacity of agents to learn stable enforcement and compliance behaviors

2 code implementations • 25 Jan 2020 • Raphael Köster, Dylan Hadfield-Menell, Gillian K. Hadfield, Joel Z. Leibo

How can societies learn to enforce and comply with social norms?

Multi-agent Reinforcement Learning

Paper
Code

An Extensible Interactive Interface for Agent Design

no code implementations • 6 Jun 2019 • Matthew Rahtz, James Fang, Anca D. Dragan, Dylan Hadfield-Menell

In deep reinforcement learning, for example, directly specifying a reward as a function of a high-dimensional observation is challenging.

Reinforcement Learning (RL)

Paper
Add Code

Adversarial Training with Voronoi Constraints

no code implementations • 2 May 2019 • Marc Khoury, Dylan Hadfield-Menell

We show that adversarial training with Voronoi constraints produces robust models which significantly improve over the state-of-the-art on MNIST and are competitive on CIFAR-10.

Paper
Add Code

Conservative Agency via Attainable Utility Preservation

3 code implementations • 26 Feb 2019 • Alexander Matt Turner, Dylan Hadfield-Menell, Prasad Tadepalli

Reward functions are easy to misspecify; although designers can make corrections after observing mistakes, an agent pursuing a misspecified reward function can irreversibly change the state of its environment.

Paper
Code

The Assistive Multi-Armed Bandit

1 code implementation • 24 Jan 2019 • Lawrence Chan, Dylan Hadfield-Menell, Siddhartha Srinivasa, Anca Dragan

Learning preferences implicit in the choices humans make is a well studied problem in both economics and computer science.

Multi-Armed Bandits

Paper
Code

On the Utility of Model Learning in HRI

no code implementations • 4 Jan 2019 • Gokul Swamy, Jens Schulz, Rohan Choudhury, Dylan Hadfield-Menell, Anca Dragan

Fundamental to robotics is the debate between model-based and model-free learning: should the robot build an explicit model of the world, or learn a policy directly?

Autonomous Driving

Paper
Add Code

Human-AI Learning Performance in Multi-Armed Bandits

no code implementations • 21 Dec 2018 • Ravi Pandya, Sandy H. Huang, Dylan Hadfield-Menell, Anca D. Dragan

People frequently face challenging decision-making problems in which outcomes are uncertain or unknown.

Decision Making Multi-Armed Bandits

Paper
Add Code

Legible Normativity for AI Alignment: The Value of Silly Rules

no code implementations • 3 Nov 2018 • Dylan Hadfield-Menell, McKane Andrus, Gillian K. Hadfield

It has become commonplace to assert that autonomous agents will have to be built to follow human rules of behavior--social norms and laws.

Paper
Add Code

On the Geometry of Adversarial Examples

no code implementations • ICLR 2019 • Marc Khoury, Dylan Hadfield-Menell

Adversarial examples are a pervasive phenomenon of machine learning models where seemingly imperceptible perturbations to the input lead to misclassifications for otherwise statistically accurate models.

Paper
Add Code

Active Inverse Reward Design

1 code implementation • 9 Sep 2018 • Sören Mindermann, Rohin Shah, Adam Gleave, Dylan Hadfield-Menell

We propose structuring this process as a series of queries asking the user to compare between different reward functions.

Informativeness

Paper
Code

An Efficient, Generalized Bellman Update For Cooperative Inverse Reinforcement Learning

no code implementations • ICML 2018 • Dhruv Malik, Malayandi Palaniappan, Jaime F. Fisac, Dylan Hadfield-Menell, Stuart Russell, Anca D. Dragan

We apply this update to a variety of POMDP solvers and find that it enables us to scale CIRL to non-trivial problems, with larger reward parameter spaces, and larger action spaces for both robot and human.

reinforcement-learning Reinforcement Learning (RL)

Paper
Add Code

Simplifying Reward Design through Divide-and-Conquer

no code implementations • 7 Jun 2018 • Ellis Ratner, Dylan Hadfield-Menell, Anca D. Dragan

Designing a good reward function is essential to robot planning and reinforcement learning, but it can also be challenging and frustrating.

Motion Planning

Paper
Add Code

Incomplete Contracting and AI Alignment

no code implementations • 12 Apr 2018 • Dylan Hadfield-Menell, Gillian Hadfield

We suggest that the analysis of incomplete contracting developed by law and economics researchers can provide a useful framework for understanding the AI alignment problem and help to generate a systematic approach to finding solutions.

Cultural Vocal Bursts Intensity Prediction

Paper
Add Code

Inverse Reward Design

1 code implementation • NeurIPS 2017 • Dylan Hadfield-Menell, Smitha Milli, Pieter Abbeel, Stuart Russell, Anca Dragan

When designing the reward, we might think of some specific training scenarios, and make sure that the reward will lead to the right behavior in those scenarios.

Paper
Code

Pragmatic-Pedagogic Value Alignment

no code implementations • 20 Jul 2017 • Jaime F. Fisac, Monica A. Gates, Jessica B. Hamrick, Chang Liu, Dylan Hadfield-Menell, Malayandi Palaniappan, Dhruv Malik, S. Shankar Sastry, Thomas L. Griffiths, Anca D. Dragan

In robotics, value alignment is key to the design of collaborative robots that can integrate into human workflows, successfully inferring and adapting to their users' objectives as they go.

Decision Making

Paper
Add Code

Should Robots be Obedient?

1 code implementation • 28 May 2017 • Smitha Milli, Dylan Hadfield-Menell, Anca Dragan, Stuart Russell

We show that when a human is not perfectly rational then a robot that tries to infer and act according to the human's underlying preferences can always perform better than a robot that simply follows the human's literal order.

Paper
Code

The Off-Switch Game

no code implementations • 24 Nov 2016 • Dylan Hadfield-Menell, Anca Dragan, Pieter Abbeel, Stuart Russell

We analyze a simple game between a human H and a robot R, where H can press R's off switch but R can disable the off switch.

Paper
Add Code

Cooperative Inverse Reinforcement Learning

2 code implementations • NeurIPS 2016 • Dylan Hadfield-Menell, Anca Dragan, Pieter Abbeel, Stuart Russell

For an autonomous system to be helpful to humans and to pose no unwarranted risks, it needs to align its values with those of the humans in its environment in such a way that its actions contribute to the maximization of value for the humans.

Active Learning reinforcement-learning +1

Paper
Code

Cannot find the paper you are looking for? You can Submit a new open access paper.