Search Results for author: Dylan Hadfield-Menell

Found 43 papers, 17 papers with code

Eight Methods to Evaluate Robust Unlearning in LLMs

no code implementations26 Feb 2024 Aengus Lynch, Phillip Guo, Aidan Ewart, Stephen Casper, Dylan Hadfield-Menell

Machine unlearning can be useful for removing harmful capabilities and memorized text from large language models (LLMs), but there are not yet standardized methods for rigorously evaluating it.

Machine Unlearning

Distributional Preference Learning: Understanding and Accounting for Hidden Context in RLHF

1 code implementation13 Dec 2023 Anand Siththaranjan, Cassidy Laidlaw, Dylan Hadfield-Menell

We prove that standard applications of preference learning, including reinforcement learning from human feedback (RLHF), implicitly aggregate over hidden contexts according to a well-known voting rule called Borda count.

Measuring the Success of Diffusion Models at Imitating Human Artists

no code implementations8 Jul 2023 Stephen Casper, Zifan Guo, Shreya Mogulothu, Zachary Marinov, Chinmay Deshpande, Rui-Jie Yew, Zheng Dai, Dylan Hadfield-Menell

When Stable Diffusion is prompted to imitate an artist from this set, we find that the artist can be identified from the imitation with an average accuracy of 81. 0%.

Image Classification Image Generation

Explore, Establish, Exploit: Red Teaming Language Models from Scratch

3 code implementations15 Jun 2023 Stephen Casper, Jason Lin, Joe Kwon, Gatlen Culp, Dylan Hadfield-Menell

Using a pre-existing classifier does not allow for red-teaming to be tailored to the target model.

Recommending to Strategic Users

no code implementations13 Feb 2023 Andreas Haupt, Dylan Hadfield-Menell, Chara Podimata

We model this user behavior as a two-stage noisy signalling game between the recommendation system and users: the recommendation system initially commits to a recommendation policy, presents content to the users during a cold start phase which the users choose to strategically consume in order to affect the types of content they will be recommended in a recommendation phase.

Recommendation Systems

Diagnostics for Deep Neural Networks with Automated Copy/Paste Attacks

1 code implementation18 Nov 2022 Stephen Casper, Kaivalya Hariharan, Dylan Hadfield-Menell

Some previous works have proposed using human-interpretable adversarial attacks including copy/paste attacks in which one natural image pasted into another causes an unexpected misclassification.

Red Teaming with Mind Reading: White-Box Adversarial Policies Against RL Agents

2 code implementations5 Sep 2022 Stephen Casper, Taylor Killian, Gabriel Kreiman, Dylan Hadfield-Menell

In this work, we study white-box adversarial policies and show that having access to a target agent's internal state can be useful for identifying its vulnerabilities.

reinforcement-learning Reinforcement Learning (RL)

Formal Contracts Mitigate Social Dilemmas in Multi-Agent RL

1 code implementation22 Aug 2022 Andreas A. Haupt, Phillip J. K. Christoffersen, Mehul Damani, Dylan Hadfield-Menell

In this work, we draw upon the idea of formal contracting from economics to overcome diverging incentives between agents in MARL.

Management Multi-agent Reinforcement Learning +2

Towards Psychologically-Grounded Dynamic Preference Models

no code implementations1 Aug 2022 Mihaela Curmei, Andreas Haupt, Dylan Hadfield-Menell, Benjamin Recht

Second, we discuss implications of dynamic preference models for recommendation systems evaluation and design.

Recommendation Systems

How to talk so AI will learn: Instructions, descriptions, and autonomy

1 code implementation16 Jun 2022 Theodore R Sumers, Robert D Hawkins, Mark K Ho, Thomas L Griffiths, Dylan Hadfield-Menell

We study two distinct types of language: $\textit{instructions}$, which provide information about the desired policy, and $\textit{descriptions}$, which provide information about the reward function.

Estimating and Penalizing Induced Preference Shifts in Recommender Systems

no code implementations25 Apr 2022 Micah Carroll, Anca Dragan, Stuart Russell, Dylan Hadfield-Menell

These steps involve two challenging ingredients: estimation requires anticipating how hypothetical algorithms would influence user preferences if deployed - we do this by using historical user interaction data to train a predictive user model which implicitly contains their preference dynamics; evaluation and optimization additionally require metrics to assess whether such influences are manipulative or otherwise unwanted - we use the notion of "safe shifts", that define a trust region within which behavior is safe: for instance, the natural way in which users would shift without interference from the system could be deemed "safe".

Recommendation Systems

Linguistic communication as (inverse) reward design

no code implementations11 Apr 2022 Theodore R. Sumers, Robert D. Hawkins, Mark K. Ho, Thomas L. Griffiths, Dylan Hadfield-Menell

We then define a pragmatic listener which performs inverse reward design by jointly inferring the speaker's latent horizon and rewards.

Guided Imitation of Task and Motion Planning

no code implementations6 Dec 2021 Michael James McDonald, Dylan Hadfield-Menell

While modern policy optimization methods can do complex manipulation from sensory data, they struggle on problems with extended time horizons and multiple sub-goals.

Imitation Learning Motion Planning +1

Robust Feature-Level Adversaries are Interpretability Tools

2 code implementations7 Oct 2021 Stephen Casper, Max Nadeau, Dylan Hadfield-Menell, Gabriel Kreiman

We demonstrate that they can be used to produce targeted, universal, disguised, physically-realizable, and black-box attacks at the ImageNet scale.

What are you optimizing for? Aligning Recommender Systems with Human Values

no code implementations22 Jul 2021 Jonathan Stray, Ivan Vendrov, Jeremy Nixon, Steven Adler, Dylan Hadfield-Menell

We describe cases where real recommender systems were modified in the service of various human values such as diversity, fairness, well-being, time well spent, and factual accuracy.

Fairness Recommendation Systems

Consequences of Misaligned AI

no code implementations NeurIPS 2020 Simon Zhuang, Dylan Hadfield-Menell

We consider the cost of this incompleteness by analyzing a model of a principal and an agent in a resource constrained world where the $L$ attributes of the state correspond to different sources of utility for the principal.

Multi-Principal Assistance Games: Definition and Collegial Mechanisms

no code implementations29 Dec 2020 Arnaud Fickinger, Simon Zhuang, Andrew Critch, Dylan Hadfield-Menell, Stuart Russell

We introduce the concept of a multi-principal assistance game (MPAG), and circumvent an obstacle in social choice theory, Gibbard's theorem, by using a sufficiently collegial preference inference mechanism.

Multi-Principal Assistance Games

no code implementations19 Jul 2020 Arnaud Fickinger, Simon Zhuang, Dylan Hadfield-Menell, Stuart Russell

Assistance games (also known as cooperative inverse reinforcement learning games) have been proposed as a model for beneficial AI, wherein a robotic agent must act on behalf of a human principal but is initially uncertain about the humans payoff function.

An Extensible Interactive Interface for Agent Design

no code implementations6 Jun 2019 Matthew Rahtz, James Fang, Anca D. Dragan, Dylan Hadfield-Menell

In deep reinforcement learning, for example, directly specifying a reward as a function of a high-dimensional observation is challenging.

Reinforcement Learning (RL)

Adversarial Training with Voronoi Constraints

no code implementations2 May 2019 Marc Khoury, Dylan Hadfield-Menell

We show that adversarial training with Voronoi constraints produces robust models which significantly improve over the state-of-the-art on MNIST and are competitive on CIFAR-10.

Conservative Agency via Attainable Utility Preservation

3 code implementations26 Feb 2019 Alexander Matt Turner, Dylan Hadfield-Menell, Prasad Tadepalli

Reward functions are easy to misspecify; although designers can make corrections after observing mistakes, an agent pursuing a misspecified reward function can irreversibly change the state of its environment.

The Assistive Multi-Armed Bandit

1 code implementation24 Jan 2019 Lawrence Chan, Dylan Hadfield-Menell, Siddhartha Srinivasa, Anca Dragan

Learning preferences implicit in the choices humans make is a well studied problem in both economics and computer science.

Multi-Armed Bandits

On the Utility of Model Learning in HRI

no code implementations4 Jan 2019 Gokul Swamy, Jens Schulz, Rohan Choudhury, Dylan Hadfield-Menell, Anca Dragan

Fundamental to robotics is the debate between model-based and model-free learning: should the robot build an explicit model of the world, or learn a policy directly?

Autonomous Driving

Human-AI Learning Performance in Multi-Armed Bandits

no code implementations21 Dec 2018 Ravi Pandya, Sandy H. Huang, Dylan Hadfield-Menell, Anca D. Dragan

People frequently face challenging decision-making problems in which outcomes are uncertain or unknown.

Decision Making Multi-Armed Bandits

Legible Normativity for AI Alignment: The Value of Silly Rules

no code implementations3 Nov 2018 Dylan Hadfield-Menell, McKane Andrus, Gillian K. Hadfield

It has become commonplace to assert that autonomous agents will have to be built to follow human rules of behavior--social norms and laws.

On the Geometry of Adversarial Examples

no code implementations ICLR 2019 Marc Khoury, Dylan Hadfield-Menell

Adversarial examples are a pervasive phenomenon of machine learning models where seemingly imperceptible perturbations to the input lead to misclassifications for otherwise statistically accurate models.

Active Inverse Reward Design

1 code implementation9 Sep 2018 Sören Mindermann, Rohin Shah, Adam Gleave, Dylan Hadfield-Menell

We propose structuring this process as a series of queries asking the user to compare between different reward functions.

Informativeness

An Efficient, Generalized Bellman Update For Cooperative Inverse Reinforcement Learning

no code implementations ICML 2018 Dhruv Malik, Malayandi Palaniappan, Jaime F. Fisac, Dylan Hadfield-Menell, Stuart Russell, Anca D. Dragan

We apply this update to a variety of POMDP solvers and find that it enables us to scale CIRL to non-trivial problems, with larger reward parameter spaces, and larger action spaces for both robot and human.

reinforcement-learning Reinforcement Learning (RL)

Simplifying Reward Design through Divide-and-Conquer

no code implementations7 Jun 2018 Ellis Ratner, Dylan Hadfield-Menell, Anca D. Dragan

Designing a good reward function is essential to robot planning and reinforcement learning, but it can also be challenging and frustrating.

Motion Planning

Incomplete Contracting and AI Alignment

no code implementations12 Apr 2018 Dylan Hadfield-Menell, Gillian Hadfield

We suggest that the analysis of incomplete contracting developed by law and economics researchers can provide a useful framework for understanding the AI alignment problem and help to generate a systematic approach to finding solutions.

Cultural Vocal Bursts Intensity Prediction

Inverse Reward Design

1 code implementation NeurIPS 2017 Dylan Hadfield-Menell, Smitha Milli, Pieter Abbeel, Stuart Russell, Anca Dragan

When designing the reward, we might think of some specific training scenarios, and make sure that the reward will lead to the right behavior in those scenarios.

Pragmatic-Pedagogic Value Alignment

no code implementations20 Jul 2017 Jaime F. Fisac, Monica A. Gates, Jessica B. Hamrick, Chang Liu, Dylan Hadfield-Menell, Malayandi Palaniappan, Dhruv Malik, S. Shankar Sastry, Thomas L. Griffiths, Anca D. Dragan

In robotics, value alignment is key to the design of collaborative robots that can integrate into human workflows, successfully inferring and adapting to their users' objectives as they go.

Decision Making

Should Robots be Obedient?

1 code implementation28 May 2017 Smitha Milli, Dylan Hadfield-Menell, Anca Dragan, Stuart Russell

We show that when a human is not perfectly rational then a robot that tries to infer and act according to the human's underlying preferences can always perform better than a robot that simply follows the human's literal order.

The Off-Switch Game

no code implementations24 Nov 2016 Dylan Hadfield-Menell, Anca Dragan, Pieter Abbeel, Stuart Russell

We analyze a simple game between a human H and a robot R, where H can press R's off switch but R can disable the off switch.

Cooperative Inverse Reinforcement Learning

2 code implementations NeurIPS 2016 Dylan Hadfield-Menell, Anca Dragan, Pieter Abbeel, Stuart Russell

For an autonomous system to be helpful to humans and to pose no unwarranted risks, it needs to align its values with those of the humans in its environment in such a way that its actions contribute to the maximization of value for the humans.

Active Learning reinforcement-learning +1

Cannot find the paper you are looking for? You can Submit a new open access paper.