Search Results for author: Adam Gleave

Found 19 papers, 12 papers with code

Exploiting Novel GPT-4 APIs

1 code implementation21 Dec 2023 Kellin Pelrine, Mohammad Taufeeque, Michał Zając, Euan McLean, Adam Gleave

Language model attacks typically assume one of two extreme threat models: full white-box access to model weights, or black-box access limited to a text generation API.

Language Modelling Retrieval +1

STARC: A General Framework For Quantifying Differences Between Reward Functions

no code implementations26 Sep 2023 Joar Skalse, Lucy Farnik, Sumeet Ramesh Motwani, Erik Jenner, Adam Gleave, Alessandro Abate

This means that reward learning algorithms generally must be evaluated empirically, which is expensive, and that their failure modes are difficult to anticipate in advance.

On The Fragility of Learned Reward Functions

no code implementations9 Jan 2023 Lev McKinney, Yawen Duan, David Krueger, Adam Gleave

Our work focuses on demonstrating and studying the causes of these relearning failures in the domain of preference-based reward learning.

Continuous Control

Adversarial Policies Beat Superhuman Go AIs

2 code implementations1 Nov 2022 Tony T. Wang, Adam Gleave, Tom Tseng, Kellin Pelrine, Nora Belrose, Joseph Miller, Michael D. Dennis, Yawen Duan, Viktor Pogrebniak, Sergey Levine, Stuart Russell

The core vulnerability uncovered by our attack persists even in KataGo agents adversarially trained to defend against our attack.

Calculus on MDPs: Potential Shaping as a Gradient

no code implementations20 Aug 2022 Erik Jenner, Herke van Hoof, Adam Gleave

In reinforcement learning, different reward functions can be equivalent in terms of the optimal policies they induce.

Math

Reducing Exploitability with Population Based Training

1 code implementation10 Aug 2022 Pavel Czempin, Adam Gleave

Self-play reinforcement learning has achieved state-of-the-art, and often superhuman, performance in a variety of zero-sum games.

Preprocessing Reward Functions for Interpretability

1 code implementation25 Mar 2022 Erik Jenner, Adam Gleave

In many real-world applications, the reward function is too complex to be manually specified.

A Primer on Maximum Causal Entropy Inverse Reinforcement Learning

no code implementations22 Mar 2022 Adam Gleave, Sam Toyer

Inverse Reinforcement Learning (IRL) algorithms infer a reward function that explains demonstrations provided by an expert acting in the environment.

reinforcement-learning Reinforcement Learning (RL)

Uncertainty Estimation for Language Reward Models

no code implementations14 Mar 2022 Adam Gleave, Geoffrey Irving

However, to solve a particular problem (such as text summarization) it is typically necessary to fine-tune them on a task-specific dataset.

Active Learning Reinforcement Learning (RL) +1

Understanding Learned Reward Functions

1 code implementation10 Dec 2020 Eric J. Michaud, Adam Gleave, Stuart Russell

However, current techniques for reward learning may fail to produce reward functions which accurately reflect user preferences.

DERAIL: Diagnostic Environments for Reward And Imitation Learning

2 code implementations2 Dec 2020 Pedro Freire, Adam Gleave, Sam Toyer, Stuart Russell

We evaluate a range of common reward and imitation learning algorithms on our tasks.

Imitation Learning

Quantifying Differences in Reward Functions

1 code implementation ICLR 2021 Adam Gleave, Michael Dennis, Shane Legg, Stuart Russell, Jan Leike

However, this method cannot distinguish between the learned reward function failing to reflect user preferences and the policy optimization process failing to optimize the learned reward.

Adversarial Policies: Attacking Deep Reinforcement Learning

2 code implementations ICLR 2020 Adam Gleave, Michael Dennis, Cody Wild, Neel Kant, Sergey Levine, Stuart Russell

Deep reinforcement learning (RL) policies are known to be vulnerable to adversarial perturbations to their observations, similar to adversarial examples for classifiers.

reinforcement-learning Reinforcement Learning (RL)

Inverse reinforcement learning for video games

1 code implementation24 Oct 2018 Aaron Tucker, Adam Gleave, Stuart Russell

Deep reinforcement learning achieves superhuman performance in a range of video game environments, but requires that a designer manually specify a reward function.

Continuous Control reinforcement-learning +1

Active Inverse Reward Design

1 code implementation9 Sep 2018 Sören Mindermann, Rohin Shah, Adam Gleave, Dylan Hadfield-Menell

We propose structuring this process as a series of queries asking the user to compare between different reward functions.

Informativeness

Multi-task Maximum Entropy Inverse Reinforcement Learning

1 code implementation22 May 2018 Adam Gleave, Oliver Habryka

Multi-task Inverse Reinforcement Learning (IRL) is the problem of inferring multiple reward functions from expert demonstrations.

Imitation Learning Meta-Learning +2

Cannot find the paper you are looking for? You can Submit a new open access paper.