no code implementations • 19 Feb 2024 • Pedro Freire, ChengCheng Tan, Adam Gleave, Dan Hendrycks, Scott Emmons
Do language models implicitly learn a concept of human wellbeing?
1 code implementation • 21 Dec 2023 • Kellin Pelrine, Mohammad Taufeeque, Michał Zając, Euan McLean, Adam Gleave
Language model attacks typically assume one of two extreme threat models: full white-box access to model weights, or black-box access limited to a text generation API.
no code implementations • 26 Sep 2023 • Joar Skalse, Lucy Farnik, Sumeet Ramesh Motwani, Erik Jenner, Adam Gleave, Alessandro Abate
This means that reward learning algorithms generally must be evaluated empirically, which is expensive, and that their failure modes are difficult to anticipate in advance.
no code implementations • 9 Jan 2023 • Lev McKinney, Yawen Duan, David Krueger, Adam Gleave
Our work focuses on demonstrating and studying the causes of these relearning failures in the domain of preference-based reward learning.
2 code implementations • 22 Nov 2022 • Adam Gleave, Mohammad Taufeeque, Juan Rocamonde, Erik Jenner, Steven H. Wang, Sam Toyer, Maximilian Ernestus, Nora Belrose, Scott Emmons, Stuart Russell
imitation provides open-source implementations of imitation and reward learning algorithms in PyTorch.
2 code implementations • 1 Nov 2022 • Tony T. Wang, Adam Gleave, Tom Tseng, Kellin Pelrine, Nora Belrose, Joseph Miller, Michael D. Dennis, Yawen Duan, Viktor Pogrebniak, Sergey Levine, Stuart Russell
The core vulnerability uncovered by our attack persists even in KataGo agents adversarially trained to defend against our attack.
no code implementations • 20 Aug 2022 • Erik Jenner, Herke van Hoof, Adam Gleave
In reinforcement learning, different reward functions can be equivalent in terms of the optimal policies they induce.
1 code implementation • 10 Aug 2022 • Pavel Czempin, Adam Gleave
Self-play reinforcement learning has achieved state-of-the-art, and often superhuman, performance in a variety of zero-sum games.
1 code implementation • 25 Mar 2022 • Erik Jenner, Adam Gleave
In many real-world applications, the reward function is too complex to be manually specified.
no code implementations • 22 Mar 2022 • Adam Gleave, Sam Toyer
Inverse Reinforcement Learning (IRL) algorithms infer a reward function that explains demonstrations provided by an expert acting in the environment.
no code implementations • 14 Mar 2022 • Adam Gleave, Geoffrey Irving
However, to solve a particular problem (such as text summarization) it is typically necessary to fine-tune them on a task-specific dataset.
no code implementations • 14 Mar 2022 • Joar Skalse, Matthew Farrugia-Roberts, Stuart Russell, Alessandro Abate, Adam Gleave
It is often very challenging to manually design reward functions for complex, real-world tasks.
1 code implementation • 10 Dec 2020 • Eric J. Michaud, Adam Gleave, Stuart Russell
However, current techniques for reward learning may fail to produce reward functions which accurately reflect user preferences.
2 code implementations • 2 Dec 2020 • Pedro Freire, Adam Gleave, Sam Toyer, Stuart Russell
We evaluate a range of common reward and imitation learning algorithms on our tasks.
1 code implementation • ICLR 2021 • Adam Gleave, Michael Dennis, Shane Legg, Stuart Russell, Jan Leike
However, this method cannot distinguish between the learned reward function failing to reflect user preferences and the policy optimization process failing to optimize the learned reward.
2 code implementations • ICLR 2020 • Adam Gleave, Michael Dennis, Cody Wild, Neel Kant, Sergey Levine, Stuart Russell
Deep reinforcement learning (RL) policies are known to be vulnerable to adversarial perturbations to their observations, similar to adversarial examples for classifiers.
1 code implementation • 24 Oct 2018 • Aaron Tucker, Adam Gleave, Stuart Russell
Deep reinforcement learning achieves superhuman performance in a range of video game environments, but requires that a designer manually specify a reward function.
1 code implementation • 9 Sep 2018 • Sören Mindermann, Rohin Shah, Adam Gleave, Dylan Hadfield-Menell
We propose structuring this process as a series of queries asking the user to compare between different reward functions.
1 code implementation • 22 May 2018 • Adam Gleave, Oliver Habryka
Multi-task Inverse Reinforcement Learning (IRL) is the problem of inferring multiple reward functions from expert demonstrations.