no code implementations • 18 Oct 2023 • Rohan Subramani, Marcus Williams, Max Heitmann, Halfdan Holm, Charlie Griffin, Joar Skalse
Although most RL algorithms require that the goal is formalised as a Markovian reward function, alternatives have been developed (such as Linear Temporal Logic and Multi-Objective Reinforcement Learning).
Multi-Objective Reinforcement Learning
reinforcement-learning
no code implementations • 13 Oct 2023 • Jacek Karwowski, Oliver Hayman, Xingjian Bai, Klaus Kiendlhofer, Charlie Griffin, Joar Skalse
First, we propose a way to quantify the magnitude of this effect and show empirically that optimising an imperfect proxy reward often leads to the behaviour predicted by Goodhart's law for a wide range of environments and reward functions.
no code implementations • 26 Sep 2023 • Joar Skalse, Lucy Farnik, Sumeet Ramesh Motwani, Erik Jenner, Adam Gleave, Alessandro Abate
This means that reward learning algorithms generally must be evaluated empirically, which is expensive, and that their failure modes are difficult to predict in advance.
1 code implementation • 28 Dec 2022 • Joar Skalse, Lewis Hammond, Charlie Griffin, Alessandro Abate
In this work we introduce reinforcement learning techniques for solving lexicographic multi-objective problems.
Multi-Objective Reinforcement Learning
reinforcement-learning
no code implementations • 6 Dec 2022 • Joar Skalse, Alessandro Abate
In this paper, we provide a mathematical analysis of how robust different IRL models are to misspecification, and answer precisely how the demonstrator policy may differ from each of the standard models before that model leads to faulty inferences about the reward function $R$.
no code implementations • 27 Sep 2022 • Joar Skalse, Nikolaus H. R. Howe, Dmitrii Krasheninnikov, David Krueger
We provide the first formal definition of reward hacking, a phenomenon where optimizing an imperfect proxy reward function, $\mathcal{\tilde{R}}$, leads to poor performance according to the true reward function, $\mathcal{R}$.
no code implementations • 14 Mar 2022 • Joar Skalse, Matthew Farrugia-Roberts, Stuart Russell, Alessandro Abate, Adam Gleave
It is often very challenging to manually design reward functions for complex, real-world tasks.
no code implementations • NeurIPS 2021 • James Bell, Linda Linsefors, Caspar Oesterheld, Joar Skalse
This gives us a powerful tool for reasoning about the limit behaviour of agents -- for example, it lets us show that there are Newcomblike environments in which a reinforcement learning agent cannot converge to any optimal policy.
no code implementations • 1 Jan 2021 • Joar Skalse
In this paper I present an argument and a general schema which can be used to construct a problem case for any decision theory, in a way that could be taken to show that one cannot formulate a decision theory that is never outperformed by any other decision theory.
no code implementations • 26 Jun 2020 • Chris Mingard, Guillermo Valle-Pérez, Joar Skalse, Ard A. Louis
Our main findings are that $P_{SGD}(f\mid S)$ correlates remarkably well with $P_B(f\mid S)$ and that $P_B(f\mid S)$ is strongly biased towards low-error and low complexity functions.
no code implementations • 25 Sep 2019 • Chris Mingard, Joar Skalse, Guillermo Valle-Pérez, David Martínez-Rubio, Vladimir Mikulik, Ard A. Louis
Understanding the inductive bias of neural networks is critical to explaining their ability to generalise.
no code implementations • 5 Jun 2019 • Evan Hubinger, Chris van Merwijk, Vladimir Mikulik, Joar Skalse, Scott Garrabrant
We analyze the type of learned optimization that occurs when a learned model (such as a neural network) is itself an optimizer - a situation we refer to as mesa-optimization, a neologism we introduce in this paper.