no code implementations • 15 Dec 2024 • Joar Skalse, Alessandro Abate
However, work so far has only considered agents that discount future reward exponentially: this is a serious limitation, especially given that extensive work in the behavioural sciences suggests that humans are better modelled as discounting hyperbolically.
no code implementations • 24 Nov 2024 • Joar Skalse, Alessandro Abate
We also provide necessary and sufficient conditions that describe precisely how the observed demonstrator policy may differ from each of the standard behavioural models before that model leads to faulty inferences about the reward function $R$.
no code implementations • 22 Jun 2024 • Lukas Fluri, Leon Lang, Alessandro Abate, Patrick Forré, David Krueger, Joar Skalse
We say that such a reward model has an error-regret mismatch.
no code implementations • 10 May 2024 • David "davidad" Dalrymple, Joar Skalse, Yoshua Bengio, Stuart Russell, Max Tegmark, Sanjit Seshia, Steve Omohundro, Christian Szegedy, Ben Goldhaber, Nora Ammann, Alessandro Abate, Joe Halpern, Clark Barrett, Ding Zhao, Tan Zhi-Xuan, Jeannette Wing, Joshua Tenenbaum
Ensuring that AI systems reliably and robustly avoid harmful or dangerous behaviours is a crucial challenge, especially for AI systems with a high degree of autonomy and general intelligence, or systems used in safety-critical contexts.
no code implementations • 11 Mar 2024 • Joar Skalse, Alessandro Abate
In addition to this, we also characterise the conditions under which a behavioural model is robust to small perturbations of the observed policy, and we analyse how robust many behavioural models are to misspecification of their parameter values (such as e. g.\ the discount rate).
no code implementations • 26 Jan 2024 • Joar Skalse, Alessandro Abate
Moreover, we find that scalar, Markovian rewards are unable to express most of the instances in each of these three classes.
no code implementations • 18 Oct 2023 • Rohan Subramani, Marcus Williams, Max Heitmann, Halfdan Holm, Charlie Griffin, Joar Skalse
However, it is well-known that certain tasks cannot be expressed by means of an objective in the Markov rewards formalism, motivating the study of alternative objective-specification formalisms in RL such as Linear Temporal Logic and Multi-Objective Reinforcement Learning.
Multi-Objective Reinforcement Learning
reinforcement-learning
+1
no code implementations • 13 Oct 2023 • Jacek Karwowski, Oliver Hayman, Xingjian Bai, Klaus Kiendlhofer, Charlie Griffin, Joar Skalse
First, we propose a way to quantify the magnitude of this effect and show empirically that optimising an imperfect proxy reward often leads to the behaviour predicted by Goodhart's law for a wide range of environments and reward functions.
no code implementations • 26 Sep 2023 • Joar Skalse, Lucy Farnik, Sumeet Ramesh Motwani, Erik Jenner, Adam Gleave, Alessandro Abate
This means that reward learning algorithms generally must be evaluated empirically, which is expensive, and that their failure modes are difficult to anticipate in advance.
1 code implementation • 28 Dec 2022 • Joar Skalse, Lewis Hammond, Charlie Griffin, Alessandro Abate
In this work we introduce reinforcement learning techniques for solving lexicographic multi-objective problems.
Multi-Objective Reinforcement Learning
reinforcement-learning
+1
no code implementations • 6 Dec 2022 • Joar Skalse, Alessandro Abate
In this paper, we provide a mathematical analysis of how robust different IRL models are to misspecification, and answer precisely how the demonstrator policy may differ from each of the standard models before that model leads to faulty inferences about the reward function $R$.
1 code implementation • 27 Sep 2022 • Joar Skalse, Nikolaus H. R. Howe, Dmitrii Krasheninnikov, David Krueger
We provide the first formal definition of reward hacking, a phenomenon where optimizing an imperfect proxy reward function, $\mathcal{\tilde{R}}$, leads to poor performance according to the true reward function, $\mathcal{R}$.
no code implementations • 14 Mar 2022 • Joar Skalse, Matthew Farrugia-Roberts, Stuart Russell, Alessandro Abate, Adam Gleave
It is often very challenging to manually design reward functions for complex, real-world tasks.
no code implementations • NeurIPS 2021 • James Bell, Linda Linsefors, Caspar Oesterheld, Joar Skalse
This gives us a powerful tool for reasoning about the limit behaviour of agents -- for example, it lets us show that there are Newcomblike environments in which a reinforcement learning agent cannot converge to any optimal policy.
no code implementations • 1 Jan 2021 • Joar Skalse
In this paper I present an argument and a general schema which can be used to construct a problem case for any decision theory, in a way that could be taken to show that one cannot formulate a decision theory that is never outperformed by any other decision theory.
no code implementations • 26 Jun 2020 • Chris Mingard, Guillermo Valle-Pérez, Joar Skalse, Ard A. Louis
Our main findings are that $P_{SGD}(f\mid S)$ correlates remarkably well with $P_B(f\mid S)$ and that $P_B(f\mid S)$ is strongly biased towards low-error and low complexity functions.
no code implementations • 25 Sep 2019 • Chris Mingard, Joar Skalse, Guillermo Valle-Pérez, David Martínez-Rubio, Vladimir Mikulik, Ard A. Louis
Understanding the inductive bias of neural networks is critical to explaining their ability to generalise.
no code implementations • 5 Jun 2019 • Evan Hubinger, Chris van Merwijk, Vladimir Mikulik, Joar Skalse, Scott Garrabrant
We analyze the type of learned optimization that occurs when a learned model (such as a neural network) is itself an optimizer - a situation we refer to as mesa-optimization, a neologism we introduce in this paper.