no code implementations • 28 Sep 2023 • Stuart Armstrong, Alexandre Maranhão, Oliver Daniels-Koch, Patrick Leask, Rebecca Gorman
Goal misgeneralisation is a key challenge in AI alignment -- the task of getting powerful Artificial Intelligences to align their goals with human intentions and human morality.
no code implementations • 19 Jun 2023 • Matija Franklin, Rebecca Gorman, Hal Ashton, Stuart Armstrong
This article is a primer on concept extrapolation - the ability to take a concept, a feature, or a goal that is defined in one context and extrapolate it safely to a more general context.
no code implementations • 20 Mar 2022 • Matija Franklin, Hal Ashton, Rebecca Gorman, Stuart Armstrong
We operationalize preference to incorporate concepts from various disciplines, outlining the importance of meta-preferences and preference-change preferences, and proposing a preliminary framework for how preferences change.
no code implementations • 28 Feb 2022 • Rebecca Gorman, Stuart Armstrong
For an artificial intelligence (AI) to be aligned with human values (or human preferences), it must first learn those values.
no code implementations • 6 Oct 2020 • James D. Miller, Roman Yampolskiy, Olle Haggstrom, Stuart Armstrong
To reduce the danger of powerful super-intelligent AIs, we might make the first such AIs oracles that can only send and receive messages.
no code implementations • 28 Apr 2020 • Stuart Armstrong, Jan Leike, Laurent Orseau, Shane Legg
We formally introduce two desirable properties: the first is `unriggability', which prevents the agent from steering the learning process in the direction of a reward function that is easier to optimise.
no code implementations • 11 Jan 2018 • Stuart Armstrong
Partially Observable Markov Decision Processes (POMDPs) are rich environments often used in machine learning.
no code implementations • 18 Dec 2017 • Stuart Armstrong, Xavier O'Rourke
`Indifference' refers to a class of methods used to control reward based agents.
no code implementations • NeurIPS 2018 • Stuart Armstrong, Sören Mindermann
Inverse reinforcement learning (IRL) attempts to infer human rewards or preferences from observed behavior.
no code implementations • 15 Nov 2017 • Stuart Armstrong, Xavier O'Rorke
It is possible that powerful and potentially dangerous artificial intelligence (AI) might be developed in the future.
no code implementations • 30 May 2017 • Stuart Armstrong, Benjamin Levinstein
This paper looks at an alternative approach: defining a general concept of `low impact'.
no code implementations • 28 Oct 2011 • Stuart Armstrong
This paper sets out to resolve how agents ought to act in the Sleeping Beauty problem and various related anthropic (self-locating belief) problems, not through the calculation of anthropic probabilities, but through finding the correct decision to make.