no code implementations • 12 Jun 2023 • Andrew Critch, Stuart Russell
While several recent works have identified societal-scale and extinction-level risks to humanity arising from artificial intelligence, few have attempted an {\em exhaustive taxonomy} of such risks.
no code implementations • 15 Jul 2022 • Andrew Critch
Deepfakes can degrade the fabric of society by limiting our ability to trust video content from leaders, authorities, and even friends.
1 code implementation • 7 Jul 2022 • Scott Emmons, Caspar Oesterheld, Andrew Critch, Vincent Conitzer, Stuart Russell
In this work, we show that any locally optimal symmetric strategy profile is also a (global) Nash equilibrium.
no code implementations • 12 Nov 2021 • Lawrence Chan, Andrew Critch, Anca Dragan
More importantly, we show that an irrational human, when correctly modelled, can communicate more information about the reward than a perfectly rational human can.
3 code implementations • 13 Oct 2021 • Shlomi Hod, Daniel Filan, Stephen Casper, Andrew Critch, Stuart Russell
These results suggest that graph-based partitioning can reveal local specialization and that statistical methods can be used to automatedly screen for sets of neurons that can be understood abstractly.
no code implementations • 29 Sep 2021 • Shlomi Hod, Stephen Casper, Daniel Filan, Cody Wild, Andrew Critch, Stuart Russell
These results suggest that graph-based partitioning can reveal modularity and help us understand how deep neural networks function.
2 code implementations • 4 Mar 2021 • Daniel Filan, Stephen Casper, Shlomi Hod, Cody Wild, Andrew Critch, Stuart Russell
We also exhibit novel methods to promote clusterability in neural network training, and find that in multi-layer perceptrons they lead to more clusterable networks with little reduction in accuracy.
no code implementations • 25 Jan 2021 • Charlotte Roman, Michael Dennis, Andrew Critch, Stuart Russell
Recent work on promoting cooperation in multi-agent learning has resulted in many methods which successfully promote cooperation at the cost of becoming more vulnerable to exploitation by malicious actors.
no code implementations • 1 Jan 2021 • Lawrence Chan, Andrew Critch, Anca Dragan
Surprisingly, we find that if we give the learner access to the correct model of the demonstrator's irrationality, these irrationalities can actually help reward inference.
no code implementations • ICLR 2021 • Dan Hendrycks, Collin Burns, Steven Basart, Andrew Critch, Jerry Li, Dawn Song, Jacob Steinhardt
We show how to assess a language model’s knowledge of basic concepts of morality.
no code implementations • 1 Jan 2021 • Shlomi Hod, Stephen Casper, Daniel Filan, Cody Wild, Andrew Critch, Stuart Russell
We apply these methods on partitionings generated by a spectral clustering algorithm which uses a graph representation of the network's neurons and weights.
no code implementations • 29 Dec 2020 • Arnaud Fickinger, Simon Zhuang, Andrew Critch, Dylan Hadfield-Menell, Stuart Russell
We introduce the concept of a multi-principal assistance game (MPAG), and circumvent an obstacle in social choice theory, Gibbard's theorem, by using a sufficiently collegial preference inference mechanism.
6 code implementations • NeurIPS 2020 • Michael Dennis, Natasha Jaques, Eugene Vinitsky, Alexandre Bayen, Stuart Russell, Andrew Critch, Sergey Levine
We call our technique Protagonist Antagonist Induced Regret Environment Design (PAIRED).
1 code implementation • NeurIPS 2020 • Sam Toyer, Rohin Shah, Andrew Critch, Stuart Russell
This rewards precise reproduction of demonstrations in one particular environment, but provides little information about how robustly an algorithm can generalise the demonstrator's intent to substantially different deployment settings.
2 code implementations • 5 Aug 2020 • Dan Hendrycks, Collin Burns, Steven Basart, Andrew Critch, Jerry Li, Dawn Song, Jacob Steinhardt
We show how to assess a language model's knowledge of basic concepts of morality.
Ranked #1 on Average on hendrycks2020ethics
no code implementations • 30 May 2020 • Andrew Critch, David Krueger
Framed in positive terms, this report examines how technical AI research might be steered in a manner that is more attentive to humanity's long-term prospects for survival as a species.
1 code implementation • 10 Mar 2020 • Daniel Filan, Shlomi Hod, Cody Wild, Andrew Critch, Stuart Russell
To discern structure in these weights, we introduce a measurable notion of modularity for multi-layer perceptrons (MLPs), and investigate the modular structure of MLPs trained on datasets of small images.
1 code implementation • NeurIPS 2021 • Alexander Matt Turner, Logan Smith, Rohin Shah, Andrew Critch, Prasad Tadepalli
Some researchers speculate that intelligent reinforcement learning (RL) agents would be incentivized to seek resources and power in pursuit of their objectives.
no code implementations • NeurIPS 2018 • Nishant Desai, Andrew Critch, Stuart J. Russell
To gain insight into the dynamics of this new framework, we implement a simple NRL agent and empirically examine its behavior in a simple environment.
no code implementations • 31 Oct 2017 • Andrew Critch, Stuart Russell
It is often argued that an agent making decisions on behalf of two or more principals who have different utility functions should adopt a {\em Pareto-optimal} policy, i. e., a policy that cannot be improved upon for one agent without making sacrifices for another.
no code implementations • 5 Jan 2017 • Andrew Critch
Observation (2) represents a substantial divergence from na\"{i}ve linear utility aggregation (as in Harsanyi's utilitarian theorem, and existing MORL algorithms), which is shown here to be inadequate for Pareto optimal sequential decision-making on behalf of players with different beliefs.
no code implementations • 12 Sep 2016 • Scott Garrabrant, Tsvi Benson-Tilsen, Andrew Critch, Nate Soares, Jessica Taylor
For instance, if the language is Peano arithmetic, it assigns probabilities to all arithmetical statements, including claims about the twin prime conjecture, the outputs of long-running computations, and its own probabilities.