no code implementations • 18 Dec 2023 • Megan Kinniment, Lucas Jun Koba Sato, Haoxing Du, Brian Goodrich, Max Hasin, Lawrence Chan, Luke Harold Miles, Tao R. Lin, Hjalmar Wijk, Joel Burget, Aaron Ho, Elizabeth Barnes, Paul Christiano
We find that these language model agents can only complete the easiest tasks from this list, although they make some progress on the more challenging tasks.
1 code implementation • 6 Feb 2023 • Bilal Chughtai, Lawrence Chan, Neel Nanda
Universality is a key hypothesis in mechanistic interpretability -- that different models learn similar features and circuits when trained on similar tasks.
no code implementations • 12 Jan 2023 • Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, Jacob Steinhardt
Based on this understanding, we define progress measures that allow us to study the dynamics of training and split training into three continuous phases: memorization, circuit formation, and cleanup.
no code implementations • 21 Dec 2022 • Buck Shlegeris, Fabien Roger, Lawrence Chan, Euan McLean
Current language models are considered to have sub-human capabilities at natural language tasks like question-answering or writing code.
no code implementations • 30 Aug 2022 • Richard Ngo, Lawrence Chan, Sören Mindermann
In coming years or decades, artificial general intelligence (AGI) may surpass human capabilities at many critical tasks.
no code implementations • 3 May 2022 • Daniel M. Ziegler, Seraphina Nix, Lawrence Chan, Tim Bauman, Peter Schmidt-Nielsen, Tao Lin, Adam Scherlis, Noa Nabeshima, Ben Weinstein-Raun, Daniel de Haas, Buck Shlegeris, Nate Thomas
We found that adversarial training increased robustness to the adversarial attacks that we trained on -- doubling the time for our contractors to find adversarial examples both with our tool (from 13 to 26 minutes) and without (from 20 to 44 minutes) -- without affecting in-distribution performance.
no code implementations • 12 Nov 2021 • Lawrence Chan, Andrew Critch, Anca Dragan
More importantly, we show that an irrational human, when correctly modelled, can communicate more information about the reward than a perfectly rational human can.
1 code implementation • 23 Apr 2021 • Avik Jain, Lawrence Chan, Daniel S. Brown, Anca D. Dragan
We test our approach in an autonomous driving domain where we find costs different from the ground truth that implicitly compensate for replanning, short horizon, incorrect dynamics models, and local minima issues.
no code implementations • 1 Jan 2021 • Lawrence Chan, Andrew Critch, Anca Dragan
Surprisingly, we find that if we give the learner access to the correct model of the demonstrator's irrationality, these irrationalities can actually help reward inference.
no code implementations • 1 Jan 2021 • Rohin Shah, Pedro Freire, Neel Alex, Rachel Freedman, Dmitrii Krasheninnikov, Lawrence Chan, Michael D Dennis, Pieter Abbeel, Anca Dragan, Stuart Russell
By merging reward learning and control, assistive agents can reason about the impact of control actions on reward learning, leading to several advantages over agents based on reward learning.
no code implementations • 11 Nov 2020 • Harry Giles, Lawrence Chan
Inverse reinforcement learning (IRL) is a common technique for inferring human preferences from data.
1 code implementation • 24 Jan 2019 • Lawrence Chan, Dylan Hadfield-Menell, Siddhartha Srinivasa, Anca Dragan
Learning preferences implicit in the choices humans make is a well studied problem in both economics and computer science.