1 code implementation • 18 Mar 2025 • Thomas Kwa, Ben West, Joel Becker, Amy Deng, Katharyn Garcia, Max Hasin, Sami Jawhar, Megan Kinniment, Nate Rush, Sydney von Arx, Ryan Bloom, Thomas Broadley, Haoxing Du, Brian Goodrich, Nikola Jurkovic, Luke Harold Miles, Seraphina Nix, Tao Lin, Neev Parikh, David Rein, Lucas Jun Koba Sato, Hjalmar Wijk, Daniel M. Ziegler, Elizabeth Barnes, Lawrence Chan
Despite rapid progress on AI benchmarks, the real-world meaning of benchmark performance remains unclear.
no code implementations • 4 Dec 2024 • Chun Hei Yip, Rajashree Agrawal, Lawrence Chan, Jason Gross
In this work, we present the first case study in rigorously compressing nonlinear feature-maps, which are the leading asymptotic bottleneck to compressing small transformer models.
2 code implementations • 22 Nov 2024 • Hjalmar Wijk, Tao Lin, Joel Becker, Sami Jawhar, Neev Parikh, Thomas Broadley, Lawrence Chan, Michael Chen, Josh Clymer, Jai Dhyani, Elena Ericheva, Katharyn Garcia, Brian Goodrich, Nikola Jurkovic, Megan Kinniment, Aron Lajko, Seraphina Nix, Lucas Sato, William Saunders, Maksym Taran, Ben West, Elizabeth Barnes
We confirm that our experts make progress in the environments given 8 hours, with 82% of expert attempts achieving a non-zero score and 24% matching or exceeding our strong reference solutions.
no code implementations • 10 Aug 2024 • Kaarel Hänni, Jake Mendel, Dmitry Vaintrob, Lawrence Chan
In this work, we present mathematical models of \emph{computation} in superposition, where superposition is actively helpful for efficiently accomplishing the task.
2 code implementations • 17 Jun 2024 • Jason Gross, Rajashree Agrawal, Thomas Kwa, Euan Ong, Chun Hei Yip, Alex Gibson, Soufiane Noubir, Lawrence Chan
We propose using mechanistic interpretability -- techniques for reverse engineering model weights into human-interpretable algorithms -- to derive and compactly prove formal guarantees on model performance.
no code implementations • 18 Dec 2023 • Megan Kinniment, Lucas Jun Koba Sato, Haoxing Du, Brian Goodrich, Max Hasin, Lawrence Chan, Luke Harold Miles, Tao R. Lin, Hjalmar Wijk, Joel Burget, Aaron Ho, Elizabeth Barnes, Paul Christiano
We find that these language model agents can only complete the easiest tasks from this list, although they make some progress on the more challenging tasks.
1 code implementation • 6 Feb 2023 • Bilal Chughtai, Lawrence Chan, Neel Nanda
Universality is a key hypothesis in mechanistic interpretability -- that different models learn similar features and circuits when trained on similar tasks.
1 code implementation • 12 Jan 2023 • Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, Jacob Steinhardt
Based on this understanding, we define progress measures that allow us to study the dynamics of training and split training into three continuous phases: memorization, circuit formation, and cleanup.
1 code implementation • 21 Dec 2022 • Buck Shlegeris, Fabien Roger, Lawrence Chan, Euan McLean
Current language models are considered to have sub-human capabilities at natural language tasks like question-answering or writing code.
no code implementations • 30 Aug 2022 • Richard Ngo, Lawrence Chan, Sören Mindermann
In coming years or decades, artificial general intelligence (AGI) may surpass human capabilities at many critical tasks.
no code implementations • 3 May 2022 • Daniel M. Ziegler, Seraphina Nix, Lawrence Chan, Tim Bauman, Peter Schmidt-Nielsen, Tao Lin, Adam Scherlis, Noa Nabeshima, Ben Weinstein-Raun, Daniel de Haas, Buck Shlegeris, Nate Thomas
We found that adversarial training increased robustness to the adversarial attacks that we trained on -- doubling the time for our contractors to find adversarial examples both with our tool (from 13 to 26 minutes) and without (from 20 to 44 minutes) -- without affecting in-distribution performance.
no code implementations • 12 Nov 2021 • Lawrence Chan, Andrew Critch, Anca Dragan
More importantly, we show that an irrational human, when correctly modelled, can communicate more information about the reward than a perfectly rational human can.
1 code implementation • 23 Apr 2021 • Avik Jain, Lawrence Chan, Daniel S. Brown, Anca D. Dragan
We test our approach in an autonomous driving domain where we find costs different from the ground truth that implicitly compensate for replanning, short horizon, incorrect dynamics models, and local minima issues.
no code implementations • 1 Jan 2021 • Rohin Shah, Pedro Freire, Neel Alex, Rachel Freedman, Dmitrii Krasheninnikov, Lawrence Chan, Michael D Dennis, Pieter Abbeel, Anca Dragan, Stuart Russell
By merging reward learning and control, assistive agents can reason about the impact of control actions on reward learning, leading to several advantages over agents based on reward learning.
no code implementations • 1 Jan 2021 • Lawrence Chan, Andrew Critch, Anca Dragan
Surprisingly, we find that if we give the learner access to the correct model of the demonstrator's irrationality, these irrationalities can actually help reward inference.
no code implementations • 11 Nov 2020 • Harry Giles, Lawrence Chan
Inverse reinforcement learning (IRL) is a common technique for inferring human preferences from data.
1 code implementation • 24 Jan 2019 • Lawrence Chan, Dylan Hadfield-Menell, Siddhartha Srinivasa, Anca Dragan
Learning preferences implicit in the choices humans make is a well studied problem in both economics and computer science.