Search Results for author: Lawrence Chan

Found 17 papers, 8 papers with code

Modular addition without black-boxes: Compressing explanations of MLPs that compute numerical integration

no code implementations4 Dec 2024 Chun Hei Yip, Rajashree Agrawal, Lawrence Chan, Jason Gross

In this work, we present the first case study in rigorously compressing nonlinear feature-maps, which are the leading asymptotic bottleneck to compressing small transformer models.

Numerical Integration

Mathematical Models of Computation in Superposition

no code implementations10 Aug 2024 Kaarel Hänni, Jake Mendel, Dmitry Vaintrob, Lawrence Chan

In this work, we present mathematical models of \emph{computation} in superposition, where superposition is actively helpful for efficiently accomplishing the task.

Compact Proofs of Model Performance via Mechanistic Interpretability

2 code implementations17 Jun 2024 Jason Gross, Rajashree Agrawal, Thomas Kwa, Euan Ong, Chun Hei Yip, Alex Gibson, Soufiane Noubir, Lawrence Chan

We propose using mechanistic interpretability -- techniques for reverse engineering model weights into human-interpretable algorithms -- to derive and compactly prove formal guarantees on model performance.

model

A Toy Model of Universality: Reverse Engineering How Networks Learn Group Operations

1 code implementation6 Feb 2023 Bilal Chughtai, Lawrence Chan, Neel Nanda

Universality is a key hypothesis in mechanistic interpretability -- that different models learn similar features and circuits when trained on similar tasks.

Progress measures for grokking via mechanistic interpretability

1 code implementation12 Jan 2023 Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, Jacob Steinhardt

Based on this understanding, we define progress measures that allow us to study the dynamics of training and split training into three continuous phases: memorization, circuit formation, and cleanup.

Memorization

Language models are better than humans at next-token prediction

1 code implementation21 Dec 2022 Buck Shlegeris, Fabien Roger, Lawrence Chan, Euan McLean

Current language models are considered to have sub-human capabilities at natural language tasks like question-answering or writing code.

Question Answering

The Alignment Problem from a Deep Learning Perspective

no code implementations30 Aug 2022 Richard Ngo, Lawrence Chan, Sören Mindermann

In coming years or decades, artificial general intelligence (AGI) may surpass human capabilities at many critical tasks.

Deep Learning

Adversarial Training for High-Stakes Reliability

no code implementations3 May 2022 Daniel M. Ziegler, Seraphina Nix, Lawrence Chan, Tim Bauman, Peter Schmidt-Nielsen, Tao Lin, Adam Scherlis, Noa Nabeshima, Ben Weinstein-Raun, Daniel de Haas, Buck Shlegeris, Nate Thomas

We found that adversarial training increased robustness to the adversarial attacks that we trained on -- doubling the time for our contractors to find adversarial examples both with our tool (from 13 to 26 minutes) and without (from 20 to 44 minutes) -- without affecting in-distribution performance.

Text Generation Vocal Bursts Intensity Prediction

Human irrationality: both bad and good for reward inference

no code implementations12 Nov 2021 Lawrence Chan, Andrew Critch, Anca Dragan

More importantly, we show that an irrational human, when correctly modelled, can communicate more information about the reward than a perfectly rational human can.

Optimal Cost Design for Model Predictive Control

1 code implementation23 Apr 2021 Avik Jain, Lawrence Chan, Daniel S. Brown, Anca D. Dragan

We test our approach in an autonomous driving domain where we find costs different from the ground truth that implicitly compensate for replanning, short horizon, incorrect dynamics models, and local minima issues.

Autonomous Driving model +1

Benefits of Assistance over Reward Learning

no code implementations1 Jan 2021 Rohin Shah, Pedro Freire, Neel Alex, Rachel Freedman, Dmitrii Krasheninnikov, Lawrence Chan, Michael D Dennis, Pieter Abbeel, Anca Dragan, Stuart Russell

By merging reward learning and control, assistive agents can reason about the impact of control actions on reward learning, leading to several advantages over agents based on reward learning.

The impacts of known and unknown demonstrator irrationality on reward inference

no code implementations1 Jan 2021 Lawrence Chan, Andrew Critch, Anca Dragan

Surprisingly, we find that if we give the learner access to the correct model of the demonstrator's irrationality, these irrationalities can actually help reward inference.

Accounting for Human Learning when Inferring Human Preferences

no code implementations11 Nov 2020 Harry Giles, Lawrence Chan

Inverse reinforcement learning (IRL) is a common technique for inferring human preferences from data.

The Assistive Multi-Armed Bandit

1 code implementation24 Jan 2019 Lawrence Chan, Dylan Hadfield-Menell, Siddhartha Srinivasa, Anca Dragan

Learning preferences implicit in the choices humans make is a well studied problem in both economics and computer science.

Multi-Armed Bandits

Cannot find the paper you are looking for? You can Submit a new open access paper.