Search Results for author: Lawrence Chan

Found 12 papers, 3 papers with code

Evaluating Language-Model Agents on Realistic Autonomous Tasks

no code implementations18 Dec 2023 Megan Kinniment, Lucas Jun Koba Sato, Haoxing Du, Brian Goodrich, Max Hasin, Lawrence Chan, Luke Harold Miles, Tao R. Lin, Hjalmar Wijk, Joel Burget, Aaron Ho, Elizabeth Barnes, Paul Christiano

We find that these language model agents can only complete the easiest tasks from this list, although they make some progress on the more challenging tasks.

Language Modelling

A Toy Model of Universality: Reverse Engineering How Networks Learn Group Operations

1 code implementation6 Feb 2023 Bilal Chughtai, Lawrence Chan, Neel Nanda

Universality is a key hypothesis in mechanistic interpretability -- that different models learn similar features and circuits when trained on similar tasks.

Progress measures for grokking via mechanistic interpretability

no code implementations12 Jan 2023 Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, Jacob Steinhardt

Based on this understanding, we define progress measures that allow us to study the dynamics of training and split training into three continuous phases: memorization, circuit formation, and cleanup.

Memorization

Language models are better than humans at next-token prediction

no code implementations21 Dec 2022 Buck Shlegeris, Fabien Roger, Lawrence Chan, Euan McLean

Current language models are considered to have sub-human capabilities at natural language tasks like question-answering or writing code.

Question Answering

The Alignment Problem from a Deep Learning Perspective

no code implementations30 Aug 2022 Richard Ngo, Lawrence Chan, Sören Mindermann

In coming years or decades, artificial general intelligence (AGI) may surpass human capabilities at many critical tasks.

Adversarial Training for High-Stakes Reliability

no code implementations3 May 2022 Daniel M. Ziegler, Seraphina Nix, Lawrence Chan, Tim Bauman, Peter Schmidt-Nielsen, Tao Lin, Adam Scherlis, Noa Nabeshima, Ben Weinstein-Raun, Daniel de Haas, Buck Shlegeris, Nate Thomas

We found that adversarial training increased robustness to the adversarial attacks that we trained on -- doubling the time for our contractors to find adversarial examples both with our tool (from 13 to 26 minutes) and without (from 20 to 44 minutes) -- without affecting in-distribution performance.

Text Generation Vocal Bursts Intensity Prediction

Human irrationality: both bad and good for reward inference

no code implementations12 Nov 2021 Lawrence Chan, Andrew Critch, Anca Dragan

More importantly, we show that an irrational human, when correctly modelled, can communicate more information about the reward than a perfectly rational human can.

Optimal Cost Design for Model Predictive Control

1 code implementation23 Apr 2021 Avik Jain, Lawrence Chan, Daniel S. Brown, Anca D. Dragan

We test our approach in an autonomous driving domain where we find costs different from the ground truth that implicitly compensate for replanning, short horizon, incorrect dynamics models, and local minima issues.

Autonomous Driving Model Predictive Control

The impacts of known and unknown demonstrator irrationality on reward inference

no code implementations1 Jan 2021 Lawrence Chan, Andrew Critch, Anca Dragan

Surprisingly, we find that if we give the learner access to the correct model of the demonstrator's irrationality, these irrationalities can actually help reward inference.

Benefits of Assistance over Reward Learning

no code implementations1 Jan 2021 Rohin Shah, Pedro Freire, Neel Alex, Rachel Freedman, Dmitrii Krasheninnikov, Lawrence Chan, Michael D Dennis, Pieter Abbeel, Anca Dragan, Stuart Russell

By merging reward learning and control, assistive agents can reason about the impact of control actions on reward learning, leading to several advantages over agents based on reward learning.

Accounting for Human Learning when Inferring Human Preferences

no code implementations11 Nov 2020 Harry Giles, Lawrence Chan

Inverse reinforcement learning (IRL) is a common technique for inferring human preferences from data.

The Assistive Multi-Armed Bandit

1 code implementation24 Jan 2019 Lawrence Chan, Dylan Hadfield-Menell, Siddhartha Srinivasa, Anca Dragan

Learning preferences implicit in the choices humans make is a well studied problem in both economics and computer science.

Multi-Armed Bandits

Cannot find the paper you are looking for? You can Submit a new open access paper.