Search Results for author: Thomas Kwa

Found 5 papers, 5 papers with code

HCAST: Human-Calibrated Autonomy Software Tasks

1 code implementation21 Mar 2025 David Rein, Joel Becker, Amy Deng, Seraphina Nix, Chris Canal, Daniel O'Connel, Pip Arnott, Ryan Bloom, Thomas Broadley, Katharyn Garcia, Brian Goodrich, Max Hasin, Sami Jawhar, Megan Kinniment, Thomas Kwa, Aron Lajko, Nate Rush, Lucas Jun Koba Sato, Sydney von Arx, Ben West, Lawrence Chan, Elizabeth Barnes

To understand and predict the societal impacts of highly autonomous AI systems, we need benchmarks with grounding, i. e., metrics that directly connect AI performance to real-world effects we care about.

InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques

2 code implementations19 Jul 2024 Rohan Gupta, Iván Arcuschin, Thomas Kwa, Adrià Garriga-Alonso

Mechanistic interpretability methods aim to identify the algorithm a neural network implements, but it is difficult to validate such methods when the true algorithm is unknown.

Catastrophic Goodhart: regularizing RLHF with KL divergence does not mitigate heavy-tailed reward misspecification

1 code implementation19 Jul 2024 Thomas Kwa, Drake Thomas, Adrià Garriga-Alonso

When applying reinforcement learning from human feedback (RLHF), the reward is learned from data and, therefore, always has some error.

Compact Proofs of Model Performance via Mechanistic Interpretability

2 code implementations17 Jun 2024 Jason Gross, Rajashree Agrawal, Thomas Kwa, Euan Ong, Chun Hei Yip, Alex Gibson, Soufiane Noubir, Lawrence Chan

We propose using mechanistic interpretability -- techniques for reverse engineering model weights into human-interpretable algorithms -- to derive and compactly prove formal guarantees on model performance.

model

Cannot find the paper you are looking for? You can Submit a new open access paper.