1 code implementation • 21 Mar 2025 • David Rein, Joel Becker, Amy Deng, Seraphina Nix, Chris Canal, Daniel O'Connel, Pip Arnott, Ryan Bloom, Thomas Broadley, Katharyn Garcia, Brian Goodrich, Max Hasin, Sami Jawhar, Megan Kinniment, Thomas Kwa, Aron Lajko, Nate Rush, Lucas Jun Koba Sato, Sydney von Arx, Ben West, Lawrence Chan, Elizabeth Barnes
To understand and predict the societal impacts of highly autonomous AI systems, we need benchmarks with grounding, i. e., metrics that directly connect AI performance to real-world effects we care about.
1 code implementation • 18 Mar 2025 • Thomas Kwa, Ben West, Joel Becker, Amy Deng, Katharyn Garcia, Max Hasin, Sami Jawhar, Megan Kinniment, Nate Rush, Sydney von Arx, Ryan Bloom, Thomas Broadley, Haoxing Du, Brian Goodrich, Nikola Jurkovic, Luke Harold Miles, Seraphina Nix, Tao Lin, Neev Parikh, David Rein, Lucas Jun Koba Sato, Hjalmar Wijk, Daniel M. Ziegler, Elizabeth Barnes, Lawrence Chan
Despite rapid progress on AI benchmarks, the real-world meaning of benchmark performance remains unclear.
2 code implementations • 19 Jul 2024 • Rohan Gupta, Iván Arcuschin, Thomas Kwa, Adrià Garriga-Alonso
Mechanistic interpretability methods aim to identify the algorithm a neural network implements, but it is difficult to validate such methods when the true algorithm is unknown.
1 code implementation • 19 Jul 2024 • Thomas Kwa, Drake Thomas, Adrià Garriga-Alonso
When applying reinforcement learning from human feedback (RLHF), the reward is learned from data and, therefore, always has some error.
2 code implementations • 17 Jun 2024 • Jason Gross, Rajashree Agrawal, Thomas Kwa, Euan Ong, Chun Hei Yip, Alex Gibson, Soufiane Noubir, Lawrence Chan
We propose using mechanistic interpretability -- techniques for reverse engineering model weights into human-interpretable algorithms -- to derive and compactly prove formal guarantees on model performance.