1 code implementation • 27 May 2024 • Xiaoman Delores Ding, Zifan Carl Guo, Eric J. Michaud, Ziming Liu, Max Tegmark
To investigate this Survival of the Fittest hypothesis, we conduct a case study on neural networks performing modular addition, and find that these networks' multiple circular representations at different Fourier frequencies undergo such competitive dynamics, with only a few circles surviving at the end.
1 code implementation • 23 May 2024 • Joshua Engels, Isaac Liao, Eric J. Michaud, Wes Gurnee, Max Tegmark
Recent work has proposed the linear representation hypothesis: that language models perform computation by manipulating one-dimensional representations of concepts ("features") in activation space.
2 code implementations • 28 Mar 2024 • Samuel Marks, Can Rager, Eric J. Michaud, Yonatan Belinkov, David Bau, Aaron Mueller
We introduce methods for discovering and applying sparse feature circuits.
1 code implementation • 7 Feb 2024 • Eric J. Michaud, Isaac Liao, Vedang Lad, Ziming Liu, Anish Mudide, Chloe Loughridge, Zifan Carl Guo, Tara Rezaei Kheirkhah, Mateja Vukelić, Max Tegmark
We present MIPS, a novel method for program synthesis based on automated mechanistic interpretability of neural networks trained to perform the desired task, auto-distilling the learned algorithm into Python code.
no code implementations • 27 Jul 2023 • Stephen Casper, Xander Davies, Claudia Shi, Thomas Krendl Gilbert, Jérémy Scheurer, Javier Rando, Rachel Freedman, Tomasz Korbak, David Lindner, Pedro Freire, Tony Wang, Samuel Marks, Charbel-Raphaël Segerie, Micah Carroll, Andi Peng, Phillip Christoffersen, Mehul Damani, Stewart Slocum, Usman Anwar, Anand Siththaranjan, Max Nadeau, Eric J. Michaud, Jacob Pfau, Dmitrii Krasheninnikov, Xin Chen, Lauro Langosco, Peter Hase, Erdem Biyik, Anca Dragan, David Krueger, Dorsa Sadigh, Dylan Hadfield-Menell
Reinforcement learning from human feedback (RLHF) is a technique for training AI systems to align with human goals.
1 code implementation • NeurIPS 2023 • Eric J. Michaud, Ziming Liu, Uzay Girit, Max Tegmark
We tentatively find that the frequency at which these quanta are used in the training distribution roughly follows a power law corresponding with the empirical scaling exponent for language models, a prediction of our theory.
1 code implementation • 24 Oct 2022 • Eric J. Michaud, Ziming Liu, Max Tegmark
We explore unique considerations involved in fitting ML models to data with very high precision, as is often required for science applications.
2 code implementations • 3 Oct 2022 • Ziming Liu, Eric J. Michaud, Max Tegmark
Grokking, the unusual phenomenon for algorithmic datasets where generalization happens long after overfitting the training data, has remained elusive.
1 code implementation • 20 May 2022 • Ziming Liu, Ouail Kitouni, Niklas Nolte, Eric J. Michaud, Max Tegmark, Mike Williams
We aim to understand grokking, a phenomenon where models generalize long after overfitting their training set.
1 code implementation • 10 Dec 2020 • Eric J. Michaud, Adam Gleave, Stuart Russell
However, current techniques for reward learning may fail to produce reward functions which accurately reflect user preferences.
1 code implementation • 26 Oct 2020 • Simon Mattsson, Eric J. Michaud, Erik Hoel
Specifically, we introduce the effective information (EI) of a feedforward DNN, which is the mutual information between layer input and output following a maximum-entropy perturbation.