no code implementations • 31 Jan 2025 • Adam Scherlis, Nora Belrose
We present an algorithm for estimating the probability mass, under a Gaussian or uniform prior, of a region in neural network parameter space corresponding to a particular behavior, such as achieving test loss below some threshold.
1 code implementation • 9 Dec 2024 • Nora Belrose, Adam Scherlis
We examine the geometry of neural network training using the Jacobian of trained network parameters with respect to their initial values.
1 code implementation • 13 Nov 2024 • Thomas Marshall, Adam Scherlis, Nora Belrose
We propose affine concept editing (ACE) as an approach for steering language models' behavior by intervening directly in activations.
no code implementations • 4 Oct 2022 • Adam Scherlis, Kshitij Sachan, Adam S. Jermyn, Joe Benton, Buck Shlegeris
We show that in a toy model the optimal capacity allocation tends to monosemantically represent the most important features, polysemantically represent less important features (in proportion to their impact on the loss), and entirely ignore the least important features.
no code implementations • 3 May 2022 • Daniel M. Ziegler, Seraphina Nix, Lawrence Chan, Tim Bauman, Peter Schmidt-Nielsen, Tao Lin, Adam Scherlis, Noa Nabeshima, Ben Weinstein-Raun, Daniel de Haas, Buck Shlegeris, Nate Thomas
We found that adversarial training increased robustness to the adversarial attacks that we trained on -- doubling the time for our contractors to find adversarial examples both with our tool (from 13 to 26 minutes) and without (from 20 to 44 minutes) -- without affecting in-distribution performance.
no code implementations • 6 Jul 2018 • Stanislav Fort, Adam Scherlis
We observe this effect for fully-connected neural networks over a range of network widths and depths on MNIST and CIFAR-10 datasets with the $\mathrm{ReLU}$ and $\tanh$ non-linearities, and a similar effect for convolutional networks.