Search Results for author: Adam Scherlis

Found 6 papers, 2 papers with code

Estimating the Probability of Sampling a Trained Neural Network at Random

no code implementations31 Jan 2025 Adam Scherlis, Nora Belrose

We present an algorithm for estimating the probability mass, under a Gaussian or uniform prior, of a region in neural network parameter space corresponding to a particular behavior, such as achieving test loss below some threshold.

Inductive Bias Language Modeling +1

Understanding Gradient Descent through the Training Jacobian

1 code implementation9 Dec 2024 Nora Belrose, Adam Scherlis

We examine the geometry of neural network training using the Jacobian of trained network parameters with respect to their initial values.

Refusal in LLMs is an Affine Function

1 code implementation13 Nov 2024 Thomas Marshall, Adam Scherlis, Nora Belrose

We propose affine concept editing (ACE) as an approach for steering language models' behavior by intervening directly in activations.

Polysemanticity and Capacity in Neural Networks

no code implementations4 Oct 2022 Adam Scherlis, Kshitij Sachan, Adam S. Jermyn, Joe Benton, Buck Shlegeris

We show that in a toy model the optimal capacity allocation tends to monosemantically represent the most important features, polysemantically represent less important features (in proportion to their impact on the loss), and entirely ignore the least important features.

Adversarial Training for High-Stakes Reliability

no code implementations3 May 2022 Daniel M. Ziegler, Seraphina Nix, Lawrence Chan, Tim Bauman, Peter Schmidt-Nielsen, Tao Lin, Adam Scherlis, Noa Nabeshima, Ben Weinstein-Raun, Daniel de Haas, Buck Shlegeris, Nate Thomas

We found that adversarial training increased robustness to the adversarial attacks that we trained on -- doubling the time for our contractors to find adversarial examples both with our tool (from 13 to 26 minutes) and without (from 20 to 44 minutes) -- without affecting in-distribution performance.

Text Generation Vocal Bursts Intensity Prediction

The Goldilocks zone: Towards better understanding of neural network loss landscapes

no code implementations6 Jul 2018 Stanislav Fort, Adam Scherlis

We observe this effect for fully-connected neural networks over a range of network widths and depths on MNIST and CIFAR-10 datasets with the $\mathrm{ReLU}$ and $\tanh$ non-linearities, and a similar effect for convolutional networks.

Cannot find the paper you are looking for? You can Submit a new open access paper.