no code implementations • 9 Apr 2024 • Gonçalo Paulo, Thomas Marshall, Nora Belrose
Recent advances in recurrent neural network architectures, such as Mamba and RWKV, have enabled RNNs to match or exceed the performance of equal-size transformers in terms of language modeling perplexity and downstream evaluations, suggesting that future systems may be built on completely new architectures.
1 code implementation • 6 Feb 2024 • Nora Belrose, Quintin Pope, Lucia Quirke, Alex Mallen, Xiaoli Fern
The distributional simplicity bias (DSB) posits that neural networks learn low-order moments of the data distribution first, before moving on to higher-order correlations.
1 code implementation • 2 Dec 2023 • Alex Mallen, Madeline Brumley, Julia Kharchenko, Nora Belrose
Eliciting Latent Knowledge (ELK) aims to find patterns in a capable neural network's activations that robustly track the true state of the world, especially in hard-to-verify cases where the model's output is untrusted.
1 code implementation • NeurIPS 2023 • Nora Belrose, David Schneider-Joseph, Shauli Ravfogel, Ryan Cotterell, Edward Raff, Stella Biderman
Concept erasure aims to remove specified features from a representation.
2 code implementations • 14 Mar 2023 • Nora Belrose, Zach Furman, Logan Smith, Danny Halawi, Igor Ostrovsky, Lev McKinney, Stella Biderman, Jacob Steinhardt
We analyze transformers from the perspective of iterative inference, seeking to understand how model predictions are refined layer by layer.
2 code implementations • 22 Nov 2022 • Adam Gleave, Mohammad Taufeeque, Juan Rocamonde, Erik Jenner, Steven H. Wang, Sam Toyer, Maximilian Ernestus, Nora Belrose, Scott Emmons, Stuart Russell
imitation provides open-source implementations of imitation and reward learning algorithms in PyTorch.
2 code implementations • 1 Nov 2022 • Tony T. Wang, Adam Gleave, Tom Tseng, Kellin Pelrine, Nora Belrose, Joseph Miller, Michael D. Dennis, Yawen Duan, Viktor Pogrebniak, Sergey Levine, Stuart Russell
The core vulnerability uncovered by our attack persists even in KataGo agents adversarially trained to defend against our attack.