Search Results for author: Nora Belrose

Found 7 papers, 6 papers with code

Does Transformer Interpretability Transfer to RNNs?

no code implementations • 9 Apr 2024 • Gonçalo Paulo, Thomas Marshall, Nora Belrose

Recent advances in recurrent neural network architectures, such as Mamba and RWKV, have enabled RNNs to match or exceed the performance of equal-size transformers in terms of language modeling perplexity and downstream evaluations, suggesting that future systems may be built on completely new architectures.

Language Modelling

Paper
Add Code

Neural Networks Learn Statistics of Increasing Complexity

1 code implementation • 6 Feb 2024 • Nora Belrose, Quintin Pope, Lucia Quirke, Alex Mallen, Xiaoli Fern

The distributional simplicity bias (DSB) posits that neural networks learn low-order moments of the data distribution first, before moving on to higher-order correlations.

24

Paper
Code

Eliciting Latent Knowledge from Quirky Language Models

1 code implementation • 2 Dec 2023 • Alex Mallen, Madeline Brumley, Julia Kharchenko, Nora Belrose

Eliciting Latent Knowledge (ELK) aims to find patterns in a capable neural network's activations that robustly track the true state of the world, especially in hard-to-verify cases where the model's output is untrusted.

Anomaly Detection Math

19

Paper
Code

LEACE: Perfect linear concept erasure in closed form

1 code implementation • NeurIPS 2023 • Nora Belrose, David Schneider-Joseph, Shauli Ravfogel, Ryan Cotterell, Edward Raff, Stella Biderman

Concept erasure aims to remove specified features from a representation.

186

Paper
Code

Eliciting Latent Predictions from Transformers with the Tuned Lens

2 code implementations • 14 Mar 2023 • Nora Belrose, Zach Furman, Logan Smith, Danny Halawi, Igor Ostrovsky, Lev McKinney, Stella Biderman, Jacob Steinhardt

We analyze transformers from the perspective of iterative inference, seeking to understand how model predictions are refined layer by layer.

Language Modelling

890

Paper
Code

imitation: Clean Imitation Learning Implementations

2 code implementations • 22 Nov 2022 • Adam Gleave, Mohammad Taufeeque, Juan Rocamonde, Erik Jenner, Steven H. Wang, Sam Toyer, Maximilian Ernestus, Nora Belrose, Scott Emmons, Stuart Russell

imitation provides open-source implementations of imitation and reward learning algorithms in PyTorch.

Imitation Learning reinforcement-learning +1

1,136

Paper
Code

Adversarial Policies Beat Superhuman Go AIs

2 code implementations • 1 Nov 2022 • Tony T. Wang, Adam Gleave, Tom Tseng, Kellin Pelrine, Nora Belrose, Joseph Miller, Michael D. Dennis, Yawen Duan, Viktor Pogrebniak, Sergey Levine, Stuart Russell

The core vulnerability uncovered by our attack persists even in KataGo agents adversarially trained to defend against our attack.

74

Paper
Code

Cannot find the paper you are looking for? You can Submit a new open access paper.