Search Results for author: Nora Belrose

Found 7 papers, 6 papers with code

Does Transformer Interpretability Transfer to RNNs?

no code implementations9 Apr 2024 Gonçalo Paulo, Thomas Marshall, Nora Belrose

Recent advances in recurrent neural network architectures, such as Mamba and RWKV, have enabled RNNs to match or exceed the performance of equal-size transformers in terms of language modeling perplexity and downstream evaluations, suggesting that future systems may be built on completely new architectures.

Language Modelling

Neural Networks Learn Statistics of Increasing Complexity

1 code implementation6 Feb 2024 Nora Belrose, Quintin Pope, Lucia Quirke, Alex Mallen, Xiaoli Fern

The distributional simplicity bias (DSB) posits that neural networks learn low-order moments of the data distribution first, before moving on to higher-order correlations.

Eliciting Latent Knowledge from Quirky Language Models

1 code implementation2 Dec 2023 Alex Mallen, Madeline Brumley, Julia Kharchenko, Nora Belrose

Eliciting Latent Knowledge (ELK) aims to find patterns in a capable neural network's activations that robustly track the true state of the world, especially in hard-to-verify cases where the model's output is untrusted.

Anomaly Detection Math

Eliciting Latent Predictions from Transformers with the Tuned Lens

2 code implementations14 Mar 2023 Nora Belrose, Zach Furman, Logan Smith, Danny Halawi, Igor Ostrovsky, Lev McKinney, Stella Biderman, Jacob Steinhardt

We analyze transformers from the perspective of iterative inference, seeking to understand how model predictions are refined layer by layer.

Language Modelling

Adversarial Policies Beat Superhuman Go AIs

2 code implementations1 Nov 2022 Tony T. Wang, Adam Gleave, Tom Tseng, Kellin Pelrine, Nora Belrose, Joseph Miller, Michael D. Dennis, Yawen Duan, Viktor Pogrebniak, Sergey Levine, Stuart Russell

The core vulnerability uncovered by our attack persists even in KataGo agents adversarially trained to defend against our attack.

Cannot find the paper you are looking for? You can Submit a new open access paper.