1 code implementation • 5 Sep 2024 • Maheep Chaudhary, Atticus Geiger
A popular new method in mechanistic interpretability is to train high-dimensional sparse autoencoders (SAEs) on neuron activations and use SAE features as the atomic units of analysis.
1 code implementation • 20 Aug 2024 • Róbert Csordás, Christopher Potts, Christopher D. Manning, Atticus Geiger
In this paper, we present a counterexample to this strong LRH: when trained to repeat an input token sequence, gated recurrent neural networks (RNNs) learn to represent the token at each position with a particular order of magnitude, rather than a direction.
1 code implementation • 12 Jun 2024 • Amir Zur, Elisa Kreiss, Karel D'Oosterlinck, Christopher Potts, Atticus Geiger
This model correlates with the judgements of blind and low-vision people while preserving transfer capabilities and has interpretable structure that sheds light on the caption--description distinction.
2 code implementations • 4 Apr 2024 • Zhengxuan Wu, Aryaman Arora, Zheng Wang, Atticus Geiger, Dan Jurafsky, Christopher D. Manning, Christopher Potts
We define a strong instance of the ReFT family, Low-rank Linear Subspace ReFT (LoReFT), and we identify an ablation of this method that trades some performance for increased efficiency.
3 code implementations • 12 Mar 2024 • Zhengxuan Wu, Atticus Geiger, Aryaman Arora, Jing Huang, Zheng Wang, Noah D. Goodman, Christopher D. Manning, Christopher Potts
Interventions on model-internal states are fundamental operations in many areas of AI, including model editing, steering, robustness, and interpretability.
1 code implementation • 27 Feb 2024 • Jing Huang, Zhengxuan Wu, Christopher Potts, Mor Geva, Atticus Geiger
Individual neurons participate in the representation of multiple high-level concepts.
1 code implementation • 23 Jan 2024 • Zhengxuan Wu, Atticus Geiger, Jing Huang, Aryaman Arora, Thomas Icard, Christopher Potts, Noah D. Goodman
We respond to the recent paper by Makelov et al. (2023), which reviews subspace interchange intervention methods like distributed alignment search (DAS; Geiger et al. 2023) and claims that these methods potentially cause "interpretability illusions".
1 code implementation • 23 Oct 2023 • Curt Tigges, Oskar John Hollinsworth, Atticus Geiger, Neel Nanda
Sentiment is a pervasive feature in natural language text, yet it is an open question how sentiment is represented within Large Language Models (LLMs).
no code implementations • 19 Sep 2023 • Jing Huang, Atticus Geiger, Karel D'Oosterlinck, Zhengxuan Wu, Christopher Potts
Natural language is an appealing medium for explaining how large language models process and store information, but evaluating the faithfulness of such explanations is challenging.
1 code implementation • 30 May 2023 • Jingyuan Selena She, Christopher Potts, Samuel R. Bowman, Atticus Geiger
For in-context learning, we test InstructGPT models and find that most prompt strategies are not successful, including those using step-by-step reasoning.
1 code implementation • NeurIPS 2023 • Zhengxuan Wu, Atticus Geiger, Thomas Icard, Christopher Potts, Noah D. Goodman
With Boundless DAS, we discover that Alpaca does this by implementing a causal model with two interpretable boolean variables.
1 code implementation • 5 Mar 2023 • Atticus Geiger, Zhengxuan Wu, Christopher Potts, Thomas Icard, Noah D. Goodman
In DAS, we find the alignment between high-level and low-level models using gradient descent rather than conducting a brute-force search, and we allow individual neurons to play multiple distinct roles by analyzing representations in non-standard bases-distributed representations.
no code implementations • 11 Jan 2023 • Atticus Geiger, Duligur Ibeling, Amir Zur, Maheep Chaudhary, Sonakshi Chauhan, Jing Huang, Aryaman Arora, Zhengxuan Wu, Noah Goodman, Christopher Potts, Thomas Icard
Causal abstraction provides a theoretical foundation for mechanistic interpretability, the field concerned with providing intelligible algorithms that are faithful simplifications of the known, but opaque low-level details of black box AI models.
no code implementations • 22 Nov 2022 • Riccardo Massidda, Atticus Geiger, Thomas Icard, Davide Bacciu
Causal abstraction provides a theory describing how several causal models can represent the same system at different levels of detail.
1 code implementation • 28 Sep 2022 • Zhengxuan Wu, Karel D'Oosterlinck, Atticus Geiger, Amir Zur, Christopher Potts
The core of our proposal is the Causal Proxy Model (CPM).
1 code implementation • 27 May 2022 • Eldar David Abraham, Karel D'Oosterlinck, Amir Feder, Yair Ori Gat, Atticus Geiger, Christopher Potts, Roi Reichart, Zhengxuan Wu
We introduce CEBaB, a new benchmark dataset for assessing concept-based explanation methods in Natural Language Processing (NLP).
1 code implementation • NAACL 2022 • Zhengxuan Wu, Atticus Geiger, Josh Rozner, Elisa Kreiss, Hanson Lu, Thomas Icard, Christopher Potts, Noah D. Goodman
Distillation efforts have led to language models that are more compact and efficient without serious drops in performance.
2 code implementations • 1 Dec 2021 • Atticus Geiger, Zhengxuan Wu, Hanson Lu, Josh Rozner, Elisa Kreiss, Thomas Icard, Noah D. Goodman, Christopher Potts
In IIT, we (1) align variables in a causal model (e. g., a deterministic program or Bayesian network) with representations in a neural model and (2) train the neural model to match the counterfactual behavior of the causal model on a base input when aligned representations in both models are set to be the value they would be for a source input.
1 code implementation • NeurIPS 2021 • Atticus Geiger, Hanson Lu, Thomas Icard, Christopher Potts
Structural analysis methods (e. g., probing and feature attribution) are increasingly important tools for neural network analysis.
no code implementations • NAACL 2021 • Douwe Kiela, Max Bartolo, Yixin Nie, Divyansh Kaushik, Atticus Geiger, Zhengxuan Wu, Bertie Vidgen, Grusha Prasad, Amanpreet Singh, Pratik Ringshia, Zhiyi Ma, Tristan Thrush, Sebastian Riedel, Zeerak Waseem, Pontus Stenetorp, Robin Jia, Mohit Bansal, Christopher Potts, Adina Williams
We introduce Dynabench, an open-source platform for dynamic dataset creation and model benchmarking.
1 code implementation • ACL 2021 • Christopher Potts, Zhengxuan Wu, Atticus Geiger, Douwe Kiela
We introduce DynaSent ('Dynamic Sentiment'), a new English-language benchmark task for ternary (positive/negative/neutral) sentiment analysis.
1 code implementation • 14 Jun 2020 • Atticus Geiger, Alexandra Carstensen, Michael C. Frank, Christopher Potts
In the two latter cases, our models perform tasks proposed in previous work to demarcate human-unique symbolic abilities.
1 code implementation • EMNLP (BlackboxNLP) 2020 • Atticus Geiger, Kyle Richardson, Christopher Potts
We address whether neural models for Natural Language Inference (NLI) can learn the compositional interactions between lexical entailment and negation, using four methods: the behavioral evaluation methods of (1) challenge test sets and (2) systematic generalization tasks, and the structural evaluation methods of (3) probes and (4) interventions.
no code implementations • IJCNLP 2019 • Atticus Geiger, Ignacio Cases, Lauri Karttunen, Chris Potts
Deep learning models for semantics are generally evaluated using naturalistic corpora.
no code implementations • NAACL 2019 • Ignacio Cases, Clemens Rosenbaum, Matthew Riemer, Atticus Geiger, Tim Klinger, Alex Tamkin, Olivia Li, S Agarwal, hini, Joshua D. Greene, Dan Jurafsky, Christopher Potts, Lauri Karttunen
The model jointly optimizes the parameters of the functions and the meta-learner{'}s policy for routing inputs through those functions.
no code implementations • 30 Oct 2018 • Atticus Geiger, Ignacio Cases, Lauri Karttunen, Christopher Potts
Standard evaluations of deep learning models for semantics using naturalistic corpora are limited in what they can tell us about the fidelity of the learned representations, because the corpora rarely come with good measures of semantic complexity.