Search Results for author: Atticus Geiger

Found 26 papers, 19 papers with code

Evaluating Open-Source Sparse Autoencoders on Disentangling Factual Knowledge in GPT-2 Small

1 code implementation5 Sep 2024 Maheep Chaudhary, Atticus Geiger

A popular new method in mechanistic interpretability is to train high-dimensional sparse autoencoders (SAEs) on neuron activations and use SAE features as the atomic units of analysis.

Recurrent Neural Networks Learn to Store and Generate Sequences using Non-Linear Representations

1 code implementation20 Aug 2024 Róbert Csordás, Christopher Potts, Christopher D. Manning, Atticus Geiger

In this paper, we present a counterexample to this strong LRH: when trained to repeat an input token sequence, gated recurrent neural networks (RNNs) learn to represent the token at each position with a particular order of magnitude, rather than a direction.

Position

Updating CLIP to Prefer Descriptions Over Captions

1 code implementation12 Jun 2024 Amir Zur, Elisa Kreiss, Karel D'Oosterlinck, Christopher Potts, Atticus Geiger

This model correlates with the judgements of blind and low-vision people while preserving transfer capabilities and has interpretable structure that sheds light on the caption--description distinction.

parameter-efficient fine-tuning

ReFT: Representation Finetuning for Language Models

2 code implementations4 Apr 2024 Zhengxuan Wu, Aryaman Arora, Zheng Wang, Atticus Geiger, Dan Jurafsky, Christopher D. Manning, Christopher Potts

We define a strong instance of the ReFT family, Low-rank Linear Subspace ReFT (LoReFT), and we identify an ablation of this method that trades some performance for increased efficiency.

Arithmetic Reasoning

pyvene: A Library for Understanding and Improving PyTorch Models via Interventions

3 code implementations12 Mar 2024 Zhengxuan Wu, Atticus Geiger, Aryaman Arora, Jing Huang, Zheng Wang, Noah D. Goodman, Christopher D. Manning, Christopher Potts

Interventions on model-internal states are fundamental operations in many areas of AI, including model editing, steering, robustness, and interpretability.

Model Editing

A Reply to Makelov et al. (2023)'s "Interpretability Illusion" Arguments

1 code implementation23 Jan 2024 Zhengxuan Wu, Atticus Geiger, Jing Huang, Aryaman Arora, Thomas Icard, Christopher Potts, Noah D. Goodman

We respond to the recent paper by Makelov et al. (2023), which reviews subspace interchange intervention methods like distributed alignment search (DAS; Geiger et al. 2023) and claims that these methods potentially cause "interpretability illusions".

Linear Representations of Sentiment in Large Language Models

1 code implementation23 Oct 2023 Curt Tigges, Oskar John Hollinsworth, Atticus Geiger, Neel Nanda

Sentiment is a pervasive feature in natural language text, yet it is an open question how sentiment is represented within Large Language Models (LLMs).

Zero-Shot Learning

Rigorously Assessing Natural Language Explanations of Neurons

no code implementations19 Sep 2023 Jing Huang, Atticus Geiger, Karel D'Oosterlinck, Zhengxuan Wu, Christopher Potts

Natural language is an appealing medium for explaining how large language models process and store information, but evaluating the faithfulness of such explanations is challenging.

ScoNe: Benchmarking Negation Reasoning in Language Models With Fine-Tuning and In-Context Learning

1 code implementation30 May 2023 Jingyuan Selena She, Christopher Potts, Samuel R. Bowman, Atticus Geiger

For in-context learning, we test InstructGPT models and find that most prompt strategies are not successful, including those using step-by-step reasoning.

Benchmarking In-Context Learning +3

Interpretability at Scale: Identifying Causal Mechanisms in Alpaca

1 code implementation NeurIPS 2023 Zhengxuan Wu, Atticus Geiger, Thomas Icard, Christopher Potts, Noah D. Goodman

With Boundless DAS, we discover that Alpaca does this by implementing a causal model with two interpretable boolean variables.

Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representations

1 code implementation5 Mar 2023 Atticus Geiger, Zhengxuan Wu, Christopher Potts, Thomas Icard, Noah D. Goodman

In DAS, we find the alignment between high-level and low-level models using gradient descent rather than conducting a brute-force search, and we allow individual neurons to play multiple distinct roles by analyzing representations in non-standard bases-distributed representations.

Explainable artificial intelligence

Causal Abstraction: A Theoretical Foundation for Mechanistic Interpretability

no code implementations11 Jan 2023 Atticus Geiger, Duligur Ibeling, Amir Zur, Maheep Chaudhary, Sonakshi Chauhan, Jing Huang, Aryaman Arora, Zhengxuan Wu, Noah Goodman, Christopher Potts, Thomas Icard

Causal abstraction provides a theoretical foundation for mechanistic interpretability, the field concerned with providing intelligible algorithms that are faithful simplifications of the known, but opaque low-level details of black box AI models.

Explainable Artificial Intelligence (XAI)

Causal Abstraction with Soft Interventions

no code implementations22 Nov 2022 Riccardo Massidda, Atticus Geiger, Thomas Icard, Davide Bacciu

Causal abstraction provides a theory describing how several causal models can represent the same system at different levels of detail.

Inducing Causal Structure for Interpretable Neural Networks

2 code implementations1 Dec 2021 Atticus Geiger, Zhengxuan Wu, Hanson Lu, Josh Rozner, Elisa Kreiss, Thomas Icard, Noah D. Goodman, Christopher Potts

In IIT, we (1) align variables in a causal model (e. g., a deterministic program or Bayesian network) with representations in a neural model and (2) train the neural model to match the counterfactual behavior of the causal model on a base input when aligned representations in both models are set to be the value they would be for a source input.

counterfactual Data Augmentation +1

Causal Abstractions of Neural Networks

1 code implementation NeurIPS 2021 Atticus Geiger, Hanson Lu, Thomas Icard, Christopher Potts

Structural analysis methods (e. g., probing and feature attribution) are increasingly important tools for neural network analysis.

Natural Language Inference

DynaSent: A Dynamic Benchmark for Sentiment Analysis

1 code implementation ACL 2021 Christopher Potts, Zhengxuan Wu, Atticus Geiger, Douwe Kiela

We introduce DynaSent ('Dynamic Sentiment'), a new English-language benchmark task for ternary (positive/negative/neutral) sentiment analysis.

Sentiment Analysis

Neural Natural Language Inference Models Partially Embed Theories of Lexical Entailment and Negation

1 code implementation EMNLP (BlackboxNLP) 2020 Atticus Geiger, Kyle Richardson, Christopher Potts

We address whether neural models for Natural Language Inference (NLI) can learn the compositional interactions between lexical entailment and negation, using four methods: the behavioral evaluation methods of (1) challenge test sets and (2) systematic generalization tasks, and the structural evaluation methods of (3) probes and (4) interventions.

Lexical Entailment Natural Language Inference +2

Stress-Testing Neural Models of Natural Language Inference with Multiply-Quantified Sentences

no code implementations30 Oct 2018 Atticus Geiger, Ignacio Cases, Lauri Karttunen, Christopher Potts

Standard evaluations of deep learning models for semantics using naturalistic corpora are limited in what they can tell us about the fidelity of the learned representations, because the corpora rarely come with good measures of semantic complexity.

Natural Language Inference

Cannot find the paper you are looking for? You can Submit a new open access paper.