TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Interpretability Techniques for Deep Learning	CausalGym	DAS	Log odds-ratio (pythia-6.9b)	9.95	# 1
Interpretability Techniques for Deep Learning	CausalGym	Random	Log odds-ratio (pythia-6.9b)	0.01	# 7
Interpretability Techniques for Deep Learning	CausalGym	LDA	Log odds-ratio (pythia-6.9b)	0.27	# 6
Interpretability Techniques for Deep Learning	CausalGym	k-means	Log odds-ratio (pythia-6.9b)	1.87	# 4
Interpretability Techniques for Deep Learning	CausalGym	PCA	Log odds-ratio (pythia-6.9b)	1.81	# 5
Interpretability Techniques for Deep Learning	CausalGym	Difference-in-means	Log odds-ratio (pythia-6.9b)	2.91	# 3
Interpretability Techniques for Deep Learning	CausalGym	Linear probe	Log odds-ratio (pythia-6.9b)	3.42	# 2

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/causalgym-benchmarking-causal/interpretability-techniques-for-deep-learning)](https://paperswithcode.com/sota/interpretability-techniques-for-deep-learning?p=causalgym-benchmarking-causal)`

CausalGym: Benchmarking causal interpretability methods on linguistic tasks

19 Feb 2024 · Aryaman Arora, Dan Jurafsky, Christopher Potts ·

Language models (LMs) have proven to be powerful tools for psycholinguistic research, but most prior work has focused on purely behavioural measures (e.g., surprisal comparisons). At the same time, research in model interpretability has begun to illuminate the abstract causal mechanisms shaping LM behavior. To help bring these strands of research closer together, we introduce CausalGym. We adapt and expand the SyntaxGym suite of tasks to benchmark the ability of interpretability methods to causally affect model behaviour. To illustrate how CausalGym can be used, we study the pythia models (14M--6.9B) and assess the causal efficacy of a wide range of interpretability methods, including linear probing and distributed alignment search (DAS). We find that DAS outperforms the other methods, and so we use it to study the learning trajectory of two difficult linguistic phenomena in pythia-1b: negative polarity item licensing and filler--gap dependencies. Our analysis shows that the mechanism implementing both of these tasks is learned in discrete stages, not gradually.

PDF Abstract

Code

Add Remove Mark official

aryamanarora/causalgym official

Tasks

Add Remove

Benchmarking

Interpretability Techniques for Deep Learning

Datasets

Introduced in the Paper:

CausalGym

Results from the Paper

Add Remove

Ranked #1 on Interpretability Techniques for Deep Learning on CausalGym

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Interpretability Techniques for Deep Learning	CausalGym	DAS	Log odds-ratio (pythia-6.9b)	9.95	# 1	Compare
Interpretability Techniques for Deep Learning	CausalGym	Random	Log odds-ratio (pythia-6.9b)	0.01	# 7	Compare
Interpretability Techniques for Deep Learning	CausalGym	LDA	Log odds-ratio (pythia-6.9b)	0.27	# 6	Compare
Interpretability Techniques for Deep Learning	CausalGym	k-means	Log odds-ratio (pythia-6.9b)	1.87	# 4	Compare
Interpretability Techniques for Deep Learning	CausalGym	PCA	Log odds-ratio (pythia-6.9b)	1.81	# 5	Compare
Interpretability Techniques for Deep Learning	CausalGym	Difference-in-means	Log odds-ratio (pythia-6.9b)	2.91	# 3	Compare
Interpretability Techniques for Deep Learning	CausalGym	Linear probe	Log odds-ratio (pythia-6.9b)	3.42	# 2	Compare

Methods

Add Remove

Pythia

Edit Social Preview

CausalGym: Benchmarking causal interpretability methods on linguistic tasks

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit Add Remove

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Add Remove

Methods

Add Remove