TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Question Answering	BoolQ	Hybrid H3 2.7B (3-shot, logit scoring)	Accuracy	60.6	# 49
Question Answering	BoolQ	Hybrid H3 125M (3-shot, rank classification)	Accuracy	56.1	# 55
Question Answering	BoolQ	Hybrid H3 125M (0-shot, logit scoring)	Accuracy	59.6	# 53
Question Answering	BoolQ	Hybrid H3 1.3B (0-shot, logit scoring)	Accuracy	61.7	# 47
Question Answering	BoolQ	Hybrid H3 125M (3-shot, logit scoring)	Accuracy	56.1	# 55
Question Answering	COPA	Hybrid H3 2.7B (3-shot, logit scoring)	Accuracy	77	# 41
Question Answering	COPA	Hybrid H3 125M (0-shot, rank classification)	Accuracy	67	# 51
Question Answering	COPA	Hybrid H3 2.7B (0-shot, logit scoring)	Accuracy	81	# 35
Question Answering	COPA	H3 125M (0-shot, rank classification)	Accuracy	51	# 59
Question Answering	COPA	Hybrid H3 125M (0-shot, logit scoring)	Accuracy	67	# 51
Long-range modeling	LRA	H3	ListOps	57.5	# 13
Long-range modeling	LRA	H3	Text	88.2	# 8
Long-range modeling	LRA	H3	Retrieval	91.0	# 7
Long-range modeling	LRA	H3	Image	87.3	# 9
Long-range modeling	LRA	H3	Pathfinder	93.0	# 10
Long-range modeling	LRA	H3	Avg	84.8	# 10
Long-range modeling	LRA	H3	Pathfinder-X	91.8	# 10
Question Answering	MultiRC	Hybrid H3 125M (3-shot, logit scoring)	EM	48.9	# 9
Question Answering	MultiRC	Hybrid H3 355M (3-shot, logit scoring)	EM	59.7	# 6
Question Answering	MultiRC	Hybrid H3 355M (0-shot, logit scoring)	EM	59.5	# 7
Question Answering	MultiRC	Hybrid H3 125M (0-shot, logit scoring)	EM	51.4	# 8
Natural Language Inference	RTE	H3 125M (0-shot, rank classification)	Accuracy	53.1%	# 88
Natural Language Inference	RTE	Hybrid H3 125M (3-shot, logit scoring)	Accuracy	58.1%	# 76
Natural Language Inference	RTE	H3 125M (3-shot, rank classification)	Accuracy	52.3%	# 89
Natural Language Inference	RTE	Hybrid H3 125M (3-shot, rank classification)	Accuracy	58.1%	# 76
Natural Language Inference	RTE	Hybrid H3 125M (0-shot, logit scoring)	Accuracy	59.2%	# 73
Language Modelling	The Pile	Transformer 125M	Test perplexity	10.7	# 4
Language Modelling	The Pile	Hybrid H3 125M	Test perplexity	10.2	# 2
Language Modelling	WikiText-103	Hybrid H3 (355M)	Test perplexity	16.9	# 18
Language Modelling	WikiText-103	Hybrid H3 (355M)	Number of params	355M	# 10
Language Modelling	WikiText-103	Hybrid H3 (125M)	Test perplexity	23.7	# 53
Language Modelling	WikiText-103	Hybrid H3 (125M)	Number of params	125M	# 37
Language Modelling	WikiText-103	Hybrid H3 (1.3B)	Test perplexity	12.5	# 6
Language Modelling	WikiText-103	Hybrid H3 (1.3B)	Number of params	1300M	# 7
Language Modelling	WikiText-103	Hybrid H3 125M	Test perplexity	18.5	# 37
Language Modelling	WikiText-103	Hybrid H3 (2.7B)	Test perplexity	10.6	# 2
Language Modelling	WikiText-103	Hybrid H3 (2.7B)	Number of params	2700M	# 5
Coreference Resolution	Winograd Schema Challenge	Hybrid H3 125M (3-shot, logit scoring)	Accuracy	43.3	# 78
Coreference Resolution	Winograd Schema Challenge	H3 125M (3-shot, rank classification)	Accuracy	63.5	# 46
Coreference Resolution	Winograd Schema Challenge	H3 125M (0-shot, rank classification)	Accuracy	61.5	# 54
Word Sense Disambiguation	Words in Context	Hybrid H3 125M (0-shot, logit scoring)	Accuracy	51.4	# 29
Word Sense Disambiguation	Words in Context	Hybrid H3 125M (3-shot, logit scoring)	Accuracy	49.1	# 37
Word Sense Disambiguation	Words in Context	Hybrid H3 125M (0-shot, rank classification)	Accuracy	51.4	# 29

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/hungry-hungry-hippos-towards-language/language-modelling-on-the-pile)](https://paperswithcode.com/sota/language-modelling-on-the-pile?p=hungry-hungry-hippos-towards-language)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/hungry-hungry-hippos-towards-language/language-modelling-on-wikitext-103)](https://paperswithcode.com/sota/language-modelling-on-wikitext-103?p=hungry-hungry-hippos-towards-language)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/hungry-hungry-hippos-towards-language/question-answering-on-multirc)](https://paperswithcode.com/sota/question-answering-on-multirc?p=hungry-hungry-hippos-towards-language)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/hungry-hungry-hippos-towards-language/long-range-modeling-on-lra)](https://paperswithcode.com/sota/long-range-modeling-on-lra?p=hungry-hungry-hippos-towards-language)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/hungry-hungry-hippos-towards-language/word-sense-disambiguation-on-words-in-context)](https://paperswithcode.com/sota/word-sense-disambiguation-on-words-in-context?p=hungry-hungry-hippos-towards-language)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/hungry-hungry-hippos-towards-language/question-answering-on-copa)](https://paperswithcode.com/sota/question-answering-on-copa?p=hungry-hungry-hippos-towards-language)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/hungry-hungry-hippos-towards-language/coreference-resolution-on-winograd-schema)](https://paperswithcode.com/sota/coreference-resolution-on-winograd-schema?p=hungry-hungry-hippos-towards-language)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/hungry-hungry-hippos-towards-language/question-answering-on-boolq)](https://paperswithcode.com/sota/question-answering-on-boolq?p=hungry-hungry-hippos-towards-language)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/hungry-hungry-hippos-towards-language/natural-language-inference-on-rte)](https://paperswithcode.com/sota/natural-language-inference-on-rte?p=hungry-hungry-hippos-towards-language)`

Hungry Hungry Hippos: Towards Language Modeling with State Space Models

28 Dec 2022 · Daniel Y. Fu, Tri Dao, Khaled K. Saab, Armin W. Thomas, Atri Rudra, Christopher Ré ·

State space models (SSMs) have demonstrated state-of-the-art sequence modeling performance in some modalities, but underperform attention in language modeling. Moreover, despite scaling nearly linearly in sequence length instead of quadratically, SSMs are still slower than Transformers due to poor hardware utilization. In this paper, we make progress on understanding the expressivity gap between SSMs and attention in language modeling, and on reducing the hardware barrier between SSMs and attention. First, we use synthetic language modeling tasks to understand the gap between SSMs and attention. We find that existing SSMs struggle with two capabilities: recalling earlier tokens in the sequence and comparing tokens across the sequence. To understand the impact on language modeling, we propose a new SSM layer, H3, that is explicitly designed for these abilities. H3 matches attention on the synthetic languages and comes within 0.4 PPL of Transformers on OpenWebText. Furthermore, a hybrid 125M-parameter H3-attention model that retains two attention layers surprisingly outperforms Transformers on OpenWebText by 1.0 PPL. Next, to improve the efficiency of training SSMs on modern hardware, we propose FlashConv. FlashConv uses a fused block FFT algorithm to improve efficiency on sequences up to 8K, and introduces a novel state passing algorithm that exploits the recurrent properties of SSMs to scale to longer sequences. FlashConv yields 2$\times$ speedup on the long-range arena benchmark and allows hybrid language models to generate text 2.4$\times$ faster than Transformers. Using FlashConv, we scale hybrid H3-attention language models up to 2.7B parameters on the Pile and find promising initial results, achieving lower perplexity than Transformers and outperforming Transformers in zero- and few-shot learning on a majority of tasks in the SuperGLUE benchmark.

PDF Abstract

Code

Add Remove Mark official

hazyresearch/h3 official

492

hazyresearch/safari

838

lindermanlab/S5

215

Tasks

Add Remove

Coreference Resolution

Few-Shot Learning

Language Modelling

Long-range modeling

Natural Language Inference

Question Answering

Word Sense Disambiguation

Datasets

GLUE

WikiText-2

WikiText-103

BoolQ

WebText

The Pile

WSC

COPA

MultiRC LRA OpenWebText RTE

Results from the Paper

Edit

Ranked #2 on Language Modelling on The Pile (Test perplexity metric)

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Question Answering	BoolQ	Hybrid H3 2.7B (3-shot, logit scoring)	Accuracy	60.6	# 49	Compare
Question Answering	BoolQ	Hybrid H3 125M (3-shot, rank classification)	Accuracy	56.1	# 55	Compare
Question Answering	BoolQ	Hybrid H3 125M (0-shot, logit scoring)	Accuracy	59.6	# 53	Compare
Question Answering	BoolQ	Hybrid H3 1.3B (0-shot, logit scoring)	Accuracy	61.7	# 47	Compare
Question Answering	BoolQ	Hybrid H3 125M (3-shot, logit scoring)	Accuracy	56.1	# 55	Compare
Question Answering	COPA	Hybrid H3 2.7B (3-shot, logit scoring)	Accuracy	77	# 41	Compare
Question Answering	COPA	Hybrid H3 125M (0-shot, rank classification)	Accuracy	67	# 51	Compare
Question Answering	COPA	Hybrid H3 2.7B (0-shot, logit scoring)	Accuracy	81	# 35	Compare
Question Answering	COPA	H3 125M (0-shot, rank classification)	Accuracy	51	# 59	Compare
Question Answering	COPA	Hybrid H3 125M (0-shot, logit scoring)	Accuracy	67	# 51	Compare
Long-range modeling	LRA	H3	ListOps	57.5	# 13	Compare
			Text	88.2	# 8	Compare
			Retrieval	91.0	# 7	Compare
			Image	87.3	# 9	Compare
			Pathfinder	93.0	# 10	Compare
			Avg	84.8	# 10	Compare
			Pathfinder-X	91.8	# 10	Compare
Question Answering	MultiRC	Hybrid H3 125M (3-shot, logit scoring)	EM	48.9	# 9	Compare
Question Answering	MultiRC	Hybrid H3 355M (3-shot, logit scoring)	EM	59.7	# 6	Compare
Question Answering	MultiRC	Hybrid H3 355M (0-shot, logit scoring)	EM	59.5	# 7	Compare
Question Answering	MultiRC	Hybrid H3 125M (0-shot, logit scoring)	EM	51.4	# 8	Compare
Natural Language Inference	RTE	H3 125M (0-shot, rank classification)	Accuracy	53.1%	# 88	Compare
Natural Language Inference	RTE	Hybrid H3 125M (3-shot, logit scoring)	Accuracy	58.1%	# 76	Compare
Natural Language Inference	RTE	H3 125M (3-shot, rank classification)	Accuracy	52.3%	# 89	Compare
Natural Language Inference	RTE	Hybrid H3 125M (3-shot, rank classification)	Accuracy	58.1%	# 76	Compare
Natural Language Inference	RTE	Hybrid H3 125M (0-shot, logit scoring)	Accuracy	59.2%	# 73	Compare
Language Modelling	The Pile	Transformer 125M	Test perplexity	10.7	# 4	Compare
Language Modelling	The Pile	Hybrid H3 125M	Test perplexity	10.2	# 2	Compare
Language Modelling	WikiText-103	Hybrid H3 (355M)	Test perplexity	16.9	# 18	Compare
Language Modelling	WikiText-103	Hybrid H3 (355M)	Number of params	355M	# 10	Compare
Language Modelling	WikiText-103	Hybrid H3 (125M)	Test perplexity	23.7	# 53	Compare
Language Modelling	WikiText-103	Hybrid H3 (125M)	Number of params	125M	# 37	Compare
Language Modelling	WikiText-103	Hybrid H3 (1.3B)	Test perplexity	12.5	# 6	Compare
Language Modelling	WikiText-103	Hybrid H3 (1.3B)	Number of params	1300M	# 7	Compare
Language Modelling	WikiText-103	Hybrid H3 125M	Test perplexity	18.5	# 37	Compare
Language Modelling	WikiText-103	Hybrid H3 (2.7B)	Test perplexity	10.6	# 2	Compare
Language Modelling	WikiText-103	Hybrid H3 (2.7B)	Number of params	2700M	# 5	Compare
Coreference Resolution	Winograd Schema Challenge	Hybrid H3 125M (3-shot, logit scoring)	Accuracy	43.3	# 78	Compare
Coreference Resolution	Winograd Schema Challenge	H3 125M (3-shot, rank classification)	Accuracy	63.5	# 46	Compare
Coreference Resolution	Winograd Schema Challenge	H3 125M (0-shot, rank classification)	Accuracy	61.5	# 54	Compare
Word Sense Disambiguation	Words in Context	Hybrid H3 125M (0-shot, logit scoring)	Accuracy	51.4	# 29	Compare
Word Sense Disambiguation	Words in Context	Hybrid H3 125M (3-shot, logit scoring)	Accuracy	49.1	# 37	Compare
Word Sense Disambiguation	Words in Context	Hybrid H3 125M (0-shot, rank classification)	Accuracy	51.4	# 29	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

Hungry Hungry Hippos: Towards Language Modeling with State Space Models

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove