TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Question Answering	COPA	Base Layers 10B (0-shot)	Accuracy	63	# 55
Question Answering	COPA	HASH Layers 10B (0-shot)	Accuracy	64	# 54
Question Answering	COPA	sMLP – deterministic 9.4B (0-shot)	Accuracy	79	# 38
Question Answering	COPA	Switch Transformer 9B	Accuracy	75	# 45
Question Answering	COPA	Gshard 9B	Accuracy	76	# 44
Sentence Completion	HellaSwag	HASH Layers 10B (0-shot)	Accuracy	33	# 79
Sentence Completion	HellaSwag	Gshard 9B	Accuracy	38	# 75
Sentence Completion	HellaSwag	Switch Transformer 9B	Accuracy	52.5	# 60
Sentence Completion	HellaSwag	sMLP – deterministic 9.4B (0-shot)	Accuracy	54.5	# 59
Sentence Completion	HellaSwag	Base Layers 10B (0-shot)	Accuracy	30.2	# 83
Question Answering	PIQA	sMLP - deterministic 9.4B (0-shot)	Accuracy	73	# 46
Question Answering	PIQA	Gshard 9B	Accuracy	68.1	# 54
Question Answering	PIQA	HASH Layers 10B (0-shot)	Accuracy	63.8	# 58
Question Answering	PIQA	Base Layers 10B (0-shot)	Accuracy	63.8	# 58
Common Sense Reasoning	ReCoRD	Base Layers 10B (0-shot)	EM	60.7	# 30
Common Sense Reasoning	ReCoRD	Gshard 9B	EM	72.4	# 24
Common Sense Reasoning	ReCoRD	sMLP – deterministic 9.4B (0-shot)	EM	73.4	# 22
Common Sense Reasoning	ReCoRD	Switch Transformer 9B	EM	79.9	# 19
Common Sense Reasoning	ReCoRD	HASH Layers 10B (0-shot)	EM	67.2	# 28
Question Answering	StoryCloze	Switch Transformer 9B	Accuracy	73.3	# 18
Question Answering	StoryCloze	sMLP – deterministic 9.4B (0-shot)	Accuracy	74.7	# 17
Question Answering	StoryCloze	Base Layers 10B (0-shot)	Accuracy	61.4	# 22
Question Answering	StoryCloze	HASH Layers 10B (0-shot)	Accuracy	64.7	# 21
Question Answering	StoryCloze	Gshard 9B	Accuracy	67.9	# 20
Common Sense Reasoning	WinoGrande	Base Layers 10B (0-shot)	Accuracy	51	# 71
Common Sense Reasoning	WinoGrande	Switch Transformer 9B (0-shot)	Accuracy	53.4	# 65
Common Sense Reasoning	WinoGrande	Gshard 9B (0-shot)	Accuracy	51.1	# 70
Common Sense Reasoning	WinoGrande	sMLP – deterministic 9.4B (0-shot)	Accuracy	54.3	# 64
Common Sense Reasoning	WinoGrande	HASH Layers 10B (0-shot)	Accuracy	51.7	# 69

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/efficient-language-modeling-with-sparse-all/question-answering-on-storycloze)](https://paperswithcode.com/sota/question-answering-on-storycloze?p=efficient-language-modeling-with-sparse-all)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/efficient-language-modeling-with-sparse-all/common-sense-reasoning-on-record)](https://paperswithcode.com/sota/common-sense-reasoning-on-record?p=efficient-language-modeling-with-sparse-all)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/efficient-language-modeling-with-sparse-all/question-answering-on-copa)](https://paperswithcode.com/sota/question-answering-on-copa?p=efficient-language-modeling-with-sparse-all)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/efficient-language-modeling-with-sparse-all/question-answering-on-piqa)](https://paperswithcode.com/sota/question-answering-on-piqa?p=efficient-language-modeling-with-sparse-all)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/efficient-language-modeling-with-sparse-all/sentence-completion-on-hellaswag)](https://paperswithcode.com/sota/sentence-completion-on-hellaswag?p=efficient-language-modeling-with-sparse-all)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/efficient-language-modeling-with-sparse-all/common-sense-reasoning-on-winogrande)](https://paperswithcode.com/sota/common-sense-reasoning-on-winogrande?p=efficient-language-modeling-with-sparse-all)`

Efficient Language Modeling with Sparse all-MLP

14 Mar 2022 · Ping Yu, Mikel Artetxe, Myle Ott, Sam Shleifer, Hongyu Gong, Ves Stoyanov, Xian Li ·

All-MLP architectures have attracted increasing interest as an alternative to attention-based models. In NLP, recent work like gMLP shows that all-MLPs can match Transformers in language modeling, but still lag behind in downstream tasks. In this work, we analyze the limitations of MLPs in expressiveness, and propose sparsely activated MLPs with mixture-of-experts (MoEs) in both feature and input (token) dimensions. Such sparse all-MLPs significantly increase model capacity and expressiveness while keeping the compute constant. We address critical challenges in incorporating conditional computation with two routing strategies. The proposed sparse all-MLP improves language modeling perplexity and obtains up to 2$\times$ improvement in training efficiency compared to both Transformer-based MoEs (GShard, Switch Transformer, Base Layers and HASH Layers) as well as dense Transformers and all-MLPs. Finally, we evaluate its zero-shot in-context learning performance on six downstream tasks, and find that it surpasses Transformer-based MoEs and dense Transformers.

PDF Abstract

Code

Add Remove Mark official

No code implementations yet. Submit your code now

Tasks

Add Remove

Common Sense Reasoning

In-Context Learning

Language Modelling

Question Answering

Sentence Completion

Zero-Shot Learning

Datasets

HellaSwag

PIQA

WinoGrande

COPA

ReCoRD CC100 StoryCloze

Results from the Paper

Edit

Ranked #17 on Question Answering on StoryCloze

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Question Answering	COPA	Base Layers 10B (0-shot)	Accuracy	63	# 55	Compare
Question Answering	COPA	HASH Layers 10B (0-shot)	Accuracy	64	# 54	Compare
Question Answering	COPA	sMLP – deterministic 9.4B (0-shot)	Accuracy	79	# 38	Compare
Question Answering	COPA	Switch Transformer 9B	Accuracy	75	# 45	Compare
Question Answering	COPA	Gshard 9B	Accuracy	76	# 44	Compare
Sentence Completion	HellaSwag	HASH Layers 10B (0-shot)	Accuracy	33	# 79	Compare
Sentence Completion	HellaSwag	Gshard 9B	Accuracy	38	# 75	Compare
Sentence Completion	HellaSwag	Switch Transformer 9B	Accuracy	52.5	# 60	Compare
Sentence Completion	HellaSwag	sMLP – deterministic 9.4B (0-shot)	Accuracy	54.5	# 59	Compare
Sentence Completion	HellaSwag	Base Layers 10B (0-shot)	Accuracy	30.2	# 83	Compare
Question Answering	PIQA	sMLP - deterministic 9.4B (0-shot)	Accuracy	73	# 46	Compare
Question Answering	PIQA	Gshard 9B	Accuracy	68.1	# 54	Compare
Question Answering	PIQA	HASH Layers 10B (0-shot)	Accuracy	63.8	# 58	Compare
Question Answering	PIQA	Base Layers 10B (0-shot)	Accuracy	63.8	# 58	Compare
Common Sense Reasoning	ReCoRD	Base Layers 10B (0-shot)	EM	60.7	# 30	Compare
Common Sense Reasoning	ReCoRD	Gshard 9B	EM	72.4	# 24	Compare
Common Sense Reasoning	ReCoRD	sMLP – deterministic 9.4B (0-shot)	EM	73.4	# 22	Compare
Common Sense Reasoning	ReCoRD	Switch Transformer 9B	EM	79.9	# 19	Compare
Common Sense Reasoning	ReCoRD	HASH Layers 10B (0-shot)	EM	67.2	# 28	Compare
Question Answering	StoryCloze	Switch Transformer 9B	Accuracy	73.3	# 18	Compare
Question Answering	StoryCloze	sMLP – deterministic 9.4B (0-shot)	Accuracy	74.7	# 17	Compare
Question Answering	StoryCloze	Base Layers 10B (0-shot)	Accuracy	61.4	# 22	Compare
Question Answering	StoryCloze	HASH Layers 10B (0-shot)	Accuracy	64.7	# 21	Compare
Question Answering	StoryCloze	Gshard 9B	Accuracy	67.9	# 20	Compare
Common Sense Reasoning	WinoGrande	Base Layers 10B (0-shot)	Accuracy	51	# 71	Compare
Common Sense Reasoning	WinoGrande	Switch Transformer 9B (0-shot)	Accuracy	53.4	# 65	Compare
Common Sense Reasoning	WinoGrande	Gshard 9B (0-shot)	Accuracy	51.1	# 70	Compare
Common Sense Reasoning	WinoGrande	sMLP – deterministic 9.4B (0-shot)	Accuracy	54.3	# 64	Compare
Common Sense Reasoning	WinoGrande	HASH Layers 10B (0-shot)	Accuracy	51.7	# 69	Compare

Methods

Add Remove

Absolute Position Encodings • Adam • BASE • BPE • Dense Connections • Dropout • GELU • gMLP • Label Smoothing • Layer Normalization • Linear Layer • Multi-Head Attention • Position-Wise Feed-Forward Layer • Residual Connection • Scaled Dot-Product Attention • Softmax • Spatial Gating Unit • Switch FFN • Switch Transformer • Transformer

Edit Social Preview

Efficient Language Modeling with Sparse all-MLP

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove