TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Common Sense Reasoning	ARC (Challenge)	GLaM 64B/64E (1 shot)	Accuracy	48.2	# 33
Common Sense Reasoning	ARC (Challenge)	GLaM 64B/64E (0 shot)	Accuracy	50.3	# 30
Common Sense Reasoning	ARC (Easy)	GLaM (64B/64E) (5-shot)	Accuracy	74.8	# 20
Common Sense Reasoning	ARC (Easy)	GLaM 64B/64E (0-shot)	Accuracy	68.0	# 36
Language Modelling	LAMBADA	GLaM 62B/64E (One-Shot)	Accuracy	80.9	# 10
Question Answering	Natural Questions	GLaM 62B/64E (One-Shot)	EM	26.3	# 31
Question Answering	Natural Questions	GLaM 62B/64E (Zero-Shot)	EM	24.7	# 35
Question Answering	Natural Questions	GLaM 62B/64E (Few-Shot)	EM	32.5	# 24
Question Answering	TriviaQA	GLaM 62B/64E (Few-shot)	EM	75.8	# 13
Question Answering	TriviaQA	GLaM 62B/64E (Zero-shot)	EM	71.3	# 22
Question Answering	TriviaQA	GLaM 62B/64E (One-shot)	EM	75.8	# 13
Question Answering	WebQuestions	GLaM 62B/64E (Zero-Shot)	EM	15.5	# 16

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/glam-efficient-scaling-of-language-models/language-modelling-on-lambada)](https://paperswithcode.com/sota/language-modelling-on-lambada?p=glam-efficient-scaling-of-language-models)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/glam-efficient-scaling-of-language-models/question-answering-on-triviaqa)](https://paperswithcode.com/sota/question-answering-on-triviaqa?p=glam-efficient-scaling-of-language-models)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/glam-efficient-scaling-of-language-models/question-answering-on-webquestions)](https://paperswithcode.com/sota/question-answering-on-webquestions?p=glam-efficient-scaling-of-language-models)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/glam-efficient-scaling-of-language-models/common-sense-reasoning-on-arc-easy)](https://paperswithcode.com/sota/common-sense-reasoning-on-arc-easy?p=glam-efficient-scaling-of-language-models)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/glam-efficient-scaling-of-language-models/question-answering-on-natural-questions)](https://paperswithcode.com/sota/question-answering-on-natural-questions?p=glam-efficient-scaling-of-language-models)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/glam-efficient-scaling-of-language-models/common-sense-reasoning-on-arc-challenge)](https://paperswithcode.com/sota/common-sense-reasoning-on-arc-challenge?p=glam-efficient-scaling-of-language-models)`

GLaM: Efficient Scaling of Language Models with Mixture-of-Experts

13 Dec 2021 · Nan Du, Yanping Huang, Andrew M. Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, Barret Zoph, Liam Fedus, Maarten Bosma, Zongwei Zhou, Tao Wang, Yu Emma Wang, Kellie Webster, Marie Pellat, Kevin Robinson, Kathleen Meier-Hellstern, Toju Duke, Lucas Dixon, Kun Zhang, Quoc V Le, Yonghui Wu, Zhifeng Chen, Claire Cui ·

Scaling language models with more data, compute and parameters has driven significant progress in natural language processing. For example, thanks to scaling, GPT-3 was able to achieve strong results on in-context learning tasks. However, training these large dense models requires significant amounts of computing resources. In this paper, we propose and develop a family of language models named GLaM (Generalist Language Model), which uses a sparsely activated mixture-of-experts architecture to scale the model capacity while also incurring substantially less training cost compared to dense variants. The largest GLaM has 1.2 trillion parameters, which is approximately 7x larger than GPT-3. It consumes only 1/3 of the energy used to train GPT-3 and requires half of the computation flops for inference, while still achieving better overall zero-shot and one-shot performance across 29 NLP tasks.

PDF Abstract

Code

Add Remove Mark official

No code implementations yet. Submit your code now

Tasks

Add Remove

Common Sense Reasoning

In-Context Learning

Language Modelling

Question Answering

Datasets

Natural Questions

TriviaQA

SuperGLUE

DROP

WebQuestions

LAMBADA

ARC (AI2 Reasoning Challenge)

Results from the Paper

Edit

Ranked #10 on Language Modelling on LAMBADA

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Common Sense Reasoning	ARC (Challenge)	GLaM 64B/64E (1 shot)	Accuracy	48.2	# 33	Compare
Common Sense Reasoning	ARC (Challenge)	GLaM 64B/64E (0 shot)	Accuracy	50.3	# 30	Compare
Common Sense Reasoning	ARC (Easy)	GLaM (64B/64E) (5-shot)	Accuracy	74.8	# 20	Compare
Common Sense Reasoning	ARC (Easy)	GLaM 64B/64E (0-shot)	Accuracy	68.0	# 36	Compare
Language Modelling	LAMBADA	GLaM 62B/64E (One-Shot)	Accuracy	80.9	# 10	Compare
Question Answering	Natural Questions	GLaM 62B/64E (Few-Shot)	EM	32.5	# 24	Compare
Question Answering	TriviaQA	GLaM 62B/64E (Few-shot)	EM	75.8	# 13	Compare
Question Answering	TriviaQA	GLaM 62B/64E (Zero-shot)	EM	71.3	# 22	Compare
Question Answering	TriviaQA	GLaM 62B/64E (One-shot)	EM	75.8	# 13	Compare
Question Answering	WebQuestions	GLaM 62B/64E (Zero-Shot)	EM	15.5	# 16	Compare

Results from Other Papers

Task	Dataset	Model	Metric Name	Metric Value	Rank	Uses Extra Training Data	Source Paper	Compare
Question Answering	Natural Questions	GLaM 62B/64E (One-Shot)	EM	26.3	# 31			See all
Question Answering	Natural Questions	GLaM 62B/64E (Zero-Shot)	EM	24.7	# 35			See all

Methods

Add Remove

Adam • Attention Dropout • BPE • Cosine Annealing • Dense Connections • Dropout • Fixed Factorized Attention • GELU • GPT-3 • Layer Normalization • Linear Layer • Linear Warmup With Cosine Annealing • Multi-Head Attention • Residual Connection • Scaled Dot-Product Attention • Softmax • Strided Attention • Weight Decay

Edit Social Preview

GLaM: Efficient Scaling of Language Models with Mixture-of-Experts

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit