TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Language Modelling	C4	Original T5	Steps	1M	# 1
Language Modelling	C4	Original T5	TPUv3 Hours	15.7K	# 1
Language Modelling	C4	Original T5	Perplexity	13.25	# 4
Language Modelling	C4	T5++	Steps	1M	# 1
Language Modelling	C4	T5++	TPUv3 Hours	16.5K	# 1
Language Modelling	C4	T5++	Perplexity	12.69	# 3
Language Modelling	C4	Primer	Steps	1M	# 1
Language Modelling	C4	Primer	TPUv3 Hours	17.3K	# 1
Language Modelling	C4	Primer	Perplexity	12.35	# 1

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/primer-searching-for-efficient-transformers/language-modelling-on-c4)](https://paperswithcode.com/sota/language-modelling-on-c4?p=primer-searching-for-efficient-transformers)`

Primer: Searching for Efficient Transformers for Language Modeling

17 Sep 2021 · David R. So, Wojciech Mańke, Hanxiao Liu, Zihang Dai, Noam Shazeer, Quoc V. Le ·

Large Transformer models have been central to recent advances in natural language processing. The training and inference costs of these models, however, have grown rapidly and become prohibitively expensive. Here we aim to reduce the costs of Transformers by searching for a more efficient variant. Compared to previous approaches, our search is performed at a lower level, over the primitives that define a Transformer TensorFlow program. We identify an architecture, named Primer, that has a smaller training cost than the original Transformer and other variants for auto-regressive language modeling. Primer's improvements can be mostly attributed to two simple modifications: squaring ReLU activations and adding a depthwise convolution layer after each Q, K, and V projection in self-attention. Experiments show Primer's gains over Transformer increase as compute scale grows and follow a power law with respect to quality at optimal model sizes. We also verify empirically that Primer can be dropped into different codebases to significantly speed up training without additional tuning. For example, at a 500M parameter size, Primer improves the original T5 architecture on C4 auto-regressive language modeling, reducing the training cost by 4X. Furthermore, the reduced training cost means Primer needs much less compute to reach a target one-shot performance. For instance, in a 1.9B parameter configuration similar to GPT-3 XL, Primer uses 1/3 of the training compute to achieve the same one-shot performance as Transformer. We open source our models and several comparisons in T5 to help with reproducibility.

PDF Abstract

Code

Add Remove Mark official

google-research/google-research official

32,737

labmlai/annotated_deep_learning_pap…

↳ View annotated code at

labml.ai

47,519

lucidrains/FLASH-pytorch

333

JunnYu/x-transformers-paddle

Tasks

Add Remove

Language Modelling

Datasets

C4 PG-19

Results from the Paper

Edit

Ranked #1 on Language Modelling on C4

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Language Modelling	C4	Original T5	Steps	1M	# 1	Compare
			TPUv3 Hours	15.7K	# 1	Compare
			Perplexity	13.25	# 4	Compare
Language Modelling	C4	T5++	Steps	1M	# 1	Compare
			TPUv3 Hours	16.5K	# 1	Compare
			Perplexity	12.69	# 3	Compare
Language Modelling	C4	Primer	Steps	1M	# 1	Compare
			TPUv3 Hours	17.3K	# 1	Compare
			Perplexity	12.35	# 1	Compare

Methods

Add Remove

Absolute Position Encodings • Adafactor • Adam • Attention Dropout • BPE • Convolution • Cosine Annealing • Dense Connections • Depthwise Convolution • Dropout • Fixed Factorized Attention • GELU • GLU • GPT-3 • Inverse Square Root Schedule • Label Smoothing • Layer Normalization • Linear Layer • Linear Warmup With Cosine Annealing • Multi-DConv-Head Attention • Multi-Head Attention • Position-Wise Feed-Forward Layer • Primer • Residual Connection • Scaled Dot-Product Attention • SentencePiece • Softmax • Squared ReLU • Strided Attention • T5 • Transformer • Weight Decay

Edit Social Preview

Primer: Searching for Efficient Transformers for Language Modeling

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove