TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Language Modelling	WikiText-103	Staged Training	Validation perplexity	16.89	# 10
Language Modelling	WikiText-103	Staged Training	Test perplexity	17.56	# 26
Language Modelling	WikiText-103	Staged Training	Number of params	247M	# 19
Language Modelling	WikiText-103	Shortformer	Validation perplexity	17.47	# 12
Language Modelling	WikiText-103	Shortformer	Test perplexity	18.15	# 31
Language Modelling	WikiText-103	Shortformer	Number of params	247M	# 19

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/shortformer-better-language-modeling-using/language-modelling-on-wikitext-103)](https://paperswithcode.com/sota/language-modelling-on-wikitext-103?p=shortformer-better-language-modeling-using)`

Shortformer: Better Language Modeling using Shorter Inputs

ACL 2021 · Ofir Press, Noah A. Smith, Mike Lewis ·

Increasing the input length has been a driver of progress in language modeling with transformers. We identify conditions where shorter inputs are not harmful, and achieve perplexity and efficiency improvements through two new methods that decrease input length. First, we show that initially training a model on short subsequences before moving on to longer ones both reduces overall training time and, surprisingly, substantially improves perplexity. Second, we show how to improve the efficiency of recurrence methods in transformers, which let models condition on previously processed tokens when generating sequences that exceed the maximal length the transformer can handle at once. Existing methods require computationally expensive relative position embeddings; we introduce a simple alternative of adding absolute position embeddings to queries and keys instead of to word embeddings, which efficiently produces superior results. We show that these recurrent models also benefit from short input lengths. Combining these techniques speeds up training by a factor of 1.65, reduces memory usage, and substantially improves perplexity on WikiText-103, without adding any parameters.

PDF Abstract ACL 2021 PDF ACL 2021 Abstract

Code

Add Remove Mark official

ofirpress/shortformer official

146

Tasks

Add Remove

Language Modelling

Position

Word Embeddings

Datasets

WikiText-2

WikiText-103

BookCorpus

Results from the Paper

Edit

Ranked #26 on Language Modelling on WikiText-103

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Language Modelling	WikiText-103	Staged Training	Validation perplexity	16.89	# 10	Compare
			Test perplexity	17.56	# 26	Compare
			Number of params	247M	# 19	Compare
Language Modelling	WikiText-103	Shortformer	Validation perplexity	17.47	# 12	Compare
			Test perplexity	18.15	# 31	Compare
			Number of params	247M	# 19	Compare

Methods

Add Remove

GELU • Layer Normalization

Edit Social Preview

Shortformer: Better Language Modeling using Shorter Inputs

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove