TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Language Modelling	Penn Treebank (Word Level)	AWD-LSTM + continuous cache pointer	Validation perplexity	53.9	# 13
Language Modelling	Penn Treebank (Word Level)	AWD-LSTM + continuous cache pointer	Test perplexity	52.8	# 17
Language Modelling	Penn Treebank (Word Level)	AWD-LSTM + continuous cache pointer	Params	24M	# 7
Language Modelling	Penn Treebank (Word Level)	AWD-LSTM	Validation perplexity	60.0	# 24
Language Modelling	Penn Treebank (Word Level)	AWD-LSTM	Test perplexity	57.3	# 30
Language Modelling	Penn Treebank (Word Level)	AWD-LSTM	Params	24M	# 7
Language Modelling	WikiText-2	AWD-LSTM	Validation perplexity	68.6	# 23
Language Modelling	WikiText-2	AWD-LSTM	Test perplexity	65.8	# 31
Language Modelling	WikiText-2	AWD-LSTM	Number of params	33M	# 23
Language Modelling	WikiText-2	AWD-LSTM + continuous cache pointer	Validation perplexity	53.8	# 11
Language Modelling	WikiText-2	AWD-LSTM + continuous cache pointer	Test perplexity	52.0	# 19
Language Modelling	WikiText-2	AWD-LSTM + continuous cache pointer	Number of params	33M	# 23

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/regularizing-and-optimizing-lstm-language/language-modelling-on-penn-treebank-word)](https://paperswithcode.com/sota/language-modelling-on-penn-treebank-word?p=regularizing-and-optimizing-lstm-language)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/regularizing-and-optimizing-lstm-language/language-modelling-on-wikitext-2)](https://paperswithcode.com/sota/language-modelling-on-wikitext-2?p=regularizing-and-optimizing-lstm-language)`

Regularizing and Optimizing LSTM Language Models

ICLR 2018 · Stephen Merity, Nitish Shirish Keskar, Richard Socher ·

Recurrent neural networks (RNNs), such as long short-term memory networks (LSTMs), serve as a fundamental building block for many sequence learning tasks, including machine translation, language modeling, and question answering. In this paper, we consider the specific problem of word-level language modeling and investigate strategies for regularizing and optimizing LSTM-based models. We propose the weight-dropped LSTM which uses DropConnect on hidden-to-hidden weights as a form of recurrent regularization. Further, we introduce NT-ASGD, a variant of the averaged stochastic gradient method, wherein the averaging trigger is determined using a non-monotonic condition as opposed to being tuned by the user. Using these and other regularization strategies, we achieve state-of-the-art word level perplexities on two data sets: 57.3 on Penn Treebank and 65.8 on WikiText-2. In exploring the effectiveness of a neural cache in conjunction with our proposed model, we achieve an even lower state-of-the-art perplexity of 52.8 on Penn Treebank and 52.0 on WikiText-2.

PDF Abstract ICLR 2018 PDF ICLR 2018 Abstract

Code

Add Remove Mark official

salesforce/awd-lstm-lm official

1,956

google-research/google-research

32,745

fastai/fastai

25,572

dmlc/gluon-nlp

2,548

castorini/hedwig

584

See all 47 implementations

Tasks

Add Remove

Language Modelling

Translation

Datasets

Penn Treebank

WikiText-2

Results from the Paper

Edit

Ranked #17 on Language Modelling on Penn Treebank (Word Level)

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Language Modelling	Penn Treebank (Word Level)	AWD-LSTM + continuous cache pointer	Validation perplexity	53.9	# 13	Compare
			Test perplexity	52.8	# 17	Compare
			Params	24M	# 7	Compare
Language Modelling	Penn Treebank (Word Level)	AWD-LSTM	Validation perplexity	60.0	# 24	Compare
			Test perplexity	57.3	# 30	Compare
			Params	24M	# 7	Compare
Language Modelling	WikiText-2	AWD-LSTM	Validation perplexity	68.6	# 23	Compare
			Test perplexity	65.8	# 31	Compare
			Number of params	33M	# 23	Compare
Language Modelling	WikiText-2	AWD-LSTM + continuous cache pointer	Validation perplexity	53.8	# 11	Compare
			Test perplexity	52.0	# 19	Compare
			Number of params	33M	# 23	Compare

Methods

Add Remove

Activation Regularization • AWD-LSTM • DropConnect • Dropout • Embedding Dropout • LSTM • Neural Cache • NT-ASGD • Sigmoid Activation • Tanh Activation • Temporal Activation Regularization • Variational Dropout • Weight Tying

Edit Social Preview

Regularizing and Optimizing LSTM Language Models

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove