On the State of the Art of Evaluation in Neural Language Models

ICLR 2018  ·  Gábor Melis, Chris Dyer, Phil Blunsom ·

Ongoing innovations in recurrent neural network architectures have provided a steady influx of apparently state-of-the-art results on language modelling benchmarks. However, these have been evaluated using differing code bases and limited computational resources, which represent uncontrolled sources of experimental variation. We reevaluate several popular architectures and regularisation methods with large-scale automatic black-box hyperparameter tuning and arrive at the somewhat surprising conclusion that standard LSTM architectures, when properly regularised, outperform more recent models. We establish a new state of the art on the Penn Treebank and Wikitext-2 corpora, as well as strong baselines on the Hutter Prize dataset.

PDF Abstract ICLR 2018 PDF ICLR 2018 Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Language Modelling WikiText-2 Melis et al. (2017) - 1-layer LSTM (tied) Validation perplexity 69.3 # 24
Test perplexity 65.9 # 32
Number of params 24M # 27

Methods


No methods listed for this paper. Add relevant methods here