Breaking the Softmax Bottleneck: A High-Rank RNN Language Model

We formulate language modeling as a matrix factorization problem, and show that the expressiveness of Softmax-based models (including the majority of neural language models) is limited by a Softmax bottleneck. Given that natural language is highly context-dependent, this further implies that in practice Softmax with distributed word embeddings does not have enough capacity to model natural language. We propose a simple and effective method to address this issue, and improve the state-of-the-art perplexities on Penn Treebank and WikiText-2 to 47.69 and 40.68 respectively. The proposed method also excels on the large-scale 1B Word dataset, outperforming the baseline by over 5.6 points in perplexity.

PDF Abstract ICLR 2018 PDF ICLR 2018 Abstract
Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Language Modelling Penn Treebank (Word Level) AWD-LSTM-MoS + dynamic eval Validation perplexity 48.33 # 7
Test perplexity 47.69 # 11
Params 22M # 23
Language Modelling Penn Treebank (Word Level) AWD-LSTM-MoS Validation perplexity 56.54 # 16
Test perplexity 54.44 # 20
Params 22M # 23
Language Modelling WikiText-2 AWD-LSTM-MoS Validation perplexity 63.88 # 19
Test perplexity 61.45 # 26
Number of params 35M # 12
Language Modelling WikiText-2 AWD-LSTM-MoS + dynamic eval Validation perplexity 42.41 # 8
Test perplexity 40.68 # 16
Number of params 35M # 12

Methods