We show that dropout training is best understood as performing MAP estimation concurrently for a family of conditional models whose objectives are themselves lower bounded by the original dropout objective. This discovery allows us to pick any model from this family after training, which leads to a substantial improvement on regularisation-heavy language modelling... (read more)

PDF Abstract ICLR 2019 PDF ICLR 2019 Abstract
TASK DATASET MODEL METRIC NAME METRIC VALUE GLOBAL RANK RESULT BENCHMARK
Language Modelling Penn Treebank (Word Level) 2-layer skip-LSTM + dropout tuning Validation perplexity 57.1 # 17
Test perplexity 55.3 # 21
Params 24M # 7

Methods used in the Paper