TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Language Modelling	One Billion Word	High-Budget MoE	PPL	28.0	# 14
Language Modelling	One Billion Word	High-Budget MoE	Number of params	5B	# 1
Language Modelling	One Billion Word	Low-Budget MoE	PPL	34.1	# 19
Language Modelling	One Billion Word	Low-Budget MoE	Number of params	5B	# 1
Machine Translation	WMT2014 English-French	MoE	BLEU score	40.56	# 30
Machine Translation	WMT2014 English-French	MoE	Hardware Burden	142G	# 1
Machine Translation	WMT2014 English-French	MoE	Operations per network pass	None	# 1
Machine Translation	WMT2014 English-German	MoE	BLEU score	26.03	# 65
Machine Translation	WMT2014 English-German	MoE	Hardware Burden	24G	# 1
Machine Translation	WMT2014 English-German	MoE	Operations per network pass	None	# 1

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/outrageously-large-neural-networks-the/language-modelling-on-one-billion-word)](https://paperswithcode.com/sota/language-modelling-on-one-billion-word?p=outrageously-large-neural-networks-the)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/outrageously-large-neural-networks-the/machine-translation-on-wmt2014-english-french)](https://paperswithcode.com/sota/machine-translation-on-wmt2014-english-french?p=outrageously-large-neural-networks-the)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/outrageously-large-neural-networks-the/machine-translation-on-wmt2014-english-german)](https://paperswithcode.com/sota/machine-translation-on-wmt2014-english-german?p=outrageously-large-neural-networks-the)`

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

23 Jan 2017 · Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, Jeff Dean ·

The capacity of a neural network to absorb information is limited by its number of parameters. Conditional computation, where parts of the network are active on a per-example basis, has been proposed in theory as a way of dramatically increasing model capacity without a proportional increase in computation. In practice, however, there are significant algorithmic and performance challenges. In this work, we address these challenges and finally realize the promise of conditional computation, achieving greater than 1000x improvements in model capacity with only minor losses in computational efficiency on modern GPU clusters. We introduce a Sparsely-Gated Mixture-of-Experts layer (MoE), consisting of up to thousands of feed-forward sub-networks. A trainable gating network determines a sparse combination of these experts to use for each example. We apply the MoE to the tasks of language modeling and machine translation, where model capacity is critical for absorbing the vast quantities of knowledge available in the training corpora. We present model architectures in which a MoE with up to 137 billion parameters is applied convolutionally between stacked LSTM layers. On large language modeling and machine translation benchmarks, these models achieve significantly better results than state-of-the-art at lower computational cost.

PDF Abstract

Code

Add Remove Mark official

davidmrau/mixture-of-experts

816

jsuarez5341/Efficient-Dynamic-Batch…

unconst/MACH

ma921/XRDidentifier

Tasks

Add Remove

Computational Efficiency

Language Modelling

Machine Translation

Translation

Datasets

WMT 2014 Billion Word Benchmark

Results from the Paper

Edit

Ranked #14 on Language Modelling on One Billion Word

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Language Modelling	One Billion Word	High-Budget MoE	PPL	28.0	# 14	Compare
Language Modelling	One Billion Word	High-Budget MoE	Number of params	5B	# 1	Compare
Language Modelling	One Billion Word	Low-Budget MoE	PPL	34.1	# 19	Compare
Language Modelling	One Billion Word	Low-Budget MoE	Number of params	5B	# 1	Compare
Machine Translation	WMT2014 English-French	MoE	BLEU score	40.56	# 30	Compare
			Hardware Burden	142G	# 1	Compare
			Operations per network pass	None	# 1	Compare
Machine Translation	WMT2014 English-German	MoE	BLEU score	26.03	# 65	Compare
			Hardware Burden	24G	# 1	Compare
			Operations per network pass	None	# 1	Compare

Methods

Add Remove

LSTM • Sigmoid Activation • Tanh Activation

Edit Social Preview

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove