Search Results for author: Noam Shazeer

Found 39 papers, 27 papers with code

ST-MoE: Designing Stable and Transferable Sparse Expert Models

2 code implementations17 Feb 2022 Barret Zoph, Irwan Bello, Sameer Kumar, Nan Du, Yanping Huang, Jeff Dean, Noam Shazeer, William Fedus

But advancing the state-of-the-art across a broad set of natural language tasks has been hindered by training instabilities and uncertain quality during fine-tuning.

Coreference Resolution Decoder +7

Searching for Efficient Transformers for Language Modeling

no code implementations NeurIPS 2021 David So, Wojciech Mańke, Hanxiao Liu, Zihang Dai, Noam Shazeer, Quoc Le

For example, at a 500M parameter size, Primer improves the original T5 architecture on C4 auto-regressive language modeling, reducing the training cost by 4X.

Language Modelling

Primer: Searching for Efficient Transformers for Language Modeling

4 code implementations17 Sep 2021 David R. So, Wojciech Mańke, Hanxiao Liu, Zihang Dai, Noam Shazeer, Quoc V. Le

For example, at a 500M parameter size, Primer improves the original T5 architecture on C4 auto-regressive language modeling, reducing the training cost by 4X.

Language Modelling

Do Transformer Modifications Transfer Across Implementations and Applications?

1 code implementation EMNLP 2021 Sharan Narang, Hyung Won Chung, Yi Tay, William Fedus, Thibault Fevry, Michael Matena, Karishma Malkan, Noah Fiedel, Noam Shazeer, Zhenzhong Lan, Yanqi Zhou, Wei Li, Nan Ding, Jake Marcus, Adam Roberts, Colin Raffel

The research community has proposed copious modifications to the Transformer architecture since it was introduced over three years ago, relatively few of which have seen widespread adoption.

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

6 code implementations11 Jan 2021 William Fedus, Barret Zoph, Noam Shazeer

We design models based off T5-Base and T5-Large to obtain up to 7x increases in pre-training speed with the same computational resources.

Language Modelling Question Answering

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

2 code implementations ICLR 2021 Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, Zhifeng Chen

Neural network scaling has been critical for improving the model quality in many real-world machine learning applications with vast amounts of training data and compute.

Machine Translation Playing the Game of 2048 +1

Talking-Heads Attention

4 code implementations5 Mar 2020 Noam Shazeer, Zhenzhong Lan, Youlong Cheng, Nan Ding, Le Hou

We introduce "talking-heads attention" - a variation on multi-head attention which includes linearprojections across the attention-heads dimension, immediately before and after the softmax operation. While inserting only a small number of additional parameters and a moderate amount of additionalcomputation, talking-heads attention leads to better perplexities on masked language modeling tasks, aswell as better quality when transfer-learning to language comprehension and question answering tasks.

Language Modelling Masked Language Modeling +2

GLU Variants Improve Transformer

22 code implementations12 Feb 2020 Noam Shazeer

Gated Linear Units (arXiv:1612. 08083) consist of the component-wise product of two linear projections, one of which is first passed through a sigmoid function.

How Much Knowledge Can You Pack Into the Parameters of a Language Model?

3 code implementations EMNLP 2020 Adam Roberts, Colin Raffel, Noam Shazeer

It has recently been observed that neural language models trained on unstructured text can implicitly store and retrieve knowledge using natural language queries.

Language Modelling Natural Language Queries

Faster Transformer Decoding: N-gram Masked Self-Attention

no code implementations14 Jan 2020 Ciprian Chelba, Mia Chen, Ankur Bapna, Noam Shazeer

Motivated by the fact that most of the information relevant to the prediction of target tokens is drawn from the source sentence $S=s_1, \ldots, s_S$, we propose truncating the target-side window used for computing self-attention by making an $N$-gram assumption.

Sentence

Fast Transformer Decoding: One Write-Head is All You Need

4 code implementations6 Nov 2019 Noam Shazeer

Multi-head attention layers, as used in the Transformer neural sequence model, are a powerful alternative to RNNs for moving information across and between sequences.

Language Modelling Large Language Model

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

52 code implementations arXiv 2019 Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu

Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP).

Answer Generation Common Sense Reasoning +12

High Resolution Medical Image Analysis with Spatial Partitioning

1 code implementation6 Sep 2019 Le Hou, Youlong Cheng, Noam Shazeer, Niki Parmar, Yeqing Li, Panagiotis Korfiatis, Travis M. Drucker, Daniel J. Blezek, Xiaodan Song

It is infeasible to train CNN models directly on such high resolution images, because neural activations of a single image do not fit in the memory of a single GPU/TPU, and naive data and model parallelism approaches do not work.

Vocal Bursts Intensity Prediction

Blockwise Parallel Decoding for Deep Autoregressive Models

no code implementations NeurIPS 2018 Mitchell Stern, Noam Shazeer, Jakob Uszkoreit

Deep autoregressive sequence-to-sequence models have demonstrated impressive performance across a wide variety of tasks in recent years.

Decoder Image Super-Resolution +2

Weakly Supervised Grammatical Error Correction using Iterative Decoding

no code implementations31 Oct 2018 Jared Lichtarge, Christopher Alberti, Shankar Kumar, Noam Shazeer, Niki Parmar

We describe an approach to Grammatical Error Correction (GEC) that is effective at making use of models trained on large amounts of weakly supervised bitext.

Grammatical Error Correction

Music Transformer

12 code implementations ICLR 2019 Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M. Dai, Matthew D. Hoffman, Monica Dinculescu, Douglas Eck

This is impractical for long sequences such as musical compositions since their memory complexity for intermediate relative information is quadratic in the sequence length.

Music Generation Music Modeling

HydraNets: Specialized Dynamic Architectures for Efficient Inference

no code implementations CVPR 2018 Ravi Teja Mullapudi, William R. Mark, Noam Shazeer, Kayvon Fatahalian

On ImageNet, applying the HydraNet template improves accuracy up to 2. 5% when compared to an efficient baseline architecture with similar inference cost.

Classification Computational Efficiency +2

Adafactor: Adaptive Learning Rates with Sublinear Memory Cost

4 code implementations ICML 2018 Noam Shazeer, Mitchell Stern

In several recently proposed stochastic optimization methods (e. g. RMSProp, Adam, Adadelta), parameter updates are scaled by the inverse square roots of exponential moving averages of squared past gradients.

Machine Translation Stochastic Optimization +1

Tensor2Tensor for Neural Machine Translation

14 code implementations WS 2018 Ashish Vaswani, Samy Bengio, Eugene Brevdo, Francois Chollet, Aidan N. Gomez, Stephan Gouws, Llion Jones, Łukasz Kaiser, Nal Kalchbrenner, Niki Parmar, Ryan Sepassi, Noam Shazeer, Jakob Uszkoreit

Tensor2Tensor is a library for deep learning models that is well-suited for neural machine translation and includes the reference implementation of the state-of-the-art Transformer model.

Machine Translation Translation

Fast Decoding in Sequence Models using Discrete Latent Variables

no code implementations ICML 2018 Łukasz Kaiser, Aurko Roy, Ashish Vaswani, Niki Parmar, Samy Bengio, Jakob Uszkoreit, Noam Shazeer

Finally, we evaluate our model end-to-end on the task of neural machine translation, where it is an order of magnitude faster at decoding than comparable autoregressive models.

Machine Translation Translation

Image Transformer

no code implementations15 Feb 2018 Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Łukasz Kaiser, Noam Shazeer, Alexander Ku, Dustin Tran

Image generation has been successfully cast as an autoregressive sequence generation or transformation problem.

Decoder Density Estimation +2

Generating Wikipedia by Summarizing Long Sequences

4 code implementations ICLR 2018 Peter J. Liu, Mohammad Saleh, Etienne Pot, Ben Goodrich, Ryan Sepassi, Lukasz Kaiser, Noam Shazeer

We show that generating English Wikipedia articles can be approached as a multi- document summarization of source documents.

Decoder Document Summarization +3

One Model To Learn Them All

1 code implementation16 Jun 2017 Lukasz Kaiser, Aidan N. Gomez, Noam Shazeer, Ashish Vaswani, Niki Parmar, Llion Jones, Jakob Uszkoreit

We present a single model that yields good results on a number of problems spanning multiple domains.

Image Captioning Image Classification +3

Attention Is All You Need

575 code implementations NeurIPS 2017 Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin

The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration.

Ranked #2 on Multimodal Machine Translation on Multi30K (BLUE (DE-EN) metric)

Abstractive Text Summarization Coreference Resolution +10

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

4 code implementations23 Jan 2017 Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, Jeff Dean

In this work, we address these challenges and finally realize the promise of conditional computation, achieving greater than 1000x improvements in model capacity with only minor losses in computational efficiency on modern GPU clusters.

Computational Efficiency Language Modelling +2

NN-grams: Unifying neural network and n-gram language models for Speech Recognition

no code implementations23 Jun 2016 Babak Damavandi, Shankar Kumar, Noam Shazeer, Antoine Bruguier

The model is trained using noise contrastive estimation (NCE), an approach that transforms the estimation problem of neural networks into one of binary classification between data samples and noise samples.

Binary Classification Language Modelling +3

Exploring the Limits of Language Modeling

10 code implementations7 Feb 2016 Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, Yonghui Wu

In this work we explore recent advances in Recurrent Neural Networks for large scale Language Modeling, a task central to language understanding.

Language Modelling

Swivel: Improving Embeddings by Noticing What's Missing

3 code implementations6 Feb 2016 Noam Shazeer, Ryan Doherty, Colin Evans, Chris Waterson

We present Submatrix-wise Vector Embedding Learner (Swivel), a method for generating low-dimensional feature embeddings from a feature co-occurrence matrix.

Sparse Non-negative Matrix Language Modeling

no code implementations TACL 2016 Joris Pelemans, Noam Shazeer, Ciprian Chelba

We evaluate SNM language models on two corpora: the One Billion Word Benchmark and a subset of the LDC English Gigaword corpus.

Automatic Speech Recognition (ASR) Language Modelling +1

End-to-End Text-Dependent Speaker Verification

3 code implementations27 Sep 2015 Georg Heigold, Ignacio Moreno, Samy Bengio, Noam Shazeer

In this paper we present a data-driven, integrated approach to speaker verification, which maps a test utterance and a few reference utterances directly to a single score for verification and jointly optimizes the system's components using the same evaluation protocol and metric as at test time.

Text-Dependent Speaker Verification

Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks

9 code implementations NeurIPS 2015 Samy Bengio, Oriol Vinyals, Navdeep Jaitly, Noam Shazeer

Recurrent Neural Networks can be trained to produce sequences of tokens given some input, as exemplified by recent results in machine translation and image captioning.

Constituency Parsing Image Captioning +2

Skip-gram Language Modeling Using Sparse Non-negative Matrix Probability Estimation

no code implementations3 Dec 2014 Noam Shazeer, Joris Pelemans, Ciprian Chelba

We present a novel family of language model (LM) estimation techniques named Sparse Non-negative Matrix (SNM) estimation.

Language Modelling

Cannot find the paper you are looking for? You can Submit a new open access paper.