Search Results for author: Sam Shleifer

Found 10 papers, 8 papers with code

NormFormer: Improved Transformer Pretraining with Extra Normalization

1 code implementation18 Oct 2021 Sam Shleifer, Jason Weston, Myle Ott

The extra operations incur negligible compute cost (+0. 4% parameter increase), but improve pretraining perplexity and downstream task performance for both causal and masked language models ranging from 125 Million to 2. 7 Billion parameters.

Language Modelling

8-bit Optimizers via Block-wise Quantization

1 code implementation6 Oct 2021 Tim Dettmers, Mike Lewis, Sam Shleifer, Luke Zettlemoyer

To maintain stability and performance, we combine block-wise quantization with two additional changes: (1) dynamic quantization, a form of non-linear optimization that is precise for both large and small magnitude values, and (2) a stable embedding layer to reduce gradient variance that comes from the highly non-uniform distribution of input tokens in language models.

Language Modelling Machine Translation +1

Pre-trained Summarization Distillation

1 code implementation24 Oct 2020 Sam Shleifer, Alexander M. Rush

A third, simpler approach is to 'shrink and fine-tune' (SFT), which avoids any explicit distillation by copying parameters to a smaller student model and then fine-tuning.

Knowledge Distillation Machine Translation +1

Incrementally Improving Graph WaveNet Performance on Traffic Prediction

6 code implementations11 Dec 2019 Sam Shleifer, Clara McCreery, Vamsi Chitters

We present a series of modifications which improve upon Graph WaveNet's previously state-of-the-art performance on the METR-LA traffic prediction task.

Traffic Prediction

Classification As Decoder: Trading Flexibility For Control In Neural Dialogue

no code implementations4 Oct 2019 Sam Shleifer, Manish Chablani, Namit Katariya, Anitha Kannan, Xavier Amatriain

Only 12% of our discriminative approach's responses are worse than the doctor's response in the same conversational context, compared to 18% for the generative model.

Classification General Classification

Using Small Proxy Datasets to Accelerate Hyperparameter Search

1 code implementation12 Jun 2019 Sam Shleifer, Eric Prokop

These "easy" proxies are higher quality than training on the full dataset for a reduced number of epochs (but equivalent computational cost), and, unexpectedly, higher quality than proxies constructed from the hardest examples.

Low Resource Text Classification with ULMFit and Backtranslation

1 code implementation21 Mar 2019 Sam Shleifer

A ULMFit model pretrained on wikitext103 and then fine-tuned on only 50 IMDB examples and 500 synthetic examples generated by backtranslation achieves 80. 6% accuracy, an 8. 1% improvement over the augmentation-free baseline with only 9 minutes of additional training time.

Classification Data Augmentation +2

Cannot find the paper you are looking for? You can Submit a new open access paper.