Search Results for author: Sam Shleifer

Found 16 papers, 11 papers with code

Low Resource Text Classification with ULMFit and Backtranslation

1 code implementation • 21 Mar 2019 • Sam Shleifer

A ULMFit model pretrained on wikitext103 and then fine-tuned on only 50 IMDB examples and 500 synthetic examples generated by backtranslation achieves 80. 6% accuracy, an 8. 1% improvement over the augmentation-free baseline with only 9 minutes of additional training time.

Data Augmentation General Classification +2

Paper
Code

Using Small Proxy Datasets to Accelerate Hyperparameter Search

1 code implementation • 12 Jun 2019 • Sam Shleifer, Eric Prokop

These "easy" proxies are higher quality than training on the full dataset for a reduced number of epochs (but equivalent computational cost), and, unexpectedly, higher quality than proxies constructed from the hardest examples.

Paper
Code

Classification As Decoder: Trading Flexibility For Control In Neural Dialogue

no code implementations • 4 Oct 2019 • Sam Shleifer, Manish Chablani, Namit Katariya, Anitha Kannan, Xavier Amatriain

Only 12% of our discriminative approach's responses are worse than the doctor's response in the same conversational context, compared to 18% for the generative model.

Classification General Classification

Paper
Add Code

HuggingFace's Transformers: State-of-the-art Natural Language Processing

9 code implementations • 9 Oct 2019 • Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, Alexander M. Rush

Transformer architectures have facilitated building higher-capacity models and pretraining has made it possible to effectively utilize this capacity for a wide variety of tasks.

Text Generation Transfer Learning

124,889

Paper
Code

Classification as Decoder: Trading Flexibility for Control in Medical Dialogue

no code implementations • 16 Nov 2019 • Sam Shleifer, Manish Chablani, Anitha Kannan, Namit Katariya, Xavier Amatriain

Generative seq2seq dialogue systems are trained to predict the next word in dialogues that have already occurred.

Classification General Classification +1

Paper
Add Code

Incrementally Improving Graph WaveNet Performance on Traffic Prediction

4 code implementations • 11 Dec 2019 • Sam Shleifer, Clara McCreery, Vamsi Chitters

We present a series of modifications which improve upon Graph WaveNet's previously state-of-the-art performance on the METR-LA traffic prediction task.

Ranked #11 on Traffic Prediction on METR-LA

Traffic Prediction

545

Paper
Code

Transformers: State-of-the-Art Natural Language Processing

2 code implementations • EMNLP 2020 • Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, Alexander Rush

Transformer architectures have facilitated building higher-capacity models and pretraining has made it possible to effectively utilize this capacity for a wide variety of tasks.

Image Classification Object Recognition +1

124,889

Paper
Code

Pre-trained Summarization Distillation

1 code implementation • 24 Oct 2020 • Sam Shleifer, Alexander M. Rush

A third, simpler approach is to 'shrink and fine-tune' (SFT), which avoids any explicit distillation by copying parameters to a smaller student model and then fine-tuning.

Knowledge Distillation Machine Translation +1

44,938

Paper
Code

8-bit Optimizers via Block-wise Quantization

2 code implementations • ICLR 2022 • Tim Dettmers, Mike Lewis, Sam Shleifer, Luke Zettlemoyer

To maintain stability and performance, we combine block-wise quantization with two additional changes: (1) dynamic quantization, a form of non-linear optimization that is precise for both large and small magnitude values, and (2) a stable embedding layer to reduce gradient variance that comes from the highly non-uniform distribution of input tokens in language models.

Language Modelling Machine Translation +1

5,374

Paper
Code

NormFormer: Improved Transformer Pretraining with Extra Normalization

1 code implementation • 18 Oct 2021 • Sam Shleifer, Jason Weston, Myle Ott

The extra operations incur negligible compute cost (+0. 4% parameter increase), but improve pretraining perplexity and downstream task performance for both causal and masked language models ranging from 125 Million to 2. 7 Billion parameters.

Language Modelling Masked Language Modeling

29,233

Paper
Code

Efficient Large Scale Language Modeling with Mixtures of Experts

no code implementations • 20 Dec 2021 • Mikel Artetxe, Shruti Bhosale, Naman Goyal, Todor Mihaylov, Myle Ott, Sam Shleifer, Xi Victoria Lin, Jingfei Du, Srinivasan Iyer, Ramakanth Pasunuru, Giri Anantharaman, Xian Li, Shuohui Chen, Halil Akin, Mandeep Baines, Louis Martin, Xing Zhou, Punit Singh Koura, Brian O'Horo, Jeff Wang, Luke Zettlemoyer, Mona Diab, Zornitsa Kozareva, Ves Stoyanov

This paper presents a detailed empirical study of how autoregressive MoE language models scale in comparison with dense models in a wide range of settings: in- and out-of-domain language modeling, zero- and few-shot priming, and full-shot fine-tuning.

Language Modelling

Paper
Add Code

Few-shot Learning with Multilingual Language Models

2 code implementations • 20 Dec 2021 • Xi Victoria Lin, Todor Mihaylov, Mikel Artetxe, Tianlu Wang, Shuohui Chen, Daniel Simig, Myle Ott, Naman Goyal, Shruti Bhosale, Jingfei Du, Ramakanth Pasunuru, Sam Shleifer, Punit Singh Koura, Vishrav Chaudhary, Brian O'Horo, Jeff Wang, Luke Zettlemoyer, Zornitsa Kozareva, Mona Diab, Veselin Stoyanov, Xian Li

Large-scale generative language models such as GPT-3 are competitive few-shot learners.

Cross-Lingual Transfer Few-Shot Learning +5

29,233

Paper
Code

Efficient Language Modeling with Sparse all-MLP

no code implementations • 14 Mar 2022 • Ping Yu, Mikel Artetxe, Myle Ott, Sam Shleifer, Hongyu Gong, Ves Stoyanov, Xian Li

All-MLP architectures have attracted increasing interest as an alternative to attention-based models.

Ranked #17 on Question Answering on StoryCloze

Common Sense Reasoning In-Context Learning +4

Paper
Add Code

OPT: Open Pre-trained Transformer Language Models

7 code implementations • 2 May 2022 • Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, Todor Mihaylov, Myle Ott, Sam Shleifer, Kurt Shuster, Daniel Simig, Punit Singh Koura, Anjali Sridhar, Tianlu Wang, Luke Zettlemoyer

Large language models, which are often trained for hundreds of thousands of compute days, have shown remarkable capabilities for zero- and few-shot learning.

Ranked #2 on Stereotypical Bias Analysis on CrowS-Pairs