Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

23 Oct 2019Colin RaffelNoam ShazeerAdam RobertsKatherine LeeSharan NarangMichael MatenaYanqi ZhouWei LiPeter J. Liu

Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice... (read more)

PDF Abstract

Evaluation Results from the Paper


TASK DATASET MODEL METRIC NAME METRIC VALUE GLOBAL RANK USES EXTRA
TRAINING DATA
COMPARE
Question Answering BoolQ T5-11B Accuracy 91.0 # 1
Document Summarization CNN / Daily Mail T5-11B ROUGE-1 43.52 # 2
Document Summarization CNN / Daily Mail T5-11B ROUGE-2 21.55 # 1
Document Summarization CNN / Daily Mail T5-11B ROUGE-L 40.69 # 1
Linguistic Acceptability CoLA T5-Large Accuracy 61.2% # 9
Linguistic Acceptability CoLA T5-Base Accuracy 51.1% # 12
Linguistic Acceptability CoLA T5-Small Accuracy 41.0% # 14
Linguistic Acceptability CoLA T5-11B Accuracy 70.8% # 1
Linguistic Acceptability CoLA T5-3B Accuracy 67.1% # 6
Natural Language Inference CommitmentBank T5-11B F1 93.0 # 1
Question Answering COPA T5-11B Accuracy 94.8 # 1
Semantic Textual Similarity MRPC T5-3B Accuracy 89.2% # 9
Semantic Textual Similarity MRPC T5-3B F1 92.5% # 1
Semantic Textual Similarity MRPC T5-Base Accuracy 87.5% # 10
Semantic Textual Similarity MRPC T5-Base F1 90.7% # 4
Semantic Textual Similarity MRPC T5-Small Accuracy 86.6% # 12
Semantic Textual Similarity MRPC T5-Small F1 89.7% # 5
Semantic Textual Similarity MRPC T5-11B Accuracy 90.0% # 6
Semantic Textual Similarity MRPC T5-11B F1 91.9% # 3
Semantic Textual Similarity MRPC T5-Large Accuracy 89.9% # 7
Semantic Textual Similarity MRPC T5-Large F1 92.4% # 2
Natural Language Inference MultiNLI T5-3B Matched 91.4 # 2
Natural Language Inference MultiNLI T5-3B Mismatched 91.2 # 2
Natural Language Inference MultiNLI T5-Base Matched 87.1 # 8
Natural Language Inference MultiNLI T5-Base Mismatched 86.2 # 7
Natural Language Inference MultiNLI T5-11B Matched 92 # 1
Natural Language Inference MultiNLI T5-11B Mismatched 91.7 # 1
Natural Language Inference MultiNLI T5-Small Matched 82.4 # 13
Natural Language Inference MultiNLI T5-Small Mismatched 82.3 # 12
Natural Language Inference MultiNLI T5-Large Matched 89.9 # 6
Natural Language Inference MultiNLI T5-Large Mismatched 89.6 # 5
Question Answering MultiRC T5-11B F1a 88.2 # 1
Natural Language Inference QNLI T5-Small Accuracy 90.3% # 12
Natural Language Inference QNLI T5-11B Accuracy 96.7% # 4
Natural Language Inference QNLI T5-3B Accuracy 96.3% # 5
Natural Language Inference QNLI T5-Large Accuracy 94.8% # 7
Natural Language Inference QNLI T5-Base Accuracy 93.7% # 10
Question Answering Quora Question Pairs T5-Small Accuracy 88.0% # 12
Question Answering Quora Question Pairs T5-Base Accuracy 89.4% # 10
Question Answering Quora Question Pairs T5-11B Accuracy 90.4% # 2
Question Answering Quora Question Pairs T5-3B Accuracy 89.7% # 8
Question Answering Quora Question Pairs T5-Large Accuracy 89.9% # 6
Question Answering ReCoRD T5-11B F1 93.3 # 1
Natural Language Inference RTE T5-Small Accuracy 69.9% # 11
Natural Language Inference RTE T5-11B Accuracy 92.5% # 1
Natural Language Inference RTE T5-Large Accuracy 87.2% # 5
Natural Language Inference RTE T5-3B Accuracy 91.1% # 2
Natural Language Inference RTE T5-Base Accuracy 80.1% # 9
Question Answering SQuAD1.1 dev T5-Small EM 79.10 # 10
Question Answering SQuAD1.1 dev T5-Small F1 87.24 # 12
Question Answering SQuAD1.1 dev T5-11B EM 90.06 # 1
Question Answering SQuAD1.1 dev T5-11B F1 95.64 # 1
Question Answering SQuAD1.1 dev T5-3B EM 88.53 # 3
Question Answering SQuAD1.1 dev T5-3B F1 94.95 # 2
Question Answering SQuAD1.1 dev T5-Large EM 86.66 # 4
Question Answering SQuAD1.1 dev T5-Large F1 93.79 # 4
Question Answering SQuAD1.1 dev T5-Base EM 85.44 # 5
Question Answering SQuAD1.1 dev T5-Base F1 92.08 # 5
Sentiment Analysis SST-2 Binary classification T5-3B Accuracy 97.4 # 1
Sentiment Analysis SST-2 Binary classification T5-Large Accuracy 96.3 # 6
Sentiment Analysis SST-2 Binary classification T5-Base Accuracy 95.2 # 9
Sentiment Analysis SST-2 Binary classification T5-Small Accuracy 91.8 # 16
Sentiment Analysis SST-2 Binary classification T5-11B Accuracy 97.1 # 2
Semantic Textual Similarity STS Benchmark T5-11B Pearson Correlation 0.925 # 1
Semantic Textual Similarity STS Benchmark T5-11B Spearman Correlation 0.921 # 1
Semantic Textual Similarity STS Benchmark T5-Small Pearson Correlation 0.856 # 10
Semantic Textual Similarity STS Benchmark T5-Small Spearman Correlation 0.85 # 5
Semantic Textual Similarity STS Benchmark T5-Base Pearson Correlation 0.894 # 8
Semantic Textual Similarity STS Benchmark T5-Base Spearman Correlation 0.886 # 4
Semantic Textual Similarity STS Benchmark T5-Large Pearson Correlation 0.899 # 7
Semantic Textual Similarity STS Benchmark T5-Large Spearman Correlation 0.892 # 3
Semantic Textual Similarity STS Benchmark T5-3B Pearson Correlation 0.906 # 6
Semantic Textual Similarity STS Benchmark T5-3B Spearman Correlation 0.898 # 2
Machine Translation WMT2014 English-French T5 BLEU score 43.4 # 3
Machine Translation WMT2014 English-German T5-11B BLEU score 32.1 # 2
Natural Language Inference WNLI T5-3B Accuracy 89.7% # 4
Natural Language Inference WNLI T5-11B Accuracy 93.2% # 1
Natural Language Inference WNLI T5-Small Accuracy 69.2% # 8
Natural Language Inference WNLI T5-Base Accuracy 78.8% # 7
Natural Language Inference WNLI T5-Large Accuracy 85.6% # 6
Word Sense Disambiguation Words in Context T5-11B Accuracy 76.1 # 1