Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new ``Colossal Clean Crawled Corpus'', we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.

PDF Abstract
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Question Answering BoolQ T5-Small 60M (fine-tuned) Accuracy 76.4 # 31
Question Answering BoolQ T5-Base 220M (fine-tuned) Accuracy 81.4 # 25
Question Answering BoolQ T5-XXL 11B (fine-tuned) Accuracy 91.2 # 4
Question Answering BoolQ T5-Large 770M (fine-tuned) Accuracy 85.4 # 15
Document Summarization CNN / Daily Mail T5-11B ROUGE-1 43.52 # 11
ROUGE-2 21.55 # 2
ROUGE-L 40.69 # 8
Abstractive Text Summarization CNN / Daily Mail T5 ROUGE-1 43.52 # 22
ROUGE-2 21.55 # 7
ROUGE-L 40.69 # 22
Linguistic Acceptability CoLA T5-Large 770M Accuracy 61.2% # 28
Linguistic Acceptability CoLA T5-11B Accuracy 70.8% # 12
Linguistic Acceptability CoLA T5-XL 3B Accuracy 67.1% # 22
Linguistic Acceptability CoLA T5-Base Accuracy 51.1% # 37
Linguistic Acceptability CoLA T5-Small Accuracy 41.0% # 41
Natural Language Inference CommitmentBank T5-Base 220M (fine-tuned) F1 86.2 # 7
Accuracy 94 # 9
Natural Language Inference CommitmentBank T5-XXL 11B (fine-tuned) F1 93.9 # 5
Accuracy 96.8 # 7
Natural Language Inference CommitmentBank T5-Large 770M (fine-tuned) F1 90.3 # 6
Accuracy 94.4 # 8
Question Answering COPA T5-XXL 11B (fine-tuned) Accuracy 94.8 # 9
Question Answering COPA T5-Base 220M (fine-tuned) Accuracy 71.2 # 47
Question Answering COPA T5-Large 770M (fine-tuned) Accuracy 83.4 # 33
Question Answering COPA T5-XL 3B (fine-tuned) Accuracy 92 # 11
Semantic Textual Similarity MRPC T5-Small Accuracy 86.6% # 31
F1 89.7 # 11
Semantic Textual Similarity MRPC T5-11B Accuracy 90.0% # 15
F1 91.9 # 4
Semantic Textual Similarity MRPC T5-Large Accuracy 89.9% # 16
F1 92.4 # 3
Semantic Textual Similarity MRPC T5-Base Accuracy 87.5% # 25
F1 90.7 # 10
Semantic Textual Similarity MRPC T5-3B Accuracy 89.2% # 19
F1 92.5 # 2
Natural Language Inference MultiNLI T5-Large Matched 89.9 # 11
Natural Language Inference MultiNLI T5-XXL 11B (fine-tuned) Matched 92.0 # 2
Natural Language Inference MultiNLI T5-3B Matched 91.4 # 4
Mismatched 91.2 # 4
Natural Language Inference MultiNLI T5-Base Matched 87.1 # 21
Mismatched 86.2 # 15
Natural Language Inference MultiNLI T5-Small Matched 82.4 # 36
Mismatched 82.3 # 25
Natural Language Inference MultiNLI T5-Large 770M Mismatched 89.6 # 8
Natural Language Inference MultiNLI T5-11B Mismatched 91.7 # 2
Question Answering MultiRC T5-XXL 11B (fine-tuned) F1 88.1 # 7
Question Answering MultiRC T5-11B EM 63.3 # 3
Multimodal Intent Recognition PhotoChat T5-base F1 58.1 # 3
Precision 58.2 # 2
Recall 57.9 # 5
Multimodal Intent Recognition PhotoChat T5-3B F1 58.9 # 2
Precision 54.1 # 5
Recall 64.6 # 2
Natural Language Inference QNLI T5-Base Accuracy 93.7% # 19
Natural Language Inference QNLI T5-Small Accuracy 90.3% # 35
Natural Language Inference QNLI T5-3B Accuracy 96.3% # 7
Natural Language Inference QNLI T5-11B Accuracy 96.7% # 6
Natural Language Inference QNLI T5-Large 770M Accuracy 94.8% # 12
Question Answering Quora Question Pairs T5-Large 770M Accuracy 89.9% # 9
Question Answering Quora Question Pairs T5-11B Accuracy 90.4% # 4
Question Answering Quora Question Pairs T5-Small Accuracy 88.0% # 16
Question Answering Quora Question Pairs T5-Base Accuracy 89.4% # 12
Question Answering Quora Question Pairs T5-3B Accuracy 89.7% # 11
Common Sense Reasoning ReCoRD T5-11B F1 94.1 # 5
Common Sense Reasoning ReCoRD T5-XXL 11B (fine-tuned) EM 93.4 # 6
Natural Language Inference RTE T5-Small Accuracy 69.9% # 54
Natural Language Inference RTE T5-XXL 11B (fine-tuned) Accuracy 92.5% # 8
Natural Language Inference RTE T5-Base 220M Accuracy 80.1% # 36
Natural Language Inference RTE T5-XL 3B Accuracy 91.1% # 14
Natural Language Inference RTE T5-Large 770M Accuracy 87.2% # 21
Question Answering SQuAD1.1 dev T5-Large 770M EM 86.66 # 6
F1 93.79 # 6
Question Answering SQuAD1.1 dev T5-Base EM 85.44 # 8
F1 92.08 # 8
Question Answering SQuAD1.1 dev T5-3B EM 88.53 # 5
F1 94.95 # 5
Question Answering SQuAD1.1 dev T5-Small EM 79.1 # 16
F1 87.24 # 18
Question Answering SQuAD1.1 dev T5-11B EM 90.06 # 1
F1 95.64 # 2
Sentiment Analysis SST-2 Binary classification T5-11B Accuracy 97.5 # 1
Sentiment Analysis SST-2 Binary classification T5-Large 770M Accuracy 96.3 # 17
Sentiment Analysis SST-2 Binary classification T5-3B Accuracy 97.4 # 3
Sentiment Analysis SST-2 Binary classification T5-Base Accuracy 95.2 # 24
Sentiment Analysis SST-2 Binary classification T5-Small Accuracy 91.8 # 47
Semantic Textual Similarity STS Benchmark T5-11B Pearson Correlation 0.925 # 4
Spearman Correlation 0.921 # 4
Semantic Textual Similarity STS Benchmark T5-Large 770M Spearman Correlation 0.886 # 12
Semantic Textual Similarity STS Benchmark T5-Small Pearson Correlation 0.856 # 25
Spearman Correlation 0.85 # 24
Semantic Textual Similarity STS Benchmark T5-Base Pearson Correlation 0.894 # 22
Semantic Textual Similarity STS Benchmark T5-Large Pearson Correlation 0.899 # 20
Semantic Textual Similarity STS Benchmark T5-3B Pearson Correlation 0.906 # 17
Spearman Correlation 0.898 # 6
Question Answering WebQuestions T5.1.1-XXL+SSM EM 42.8 # 6
Semantic Parsing WebQuestionsSP T5-11B (Raffel et al., 2020) Accuracy 56.5 # 5
Poll Generation WeiboPolls T5 ROUGE-1 45.33 # 2
ROUGE-L 42.69 # 2
BLEU-1 37.34 # 2
BLEU-3 21.06 # 2
Answer Generation WeiboPolls T5 ROUGE-1 46.20 # 2
ROUGE-L 43.32 # 2
BLEU-1 37.77 # 2
BLEU-3 25.86 # 1
Question Generation WeiboPolls T5 ROUGE-1 44.46 # 2
ROUGE-L 42.06 # 2
BLEU-1 36.91 # 2
BLEU-3 16.26 # 2
Coreference Resolution Winograd Schema Challenge T5-XXL 11B (fine-tuned) Accuracy 93.8 # 7
Machine Translation WMT2014 English-French T5 BLEU score 43.4 # 9
Machine Translation WMT2014 English-German T5-11B BLEU score 32.1 # 4
Number of Params 11110M # 1
Natural Language Inference WNLI T5-Base 220M Accuracy 78.8 # 12
Natural Language Inference WNLI T5-Small 60M Accuracy 69.2 # 18
Natural Language Inference WNLI T5-XXL 11B Accuracy 93.2 # 3
Natural Language Inference WNLI T5-XL 3B Accuracy 89.7 # 6
Natural Language Inference WNLI T5-Large 770M Accuracy 85.6 # 10
Word Sense Disambiguation Words in Context T5-XXL 11B Accuracy 76.9 # 8

Methods