Better Fine-Tuning by Reducing Representational Collapse

Although widely adopted, existing approaches for fine-tuning pre-trained language models have been shown to be unstable across hyper-parameter settings, motivating recent work on trust region methods. In this paper, we present a simplified and efficient method rooted in trust region theory that replaces previously used adversarial objectives with parametric noise (sampling from either a normal or uniform distribution), thereby discouraging representation change during fine-tuning when possible without hurting performance. We also introduce a new analysis to motivate the use of trust region methods more generally, by studying representational collapse; the degradation of generalizable representations from pre-trained models as they are fine-tuned for a specific end task. Extensive experiments show that our fine-tuning method matches or exceeds the performance of previous trust region methods on a range of understanding and generation tasks (including DailyMail/CNN, Gigaword, Reddit TIFU, and the GLUE benchmark), while also being much faster. We also show that it is less prone to representation collapse; the pre-trained models maintain more generalizable representations every time they are fine-tuned.

PDF Abstract ICLR 2021 PDF ICLR 2021 Abstract
Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Abstractive Text Summarization CNN / Daily Mail BART+R3F ROUGE-1 44.38 # 14
ROUGE-2 21.53 # 10
ROUGE-L 41.17 # 18
Text Summarization GigaWord BART-RXF ROUGE-1 40.45 # 2
ROUGE-2 20.69 # 2
ROUGE-L 36.56 # 12
Text Summarization Reddit TIFU BART+R3F ROUGE-1 30.31 # 2
ROUGE-2 10.98 # 3
ROUGE-L 24.74 # 3
Cross-Lingual Natural Language Inference XNLI Zero-Shot English-to-French XLM-R R4F Accuracy 84.7% # 1
Cross-Lingual Natural Language Inference XNLI Zero-Shot English-to-German XLM-R R4F Accuracy 84.2% # 1
Cross-Lingual Natural Language Inference XNLI Zero-Shot English-to-Spanish XLM-R R4F Accuracy 85.2% # 1


No methods listed for this paper. Add relevant methods here