Deep Neural Networks (DNN) have been widely employed in industry to address various Natural Language Processing (NLP) tasks.
However, it requires that both sentences are fed into the network, which causes a massive computational overhead: Finding the most similar pair in a collection of 10, 000 sentences requires about 50 million inference computations (~65 hours) with BERT.
Research in natural language processing proceeds, in part, by demonstrating that new models achieve superior performance (e. g., accuracy) on held-out test data, compared to previous results.
We conclude that a variety of methods is necessary to reveal all relevant aspects of a model's grammatical knowledge in a given domain.
Existing works, including ELMO and BERT, have revealed the importance of pre-training for NLP tasks.
For abstractive summarization, we propose a new fine-tuning schedule which adopts different optimizers for the encoder and the decoder as a means of alleviating the mismatch between the two (the former is pretrained while the latter is not).
SOTA for Extractive Document Summarization on CNN / Daily Mail (using extra training data)