With the capability of modeling bidirectional contexts, denoising autoencoding based pretraining like BERT achieves better performance than pretraining approaches based on autoregressive language modeling.
Ranked #1 on Sentiment Analysis on Yelp Binary classification
With the success of language pretraining, it is highly desirable to develop more efficient architectures of good scalability that can exploit the abundant unlabeled data at a lower cost.
Ranked #6 on Reading Comprehension on RACE
We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers.
Ranked #1 on Question Answering on CoQA
Increasing model size when pretraining natural language representations often results in improved performance on downstream tasks.
Ranked #1 on Natural Language Inference on QNLI
We introduce fairseq S2T, a fairseq extension for speech-to-text (S2T) modeling tasks such as end-to-end speech recognition and speech-to-text translation.
Ranked #3 on Speech-to-Text Translation on MuST-C EN->DE
Large pre-trained language models have been shown to store factual knowledge in their parameters, and achieve state-of-the-art results when fine-tuned on downstream NLP tasks.
Ranked #7 on Question Answering on Natural Questions (short)
On unsupervised machine translation, we obtain 34. 3 BLEU on WMT'16 German-English, improving the previous state of the art by more than 9 BLEU.
Existing work in translation demonstrated the potential of massively multilingual machine translation by training a single model able to translate between any pair of languages.