Assessing the Coherence Modeling Capabilities of Pretrained Transformer-based Language Models

ACL ARR November 2021 · Anonymous ·

The task of ordering a shuffled set of sentences into a coherent text is used to evaluate the capacity of a model to understand causal and temporal relations between entities and events. Recent approaches rely on pretrained Transformer-based models, but it remains unknown whether the differences between them, such as size, pretraining data and objectives, affect their coherence modeling capacity. We present a simple architecture for sentence ordering that relies exclusively on pretrained Transformer-based encoder-only models. This allows us to compare the coherence modeling capabilities of the monolingual and multilingual versions of BERT, RoBERTa, and DistilBERT. We show that RoBERTa-based models outperform BERT-based models and are more robust when ordering longer documents with more than 10 sentences. Thus, the intuitive advantage offered by sentence-based objectives such as Next Sentence Prediction used in BERT is effectively compensated by the higher amount and diversity of the training data used in RoBERTa. However, the difference between multilingual versions of BERT and RoBERTa is narrower. This suggests that exposure to different languages partially makes up for the benefits of larger and more diverse training data.

PDF Abstract