CDLM: Cross-Document Language Modeling

We introduce a new pretraining approach geared for multi-document language modeling, incorporating two key ideas into the masked language modeling self-supervised objective. First, instead of considering documents in isolation, we pretrain over sets of multiple related documents, encouraging the model to learn cross-document relationships. Second, we improve over recent long-range transformers by introducing dynamic global attention that has access to the entire input to predict masked tokens. We release CDLM (Cross-Document Language Model), a new general language model for multi-document setting that can be easily applied to downstream tasks. Our extensive analysis shows that both ideas are essential for the success of CDLM, and work in synergy to set new state-of-the-art results for several multi-text tasks. Code and models are available at https://github.com/aviclu/CDLM.

PDF Abstract Findings (EMNLP) 2021 PDF Findings (EMNLP) 2021 Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Citation Recommendation AAN test Longformer F1 85.4 # 1
Citation Recommendation ANN test Rand CD-LM F1 85.7 # 2
Citation Recommendation ANN test CD-LM F1 88.8 # 1
Entity Cross-Document Coreference Resolution ECB+ test Longformer CoNLL F1 80.4 # 3
Event Cross-Document Coreference Resolution ECB+ test Yu et al CoNLL F1 84.4 # 4
Entity Cross-Document Coreference Resolution ECB+ test CD-LM CoNLL F1 82.9 # 1
Event Cross-Document Coreference Resolution ECB+ test CD-LM CoNLL F1 85.6 # 2
Cross-Document Language Modeling MultiNews test Longformer Perplexity 2.34 # 3
Cross-Document Language Modeling MultiNews test CD-LM Perplexity 1.76 # 1
Cross-Document Language Modeling MultiNews test Rand CD-LM Perplexity 1.93 # 2
Cross-Document Language Modeling MultiNews val CD-LM Perplexity 1.69 # 1
Cross-Document Language Modeling MultiNews val Longformer Perplexity 2.03 # 3
Cross-Document Language Modeling MultiNews val Rand CD-LM Perplexity 1.88 # 2
Citation Recommendation OC Rand CD-LM F1 93.5 # 2
Citation Recommendation OC CD-LM F1 95.3 # 1
Citation Recommendation OC Longformer F1 93.4 # 3
Citation Recommendation PAN Longformer F1 80.4 # 2
Citation Recommendation PAN Rand CD-LM F1 79.4 # 3
Citation Recommendation PAN CD-LM F1 82.9 # 1
Citation Recommendation S2ORC Rand CD-LM F1 94.6 # 3
Citation Recommendation S2ORC CD-LM F1 96.5 # 1
Citation Recommendation S2ORC Longformer F1 95.8 # 2

Methods