LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding

Pre-training of text and layout has proved effective in a variety of visually-rich document understanding tasks due to its effective model architecture and the advantage of large-scale unlabeled scanned/digital-born documents. We propose LayoutLMv2 architecture with new pre-training tasks to model the interaction among text, layout, and image in a single multi-modal framework. Specifically, with a two-stream multi-modal Transformer encoder, LayoutLMv2 uses not only the existing masked visual-language modeling task but also the new text-image alignment and text-image matching tasks, which make it better capture the cross-modality interaction in the pre-training stage. Meanwhile, it also integrates a spatial-aware self-attention mechanism into the Transformer architecture so that the model can fully understand the relative positional relationship among different text blocks. Experiment results show that LayoutLMv2 outperforms LayoutLM by a large margin and achieves new state-of-the-art results on a wide variety of downstream visually-rich document understanding tasks, including FUNSD (0.7895 $\to$ 0.8420), CORD (0.9493 $\to$ 0.9601), SROIE (0.9524 $\to$ 0.9781), Kleister-NDA (0.8340 $\to$ 0.8520), RVL-CDIP (0.9443 $\to$ 0.9564), and DocVQA (0.7295 $\to$ 0.8672). We made our model and code publicly available at \url{}.

PDF Abstract ACL 2021 PDF ACL 2021 Abstract

Results from the Paper

Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Key Information Extraction CORD LayoutLMv2BASE F1 94.95 # 7
Key Information Extraction CORD LayoutLMv2LARGE F1 96.01 # 6
Visual Question Answering (VQA) DocVQA test LayoutLMv2BASE ANLS 0.7808 # 24
Visual Question Answering (VQA) DocVQA test LayoutLMv2LARGE ANLS 0.8672 # 14
Semantic entity labeling FUNSD LayoutLMv2BASE F1 82.76 # 12
Relation Extraction FUNSD LayoutLMv2 large F1 70.57 # 6
Semantic entity labeling FUNSD LayoutLMv2LARGE F1 84.2 # 10
Key Information Extraction Kleister NDA LayoutLMv2BASE F1 83.3 # 2
Key Information Extraction Kleister NDA LayoutLMv2LARGE F1 85.2 # 1
Document Image Classification RVL-CDIP LayoutLMv2LARGE Accuracy 95.64% # 6
Document Image Classification RVL-CDIP LayoutLMv2BASE Accuracy 95.25% # 11
Parameters 200M # 24
Key Information Extraction SROIE LayoutLMv2LARGE F1 96.61 # 2
Key Information Extraction SROIE LayoutLMv2LARGE (Excluding OCR mismatch) F1 97.81 # 1
Key Information Extraction SROIE LayoutLMv2BASE F1 96.25 # 3